# LlamaIndex

LlamaIndex is a framework designed to help you build applications powered by Large Language Models such as chatbots, AI assistants, and translation tools. One of its most valuable capabilities is enriching the knowledge of your LLM with **your own data**, enabling the model to answer questions about **personal, organizational, or domain-specific information** that it wasn‚Äôt originally trained on.

::: note
**Before you run this notebook**
- You need an OpenAI API key available as `OPENAI_API_KEY`.
- Cells using LlamaParse require a LlamaCloud API key (`LLAMA_CLOUD_API_KEY`).
- If you do not have these keys, skip the LlamaParse/OpenAI sections and follow the local examples.
:::

# Setup: Installing Required Libraries

Before we begin, we need to install the necessary Python libraries. Run the cell below to install all dependencies for this notebook.

In [None]:
# Install required libraries with working versions
!pip install -q llama-index-core==0.14.6 llama-index-embeddings-openai==0.5.1 \
    llama-index-llms-openai==0.6.6 openai==1.109.1 \
    chromadb==1.2.2 llama-index-vector-stores-chroma==0.5.3 \
    llama-index-readers-file llama-parse

print("‚úÖ All libraries installed successfully!")
print("‚ö†Ô∏è  IMPORTANT: Please restart your kernel/runtime now before running the next cell!")

# 1. Data Connectors

LlamaIndex uses data connectors to **ingest information** from a wide range of **structured and unstructured sources**.

The simplest way to load the data is using `SimpleDirectoryReader` which supports various file types such as:

- csv - comma-separated values
- docx - Microsoft Word
- ipynb - Jupyter Notebook
- pdf - Portable Document Format
- ppt, .pptm, .pptx - Microsoft PowerPoint
- ...and many more.

Data connector takes your data from these different formats and put them together in a uniform, organized way so they can be used within your LLM application.

You can find all supported file types in [the documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/#simpledirectoryreader).

Let's import `SimpleDirectoryReader`:

In [None]:
import os

# Configure OpenAI API key
OPENAI_API_KEY = None

try:
    from google.colab import userdata  # type: ignore
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY:
        print('‚úÖ API key loaded from Colab secrets')
except Exception:
    pass

if not OPENAI_API_KEY:
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    try:
        from getpass import getpass
        print('üí° To use Colab secrets: Go to üîë (left sidebar) ‚Üí Add new secret ‚Üí Name: OPENAI_API_KEY')
        OPENAI_API_KEY = getpass('Enter your OpenAI API Key: ')
    except Exception as exc:
        raise ValueError('‚ùå ERROR: No API key provided! Set OPENAI_API_KEY as an environment variable or Colab secret.') from exc

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == '':
    raise ValueError('‚ùå ERROR: No API key provided!')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print('‚úÖ Authentication configured!')

OPENAI_MODEL = 'gpt-5-nano'  # Using gpt-5-nano for cost efficiency
print(f'ü§ñ Selected Model: {OPENAI_MODEL}')

OPENAI_EMBED_MODEL = 'text-embedding-3-small'
print(f'üß† Embedding Model: {OPENAI_EMBED_MODEL}')


In [None]:
from llama_index.core import SimpleDirectoryReader

We will load the PDF file called "charter.pdf" (stored in "data" folder in notebook's directory) containing the Charter of Fundamental Rights of the European Union.  

> NOTE: In this notebook we will use the asynchronous (async) versions of data connectors using `await` and `.aload_data()`. It helps everything run more smoothly and prevents technical errors with the notebook‚Äôs event loop. You don‚Äôt need to understand all the internals, just know that `await` is the keyword that tells Python "this step might take a while, pause here until it‚Äôs done".


In [None]:
# Generating documents
documents = await SimpleDirectoryReader(input_files = ["data/charter.pdf"]).aload_data()

When this data connector processes a PDF, it doesn‚Äôt treat the whole file as a single block of text. Instead, it splits the PDF into pages and each page is returned as **document object**.

In [None]:
# The number of pages in the original PDF file == The number of document objects
len(documents)

Let's display the text of the second document where we can see the Table of Contents:

In [None]:
print(documents[1].text)

Each document include metadata such as `file_name`, `file_type`, `creation_date`, etc.:

In [None]:
documents[1].metadata

### üìù EXERCISE 1: Load and Explore Documents (5-7 minutes)

**What you'll practice:** Using SimpleDirectoryReader to load documents and inspect their properties.

**Your task:**
1. Check how many document objects were created from the PDF file
2. Display the text content of the first document (index 0)
3. Print the metadata for the first document
4. Think about: Why is the PDF split into multiple document objects? What does each object represent?

**Hint:** 
- Use `len(documents)` to count documents
- Access properties with `documents[0].text` and `documents[0].metadata`

**Expected outcome:** You'll see that each document object corresponds to one page of the PDF, keeping the content organized and making it easier to track where information comes from.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# 
# print(f"Total documents loaded: {len(documents)}")
# print(f"\nFirst document text (first 300 characters):")
# print(documents[0].text[:300])
# print(f"\nFirst document metadata:")
# print(documents[0].metadata)

# 2. Creating the Index and Querying
Next, we‚Äôll build a vector database to store our embeddings. We'll use `VectorStoreIndex.from_documents()` which automatically **breaks each document into smaller pieces called nodes** based on length. Each node keeps the metadata of its parent document, so we don‚Äôt lose context. Once the nodes are created, they are passed to an embedding model - `text-embedding-ada-002` from OpenAI by default.

In [None]:
# Creating the index
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

Next, we‚Äôll turn the index into a query engine so that we can ask questions.

Behind the scenes, the workflow looks like this:
1. **Query Embedding**: Our text query is embedded into a vector
2. **Retriever**: Query vector is compared against the embeddings stored in the index and retriever returns the most relevant nodes - LlamaIndex uses **cosine** similarity by default
3. **Response Syntethizer**: Combines the retrieved nodes with our query to generate a prompt, which is then passed to an LLM to produce an answer - LlamaIndex uses `gpt-3.5-turbo` from OpenAI by default.

In [None]:
# Setting the index as query engine
query_engine = index.as_query_engine()

# Querying
print(query_engine.query("What is Title 1 about?"))

# 3. Making Data Persistent

By default, `VectorStoreIndex` keeps all data in memory. However, LlamaIndex has its own built-in persistence mechanism.

We will use `persist()` method that handle saving the index into "my_storage". In the code cell below, if folder "my_storage" does not exist yet the code will:
- load PDF file from "data" folder
- build a new index
- persist that index to disk inside "my storage"

If folder "my_storage" already exists, the code instead:
- creates `StorageContext` object pointing to this folder
- reload the previously saved index directly

In [None]:
import os
import os.path
from llama_index.core import StorageContext, load_index_from_storage

# A directory
PERSIST_DIR = "./my_storage"

if not os.path.exists(PERSIST_DIR):
    # Loading the documents and creating the index
    documents = await SimpleDirectoryReader(input_files = ["data/charter.pdf"]).aload_data()
    index = VectorStoreIndex.from_documents(documents)
    # Storing
    index.storage_context.persist(persist_dir = PERSIST_DIR)
else:
    # Reloading the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

Now we can start running queries against it:

### üìù EXERCISE 2: Query Your Index (7-10 minutes)

**What you'll practice:** Querying a vector index and understanding how LlamaIndex retrieves information.

**Your task:**
1. Think of a question about the EU Charter document (e.g., "What rights do children have?", "What is Article 10 about?", "What freedoms are protected?")
2. Query the index using your question
3. Print the response
4. Try a second question and compare the answers
5. Think about: How does the answer quality depend on your question phrasing?

**Hint:** Use the query engine that was created from the persistent index:
```python
query_engine = index.as_query_engine()
response = query_engine.query("Your question")
print(response)
```

**Expected outcome:** You should get relevant answers based on the document content. More specific questions typically yield better, more focused answers.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# 
# question1 = "Your first question here"
# response1 = query_engine.query(question1)
# print(f"Question 1: {question1}")
# print(f"Answer: {response1}\n")
# 
# question2 = "Your second question here"
# response2 = query_engine.query(question2)
# print(f"Question 2: {question2}")
# print(f"Answer: {response2}")

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Can you summarize Title 2?")
print(response)

# 4. LlamaParse

If your dataset includes different file types or documents with complex layouts (such as tables, multi-column text or embedded images), you can use `LlamaParse`. This parser is part of LlamaCloud and is designed to convert documents into structured outputs while preserving layout features far more accurately than generic readers.

To use this parser, you‚Äôll first need **LlamaCloud account**. Go to www.llamaindex.ai and sign-up. Then navigate to **API keys** section and click **Generate New Key**. Be sure to copy and store this secret key in a safe place. For security reasons, it will not be shown again in your account.

> NOTE: You can also make your LlamaParse API key and base URL load automatically every time your terminal starts. This way, you don‚Äôt have to set them manually in every session. Open your terminal and edit your shell file - type `nano ~/.zshrc`. At the end of the file, add the following lines. Then run `source ~/.zshrc`.
>
> `export LLAMA_CLOUD_API_KEY="YOUR_EU_KEY"`
>
> `export LLAMA_CLOUD_API_BASE="api.cloud.eu.llamaindex.ai"`


In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type = "text",
    base_url = "https://api.cloud.eu.llamaindex.ai",  # Calling the EU LlamaCloud endpoint
    verbose = True
)

In [None]:
documents = await parser.aload_data("./data/charter.pdf")

Let's again display the text of the second document - the parser preserves layout features like headings better than a simple text extractor like `SimpleDirectoryReader`:

In [None]:
print(documents[1].text)

## 4.1 Using LlamaParse - PDF with tables into Markdown

### üìù EXERCISE 3: Compare SimpleDirectoryReader vs LlamaParse (10-12 minutes)

**What you'll practice:** Understanding the differences between basic and advanced document parsing.

**Your task:**
1. Look at the parsed output from LlamaParse for the livestock_poultry.pdf document
2. Display a different page/document from the parsed results (try index 5 or 6)
3. Observe how tables and structured data are represented
4. Think about: When would you use LlamaParse instead of SimpleDirectoryReader?

**Key differences to notice:**
- **SimpleDirectoryReader**: Fast, simple, treats all text uniformly
- **LlamaParse**: Preserves structure (tables, headings, lists) as Markdown, better for complex layouts

**Hint:** Access different pages with `pdf_doc[index].text` where index is 0 to len(pdf_doc)-1

**Expected outcome:** You'll see that LlamaParse preserves table structure and formatting that would be lost with basic text extraction.

In [None]:
# YOUR CODE HERE
# Example solution structure:
# 
# print(f"Total pages/documents in PDF: {len(pdf_doc)}")
# 
# # Display a different page
# page_index = 5  # Try different numbers
# print(f"\nContent from page/document {page_index}:")
# print(pdf_doc[page_index].text[:600])  # Show first 600 characters
# 
# # Analyze the structure
# print("\nObservations:")
# print("- Are tables preserved?")
# print("- Are headings clearly marked?")
# print("- Is the layout structure maintained?")

Now let‚Äôs try `LlamaParse` on PDF called "livestock_poultry.pdf" that contains not only the text but also **several tables**. `LlamaParse` will return the content in **Markdown format** which makes the document far easier for an LLM to interpret.

In the code cell below, we initialize the parser that connects to the LlamaCloud API - we need to set `base_url` that specifies which regional LlamaCloud endpoint to use. In this case, we‚Äôre pointing to the EU server.

In [None]:
# Parsing PDF
parser = LlamaParse(
    result_type = "markdown",
    base_url = "https://api.cloud.eu.llamaindex.ai",
    verbose = True
)

Now we can send a PDF file to the parser:

In [None]:
pdf_doc = await parser.aload_data("./data/livestock_poultry.pdf")

Let's print the document with index 8. Compare this Markdown output with the original PDF (page 9). Notice how the layout is preserved. This is what makes `LlamaParse` valuable: instead of flattening tables into plain text, it captures structure in a way that downstream models can use effectively.

In [None]:
preview = pdf_doc[8].text[:500]
print(preview)

## 4.2 Parsing different file types

In this section, we‚Äôll see how to use LlamaParse to handle documents of different types, such as PDFs and Word files, and bring them into a single search workflow.

Instead of writing separate code for each format, we can map file extensions to the same parser and let `SimpleDirectoryReader` automatically process everything in a folder.

First, we'll initialize a parser:

In [None]:
parser = LlamaParse(result_type = "markdown",
                    base_url = "https://api.cloud.eu.llamaindex.ai",
                    verbose = True)

Next, we'll map file extensions to the parser:

In [None]:
file_extractor = {
    ".pdf": parser,
    ".docx": parser
}

Now we can tell `SimpleDirectoryReader` to scan a folder with files "charter.pdf", "livestock_poultry.pdf" and "vacation_policy.docx". If it finds a `.pdf` or `.docx`, it will use our parser to process it. The result is a list of document objects where each page or section is stored as Markdown text.

In [None]:
documents = await SimpleDirectoryReader(
    input_dir = "./data/",
    file_extractor = file_extractor
).aload_data()

Now we are going to create embeddings for our documents. As we already know, when we build `VectorStoreIndex`, it automatically splits text into chunks before embedding, but this uses default settings.

However, we can use `SentenceSplitter` to gain explicit control over how that chunking happens:
- `chunk_size`: sets the maximum length of each chunk (keeps chunks small enough to fit into the embedding model and LLM context window)
- `chunk_overlap`: defines how much content is repeated between consecutive chunks


::: info
**Why chunk size matters**

When we embed an entire document as a single vector, we are effectively averaging all of its topics into one point in space. For multi-topic articles this produces a diluted signal: a query about "zero trust" might rank poorly because the vector also carries equally strong signals for other sections such as cryptography or phishing. Chunking breaks the document into focused segments so each embedding represents one coherent idea, dramatically improving retrieval precision.

**Trade-offs**
- *Chunks that are too large (1000+ tokens)* keep the full context but blend unrelated concepts, reducing similarity scores and hurting recall.
- *Chunks that are too small (50 tokens)* deliver crisp matches but may lose the surrounding context the LLM needs when generating an answer.

**Mitigation strategies**
1. Start with a balanced window (e.g., 256‚Äì512 tokens) and adjust based on your corpus.
2. Introduce overlap (e.g., 10‚Äì20% of the chunk size) so important sentences near boundaries appear in both neighbouring chunks.
3. During retrieval, fetch neighbouring chunks or stitch together the original document spans so the LLM receives enough context to respond reliably.

This approach preserves the semantic focus needed for accurate vector search while still giving the downstream LLM the broader context it needs.
:::


In [None]:
from llama_index.core.node_parser import SentenceSplitter

# Split into nodes (chunks)
splitter = SentenceSplitter(
    chunk_size = 512,        # each chunk will be about 512 characters/tokens long
    chunk_overlap = 50)      # the last 50 characters/tokens of one chunk will also appear at the start of the next

nodes = splitter.get_nodes_from_documents(documents)

The next step is to build a vector index:

In [None]:
# Creating embeddings from "nodes"
index = VectorStoreIndex.from_documents(nodes)

# Wrapping the index in a query engine
query_engine = index.as_query_engine()

In the code cell below, the question is converted into a vector embedding which is compared against all stored embeddings (nodes) in the vector index. The nodes whose embeddings are most similar (highest cosine score) are selected as "relevant" and combined with the query and passed to an LLM to generate the answer:

In [None]:
# Running the query
print(query_engine.query("How many days can be carried over into the next calendar year?"))

In [None]:
# Running the query
print(query_engine.query("What are brazil top five pork export markets?"))

In [None]:
# Running the query
print(query_engine.query("What are the citizens' rights?"))

## 4.5 Using different LLM

Up to now, we‚Äôve built a vector index using the default embedding model and the default LLM. But both of these can be customized. By default, LlamaIndex uses OpenAI‚Äôs `text-embedding-ada-002` for embeddings and `gpt-3.5-turbo` for the LLM.

In the example below, we‚Äôll rebuild our index with a different embedding model - `text-embedding-3-small`, and then use a different LLM - `gpt-5-nano` to generate answers:

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding

# Building a new index + new embedding model
pdf_index = VectorStoreIndex.from_documents(
    pdf_doc,
    embedding = OpenAIEmbedding(model = OPENAI_EMBED_MODEL)
)

In [None]:
from llama_index.llms.openai import OpenAI

# Using new LLM
query_engine = index.as_query_engine(llm = OpenAI(model=OPENAI_MODEL))
response = query_engine.query("What is the forecasted percentage change of global export of pork between 2024 and 2025?")
print(response)