# LlamaIndex

LlamaIndex is a framework designed to help you build applications powered by Large Language Models such as chatbots, AI assistants, and translation tools. One of its most valuable capabilities is enriching the knowledge of your LLM with **your own data**, enabling the model to answer questions about **personal, organizational, or domain-specific information** that it wasn’t originally trained on.

> NOTE:
**Before running any code in this notebook, you need to complete three quick
   setup steps**:

Step 1: Store Your API Keys in Colab Secrets 🔑

  This notebook requires two API keys. Follow these steps to securely store
  them:

  1. Click the 🔑 icon in the left sidebar of Colab (it says "Secrets" when
  you hover over it)
  2. Add your OpenAI API key:
    - Click "Add new secret"
    - Name: OPENAI_API_KEY
    - Value: Paste your OpenAI API key

  3. Add your LlamaParse API key:
    - Click "Add new secret" again
    - Name: LLAMA_CLOUD_API_KEY
    - Value: Paste your LlamaParse API key


  📌 Note: If you don't have these API keys yet:
  - OpenAI API key: Get it from https://platform.openai.com/api-keys
  - LlamaParse API key: Get it from https://cloud.llamaindex.ai (sign up,
  then go to API Keys section)

  ---
  Step 2: Create a Data Folder 📁

  1. In the Colab file browser (left sidebar), you'll see
  your current files
  2. Right-click in the empty space
  3. Select "New folder"
  4. Name it exactly: data

  ---
  Step 3: Upload Documents to the Data Folder 📄

  You need to upload three PDF documents to the data folder:

  1. Download these files (from course materials):
    - charter.pdf - Charter of Fundamental Rights of the European Union
    - livestock_poultry.pdf - Livestock and poultry data with tables
    - vacation_policy.docx - Sample vacation policy document
  2. Upload them to Colab:
    - Click on the data folder you just created
    - Click the upload icon at the top of the file browser
    - Select all three files and upload them




# Setup: Installing Required Libraries

Before we begin, we need to install the necessary Python libraries. Run the cell below to install all dependencies for this notebook.

In [1]:
# Install required libraries with working versions
!pip install -q llama-index-core==0.14.6 llama-index-embeddings-openai==0.5.1 \
    llama-index-llms-openai==0.6.6 openai==1.109.1 \
    chromadb==1.2.2 llama-index-vector-stores-chroma==0.5.3 \
    llama-index-readers-file llama-parse

print("✅ All libraries installed successfully!")
print("⚠️  IMPORTANT: Please restart your kernel/runtime now before running the next cell!")

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.7/20.7 MB[0m [31m82.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m278.2/278.2 kB[0m [31m20.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m87.0 MB/s[0m eta [36m0:00:00

# 1. Data Connectors

LlamaIndex uses data connectors to **ingest information** from a wide range of **structured and unstructured sources**.

The simplest way to load the data is using `SimpleDirectoryReader` which supports various file types such as:

- csv - comma-separated values
- docx - Microsoft Word
- ipynb - Jupyter Notebook
- pdf - Portable Document Format
- ppt, .pptm, .pptx - Microsoft PowerPoint
- ...and many more.

Data connector takes your data from these different formats and put them together in a uniform, organized way so they can be used within your LLM application.

You can find all supported file types in [the documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/#simpledirectoryreader).

Let's import `SimpleDirectoryReader`:

In [1]:
import os

# Configure OpenAI API key
OPENAI_API_KEY = None

try:
    from google.colab import userdata  # type: ignore
    OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
    if OPENAI_API_KEY:
        print('✅ API key loaded from Colab secrets')
except Exception:
    pass

if not OPENAI_API_KEY:
    OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

if not OPENAI_API_KEY:
    try:
        from getpass import getpass
        print('💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: OPENAI_API_KEY')
        OPENAI_API_KEY = getpass('Enter your OpenAI API Key: ')
    except Exception as exc:
        raise ValueError('❌ ERROR: No API key provided! Set OPENAI_API_KEY as an environment variable or Colab secret.') from exc

if not OPENAI_API_KEY or OPENAI_API_KEY.strip() == '':
    raise ValueError('❌ ERROR: No API key provided!')

os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

print('✅ Authentication configured!')

OPENAI_MODEL = 'gpt-5-nano'  # Using gpt-5-nano for cost efficiency
print(f'🤖 Selected Model: {OPENAI_MODEL}')

OPENAI_EMBED_MODEL = 'text-embedding-3-small'
print(f'🧠 Embedding Model: {OPENAI_EMBED_MODEL}')


✅ API key loaded from Colab secrets
✅ Authentication configured!
🤖 Selected Model: gpt-5-nano
🧠 Embedding Model: text-embedding-3-small


In [2]:
from llama_index.core import SimpleDirectoryReader

We will load the PDF file called "charter.pdf" (stored in "data" folder in notebook's directory) containing the Charter of Fundamental Rights of the European Union.  

> NOTE: In this notebook we will use the asynchronous (async) versions of data connectors using `await` and `.aload_data()`. It helps everything run more smoothly and prevents technical errors with the notebook’s event loop. You don’t need to understand all the internals, just know that `await` is the keyword that tells Python "this step might take a while, pause here until it’s done".


In [3]:
# Generating documents
documents = await SimpleDirectoryReader(input_files = ["data/charter.pdf"]).aload_data()

When this data connector processes a PDF, it doesn’t treat the whole file as a single block of text. Instead, it splits the PDF into pages and each page is returned as **document object**.

In [4]:
# The number of pages in the original PDF file == The number of document objects
len(documents)

17

Let's display the text of the second document where we can see the Table of Contents:

In [5]:
print(documents[1].text)

 
Table of Contents  
Page 
PREAMBLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393  
TITLE I DIGNITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394  
TITLE II FREEDOMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395  
TITLE III EQUALITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397  
TITLE IV SOLIDARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399  
TITLE V CITIZENS' RIGHTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401  
TITLE VI JUSTICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403  
TITLE VII GENERAL PROVISIONS GOVERNING THE INTERPRETATION AND 
APPLICATION OF THE CHARTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Each document include metadata such as `file_name`, `file_type`, `creation_date`, etc.:

In [6]:
documents[1].metadata

{'page_label': '390',
 'file_name': 'charter.pdf',
 'file_path': 'data/charter.pdf',
 'file_type': 'application/pdf',
 'file_size': 1049657,
 'creation_date': '2025-10-28',
 'last_modified_date': '2025-10-28'}

### 📝 EXERCISE 1: Load and Explore Documents


**Your task:**
1. Check how many document objects were created from the PDF file
2. Display the text content of the first document
3. Print the metadata for the first document
4. Think about: Why is the PDF split into multiple document objects? What does each object represent?

**Hint:**
- Use `len(documents)` to count documents
- Access properties with `documents[0].text` and `documents[0].metadata`



In [7]:
# YOUR CODE HERE


# 2. Creating the Index and Querying
Next, we’ll build a vector database to store our embeddings. We'll use `VectorStoreIndex.from_documents()` which automatically **breaks each document into smaller pieces called nodes** based on length. Each node keeps the metadata of its parent document, so we don’t lose context. Once the nodes are created, they are passed to an embedding model - `text-embedding-ada-002` from OpenAI by default.

In [8]:
# Creating the index
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

Next, we’ll turn the index into a query engine so that we can ask questions.

Behind the scenes, the workflow looks like this:
1. **Query Embedding**: Our text query is embedded into a vector
2. **Retriever**: Query vector is compared against the embeddings stored in the index and retriever returns the most relevant nodes - LlamaIndex uses **cosine** similarity by default
3. **Response Syntethizer**: Combines the retrieved nodes with our query to generate a prompt, which is then passed to an LLM to produce an answer - LlamaIndex uses `gpt-3.5-turbo` from OpenAI by default.

In [9]:
# Setting the index as query engine
query_engine = index.as_query_engine()

# Querying
response = query_engine.query("What is Title 1 about?")
print(response)

Title I is about fundamental human rights and dignity, emphasizing the protection and respect for human life, integrity, and prohibiting practices such as torture, slavery, discrimination, and ensuring equality between men and women.


## 2.1 Understanding Retrieval: What's Happening Behind the Scenes?

In the previous example, we asked a question and received an answer. Let's understand this workflow:


**1. Embedding the Query**
Our question ("What is Title 1 about?") is converted into a numerical vector (embedding) using the same embedding model that was used to embed documents. This ensures the query and documents exist in the same semantic space and can be meaningfully compared.

**2. Retrieval (The "R" in RAG)**
LlamaIndex compares the query embedding against all the chunk embeddings stored in the index using cosine similarity. It then retrieves the most semantically similar chunks—these are the pieces of documents that are most likely to contain the answer to our question.

By default, LlamaIndex retrieves the **top 2 most similar chunks**. These chunks become the "context" that will be sent to the LLM.

**3. Generation (The "G" in RAG)**
The retrieved chunks are combined with our original question and sent to the LLM in a prompt that essentially says: "Here are some relevant document excerpts. Based ONLY on these excerpts, answer the following question."

The LLM reads the context, synthesizes the information, and generates a natural language answer.



Let's see this retrieval process in action.

In [10]:
# Query the index and inspect what was retrieved
query_text = "What is Title 1 about?"
response = query_engine.query(query_text)

print("=" * 80)
print("QUESTION:")
print("=" * 80)
print(query_text)
print("\n")

print("=" * 80)
print("FINAL ANSWER FROM LLM:")
print("=" * 80)
print(response)
print("\n")

print("=" * 80)
print("RETRIEVED CHUNKS (What the LLM actually saw as context):")
print("=" * 80)

# Inspect the source nodes (retrieved chunks)
for i, node in enumerate(response.source_nodes, 1):
    print(f"\n📄 CHUNK {i}")
    print(f"   Relevance Score: {node.score:.4f} (higher = more similar to query)")
    print(f"   Source: {node.metadata.get('file_name', 'N/A')} | Page: {node.metadata.get('page_label', 'N/A')}")
    print(f"   Text Preview (first 300 chars):")
    print(f"   {node.text[:300]}...")
    print("-" * 80)

QUESTION:
What is Title 1 about?


FINAL ANSWER FROM LLM:
Title I is about fundamental human rights and dignity, emphasizing the protection and respect for human life, integrity, and prohibiting practices such as torture, slavery, discrimination, and ensuring equality between men and women.


RETRIEVED CHUNKS (What the LLM actually saw as context):

📄 CHUNK 1
   Relevance Score: 0.7727 (higher = more similar to query)
   Source: charter.pdf | Page: 394
   Text Preview (first 300 chars):
   TITLE I  
DIGNITY 
Article 1  
Human dignity  
Human dignity is inviolable. It must be respected and protected.  
Article 2  
Right to life  
1. Everyone has the right to life.  
2. No one shall be condemned to the death penalty, or executed.  
Article 3  
Right to the integrity of the person  
1. E...
--------------------------------------------------------------------------------

📄 CHUNK 2
   Relevance Score: 0.7634 (higher = more similar to query)
   Source: charter.pdf | Page: 398
   Text Previe

### Key Observations

From the output above, notice several important things:

1. **The LLM's answer is grounded in specific chunks** - The answer didn't come from the LLM's training data. It came from the chunks that were retrieved from your document.

2. **Relevance scores guide retrieval** - Each chunk has a similarity score (typically between 0 and 1). Higher scores mean the chunk is more semantically similar to our query. LlamaIndex uses these scores to rank chunks and select the most relevant ones.

3. **Metadata is preserved** - Each chunk remembers where it came from (file name, page number, etc.). This is crucial for citation and traceability.

4. **Chunks are contextual excerpts** - Notice that each chunk is a portion of a document, not the entire document. This is why chunking strategy (which we'll explore more later) is so important: chunks must be large enough to contain meaningful information but small enough to be focused and relevant.



### Controlling Retrieval: The `similarity_top_k` Parameter

By default, LlamaIndex retrieves the **top 2** most similar chunks. But you can control this behavior using the `similarity_top_k` parameter when creating your query engine.

**Trade-offs:**
- **Fewer chunks (k=1-2):** Faster, more focused, but might miss relevant information
- **More chunks (k=5-10):** Better coverage, but more noise and higher LLM costs (more tokens to process)

Let's experiment with different values:

In [11]:
# Create a query engine that retrieves top 5 chunks instead of default 2
query_engine_expanded = index.as_query_engine(similarity_top_k=5)

query_text = "What freedoms are protected in the EU Charter?"
response = query_engine_expanded.query(query_text)

print(f"QUESTION: {query_text}\n")
print(f"ANSWER: {response}\n")
print("=" * 80)
print(f"Retrieved {len(response.source_nodes)} chunks:")
print("=" * 80)

for i, node in enumerate(response.source_nodes, 1):
    print(f"\nChunk {i} | Score: {node.score:.4f} | Page: {node.metadata.get('page_label', 'N/A')}")
    print(f"Preview: {node.text[:150]}...")

QUESTION: What freedoms are protected in the EU Charter?

ANSWER: Freedom of expression, freedom of assembly and association, freedom of the arts and sciences, right to education, freedom to choose an occupation and right to engage in work, right to liberty and security, respect for private and family life, protection of personal data, right to marry and right to found a family, freedom of thought, conscience and religion.

Retrieved 5 chunks:

Chunk 1 | Score: 0.8797 | Page: 397
Preview: Article 16  
Freedom to conduct a business  
The freedom to conduct a business in accordance with Union law and national laws and practices 
is recogn...

Chunk 2 | Score: 0.8752 | Page: 393
Preview: The European Parliament, the Council and the Commission solemnly proclaim the following text as 
the Charter of Fundamental Rights of the European Uni...

Chunk 3 | Score: 0.8704 | Page: 405
Preview: 5. The provisions of this Charter which contain principles may be implemented by legislative and 
executiv

**💡 Practical Tip:** Start with the default (`similarity_top_k=2`) and only increase it if you notice that answers are incomplete or missing information that you know exists in your documents. You can always inspect the retrieved chunks to diagnose whether retrieval is the problem.

### 📝 EXERCISE 2.1: Inspect Retrieval for Your Own Query

**What you'll practice:** Understanding the retrieval process by examining which chunks are selected for different queries.

**Your task:**
1. Create a query about something specific in the EU Charter (e.g., "What are children's rights?", "What does Article 8 say?")
2. Use the query engine to get an answer
3. Inspect the retrieved chunks using `response.source_nodes`
4. Print the relevance scores and text previews for each chunk
5. Think about: Do the retrieved chunks actually contain the information needed to answer your question? Are the scores reasonable?

**Expected outcome:** You'll see exactly which parts of your documents were used to generate the answer, helping you understand whether the retrieval stage is working correctly.

In [12]:
# YOUR CODE HERE

# 3. Making Data Persistent

By default, `VectorStoreIndex` keeps all data in memory. However, LlamaIndex has its own built-in persistence mechanism.

We will use `persist()` method that handle saving the index into "my_storage". In the code cell below, if folder "my_storage" does not exist yet the code will:
- load PDF file from "data" folder
- build a new index
- persist that index to disk inside "my storage"

If folder "my_storage" already exists, the code instead:
- creates `StorageContext` object pointing to this folder
- reload the previously saved index directly

In [13]:
import os
import os.path
from llama_index.core import StorageContext, load_index_from_storage

# A directory
PERSIST_DIR = "./my_storage"

if not os.path.exists(PERSIST_DIR):
    # Loading the documents and creating the index
    documents = await SimpleDirectoryReader(input_files = ["data/charter.pdf"]).aload_data()
    index = VectorStoreIndex.from_documents(documents)
    # Storing
    index.storage_context.persist(persist_dir = PERSIST_DIR)
else:
    # Reloading the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

Now we can start running queries against it:

In [14]:
query_engine = index.as_query_engine()
response = query_engine.query("Can you summarize Title 2?")
print(response)

Title II focuses on freedoms. It includes articles on the right to liberty and security, respect for private and family life, protection of personal data, the right to marry and found a family, freedom of thought, conscience, and religion.


### 📝 EXERCISE 2: Query Your Index



**Your task:**
1. Think of a question about the EU Charter document (e.g., "What rights do children have?", "What is Article 10 about?", "What freedoms are protected?")
2. Query the index using your question
3. Print the response
4. Try a second question and compare the answers
5. Think about: How does the answer quality depend on your question phrasing?



In [15]:
# YOUR CODE HERE


# 4. LlamaParse

If your dataset includes different file types or documents with complex layouts (such as tables, multi-column text or embedded images), you can use `LlamaParse`. This parser is part of LlamaCloud and is designed to convert documents into structured outputs while preserving layout features far more accurately than generic readers.

To use this parser, you’ll first need **LlamaCloud account**. Go to www.llamaindex.ai and sign-up. Then navigate to **API keys** section and click **Generate New Key**. Be sure to copy and store this secret key in a safe place. For security reasons, it will not be shown again in your account.

> NOTE: You can also make your LlamaParse API key and base URL load automatically every time your terminal starts. This way, you don’t have to set them manually in every session. Open your terminal and edit your shell file - type `nano ~/.zshrc`. At the end of the file, add the following lines. Then run `source ~/.zshrc`.
>
> `export LLAMA_CLOUD_API_KEY="YOUR_EU_KEY"`
>
> `export LLAMA_CLOUD_API_BASE="api.cloud.eu.llamaindex.ai"`


In [16]:
import os

# Configure LlamaParse API key
LLAMA_CLOUD_API_KEY = None

try:
    from google.colab import userdata
    LLAMA_CLOUD_API_KEY = userdata.get('LLAMA_CLOUD_API_KEY')
    if LLAMA_CLOUD_API_KEY:
        print('✅ LlamaParse API key loaded from Colab secrets')
except Exception:
    pass

if not LLAMA_CLOUD_API_KEY:
    LLAMA_CLOUD_API_KEY = os.getenv('LLAMA_CLOUD_API_KEY')

if not LLAMA_CLOUD_API_KEY:
    try:
        from getpass import getpass
        print('💡 To use Colab secrets: Go to 🔑 (left sidebar) → Add new secret → Name: LLAMA_CLOUD_API_KEY')
        LLAMA_CLOUD_API_KEY = getpass('Enter your LlamaParse API Key: ')
    except Exception as exc:
        raise ValueError(
            '❌ ERROR: No LlamaParse API key provided! Set LLAMA_CLOUD_API_KEY as an environment variable or Colab secret.'
        ) from exc

if not LLAMA_CLOUD_API_KEY or LLAMA_CLOUD_API_KEY.strip() == '':
    raise ValueError('❌ ERROR: No LlamaParse API key provided!')

os.environ['LLAMA_CLOUD_API_KEY'] = LLAMA_CLOUD_API_KEY

print('✅ LlamaParse authentication configured!')


✅ LlamaParse API key loaded from Colab secrets
✅ LlamaParse authentication configured!


In [17]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type = "text",
    base_url = "https://api.cloud.eu.llamaindex.ai",  # Calling the EU LlamaCloud endpoint
    verbose = True
)

In [18]:
documents = await parser.aload_data("./data/charter.pdf")

Started parsing the file under job_id e1d5779b-b7f3-4dd1-90b3-e3b0daea563d


Let's again display the text of the second document - the parser preserves layout features like headings better than a simple text extractor like `SimpleDirectoryReader`:

In [19]:
print(documents[1].text)


C 202/390  EN    Official Journal of the European Union    7.6.2016

                 Table of Contents

                                                                                                                                   Page

PREAMBLE      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        393

TITLE I       DIGNITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              394

TITLE II      FREEDOMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 395

TITLE III     EQUALITY              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    397

TITLE IV      SOLIDARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 399

TITLE V       CITIZENS' RIGHTS                  . . . . . . . . . . . . . . .

## 4.1 Using LlamaParse - PDF with tables into Markdown

Now let’s try `LlamaParse` on PDF called "livestock_poultry.pdf" that contains not only the text but also **several tables**. `LlamaParse` will return the content in **Markdown format** which makes the document far easier for an LLM to interpret.

In the code cell below, we initialize the parser that connects to the LlamaCloud API - we need to set `base_url` that specifies which regional LlamaCloud endpoint to use. In this case, we’re pointing to the EU server.

In [20]:
# Parsing PDF
parser = LlamaParse(
    result_type = "markdown",
    base_url = "https://api.cloud.eu.llamaindex.ai",
    verbose = True
)

Now we can send a PDF file to the parser:

In [21]:
pdf_doc = await parser.aload_data("./data/livestock_poultry.pdf")

Started parsing the file under job_id 4723152b-c798-4c7d-8b81-b7daebae5538


Let's print the document with index 8. Compare this Markdown output with the original PDF (page 9). Notice how the layout is preserved. This is what makes `LlamaParse` valuable: instead of flattening tables into plain text, it captures structure in a way that downstream models can use effectively.

In [22]:
preview = pdf_doc[8].text[:500]
print(preview)



Cattle Stocks - Top Countries Summary

# (in 1,000 head)

# 1. Total Cattle Beg. Stks

| Country        | 2021    | 2022    | 2023    | 2024    | 2025    | 2025    |
| -------------- | ------- | ------- | ------- | ------- | ------- | ------- |
| India          | 305,500 | 306,700 | 307,400 | 307,420 | 307,490 | 307,490 |
| Brazil         | 193,195 | 193,780 | 194,365 | 192,572 | 186,875 | 186,875 |
| China          | 95,621  | 98,172  | 102,160 | 105,090 | 104,000 | 104,900 |
| European Union


### 📝 EXERCISE 3: Compare SimpleDirectoryReader vs LlamaParse



**Your task:**
1. Look at the parsed output from LlamaParse for the livestock_poultry.pdf document
2. Display a different page/document from the parsed results
3. Observe how tables and structured data are represented
4. Think about: When would you use LlamaParse instead of SimpleDirectoryReader?



In [23]:
# YOUR CODE HERE


## 4.2 Parsing different file types

In this section, we’ll see how to use LlamaParse to handle documents of different types, such as PDFs and Word files, and bring them into a single search workflow.

Instead of writing separate code for each format, we can map file extensions to the same parser and let `SimpleDirectoryReader` automatically process everything in a folder.

First, we'll initialize a parser:

In [24]:
parser = LlamaParse(result_type = "markdown",
                    base_url = "https://api.cloud.eu.llamaindex.ai",
                    verbose = True)

Next, we'll map file extensions to the parser:

In [25]:
file_extractor = {
    ".pdf": parser,
    ".docx": parser
}

Now we can tell `SimpleDirectoryReader` to scan a folder with files "charter.pdf", "livestock_poultry.pdf" and "vacation_policy.docx". If it finds a `.pdf` or `.docx`, it will use our parser to process it. The result is a list of document objects where each page or section is stored as Markdown text.

In [26]:
documents = await SimpleDirectoryReader(
    input_dir = "./data/",
    file_extractor = file_extractor
).aload_data()

Started parsing the file under job_id 4fb5fe16-bb08-4eeb-a63c-5faa2db84e87
Started parsing the file under job_id 8268e283-29e2-4fae-bbcc-26d5a1e2d04e
Started parsing the file under job_id 5fcf4d7a-f68f-4930-a594-1c14640b68c7


Now we are going to create embeddings for our documents. As we already know, when we build `VectorStoreIndex`, it automatically splits text into chunks before embedding, but this uses default settings.

However, we can use `SentenceSplitter` to gain explicit control over how that chunking happens:
- `chunk_size`: sets the maximum length of each chunk (keeps chunks small enough to fit into the embedding model and LLM context window)
- `chunk_overlap`: defines how much content is repeated between consecutive chunks



**Chunk size matters**

When we embed an entire document as a single vector, we are effectively averaging all of its topics into one point in space. For multi-topic articles this produces a diluted signal: a query about "zero trust" might rank poorly because the vector also carries equally strong signals for other sections such as cryptography or phishing. Chunking breaks the document into focused segments so each embedding represents one coherent idea, dramatically improving retrieval precision.

**Trade-offs**
- *Chunks that are too large (1000+ tokens)* keep the full context but blend unrelated concepts, reducing similarity scores and hurting recall.
- *Chunks that are too small (50 tokens)* deliver crisp matches but may lose the surrounding context the LLM needs when generating an answer.

**Mitigation strategies**
1. Start with a balanced window (e.g., 256–512 tokens) and adjust based on your corpus.
2. Introduce overlap (e.g., 10–20% of the chunk size) so important sentences near boundaries appear in both neighbouring chunks.
3. During retrieval, fetch neighbouring chunks or stitch together the original document spans so the LLM receives enough context to respond reliably.

This approach preserves the semantic focus needed for accurate vector search while still giving the downstream LLM the broader context it needs.



In [27]:
from llama_index.core.node_parser import SentenceSplitter

# Split into nodes (chunks)
splitter = SentenceSplitter(
    chunk_size = 512,        # each chunk will be about 512 characters/tokens long
    chunk_overlap = 50)      # the last 50 characters/tokens of one chunk will also appear at the start of the next

nodes = splitter.get_nodes_from_documents(documents)

The next step is to build a vector index:

In [28]:
# Creating embeddings from "nodes"
index = VectorStoreIndex.from_documents(nodes)

# Wrapping the index in a query engine
query_engine = index.as_query_engine()

In the code cell below, the question is converted into a vector embedding which is compared against all stored embeddings (nodes) in the vector index. The nodes whose embeddings are most similar (highest cosine score) are selected as "relevant" and combined with the query and passed to an LLM to generate the answer:

In [29]:
# Running the query
print(query_engine.query("How many days can be carried over into the next calendar year?"))

The number of days that can be carried over into the next calendar year is 15 days.


In [30]:
# Running the query
print(query_engine.query("What are brazil top five pork export markets?"))

China, Philippines, Chile, Japan, and Hong Kong.


In [31]:
# Running the query
print(query_engine.query("What are the citizens' rights?"))

The citizens' rights include the right to vote and stand as a candidate at elections to the European Parliament and municipal elections, the right to good administration, the right of access to documents, the right to refer cases of maladministration to the European Ombudsman, the right to petition the European Parliament, and the freedom of movement and residence within the territory of the Member States.


## 4.5 Using different LLM

Up to now, we’ve built a vector index using the default embedding model and the default LLM. But both of these can be customized. By default, LlamaIndex uses OpenAI’s `text-embedding-ada-002` for embeddings and `gpt-3.5-turbo` for the LLM.

In the example below, we’ll rebuild our index with a different embedding model - `text-embedding-3-small`, and then use a different LLM - `gpt-5-nano` to generate answers:

In [32]:
from llama_index.embeddings.openai import OpenAIEmbedding

# Building a new index + new embedding model
pdf_index = VectorStoreIndex.from_documents(
    pdf_doc,
    embedding = OpenAIEmbedding(model = OPENAI_EMBED_MODEL)
)

In [33]:
from llama_index.llms.openai import OpenAI

# Using new LLM
query_engine = index.as_query_engine(llm = OpenAI(model=OPENAI_MODEL))
response = query_engine.query("What is the forecasted percentage change of global export of pork between 2024 and 2025?")
print(response)

-1% (a decline of about one percent from 2024 to 2025).
