# LlamaIndex

LlamaIndex is a framework designed to help you build applications powered by Large Language Models such as chatbots, AI assistants, and translation tools. One of its most valuable capabilities is enriching the knowledge of your LLM with **your own data**, enabling the model to answer questions about **personal, organizational, or domain-specific information** that it wasn’t originally trained on.

# 1. Data Connectors

LlamaIndex uses data connectors to **ingest information** from a wide range of **structured and unstructured sources**.

The simplest way to load the data is using `SimpleDirectoryReader` which supports various file types such as:

- csv - comma-separated values
- docx - Microsoft Word
- ipynb - Jupyter Notebook
- pdf - Portable Document Format
- ppt, .pptm, .pptx - Microsoft PowerPoint
- ...and many more.

Data connector takes your data from these different formats and put them together in a uniform, organized way so they can be used within your LLM application.

You can find all supported file types in [the documentation](https://docs.llamaindex.ai/en/stable/module_guides/loading/simpledirectoryreader/#simpledirectoryreader).

Let's import `SimpleDirectoryReader`:

In [None]:
from llama_index.core import SimpleDirectoryReader

We will load the PDF file called "charter.pdf" (stored in "data" folder in notebook's directory) containing the Charter of Fundamental Rights of the European Union.  

> NOTE: In this notebook we will use the asynchronous (async) versions of data connectors using `await` and `.aload_data()`. It helps everything run more smoothly and prevents technical errors with the notebook’s event loop. You don’t need to understand all the internals, just know that `await` is the keyword that tells Python "this step might take a while, pause here until it’s done".


In [None]:
# Generating documents
documents = await SimpleDirectoryReader(input_files = ["data/charter.pdf"]).aload_data()

When this data connector processes a PDF, it doesn’t treat the whole file as a single block of text. Instead, it splits the PDF into pages and each page is returned as **document object**.

In [None]:
# The number of pages in the original PDF file == The number of document objects
len(documents)

17

Let's display the text of the second document where we can see the Table of Contents:

In [None]:
print(documents[1].text)

 
Table of Contents  
Page 
PREAMBLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393  
TITLE I DIGNITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394  
TITLE II FREEDOMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395  
TITLE III EQUALITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397  
TITLE IV SOLIDARITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399  
TITLE V CITIZENS' RIGHTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401  
TITLE VI JUSTICE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403  
TITLE VII GENERAL PROVISIONS GOVERNING THE INTERPRETATION AND 
APPLICATION OF THE CHARTER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Each document include metadata such as `file_name`, `file_type`, `creation_date`, etc.:

In [None]:
documents[1].metadata

{'page_label': '390',
 'file_name': 'charter.pdf',
 'file_path': 'data/charter.pdf',
 'file_type': 'application/pdf',
 'file_size': 1049657,
 'creation_date': '2025-08-19',
 'last_modified_date': '2025-08-19'}

# 2. Creating the Index and Querying
Next, we’ll build a vector database to store our embeddings. We'll use `VectorStoreIndex.from_documents()` which automatically **breaks each document into smaller pieces called nodes** based on length. Each node keeps the metadata of its parent document, so we don’t lose context. Once the nodes are created, they are passed to an embedding model - `text-embedding-ada-002` from OpenAI by default.

In [None]:
# Creating the index
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(documents)

2025-09-18 17:59:45,969 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Next, we’ll turn the index into a query engine so that we can ask questions.

Behind the scenes, the workflow looks like this:
1. **Query Embedding**: Our text query is embedded into a vector
2. **Retriever**: Query vector is compared against the embeddings stored in the index and retriever returns the most relevant nodes - LlamaIndex uses **cosine** similarity by default
3. **Response Syntethizer**: Combines the retrieved nodes with our query to generate a prompt, which is then passed to an LLM to produce an answer - LlamaIndex uses `gpt-3.5-turbo` from OpenAI by default.

In [None]:
# Setting the index as query engine
query_engine = index.as_query_engine()

# Querying
print(query_engine.query("What is Title 1 about?"))

2025-09-18 18:01:07,684 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-18 18:01:09,009 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Title I is about fundamental human rights and dignity, emphasizing the protection and respect for human life, integrity, and prohibiting practices such as the death penalty, torture, slavery, and discrimination.


# 3. Making Data Persistent

By default, `VectorStoreIndex` keeps all data in memory. However, LlamaIndex has its own built-in persistence mechanism.

We will use `persist()` method that handle saving the index into "my_storage". In the code cell below, if folder "my_storage" does not exist yet the code will:
- load PDF file from "data" folder
- build a new index
- persist that index to disk inside "my storage"

If folder "my_storage" already exists, the code instead:
- creates `StorageContext` object pointing to this folder
- reload the previously saved index directly

In [None]:
import os
import os.path
from llama_index.core import StorageContext, load_index_from_storage

# A directory
PERSIST_DIR = "./my_storage"

if not os.path.exists(PERSIST_DIR):
    # Loading the documents and creating the index
    documents = await SimpleDirectoryReader(input_files = ["data/charter.pdf"]).aload_data()
    index = VectorStoreIndex.from_documents(documents)
    # Storing
    index.storage_context.persist(persist_dir = PERSIST_DIR)
else:
    # Reloading the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context)

2025-09-18 18:06:01,779 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"


Now we can start running queries against it:

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Can you summarize Title 2?")
print(response)

2025-09-18 18:07:16,020 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-18 18:07:16,833 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Title II of the document focuses on various freedoms. It covers the right to liberty and security, respect for private and family life, protection of personal data, the right to marry and found a family, freedom of thought, conscience, and religion.


# 4. LlamaParse

If your dataset includes different file types or documents with complex layouts (such as tables, multi-column text or embedded images), you can use `LlamaParse`. This parser is part of LlamaCloud and is designed to convert documents into structured outputs while preserving layout features far more accurately than generic readers.

To use this parser, you’ll first need **LlamaCloud account**. Go to www.llamaindex.ai and sign-up. Then navigate to **API keys** section and click **Generate New Key**. Be sure to copy and store this secret key in a safe place. For security reasons, it will not be shown again in your account.

> NOTE: You can also make your LlamaParse API key and base URL load automatically every time your terminal starts. This way, you don’t have to set them manually in every session. Open your terminal and edit your shell file - type `nano ~/.zshrc`. At the end of the file, add the following lines. Then run `source ~/.zshrc`.
>
> `export LLAMA_CLOUD_API_KEY="YOUR_EU_KEY"`
>
> `export LLAMA_CLOUD_API_BASE="api.cloud.eu.llamaindex.ai"`


In [None]:
from llama_parse import LlamaParse

parser = LlamaParse(
    result_type = "text",
    base_url = "https://api.cloud.eu.llamaindex.ai",  # Calling the EU LlamaCloud endpoint
    verbose = True
)

In [None]:
documents = await parser.aload_data("./data/charter.pdf")

2025-09-18 18:14:35,826 - INFO - HTTP Request: POST https://api.cloud.eu.llamaindex.ai/api/parsing/upload "HTTP/1.1 200 OK"


Started parsing the file under job_id d641e88d-b25c-4e08-b607-bae12af278df


2025-09-18 18:14:37,193 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/d641e88d-b25c-4e08-b607-bae12af278df "HTTP/1.1 200 OK"
2025-09-18 18:14:39,717 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/d641e88d-b25c-4e08-b607-bae12af278df "HTTP/1.1 200 OK"
2025-09-18 18:14:43,608 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/d641e88d-b25c-4e08-b607-bae12af278df "HTTP/1.1 200 OK"
2025-09-18 18:14:48,114 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/d641e88d-b25c-4e08-b607-bae12af278df "HTTP/1.1 200 OK"
2025-09-18 18:14:51,882 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/d641e88d-b25c-4e08-b607-bae12af278df/result/text "HTTP/1.1 200 OK"


Let's again display the text of the second document - the parser preserves layout features like headings better than a simple text extractor like `SimpleDirectoryReader`:

In [None]:
print(documents[1].text)


C 202/390  EN                     Official Journal of the European Union                                                                      7.6.2016

                                          Table of Contents

                                                                                                                                              Page

           PREAMBLE      . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        393

           TITLE I       DIGNITY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .              394

           TITLE II      FREEDOMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .                 395

           TITLE III     EQUALITY              . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    397

           TITLE IV      SOLIDARITY . . . . . . . 

## 4.1 Using LlamaParse - PDF with tables into Markdown

Now let’s try `LlamaParse` on PDF called "livestock_poultry.pdf" that contains not only the text but also **several tables**. `LlamaParse` will return the content in **Markdown format** which makes the document far easier for an LLM to interpret.

In the code cell below, we initialize the parser that connects to the LlamaCloud API - we need to set `base_url` that specifies which regional LlamaCloud endpoint to use. In this case, we’re pointing to the EU server.

In [None]:
# Parsing PDF
parser = LlamaParse(
    result_type = "markdown",
    base_url = "https://api.cloud.eu.llamaindex.ai",
    verbose = True
)

Now we can send a PDF file to the parser:

In [None]:
pdf_doc = await parser.aload_data("./data/livestock_poultry.pdf")

2025-09-18 18:38:46,624 - INFO - HTTP Request: POST https://api.cloud.eu.llamaindex.ai/api/parsing/upload "HTTP/1.1 200 OK"


Started parsing the file under job_id 75c128e5-afaf-49e4-95b6-329aeea0f8d6


2025-09-18 18:38:48,377 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/75c128e5-afaf-49e4-95b6-329aeea0f8d6 "HTTP/1.1 200 OK"
2025-09-18 18:38:50,851 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/75c128e5-afaf-49e4-95b6-329aeea0f8d6 "HTTP/1.1 200 OK"
2025-09-18 18:38:54,262 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/75c128e5-afaf-49e4-95b6-329aeea0f8d6 "HTTP/1.1 200 OK"
2025-09-18 18:38:58,606 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/75c128e5-afaf-49e4-95b6-329aeea0f8d6 "HTTP/1.1 200 OK"
2025-09-18 18:39:04,647 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/75c128e5-afaf-49e4-95b6-329aeea0f8d6 "HTTP/1.1 200 OK"
2025-09-18 18:39:10,175 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/75c128e5-afaf-49e4-95b6-329aeea0f8d6 "HTTP/1.1 200 OK"
2025-09-18 18:39:15,605 - INFO - HTTP Request: GET https:/

Let's print the document with index 8. Compare this Markdown output with the original PDF (page 9). Notice how the layout is preserved. This is what makes `LlamaParse` valuable: instead of flattening tables into plain text, it captures structure in a way that downstream models can use effectively.

In [None]:
print(pdf_doc[8].text[:10000])


# Cattle Stocks - Top Countries Summary

# (in 1,000 head)

|                        | 2021    | 2022    | 2023    | 2024    | 2025    | 2025    |
| ---------------------- | ------- | ------- | ------- | ------- | ------- | ------- |
| Total Cattle Beg. Stks |         |         |         |         | Oct     | Apr     |
| India                  | 305,500 | 306,700 | 307,400 | 307,420 | 307,490 | 307,490 |
| Brazil                 | 193,195 | 193,780 | 194,365 | 192,572 | 186,875 | 186,875 |
| China                  | 95,621  | 98,172  | 102,160 | 105,090 | 104,000 | 104,900 |
| European Union         | 76,551  | 75,705  | 74,808  | 73,745  | 72,300  | 71,822  |
| Argentina              | 53,540  | 53,400  | 54,100  | 52,800  | 53,200  | 52,370  |
| Australia              | 23,021  | 23,944  | 25,800  | 27,080  | 27,020  | 27,260  |
| Mexico                 | 17,000  | 17,314  | 17,763  | 17,840  | 17,735  | 17,965  |
| Russia                 | 17,953  | 17,798  | 17,435  | 17,285  | 16

## 4.2 Parsing different file types

In this section, we’ll see how to use LlamaParse to handle documents of different types, such as PDFs and Word files, and bring them into a single search workflow.

Instead of writing separate code for each format, we can map file extensions to the same parser and let `SimpleDirectoryReader` automatically process everything in a folder.

First, we'll initialize a parser:

In [None]:
parser = LlamaParse(result_type = "markdown",
                    base_url = "https://api.cloud.eu.llamaindex.ai",
                    verbose = True)

Next, we'll map file extensions to the parser:

In [None]:
file_extractor = {
    ".pdf": parser,
    ".docx": parser
}

Now we can tell `SimpleDirectoryReader` to scan a folder with files "charter.pdf", "livestock_poultry.pdf" and "vacation_policy.docx". If it finds a `.pdf` or `.docx`, it will use our parser to process it. The result is a list of document objects where each page or section is stored as Markdown text.

In [None]:
documents = await SimpleDirectoryReader(
    input_dir = "./data/",
    file_extractor = file_extractor
).aload_data()

2025-09-18 18:58:30,085 - INFO - HTTP Request: POST https://api.cloud.eu.llamaindex.ai/api/parsing/upload "HTTP/1.1 200 OK"
2025-09-18 18:58:30,141 - INFO - HTTP Request: POST https://api.cloud.eu.llamaindex.ai/api/parsing/upload "HTTP/1.1 200 OK"


Started parsing the file under job_id 444ca46f-544e-49b4-8b01-7f23e85f6e7b
Started parsing the file under job_id 91f803e4-c119-4fa2-9769-6accecbb9a03


2025-09-18 18:58:30,986 - INFO - HTTP Request: POST https://api.cloud.eu.llamaindex.ai/api/parsing/upload "HTTP/1.1 200 OK"


Started parsing the file under job_id 2c3820d1-0af8-4c57-ad95-566d77e20498


2025-09-18 18:58:31,638 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/444ca46f-544e-49b4-8b01-7f23e85f6e7b "HTTP/1.1 200 OK"
2025-09-18 18:58:31,642 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/91f803e4-c119-4fa2-9769-6accecbb9a03 "HTTP/1.1 200 OK"
2025-09-18 18:58:32,574 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/2c3820d1-0af8-4c57-ad95-566d77e20498 "HTTP/1.1 200 OK"
2025-09-18 18:58:33,879 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/91f803e4-c119-4fa2-9769-6accecbb9a03 "HTTP/1.1 200 OK"
2025-09-18 18:58:34,506 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/444ca46f-544e-49b4-8b01-7f23e85f6e7b "HTTP/1.1 200 OK"
2025-09-18 18:58:34,764 - INFO - HTTP Request: GET https://api.cloud.eu.llamaindex.ai/api/parsing/job/2c3820d1-0af8-4c57-ad95-566d77e20498 "HTTP/1.1 200 OK"
2025-09-18 18:58:35,036 - INFO - HTTP Request: GET https:/

Now we are going to create embeddings for our documents. As we already know, when we build `VectorStoreIndex`, it automatically splits text into chunks before embedding, but this uses default settings.

However, we can use `SentenceSplitter` to gain explicit control over how that chunking happens:
- `chunk_size`: sets the maximum length of each chunk (keeps chunks small enough to fit into the embedding model and LLM context window)
- `chunk_overlap`: defines how much content is repeated between consecutive chunks


In [None]:
from llama_index.core.node_parser import SentenceSplitter

# Split into nodes (chunks)
splitter = SentenceSplitter(
    chunk_size = 512,        # each chunk will be about 512 characters/tokens long
    chunk_overlap = 50)      # the last 50 characters/tokens of one chunk will also appear at the start of the next

nodes = splitter.get_nodes_from_documents(documents)

The next step is to build a vector index:

In [None]:
# Creating embeddings from "nodes"
index = VectorStoreIndex.from_documents(nodes)

# Wrapping the index in a query engine
query_engine = index.as_query_engine()

In the code cell below, the question is converted into a vector embedding which is compared against all stored embeddings (nodes) in the vector index. The nodes whose embeddings are most similar (highest cosine score) are selected as "relevant" and combined with the query and passed to an LLM to generate the answer:

In [None]:
# Running the query
print(query_engine.query("How many days can be carried over into the next calendar year?"))

2025-09-18 19:22:46,634 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-18 19:22:47,888 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-18 19:22:49,890 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


Up to 10 unused vacation days may be carried over into the next calendar year.


In [None]:
# Running the query
print(query_engine.query("What are brazil top five pork export markets?"))

2025-09-18 19:23:10,315 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-18 19:23:11,031 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


China, Philippines, Chile, Japan, Hong Kong


In [None]:
# Running the query
print(query_engine.query("What are the citizens' rights?"))

2025-09-18 19:32:37,467 - INFO - HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
2025-09-18 19:32:42,601 - INFO - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"


The citizens' rights include the right to vote and stand as a candidate at elections to the European Parliament and municipal elections, the right to good administration, the right of access to documents, the right to refer cases of maladministration to the European Ombudsman, the right to petition the European Parliament, and the freedom of movement and residence within the territory of the Member States.


## 4.5 Using different LLM

Up to now, we’ve built a vector index using the default embedding model and the default LLM. But both of these can be customized. By default, LlamaIndex uses OpenAI’s `text-embedding-ada-002` for embeddings and `gpt-3.5-turbo` for the LLM.

In the example below, we’ll rebuild our index with a different embedding model - `text-embedding-3-small`, and then use a different LLM - `gpt-4o-mini` to generate answers:

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding

# Building a new index + new embedding model
pdf_index = VectorStoreIndex.from_documents(
    pdf_doc,
    embedding = OpenAIEmbedding(model = "text-embedding-3-small")
)

In [None]:
from llama_index.llms.openai import OpenAI

# Using new LLM
query_engine = index.as_query_engine(llm = OpenAI(model="gpt-4o-mini"))
response = query_engine.query("What is the forecasted percentage change of global export of pork between 2024 and 2025?")
print(response)