# Build Your First RAG System

1. Data Ingestion.
2. Indexing.
3. Retriever.
4. Response Synthesizer.
5. Querying.

## Install Required packages

Download the required packages by executing the below commands in either Anaconda Prompt (in Windows) or Terminal (in Linux or Mac OS)

pip install llama-index

## Environment Variables

It is recommonded to store the API keys in a '.env' file, separate from the code.
Plesae follow the below steps.
1. Create a text file with the name '.env'
2. Enter your api key in this format OPENAI_API_KEY='sk-e8943u9ru4982............'
3. Save and close the file

Then, as shown below you can provide the path of the '.env' file to 'load_dotenv' method.
This will load any API keys stored in the '.env' file.

## Start

In [1]:
import os

In [2]:
from dotenv import load_dotenv, find_dotenv

In [3]:
#load_dotenv('/home/santhosh/Projects/courses/Pinnacle/.env')
load_dotenv(find_dotenv(), override=True)

True

In [5]:
# Retrieve the OpenAI API key from environment variables
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']
os.environ.get('OPENAI_API_KEY')

'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'

This setup ensures that our API key remains secure and easily configurable. Always remember to keep your `.env` file secure and avoid including it in version control."


# Stage 1: Data Ingestion

## Data Loaders


We start by loading the data from a PDF file. For this, we will use the SimpleDirectoryReader class from LlamaIndex.

In [7]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.12.44-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-agent-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.11-py3-none-any.whl.metadata (439 bytes)
Collecting llama-index-cli<0.5,>=0.4.2 (from llama-index)
  Downloading llama_index_cli-0.4.3-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.13,>=0.12.44 (from llama-index)
  Downloading llama_index_core-0.12.44-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.7.7-py3-none-any.whl.metadata (3.3 kB)
Collecting llama-index-llms-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_llms_openai-0.4.7-py3-none-any.whl.metadata (3.0 kB)
Collecting llama

In [8]:
from llama_index.core import SimpleDirectoryReader

In [9]:
documents = SimpleDirectoryReader(input_files=['files/us_constitution.pdf']).load_data()

We can then check the type of the `documents` variable and the total number of pages read from the PDF:

In [10]:
# Check the datatype and length of the loaded documents
type(documents)

list

In [11]:
# total number of pages read from the PDF
len(documents)

41

**To understand the structure of the loaded documents, let's retrieve the first document, which corresponds to the first page of the PDF:**


In [12]:
# Retrieve the first document (essentially the first page in the PDF)
documents[0]

Document(id_='cb8d2111-ccd7-42e4-adb2-0d62bbd434be', embedding=None, metadata={'page_label': '1', 'file_name': 'us_constitution.pdf', 'file_path': 'files\\us_constitution.pdf', 'file_type': 'application/pdf', 'file_size': 170876, 'creation_date': '2025-06-07', 'last_modified_date': '2025-06-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='The United States Constitution \n W e the People of the United States, in Order to form a more perfect \n Union, establish Justice, insure domestic T ranquility , provide for the \n common defence, promote the general W elfare, and secure the \n Blessings of Liberty to ourselves and our Posterity , d

We can also access specific attributes of the document, such as its ID and metadata:

In [13]:
# Get the ID of the first document
documents[0].id_

'cb8d2111-ccd7-42e4-adb2-0d62bbd434be'

In [14]:
documents[0].doc_id

'cb8d2111-ccd7-42e4-adb2-0d62bbd434be'

In [15]:
# Get the metadata of the first document
documents[0].metadata

{'page_label': '1',
 'file_name': 'us_constitution.pdf',
 'file_path': 'files\\us_constitution.pdf',
 'file_type': 'application/pdf',
 'file_size': 170876,
 'creation_date': '2025-06-07',
 'last_modified_date': '2025-06-07'}

In [16]:
# Get the text content of the first document
print(documents[0].text)

The United States Constitution 
 W e the People of the United States, in Order to form a more perfect 
 Union, establish Justice, insure domestic T ranquility , provide for the 
 common defence, promote the general W elfare, and secure the 
 Blessings of Liberty to ourselves and our Posterity , do ordain and 
 establish this Constitution for the United States of America. 
The Constitutional Con v ention 
 Article I 
 Section 1: Congress 
 All legislative Powers herein granted shall be vested in a Congress of 
 the United States, which shall consist of a Senate and House of 
 Representatives. 
Section 2: The House of Representatives 


## Embedding Model

Next, we need to prepare our document for embedding and interaction with a large language model. We will use the OpenAI API for this purpose.

In [17]:
# Embedding Model
from llama_index.embeddings.openai import OpenAIEmbedding

In [18]:
# Initialize the embedding model
embed_model = OpenAIEmbedding(model="text-embedding-3-large")

## LLM

Similarly, let's set up our large language model (LLM):

In [19]:
# LLM
from llama_index.llms.openai import OpenAI

In [20]:
# Initialize the large language model
llm = OpenAI(model= "gpt-4o-mini")

# Stage 2: Indexing

In [21]:
# Indexing
from llama_index.core import VectorStoreIndex

Here, we use the `VectorStoreIndex` class to create an index from the loaded documents. We pass the document chunks, embedding model, and LLM to the `from_documents` method.

In [22]:
# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Stage 3: Retrieval

Finally, we set up a retriever to query our indexed documents. This allows us to retrieve relevant information based on our queries.

In [23]:
# Setting up the Index as Retriever
retriever = index.as_retriever()

The `as_retriever` method converts our index into a retriever, and the `retrieve` method allows us to query the index.

In [24]:
# Retrieve information based on the query "What are Transformers?"
retrieved_nodes = retriever.retrieve("What is US consulate?")

We can check the metadata of the retrieved nodes to understand the source of the information:

The metadata provides details such as the page label, file name, file path, file type, and other relevant information.

In [25]:
# Get the metadata of the first retrieved node
retrieved_nodes[0].metadata

{'page_label': '1',
 'file_name': 'us_constitution.pdf',
 'file_path': 'files\\us_constitution.pdf',
 'file_type': 'application/pdf',
 'file_size': 170876,
 'creation_date': '2025-06-07',
 'last_modified_date': '2025-06-07'}

let's access the ID of the first retrieved node, which is a unique identifier for the first node:

In [26]:
# Access the ID of the first retrieved node
retrieved_nodes[0].id_

'7c7cfe83-7aa6-4679-a020-fcfb176cb0dc'

Similarly, we can access the node_id attribute, which typically holds the same value:

In [27]:
# Access the node_id of the first retrieved node
retrieved_nodes[0].node_id

'7c7cfe83-7aa6-4679-a020-fcfb176cb0dc'

Next, let's explore the `node` attribute of the retrieved node. This attribute contains a `TextNode` object, which holds all the relevant information extracted during the retrieval process: The `TextNode` object includes various details such as metadata and text content.

In [28]:
# Access the full node object of the first retrieved node
retrieved_nodes[0].node

TextNode(id_='7c7cfe83-7aa6-4679-a020-fcfb176cb0dc', embedding=None, metadata={'page_label': '1', 'file_name': 'us_constitution.pdf', 'file_path': 'files\\us_constitution.pdf', 'file_type': 'application/pdf', 'file_size': 170876, 'creation_date': '2025-06-07', 'last_modified_date': '2025-06-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='cb8d2111-ccd7-42e4-adb2-0d62bbd434be', node_type='4', metadata={'page_label': '1', 'file_name': 'us_constitution.pdf', 'file_path': 'files\\us_constitution.pdf', 'file_type': 'application/pdf', 'file_size': 170876, 'creation_date': '2025-06-07', 'last_modified_date': '2025-06-07'}, hash='957cfd697d9ed8ed93e148b4984bd8058ff7be341dfd2b196fe1d72017d2cc69')}, metadata_tem

We can also extract and inspect the text content of this node to understand the retrieved information better:

In [29]:
# Access the text content of the first retrieved node
print(retrieved_nodes[0].text)

The United States Constitution 
 W e the People of the United States, in Order to form a more perfect 
 Union, establish Justice, insure domestic T ranquility , provide for the 
 common defence, promote the general W elfare, and secure the 
 Blessings of Liberty to ourselves and our Posterity , do ordain and 
 establish this Constitution for the United States of America. 
The Constitutional Con v ention 
 Article I 
 Section 1: Congress 
 All legislative Powers herein granted shall be vested in a Congress of 
 the United States, which shall consist of a Senate and House of 
 Representatives. 
Section 2: The House of Representatives


In [30]:
retrieved_nodes[1].metadata

{'page_label': '16',
 'file_name': 'us_constitution.pdf',
 'file_path': 'files\\us_constitution.pdf',
 'file_type': 'application/pdf',
 'file_size': 170876,
 'creation_date': '2025-06-07',
 'last_modified_date': '2025-06-07'}

In [31]:
print(retrieved_nodes[1].text)

He shall have Power , by and with the Advice and Consent of the 
 Senate, to make T reaties, provided two thirds of the Senators present 
 concur; and he shall nominate, and by and with the Advice and 
 Consent of the Senate, shall appoint Ambassadors, other public 
 Ministers and Consuls, Judges of the supreme Court, and all other 
 Of ficers of the United States, whose Appointments are not herein 
 otherwise provided for , and which shall be established by Law: but the 
 Congress may by Law vest the Appointment of such inferior Of ficers, 
 as they think proper , in the President alone, in the Courts of Law , or in 
 the Heads of Departments. 
 The President shall have Power to fill up all V acancies that may happen 
 during the Recess of the Senate, by granting Commissions which shall 
 expire at the End of their next Session. 
Section 3 
 He shall from time to time give to the Congress Information of the State 
 of the Union, and recommend to their Consideration such Measures as 
 

# Stage 4: Response Synthesis


We need to synthesize responses from our large language model (LLM). For this, we use the `get_response_synthesizer` function:

In [32]:
from llama_index.core import get_response_synthesizer

Here, the `get_response_synthesizer` function takes our LLM as an argument and returns a synthesizer object that will help generate coherent responses to our queries.

In [33]:
# Initialize the response synthesizer with the LLM
response_synthesizer = get_response_synthesizer(llm=llm)

## Stage 5: Query Engine

Next, we set up a query engine. This engine will allow us to query our indexed documents and receive synthesized responses from the LLM:

In [34]:
# Create a query engine using the index, LLM, and response synthesizer
query_engine = index.as_query_engine(llm=llm, response_synthesizer=response_synthesizer)

We use the `as_query_engine` method from our index object to create a query engine, passing the LLM and response synthesizer as arguments.

With our query engine ready, we can now query the LLM using natural language:


In [35]:
# Query the LLM using the query engine
response = query_engine.query("What is us consulate?")  

In this command, we query the LLM with the question "What are Transformers?" and store the response in the `response` variable.

To view the response generated by the LLM, we can access the `response` attribute:


In [36]:
# View the response from the LLM
response.response 

'A U.S. consulate is a diplomatic mission that represents the United States in a foreign country. It is responsible for assisting U.S. citizens abroad, promoting American interests, and facilitating trade and communication between the U.S. and the host country. Consulates handle various functions, including issuing visas, providing support to American citizens, and fostering economic and cultural relations.'

This returns the synthesized answer to our query.

We can further analyze the response by checking its length and inspecting the source nodes used to generate it:


These commands provide the length of the response and the number of source nodes, respectively.

In [37]:
# Check the length of the response
len(response.response) # number of characters in the response

409

In [38]:
# Check the number of source nodes
len(response.source_nodes)  # list of 2 nodes

2

In [37]:
# Access the ID and metadata of the first source node
response.source_nodes[0].id_

'3c11e22d-2950-4697-aee3-a6dc137fed3d'

In [38]:
# Access the ID and metadata of the second source node
response.source_nodes[0].metadata

{'page_label': '4',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

In [39]:
response.source_nodes[1].id_

'c32c5c7e-fd92-4147-8e3d-79817f93a273'

In [40]:
response.source_nodes[1].metadata

{'page_label': '6',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-06-11',
 'last_modified_date': '2024-03-27'}

# End to End RAG Pipeline

In this final section, we will integrate everything we have learned to create a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline. This pipeline will read documents, index them, and allow us to query the indexed data using a large language model (LLM).

Let's walk through the entire process step by step:

- First, we import the necessary libraries and load our documents from a specified directory. We use the `SimpleDirectoryReader` class from LlamaIndex to read all documents in the 'data' directory:


- The `SimpleDirectoryReader` reads the documents in the 'data' directory and stores them in the `documents` variable.

- Next, we initialize our large language model (LLM) and embedding model. For this demonstration, we assume that these models have already been initialized and are available as `llm` and `embed_model`:

- With our documents and models ready, we proceed to create an index. This index will facilitate efficient retrieval of information from our documents. Here, we use the `VectorStoreIndex` class to create an index from the loaded documents, embedding model, and LLM.

- We then set up a query engine that will allow us to query the indexed documents using natural language. The query engine is created from our index and LLM:

- Finally, we use the query engine to ask a question and receive a response from the LLM. In this example, we query the different types of Transformer models:

- The `query` method sends the question to the LLM, which retrieves relevant information from the indexed documents and synthesizes a response. The response is then printed to the console.




In [41]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load data from the specified directory
documents = SimpleDirectoryReader("data").load_data()

# Initialize LLM and embedding model (assumed to be pre-initialized)
llm = llm
embed_model = embed_model

# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

# Create a query engine from the index and LLM
query_engine = index.as_query_engine(llm=llm)

# Query the LLM and print the response
print(query_engine.query("What are the different types of Transformer Models?").response)

The Transformer models mentioned include the base model and the big model. The base model achieves a BLEU score of 27.3 for English-to-German translation and 38.1 for English-to-French translation, while the big model achieves higher scores of 28.4 and 41.0, respectively. These models differ in their configurations and training costs, with the big model outperforming the base model and other previously published models.


In [48]:
print(query_engine.query("Why do we need positional encodings in transformer?").response)

Positional encodings are needed in transformers because the model itself doesn't have any inherent sense of position or order of the sequence elements. Unlike recurrent neural networks, transformers process input data in parallel rather than sequentially, which makes them more efficient but also means they don't inherently understand the order of the data. Positional encodings are used to give the model some information about the relative positions of the elements in the sequence.


In [49]:
print(query_engine.query("What are Encoder and Decoder blocks in transformer?").response)

The encoder and decoder are key components of the Transformer model architecture. The encoder is made up of a stack of six identical layers, each with two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by layer normalization. 

The decoder, like the encoder, is composed of a stack of six identical layers. However, in addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Residual connections are also used around each of the sub-layers in the decoder, followed by layer normalization. The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions, ensuring that the predictions for a given position can depend only on the known outputs at p

In [50]:
query = "If I want to generate document embeddings, then which type of Transformer Architecture I must choose?"
print(query_engine.query(query).response)

The Transformer architecture you should choose for generating document embeddings is the Encoder part of the Transformer model. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, which can be used as document embeddings. It is composed of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.


In [51]:
query = """If I want to generate document embeddings, 
then which type of Transformer Architecture I must choose among Encoders, Decoders or Encoder-Decorder?"""

print(query_engine.query(query).response)

To generate document embeddings, you should choose the Encoder part of the Transformer Architecture. The Encoder maps an input sequence of symbol representations to a sequence of continuous representations, which can be used as document embeddings.


By following these steps, we have created a fully functional end-to-end RAG pipeline. This pipeline can ingest documents, index them, and answer natural language queries using a powerful combination of LlamaIndex and OpenAI's models. This demonstrates the practical application of RAG systems in extracting and synthesizing information from large datasets.
