# Build Your First RAG System

1. Data Ingestion.
2. Indexing.
3. Retriever.
4. Response Synthesizer.
5. Querying.

## Install Required packages

Download the required packages by executing the below commands in either Anaconda Prompt (in Windows) or Terminal (in Linux or Mac OS)

pip install llama-index

## Environment Variables

It is recommonded to store the API keys in a '.env' file, separate from the code.
Plesae follow the below steps.
1. Create a text file with the name '.env'
2. Enter your api key in this format OPENAI_API_KEY='sk-e82............'
3. Save and close the file

Then, as shown below you can provide the path of the '.env' file to 'load_dotenv' method.
This will load any API keys stored in the '.env' file.

## Start

In [None]:
import os

In [None]:
#from dotenv import load_dotenv, find_dotenv

ModuleNotFoundError: No module named 'dotenv'

In [None]:
# Load environment variables from the .env file
#load_dotenv('D:/.env')

False

In [None]:
# Retrieve the OpenAI API key from environment variables
from google.colab import userdata
os.environ['OPENAI_API_KEY'] = userdata.get('ChatGPT')
OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

This setup ensures that our API key remains secure and easily configurable. Always remember to keep your `.env` file secure and avoid including it in version control."


# Stage 1: Data Ingestion

## Data Loaders


We start by loading the data from a PDF file. For this, we will use the SimpleDirectoryReader class from LlamaIndex.

In [None]:
!pip install llama_index.core
from llama_index.core import SimpleDirectoryReader



In [None]:
!pip install llama-index-readers-file




In [None]:
documents = SimpleDirectoryReader(input_files=['transformers.pdf']).load_data()

We can then check the type of the `documents` variable and the total number of pages read from the PDF:

In [None]:
# Check the datatype and length of the loaded documents
type(documents)

list

In [None]:
# total number of pages read from the PDF
len(documents)

15

In [None]:
print(documents[1].text)

1 Introduction
Recurrent neural networks, long short-term memory [ 13] and gated recurrent [ 7] neural networks
in particular, have been firmly established as state of the art approaches in sequence modeling and
transduction problems such as language modeling and machine translation [ 35,2,5]. Numerous
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
architectures [38, 24, 15].
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
states ht, as a function of the previous hidden state ht−1and the input for position t. This inherently
sequential nature precludes parallelization within training examples, which becomes critical at longer
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
significant improvements in computational efficiency through factor

**To understand the structure of the loaded documents, let's retrieve the first document, which corresponds to the first page of the PDF:**


In [None]:
# Retrieve the first document (essentially the first page in the PDF)
documents[0]

Document(id_='8e7b7bce-3239-4e5c-a77f-e9ca441d257d', embedding=None, metadata={'page_label': '1', 'file_name': 'transformers.pdf', 'file_path': 'transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-10-09', 'last_modified_date': '2024-10-09'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nlli

We can also access specific attributes of the document, such as its ID and metadata:

In [None]:
# Get the ID of the first document
documents[0].id_

'ab068b18-42c2-4ac6-bcec-d81108d11c95'

In [None]:
documents[0].doc_id

'ab068b18-42c2-4ac6-bcec-d81108d11c95'

In [None]:
# Get the metadata of the first document
documents[0].metadata

{'page_label': '1',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-05-28',
 'last_modified_date': '2024-05-28'}

In [None]:
# Get the text content of the first document
print(documents[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experime

## Embedding Model

Next, we need to prepare our document for embedding and interaction with a large language model. We will use the OpenAI API for this purpose.

In [None]:
!pip install llama_index.embeddings.huggingface

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5"
)


# Embedding Model
#!pip install llama_index
#from llama_index.embeddings.openai import OpenAIEmbedding



In [None]:
# Initialize the embedding model
#embed_model = OpenAIEmbedding(model="text-embedding-3-large") #'text-embedding-3-small')

## LLM

Similarly, let's set up our large language model (LLM):

In [None]:
# LLM
#from llama_index.llms.openai import OpenAI



In [None]:
#Initialize Gemini

!pip install llama-index-llms-gemini




In [None]:
!pip install google-generativeai
!pip install llama-index



In [None]:
# Initialize the large language model
#llm = OpenAI(model= "gpt-4o") # 'gpt-3.5-turbo'

from llama_index.llms.gemini import Gemini

from google.colab import userdata
x=userdata.get('gemini_api_key')
os.environ['GEMINI_API_KEY'] = x
GEMINI_API_KEY = os.environ['GEMINI_API_KEY']

print(os.environ['GEMINI_API_KEY'])




AIzaSyAynjpmUr6VDMOrbZ-9e5o1gCRTeDi2chU


In [None]:
!pip install -q llama-index google-generativeai

from llama_index.llms.gemini import Gemini
llm=Gemini(api_key=get.userdata['gemini'],model="models/gemini-pro")

# Stage 2: Indexing

In [None]:
# Indexing
from llama_index.core import VectorStoreIndex

Here, we use the `VectorStoreIndex` class to create an index from the loaded documents. We pass the document chunks, embedding model, and LLM to the `from_documents` method.

In [None]:
# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model) #, llm=llm)

# Stage 3: Retrieval

Finally, we set up a retriever to query our indexed documents. This allows us to retrieve relevant information based on our queries.

In [None]:
# Setting up the Index as Retriever
retriever = index.as_retriever()

The `as_retriever` method converts our index into a retriever, and the `retrieve` method allows us to query the index.

In [None]:
# Retrieve information based on the query "What are Transformers?"
retrieved_nodes = retriever.retrieve("How to design the large language model")

We can check the metadata of the retrieved nodes to understand the source of the information:

The metadata provides details such as the page label, file name, file path, file type, and other relevant information.

In [None]:
# Get the metadata of the first retrieved node
retrieved_nodes[1].metadata

{'page_label': '12',
 'file_name': 'transformers.pdf',
 'file_path': 'transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-10-09',
 'last_modified_date': '2024-10-09'}

let's access the ID of the first retrieved node, which is a unique identifier for the first node:

In [None]:
# Access the ID of the first retrieved node
retrieved_nodes[0].id_

'f492b420-1814-4715-80fb-83d9feaa164a'

Similarly, we can access the node_id attribute, which typically holds the same value:

In [None]:
# Access the node_id of the first retrieved node
retrieved_nodes[0].node_id

'976a6833-1f49-407c-8e2e-b1f3517840fd'

Next, let's explore the `node` attribute of the retrieved node. This attribute contains a `TextNode` object, which holds all the relevant information extracted during the retrieval process: The `TextNode` object includes various details such as metadata and text content.

In [None]:
# Access the full node object of the first retrieved node
retrieved_nodes[0].node

TextNode(id_='f492b420-1814-4715-80fb-83d9feaa164a', embedding=None, metadata={'page_label': '1', 'file_name': 'transformers.pdf', 'file_path': 'transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-10-09', 'last_modified_date': '2024-10-09'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='8e7b7bce-3239-4e5c-a77f-e9ca441d257d', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '1', 'file_name': 'transformers.pdf', 'file_path': 'transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2024-10-09', 'last_modified_date': '2024-10-09'}, hash='98077663a5fa91871705d4873c39accfacfe015d66836b3202aab137a814fc5c')}, text='Provide

We can also extract and inspect the text content of this node to understand the retrieved information better:

In [None]:
# Access the text content of the first retrieved node
print(retrieved_nodes[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experime

In [None]:
retrieved_nodes[1].metadata

{'page_label': '6',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-05-28',
 'last_modified_date': '2024-05-28'}

In [None]:
print(retrieved_nodes[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.comNoam Shazeer∗
Google Brain
noam@google.comNiki Parmar∗
Google Research
nikip@google.comJakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.comAidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Experime

# Stage 4: Response Synthesis


We need to synthesize responses from our large language model (LLM). For this, we use the `get_response_synthesizer` function:

In [None]:
from llama_index.core import get_response_synthesizer

Here, the `get_response_synthesizer` function takes our LLM as an argument and returns a synthesizer object that will help generate coherent responses to our queries.

In [None]:
# Initialize the response synthesizer with the LLM
response_synthesizer = get_response_synthesizer(llm=llm)

## Stage 5: Query Engine

Next, we set up a query engine. This engine will allow us to query our indexed documents and receive synthesized responses from the LLM:

In [None]:
# Create a query engine using the index, LLM, and response synthesizer
query_engine = index.as_query_engine(llm=llm, response_synthesizer=response_synthesizer)

We use the `as_query_engine` method from our index object to create a query engine, passing the LLM and response synthesizer as arguments.

With our query engine ready, we can now query the LLM using natural language:


In [None]:
# Query the LLM using the query engine
response = query_engine.query("Summarize the document in 50 words")

In this command, we query the LLM with the question "What are Transformers?" and store the response in the `response` variable.

To view the response generated by the LLM, we can access the `response` attribute:


In [None]:
# View the response from the LLM
response.response

'This document introduces a novel neural network architecture called the Transformer, which is based solely on attention mechanisms, dispensing with recurrence and convolutions. The Transformer is shown to be more effective than existing sequence transduction models on a wide range of tasks, including translation, language modeling, and abstractive summarization.'

This returns the synthesized answer to our query.

We can further analyze the response by checking its length and inspecting the source nodes used to generate it:


These commands provide the length of the response and the number of source nodes, respectively.

In [None]:
# Check the length of the response
len(response.response) # number of characters in the response

364

In [None]:
# Check the number of source nodes
len(response.source_nodes)  # list of 2 nodes

2

In [None]:
# Access the ID and metadata of the first source node
response.source_nodes[0].id_

'976a6833-1f49-407c-8e2e-b1f3517840fd'

In [None]:
# Access the ID and metadata of the second source node
response.source_nodes[0].metadata

{'page_label': '4',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-05-28',
 'last_modified_date': '2024-05-28'}

In [None]:
response.source_nodes[1].id_

'0ad43a2d-29e2-41a5-a8c9-796d53a207f4'

In [None]:
response.source_nodes[1].metadata

{'page_label': '6',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2024-05-28',
 'last_modified_date': '2024-05-28'}

# End to End RAG Pipeline

In this final section, we will integrate everything we have learned to create a complete end-to-end Retrieval-Augmented Generation (RAG) pipeline. This pipeline will read documents, index them, and allow us to query the indexed data using a large language model (LLM).

Let's walk through the entire process step by step:

- First, we import the necessary libraries and load our documents from a specified directory. We use the `SimpleDirectoryReader` class from LlamaIndex to read all documents in the 'data' directory:


- The `SimpleDirectoryReader` reads the documents in the 'data' directory and stores them in the `documents` variable.

- Next, we initialize our large language model (LLM) and embedding model. For this demonstration, we assume that these models have already been initialized and are available as `llm` and `embed_model`:

- With our documents and models ready, we proceed to create an index. This index will facilitate efficient retrieval of information from our documents. Here, we use the `VectorStoreIndex` class to create an index from the loaded documents, embedding model, and LLM.

- We then set up a query engine that will allow us to query the indexed documents using natural language. The query engine is created from our index and LLM:

- Finally, we use the query engine to ask a question and receive a response from the LLM. In this example, we query the different types of Transformer models:

- The `query` method sends the question to the LLM, which retrieves relevant information from the indexed documents and synthesizes a response. The response is then printed to the console.




In [None]:
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

# Load data from the specified directory
documents = SimpleDirectoryReader("data").load_data()

# Initialize LLM and embedding model (assumed to be pre-initialized)
llm = llm
embed_model = embed_model

# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, llm=llm)

# Create a query engine from the index and LLM
query_engine = index.as_query_engine(llm=llm)

# Query the LLM and print the response
print(query_engine.query("What are the different types of Transformer Models?").response)

The context does not provide information on different types of Transformer Models.


In [None]:
print(query_engine.query("Why do we need positional encodings in transformer?").response)

Positional encodings are needed in transformers because the model itself doesn't have any inherent sense of position or order of the sequence elements. Unlike recurrent neural networks, transformers process input data in parallel rather than sequentially, which makes them more efficient but also means they don't inherently understand the order of the data. Positional encodings are used to give the model some information about the relative positions of the elements in the sequence.


In [None]:
print(query_engine.query("What are Encoder and Decoder blocks in transformer?").response)

The encoder and decoder are key components of the Transformer model architecture. The encoder is made up of a stack of six identical layers, each with two sub-layers. The first sub-layer is a multi-head self-attention mechanism, and the second is a simple, position-wise fully connected feed-forward network. Residual connections are employed around each of the two sub-layers, followed by layer normalization. 

The decoder, like the encoder, is composed of a stack of six identical layers. However, in addition to the two sub-layers in each encoder layer, the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Residual connections are also used around each of the sub-layers in the decoder, followed by layer normalization. The self-attention sub-layer in the decoder stack is modified to prevent positions from attending to subsequent positions, ensuring that the predictions for a given position can depend only on the known outputs at p

In [None]:
query = "If I want to generate document embeddings, then which type of Transformer Architecture I must choose?"
print(query_engine.query(query).response)

The Transformer architecture you should choose for generating document embeddings is the Encoder part of the Transformer model. The encoder maps an input sequence of symbol representations to a sequence of continuous representations, which can be used as document embeddings. It is composed of a stack of identical layers, each with two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.


In [None]:
query = """If I want to generate document embeddings,
then which type of Transformer Architecture I must choose among Encoders, Decoders or Encoder-Decorder?"""

print(query_engine.query(query).response)

To generate document embeddings, you should choose the Encoder part of the Transformer Architecture. The Encoder maps an input sequence of symbol representations to a sequence of continuous representations, which can be used as document embeddings.


By following these steps, we have created a fully functional end-to-end RAG pipeline. This pipeline can ingest documents, index them, and answer natural language queries using a powerful combination of LlamaIndex and OpenAI's models. This demonstrates the practical application of RAG systems in extracting and synthesizing information from large datasets.
