Retrieval Augmented Generation (RAG)

In [1]:
# Import the os module to interact with the operating system
import os

# Import the llama_index library 
import llama_index

# Import the load_dotenv function from the dotenv library to load environment variables from a .env file
from dotenv import load_dotenv

# Import the openai library to interact with OpenAI's API
import openai

# Load environment variables from a .env file into the environment
load_dotenv()

True

The code below sets the 'OPENAI_API_KEY' environment variable in the operating system environment to the value retrieved from the .env file using os.getenv('OPENAI_API_KEY'). This ensures that the OpenAI API key is available for use in your script.

In [2]:
# Set the 'OPENAI_API_KEY' environment variable in the current environment to the value retrieved from the environment variables
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY')
# openai.api_key = 'sk-...tTM7'



In [3]:
# Import the VectorStoreIndex and SimpleDirectoryReader classes from the llama_index.core module
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader

# Create an instance of SimpleDirectoryReader to read files from the "data" directory
pdfs = SimpleDirectoryReader("data").load_data()

# The load_data() method reads and loads the data from the specified directory

This snippet imports the necessary classes from the llama_index.core module and uses SimpleDirectoryReader to load data (presumably PDF files) from the "data" directory. The loaded data is stored in the pdfs variable.

In [4]:
# pdfs

In [5]:
# Create an instance of VectorStoreIndex from the documents loaded into pdfs
# The from_documents method is used to build the index from the provided documents
# The show_progress parameter, when set to True, displays the progress of the indexing process
index = VectorStoreIndex.from_documents(pdfs, show_progress=True)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 547/547 [00:01<00:00, 534.16it/s]
Generating embeddings: 100%|██████████| 642/642 [00:07<00:00, 81.08it/s]


In [6]:
index # Created the VectorStoreIndex in the form of Word Embeddings.

<llama_index.core.indices.vector_store.base.VectorStoreIndex at 0x28a9d92fe80>

In [7]:
# Convert the VectorStoreIndex instance into a query engine
# The as_query_engine method prepares the index to handle queries
# query_engine = index.as_query_engine()

This line converts the VectorStoreIndex instance into a query engine using the as_query_engine method, which prepares the index to handle queries. The resulting query engine is stored in the query_engine variable.

In [3]:
# Import the VectorIndexRetriever class from the llama_index.core.retrievers module
from llama_index.core.retrievers import VectorIndexRetriever

# Import the RetrieverQueryEngine class from the llama_index.core.query_engine module
from llama_index.core.query_engine import RetrieverQueryEngine

# Import the SimilarityPostprocessor class from the llama_index.core.indices.postprocessor module
from llama_index.core.indices.postprocessor import SimilarityPostprocessor

# Create an instance of VectorIndexRetriever with the index and set similarity_top_k to 4
# This specifies that the retriever should return the top 4 most similar documents for a query
retriever = VectorIndexRetriever(index=index, similarity_top_k=4)

# Create an instance of SimilarityPostprocessor with a similarity_cutoff of 0.70
# This postprocessor will filter out results with a similarity score below 0.70
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.70)

# Create an instance of RetrieverQueryEngine with the specified retriever and postprocessor
# The node_postprocessors parameter is used to apply postprocessing steps to the retrieved documents
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[postprocessor]
)

NameError: name 'index' is not defined

This snippet sets up a more customized query engine:

* VectorIndexRetriever is used to retrieve the top 4 most similar documents based on the query.
* SimilarityPostprocessor filters out documents that do not meet a similarity threshold of 0.70.
* RetrieverQueryEngine is initialized with the retriever and postprocessor, enabling more refined query results.

In [22]:
# It acts as a retriever for our queries from our indexes.
query_engine

<llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine at 0x28a9da76440>

This following snippet uses the query engine to perform a query with the question "What is a CNN?" and then prints the response obtained from the query engine.

In [28]:
response = query_engine.query("What is a CNN?")
print(response)

A CNN, or Convolutional Neural Network, is a type of neural network that is specifically designed for processing and analyzing visual data, such as images. It consists of neurons with learnable weights and biases, where each neuron receives inputs, performs a dot product operation, and may apply a non-linearity. CNNs are structured to efficiently handle image inputs, allowing for the encoding of specific properties into the network architecture. This design choice leads to more effective implementation of the forward function and a reduction in the number of parameters in the network compared to traditional neural networks.


In [24]:
response = query_engine.query("What is a RNN?")
print(response)

A Recurrent Neural Network (RNN) is a type of neural network that is designed to operate over sequences of vectors. Unlike Vanilla Neural Networks, which accept fixed-sized inputs and produce fixed-sized outputs using a fixed number of computational steps, RNNs can process sequences in the input, output, or both. This capability allows RNNs to learn patterns and dependencies in sequential data, making them particularly effective for tasks involving sequences like text generation, speech recognition, and time series prediction.


In [25]:
response = query_engine.query("Explain the transformer architecture.")
print(response)

The Transformer model architecture consists of an encoder and a decoder. The encoder is made up of a stack of identical layers, each containing two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network. Residual connections and layer normalization are applied around each sub-layer. The decoder also consists of a stack of identical layers, with an additional third sub-layer that performs multi-head attention over the output of the encoder stack. The self-attention mechanism in the decoder is modified to prevent positions from attending to subsequent positions. This architecture allows for parallelization and draws global dependencies between input and output using self-attention without relying on recurrent neural networks or convolution.


In [29]:
response = query_engine.query("What is Attention is All You Need?")
print(response)

Attention Is All You Need is a paper that introduces a new network architecture called the Transformer, which is based solely on attention mechanisms. This architecture eliminates the need for complex recurrent or convolutional neural networks typically used in sequence transduction models. The Transformer model has shown superior quality in machine translation tasks, is more parallelizable, and requires significantly less training time compared to existing models.


In [30]:
from llama_index.core.response.pprint_utils import pprint_response

pprint_response(response, show_source=True)

Final Response: Attention Is All You Need is a paper that introduces a
new network architecture called the Transformer, which is based solely
on attention mechanisms. This architecture eliminates the need for
complex recurrent or convolutional neural networks typically used in
sequence transduction models. The Transformer model has shown superior
quality in machine translation tasks, is more parallelizable, and
requires significantly less training time compared to existing models.
______________________________________________________________________
Source Node 1/4
Node ID: 8549d20a-b81f-4927-8072-69886d41f63e
Similarity: 0.8405749229591184
Text: Provided proper attribution is provided, Google hereby grants
permission to reproduce the tables and figures in this paper solely
for use in journalistic or scholarly works. Attention Is All You Need
Ashish Vaswani∗ Google Brain avaswani@google.comNoam Shazeer∗ Google
Brain noam@google.comNiki Parmar∗ Google Research
nikip@google.comJakob Usz

Storing the index

In [6]:
# Import necessary modules and classes
import os.path
from llama_index.core import (
    VectorStoreIndex,  # For creating and handling vector store indices
    SimpleDirectoryReader,  # For reading documents from a directory
    StorageContext,  # For managing storage contexts
    load_index_from_storage,  # For loading an index from storage
)

# Define the directory where the storage will be persisted
PERSIST_DIR = "./storage"

# Check if the storage directory already exists
if not os.path.exists(PERSIST_DIR):
    # If the storage directory does not exist, load the documents and create the index
    documents = SimpleDirectoryReader("data").load_data()  # Read documents from the "data" directory
    index = VectorStoreIndex.from_documents(documents)  # Create an index from the loaded documents
    
    # Store the created index for later use
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # If the storage directory exists, load the existing index
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)  # Create a storage context from the existing directory
    index = load_index_from_storage(storage_context)  # Load the index from the storage context

# Create a query engine from the index, regardless of whether it was newly created or loaded from storage
query_engine = index.as_query_engine()

# Query the index with a specific question
response = query_engine.query("Summarize Attention is all you need in 250 words.")

# Print the response from the query engine
print(response)


The paper "Attention Is All You Need" introduces a new network architecture called the Transformer, which relies solely on attention mechanisms, eliminating the need for recurrent or convolutional neural networks in sequence transduction models. By connecting the encoder and decoder through attention, the Transformer model proves to be more efficient, parallelizable, and quicker to train compared to existing models. Experimental results on machine translation tasks demonstrate the Transformer's superiority in quality, achieving significant improvements in BLEU scores on tasks like English-to-German and English-to-French translations. The model's success extends beyond translation tasks, showing promising results in English constituency parsing as well. The paper highlights the benefits of self-attention, such as improved computational performance for long sequences and potentially more interpretable models. Through detailed training procedures, hardware specifications, and optimization

1. Imports:

Various components from the llama_index.core module are imported to handle vector indices, reading documents, managing storage contexts, and loading indices from storage.

2. Storage Check and Index Creation/Loading:

The script checks if a directory (./storage) exists to determine if a previously stored index is available.
If the directory does not exist, it reads documents from the data directory, creates an index from these documents, and persists the index.
If the directory exists, it loads the index from the stored context.

3. Query Engine:

Whether the index is newly created or loaded from storage, it is converted into a query engine.
A query is made to the index asking, "What are transformers?", and the response is printed.