# Retrival Augmented Generation 

In [1]:
# load env
import dotenv
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
import os
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# llama Index
from llama_index import (
    VectorStoreIndex, 
    SimpleDirectoryReader,
    StorageContext,
    load_index_from_storage
    )

# check if storage alrady exits
PERSIST_DIR ="./storage"
if not os.path.exists(PERSIST_DIR):
    # load the data and create index
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents=documents)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    # load the exitiong index
    storage_context= StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index=load_index_from_storage(storage_context=storage_context)


#documents = SimpleDirectoryReader("data").load_data()

In [7]:
#index= VectorStoreIndex.from_documents(documents=documents, show_progress=True)

Parsing nodes: 100%|██████████| 57/57 [00:00<00:00, 91.17it/s] 
Generating embeddings: 100%|██████████| 129/129 [00:04<00:00, 26.68it/s]


In [33]:
query_engine=index.as_query_engine()
query_engine

<llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine at 0x12da9d84940>

In [34]:
response = query_engine.query("How the attention works")

In [35]:
response.response

'The attention mechanism in the Transformer model is based on a compatibility function between a query and a set of key-value pairs. The input consists of queries and keys of dimension dk, and values of dimension dv. The attention function computes the dot products of the queries with the keys, scales the dot products by dividing them by the square root of the dimension dk, and applies a softmax function to obtain the weights assigned to each value. The output is then computed as a weighted sum of the values, where the weights are the attention weights. This allows the model to focus on different parts of the input sequence when making predictions.'

In [14]:
response=query_engine.query("What is Attention")
response.response

'Attention is a mechanism used in sequence transduction models that allows the model to focus on specific parts of the input sequence when generating the output. It involves mapping a query and a set of key-value pairs to an output, where the output is computed as a weighted sum of the values. The weights are determined by a compatibility function between the query and the corresponding key.'

In [15]:
from llama_index.response.pprint_utils import pprint_response
pprint_response(response, show_source=True)
print(response)

Final Response: Attention is a mechanism used in sequence transduction
models that allows the model to focus on specific parts of the input
sequence when generating the output. It involves mapping a query and a
set of key-value pairs to an output, where the output is computed as a
weighted sum of the values. The weights are determined by a
compatibility function between the query and the corresponding key.
______________________________________________________________________
Source Node 1/2
Node ID: 8a58ccf2-81a7-4652-a57f-326bce8109fd
Similarity: 0.8079673385653757
Text: Attention Is All You Need Ashish Vaswani∗ Google Brain
avaswani@google.comNoam Shazeer∗ Google Brain noam@google.comNiki
Parmar∗ Google Research nikip@google.comJakob Uszkoreit∗ Google
Research usz@google.com Llion Jones∗ Google Research
llion@google.comAidan N. Gomez∗† University of Toronto
aidan@cs.toronto.eduŁukasz Kaiser∗ Google Brain lukasz...
_____________________________________________________________________

In [25]:
from llama_index.retrievers import VectorIndexRetriever
from llama_index.query_engine import RetrieverQueryEngine
from llama_index.indices.postprocessor import SimilarityPostprocessor

retriever = VectorIndexRetriever(index=index, similarity_top_k=4)
postprocessor = SimilarityPostprocessor(similarity_cutoff=0.80)

query_engine = RetrieverQueryEngine(retriever=retriever, node_postprocessors=[postprocessor])

In [26]:
response = query_engine.query("how to best of understanding the attention?")

In [None]:
response

In [27]:
from llama_index.response.pprint_utils  import pprint_response
pprint_response(response, show_source=True)
print(response)

Final Response: The best way to understand attention is to think of it
as a mechanism that allows a model to focus on specific parts of the
input when making predictions. In the context of the Transformer model
described in the given information, attention is used to connect the
encoder and decoder. It computes the compatibility between a query and
a set of key-value pairs, and then uses this compatibility to assign
weights to the values. These weighted values are then used to make
predictions. The attention mechanism used in the Transformer model is
called "Scaled Dot-Product Attention", which involves computing dot
products between queries and keys, and then scaling the dot products
by a factor of 1/sqrt(dk). This scaling helps to prevent the dot
products from becoming too large and affecting the gradients during
training.
______________________________________________________________________
Source Node 1/4
Node ID: 8a58ccf2-81a7-4652-a57f-326bce8109fd
Similarity: 0.8156380603711351