In [19]:
!pip install -r requirements.txt

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You should consider upgrading via the '/Users/suyog_k/Desktop/projects/RAG-Llama/.rag_llama/bin/python3 -m pip install --upgrade pip' command.[0m


In [20]:
#import .env vars
import os
from dotenv import load_dotenv

load_dotenv()
MODEL = 'llama3.1'

In [21]:
#load embedder
from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-base-en-v1.5"  ##embedding model focused on RAG 
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}

hf = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [22]:
### If you wish to use Ollama embeddings
#from langchain_community.embeddings import OllamaEmbeddings
#embeddings = OllamaEmbeddings(model=MODEL)

In [23]:
from langchain_community.llms import Ollama
from langchain_core.output_parsers import StrOutputParser

model = Ollama(model=MODEL)
embeddings = hf
parser = StrOutputParser()

#chain recieves O/P from model and parses it to a str 
chain = model | parser
#chain.invoke("Tell me about yourself")

"I'm a large language model, which means I'm a computer program designed to process and generate human-like text. I don't have a physical body or personal experiences like humans do, but I can still share some information about my capabilities and characteristics.\n\nHere are a few things you might find interesting:\n\n1. **Language understanding**: I've been trained on a massive corpus of text data, which allows me to comprehend and respond to a wide range of topics and questions.\n2. **Knowledge base**: My training data includes a vast amount of information from various sources, including books, articles, research papers, and websites. This means I can provide answers to many different types of questions, from simple definitions to more complex topics like science and history.\n3. **Conversational skills**: While I don't have personal opinions or emotions, I'm designed to engage in conversations that feel natural and productive. I can respond to questions, discuss topics, and even te

In [24]:
from langchain_community.document_loaders import PyPDFLoader
#load and split the pdf document(DATA)
loader = PyPDFLoader("attention_paper.pdf")
pages = loader.load_and_split()
pages

[Document(metadata={'source': 'attention_paper.pdf', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the encoder and decoder through an attention\nmechanism. We propose a new simple network architecture, the Transfor

In [25]:
from langchain.prompts import PromptTemplate

template = """ 
Answer the question accurately based on the context below. 
If you can't answer the question, reply "Sorry, I don't know".

Context: {context}
Question: {question}
"""
prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Context", question="Question"))

 
Answer the question accurately based on the context below. 
If you can't answer the question, reply "Sorry, I don't know".

Context: Context
Question: Question



In [26]:
##update chain
chain = prompt | model | parser
chain.input_schema.schema()

{'title': 'PromptInput',
 'type': 'object',
 'properties': {'context': {'title': 'Context', 'type': 'string'},
  'question': {'title': 'Question', 'type': 'string'}},
 'required': ['context', 'question']}

In [27]:
chain.invoke(
    {
        "context": "The 7 Habits of Highly Effective People was authored by Stephen R. Covey",
        "question": "Which book did Stephen R. Covey author?"
    }
)

'The 7 Habits of Highly Effective People.'

In [28]:
##Vector Store DB for Retreival 
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(
    pages,
    embedding = embeddings
)

In [29]:
retriever = vectorstore.as_retriever()
retriever.invoke("Model Architecture")

[Document(metadata={'source': 'attention_paper.pdf', 'page': 2}, page_content='Figure 1: The Transformer - model architecture.\nThe Transformer follows this overall architecture using stacked self-attention and point-wise, fully\nconnected layers for both the encoder and decoder, shown in the left and right halves of Figure 1,\nrespectively.\n3.1 Encoder and Decoder Stacks\nEncoder: The encoder is composed of a stack of N= 6 identical layers. Each layer has two\nsub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position-\nwise fully connected feed-forward network. We employ a residual connection [ 11] around each of\nthe two sub-layers, followed by layer normalization [ 1]. That is, the output of each sub-layer is\nLayerNorm( x+ Sublayer( x)), where Sublayer( x)is the function implemented by the sub-layer\nitself. To facilitate these residual connections, all sub-layers in the model, as well as the embedding\nlayers, produce outputs of dimensio

In [32]:
from operator import itemgetter

chain = (
    {
        "context" : itemgetter("question") | retriever, #pass the question to retreiver and itemgetter gets the O/P
        "question" : itemgetter("question")     #gets the value stored in question key
    }
    | prompt
    | model
    | parser
)

chain.invoke({"question" : "Tell me about Encoders?"})

"The text does not provide a comprehensive overview of encoders in general. However, it mentions encoders in the context of sequence transduction tasks and discusses how they are used in conjunction with decoders.\n\nSpecifically, it talks about the following:\n\n1. **Encoder-Decoder Attention Layers**: In these layers, queries come from the previous decoder layer, while memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.\n2. **Self-Attention Layers in the Encoder**: The encoder contains self-attention layers where each position can attend to all positions in the previous layer of the encoder.\n\nIt is implied that the encoder's purpose is to process the input sequence and extract relevant information, which is then used by the decoder to produce the output sequence.\n\nThere is no detailed explanation of encoders as a concept or their general use cases. If you have specific questions 

In [33]:
##TESTING 
questions = [
    "What is Scaled Dot-Product Attention?",
    "What is Multi-Head Attention?",
    "Tell me about Training Data and Batching?",
    "Which GPUs were used to train the model?",
    "How many Stacks is decoder composed of?",
]

for question in questions:
    print(f"Question: {question}")
    print(f"Answer: {chain.invoke({'question': question})}")
    print()

Question: What is Scaled Dot-Product Attention?
Answer: The Scaled Dot-Product Attention is a type of attention mechanism used in deep learning models, particularly in the Transformer architecture for sequence-to-sequence tasks.

In this context, it refers to the dot-product attention mechanism with scaling and a softmax function applied to it. This mechanism allows the model to attend to different parts of the input sequence when generating output values.

More specifically, the Scaled Dot-Product Attention involves three steps:

1. Computing the dot product between the query and key vectors.
2. Scaling the result by dividing by the square root of the dimensionality of the key vector (to prevent the softmax normalization from dominating the gradients).
3. Applying a softmax function to normalize the scaled attention scores.

This mechanism is described in detail in Vaswani et al.'s paper on the Transformer model, where it is used as the primary attention component for processing input

In [34]:
#streaming answers
for i in chain.stream({"question": "What is Multi-Head Attention?"}):
    print(i, end="", flush = True) #real time o/p stream

Multi-Head Attention is a mechanism used in the Transformer architecture to jointly attend to information from different representation subspaces at each position, as opposed to just using a single attention mechanism to attend to all positions simultaneously. This allows the model to learn complex interactions between different parts of the input sequence.

In the context of the provided text, Multi-Head Attention is described as a sub-layer in the decoder stack that performs multi-head attention over the output of the encoder stack. This means that the model will attend to different aspects of the input sequence at each position, using multiple "heads" or attention mechanisms to do so.

The benefits of Multi-Head Attention are not explicitly stated in the provided text, but it is implied to be a key innovation of the Transformer architecture, allowing for better performance on machine translation tasks.

In [35]:
## batching: parallel calls
chain.batch([{"question": q} for q in questions])

['Scaled Dot-Product Attention is a type of self-attention mechanism used in deep learning models, such as the Transformer architecture. It was introduced in the paper "Attention Is All You Need" by Vaswani et al.\n\nIn essence, Scaled Dot-Product Attention allows the model to attend to different positions in the input sequence and weigh their importance for computing the output. This is done by calculating a compatibility score between each pair of positions using the dot product of two vectors (the query and key vectors), scaling this score with a learnable factor (the scale parameter), and then applying a softmax function to normalize the scores.\n\nThe formula for Scaled Dot-Product Attention is:\n\nAttention(Q, K, V) = softmax( Q*K^T / sqrt(dk) ) * V\n\nwhere Q is the query vector, K is the key vector, V is the value vector, and dk is the dimension of the key vector.',
 'Multi-Head Attention is a sub-layer in the Transformer architecture that performs self-attention over the outpu