## Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) [[1](https://arxiv.org/abs/2005.11401v4)] is an advanced NLP technique that enhances the quality and reliability of Large Language Models (LLMs) by grounding them in external knowledge sources.

In practice, this approach combines information retrieval with text generation as follows:
1. Given a user query (prompt), the system accesses an external large knowledge base (such as a vector index) to find relevant passages.
2. It then augments the original query with this retrieved information.
3. The LLM generates a response based on both the original query and the augmented context.

Key benefits of implementing RAG in LLM-based systems include:
1. More factual and specific response generation.
2. Easy incorporation of updated knowledge by modifying the retrieval corpus without retraining the LLM.
3. Provides a form of interpretability by citing the retrieved passages used for generation.

[1] Lewis P, et al. 2020. *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks*. [arXiv:2005.11401](https://arxiv.org/abs/2005.11401v4)

In this notebook, we'll build a basic knowledge base with exemplary documents, apply chunking, index the embedded splits into a vector storage, and build a conversational chain with history:

<p align="center">
  <img src="https://github.com/dcarpintero/generative-ai-101/blob/main/static/retrieval_augmented_generation.png?raw=1">
</p>

### 1. Build up Knowledge Base

The most common approach in RAG is to create dense vector representations of the knowledge base in order to calculate the semantic similarity to a given user query.

In this basic example, we will take two sources related to the Llama 3.1 model, split them into chunks, embed them using an open-source embedding model, and load them into a vector store.

In [1]:
%pip install langchain langchain-community langchain-huggingface sentence-transformers faiss-cpu bs4 --quiet | tail -n 1

   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.5/49.5 kB 1.8 MB/s eta 0:00:00


#### 1.1 Document Ingestion

We first load the document(s) from web url's:

In [2]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader(["https://ai.meta.com/blog/meta-llama-3-1/",
                        "https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md"])
docs = loader.load()



#### 1.2 Chunking Documents for RAG

A critical step in implementing Retrieval-Augmented Generation (RAG) is splitting documents into appropriate chunks. This process ensures that semantically relevant content is grouped together, optimizing retrieval accuracy and context preservation. In this section we will explore how to effectively chunk our documents using LangChain.

##### Why Chunking Matters

1. **Semantic Coherence**: Proper chunking keeps related information together, improving the relevance of retrieved content.
2. **Context Window Optimization**: Chunks should fit within the LLM's context window for efficient processing.
3. **Retrieval Precision**: Well-defined chunks enable more accurate and targeted information retrieval.

##### Using LangChain's Text Splitters

LangChain offers various text splitters, with the `RecursiveCharacterTextSplitter` being a recommended choice for generic text. This splitter is intended to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
splits = text_splitter.split_documents(docs)

Let's inspect the second and third chunks:

In [4]:
from IPython.display import display, Markdown

def md(s):
    display(Markdown(s))

In [5]:
md(splits[1].page_content)
md(splits[2].page_content)

Our approachResearchProduct experiencesLlamaBlogTry Meta AILarge Language ModelIntroducing Llama 3.1: Our most capable models to dateJuly 23, 2024•15 minute readTakeaways:Meta is committed to openly accessible AI. Read Mark Zuckerberg’s letter detailing why open source is good for developers, good for Meta, and good for the world.Bringing open intelligence to all, our latest models expand context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first

context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first frontier-level open source AI model.Llama 3.1 405B is in a class of its own, with unmatched flexibility, control, and state-of-the-art capabilities that rival the best closed source models. Our new model will enable the community to unlock new workflows, such as synthetic data generation and model distillation.We’re continuing to build out Llama to be a system by providing more components that work

We can see that there is indeed an overlap among those chunks:

In [6]:
md(splits[1].page_content[-100:])
md(splits[2].page_content[:100])

and context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first

context length to 128K, add support across eight languages, and include Llama 3.1 405B—the first fro

You might also experiment with chunking strategies at https://chunkviz.up.railway.app/, a tool that highlights splits and overlaps for common splitters:

![RAG Chunking](https://github.com/dcarpintero/generative-ai-101/blob/main/static/rag_chunking.png?raw=1)

#### 1.3 Embedding Transformation & Indexing

Let's load the documents into a vector storage with an open-source embedding model. In this example we use [FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/), which is highly optimized for large-scale datasets and GPU acceleration:

In [7]:
%%capture
from langchain_community.vectorstores import FAISS
from langchain_huggingface import HuggingFaceEmbeddings

embedding_model = "sentence-transformers/all-MiniLM-L6-v2"
db = FAISS.from_documents(documents = splits,
                          embedding = HuggingFaceEmbeddings(model_name=embedding_model))

### 2. Foundation Models on Groq

You might get a GROQ API KEY at https://console.groq.com/keys:

In [8]:
import os
from getpass import getpass

GROQ_API_TOKEN = getpass()
os.environ["GROQ_API_KEY"] = GROQ_API_TOKEN

··········


In this example we will use [Llama3-8b](https://ai.meta.com/blog/meta-llama-3-1/):

In [11]:
!pip install langchain-groq
from langchain_groq import ChatGroq
llm = ChatGroq(temperature=0, model_name="llama3-8b-8192")

Collecting langchain-groq
  Downloading langchain_groq-0.2.3-py3-none-any.whl.metadata (3.0 kB)
Collecting groq<1,>=0.4.1 (from langchain-groq)
  Downloading groq-0.14.0-py3-none-any.whl.metadata (14 kB)
Downloading langchain_groq-0.2.3-py3-none-any.whl (14 kB)
Downloading groq-0.14.0-py3-none-any.whl (109 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m109.5/109.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq, langchain-groq
Successfully installed groq-0.14.0 langchain-groq-0.2.3


### 3. Generate a Retrieval-Augmented Response with LangChain

In [12]:
from langchain.chains import ConversationalRetrievalChain

chat_history = []
chain = ConversationalRetrievalChain.from_llm(llm,
                                              db.as_retriever(),
                                              return_source_documents=True)

We ask a very specific question about LLama 3.1, namely the size of the context length in Llama 3.1, the LLM generated response should be '128k':

![RAG Source](https://github.com/dcarpintero/generative-ai-101/blob/main/static/rag_source.png?raw=1)

##### 3.1 Model Inference with RAG & Source Citation

In [13]:
user_query = "how long is the context length in Llama 3.1 405B?"
llm_output = chain.invoke({"question": user_query, "chat_history": chat_history})

md(llm_output['answer'])

According to the text, the context length in Llama 3.1 405B is 128K.

LangChain includes the sources in the response:

In [14]:
llm_output['source_documents']

[Document(id='240cde7a-9713-4fa3-99e6-558109118fd0', metadata={'source': 'https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md', 'title': 'llama-models/models/llama3_1/MODEL_CARD.md at main · meta-llama/llama-models · GitHub', 'description': 'Utilities intended for use with Llama models. Contribute to meta-llama/llama-models development by creating an account on GitHub.', 'language': 'en'}, page_content='Language\n\nLlama 3.1 8B Instruct\n\nLlama 3.1 70B Instruct\n\nLlama 3.1 405B Instruct\n\n\n\nGeneral\n\nMMLU (5-shot, macro_avg/acc)\n\nPortuguese\n   \n62.12\n   \n80.13\n   \n84.95\n   \n\n\nSpanish\n   \n62.45\n   \n80.05\n   \n85.08\n   \n\n\nItalian\n   \n61.63\n   \n80.4\n   \n85.04\n   \n\n\nGerman\n   \n60.59\n   \n79.27\n   \n84.36\n   \n\n\nFrench\n   \n62.34\n   \n79.82\n   \n84.66\n   \n\n\nHindi\n   \n50.88\n   \n74.52\n   \n80.31\n   \n\n\nThai\n   \n50.32\n   \n72.95\n   \n78.21'),
 Document(id='8b557c4a-6419-407b-b1ac-6b88cd12279c', metada

We can see that the first source includes indeed the answer:

In [15]:
md(llm_output['source_documents'][0].page_content)

Language

Llama 3.1 8B Instruct

Llama 3.1 70B Instruct

Llama 3.1 405B Instruct



General

MMLU (5-shot, macro_avg/acc)

Portuguese
   
62.12
   
80.13
   
84.95
   


Spanish
   
62.45
   
80.05
   
85.08
   


Italian
   
61.63
   
80.4
   
85.04
   


German
   
60.59
   
79.27
   
84.36
   


French
   
62.34
   
79.82
   
84.66
   


Hindi
   
50.88
   
74.52
   
80.31
   


Thai
   
50.32
   
72.95
   
78.21

##### 3.2 Follow-up Question with Chat History

In [16]:
chat_history = [(user_query, llm_output["answer"])]

Including the chat history allows the the model to correctly infer the intent, namely that the user is asking about the context length of the '8b model':

In [17]:
user_query = "what about the 8b model?"
llm_output = chain.invoke({"question": user_query, "chat_history": chat_history})
md(llm_output['answer'])

According to the text, the context length in the 8B model is 128K.

##### 3.3 Same Question without Chat History is Not Accurate

In [18]:
user_query = "what about the 8b model?"
llm_output = chain.invoke({"question": user_query, "chat_history": []})
md(llm_output['answer'])

The text does not mention the "8b model". It does mention quantizing the 405B model from 16-bit (BF16) to 8-bit (FP8) numerics, but it does not mention an "8b model" specifically.

Without chat history, the model appears to just retrieve passages that approximate the semantic meaning of the word 'model' contained in the user question, but is not able to retrieve information about the context length:

In [19]:
for doc in llm_output['source_documents']:
    md(doc.page_content)

Introducing Llama 3.1: Our most capable models to date

this blog post.)While this is our biggest model yet, we believe there’s still plenty of new ground to explore in the future, including more device-friendly sizes, additional modalities, and more investment at the agent platform layer.As always, we look forward to seeing all the amazing products and experiences the community will build with these models.This work was supported by our partners across the AI community. We’d like to thank and acknowledge (in alphabetical order): Accenture, Amazon

parameter model to improve the post-training quality of our smaller models.To support large-scale production inference for a model at the scale of the 405B, we quantized our models from 16-bit (BF16) to 8-bit (FP8) numerics, effectively lowering the compute requirements needed and allowing the model to run within a single server node.Instruction and chat fine-tuningWith Llama 3.1 405B, we strove to improve the helpfulness, quality, and detailed instruction-following capability of the model in

translation. With the release of the 405B model, we’re poised to supercharge innovation—with unprecedented opportunities for growth and exploration. We believe the latest generation of Llama will ignite new applications and modeling paradigms, including synthetic data generation to enable the improvement and training of smaller models, as well as model distillation—a capability that has never been achieved at this scale in open source.As part of this latest release, we’re introducing upgraded

### 4. Model Hallucination without RAG

Note that without RAG, the model generates an incorrect response, and that the user can not verify the information since the sources are not available:

In [20]:
result = llm.invoke("how long is the context length in Llama 3.1 405B?")
md(result.content)

According to the official documentation, the context length in LLaMA 3.1 405B is 2048 tokens.

In other words, the model can process and respond to input sequences of up to 2048 tokens (or characters) in length.