## <font color='purple'>__Retrieval Augmented Generation__</font>

<font color='purple'>Retrieval Augmented Generation (RAG)</font> is a powerful paradigm in natural language processing that combines the strengths of information retrieval and language generation. This approach involves retrieving relevant information from a large dataset and using that information to enhance the generation of accurate text.  

The phrase <font color='purple'>Retrieval Augmented Generation</font> comes from a recent [paper by Lewis et al. from Facebook AI](https://research.facebook.com/publications/retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks/). The idea is to use a pre-trained language model (LM) to generate text, but to use a separate retrieval system to find relevant documents to condition the LM on.



#### <font color='purple'>How it Works</font>

From start to finish, the RAG relies on 5 steps:
![RAG steps](./images/rag.png)  

**1. Load**  
Load documents from different source files (_url, csv, pdf, txt_) in diverse locations (s3 storage, public sites, etc.)  

**2. Transform**  
Prepare larger documents for retrieval by creating splits or chunks the data.  

**3. Embed**  
Create embeddings for documents to capture the semantic meaning of the text. This later enables models to efficiently find other pieces of text that are similar.  

**4. Store**  
Vector stores support efficient storage and search of document embeddings.  

**5. Retrieve**  
Relevant information is retrieved to produce more informed and context-aware responses.

During runtime, this blending of retrieval and generation enhances the richness and relevance of the generated content.
![RAG runtime](./images/basic_rag.png)  

_Taken from: [https://docs.llamaindex.ai/en/stable/_static/getting_started/basic_rag.png]([https://docs.llamaindex.ai/en/stable/_static/getting_started/basic_rag.png])_


#### <font color='purple'>Sample Uses Cases</font>

- **Question Answering Systems or Conversational Agents**:  
Retrieve information from vast knowledge bases, such as multiple pdf or csv files, and incorporate into response (example today).

- **Context Creation**:  
Enhance the generation of informative text by pulling in relevant details from a wide range of sources.

- **Code Generation**:  
Assist in generating code snippets by retrieving information from programming knowledge bases. 

- **Prevent Hallucinations**:  
Bring in external knowledge to check whether a GPT response is a hallucination. 

#### <font color='purple'>Example: Retrieving Information Non-existent in Training</font>

One way to use RAG is to feed the LLM with up-to-date information.  
The Llama 2 was trained between January 2023 and July 2023. The Mistral 7B model was released in September 2023. Let's ask LLama2 a question about the Mistral 7B model.

```
# Run the query through Llama2 13B chat model with test_llama2.py  
query = "[INST]What is a Mistral 7B language model?[/INST]"
```

__Output:__  
> [INST]What is a Mistral 7B language model?[/INST]  I'm not familiar with a "Mistral 7B language model." It's possible that this is a custom or proprietary language model developed by a specific organization or individual, and not a widely known or used model.
> 
> There are many language models available, each with their own strengths and weaknesses, and it's important to choose the right model for your specific use case. Some popular language models include BERT, RoBERTa, XLNet, and transformers. These models have been pre-trained on large datasets and can be fine-tuned for specific tasks such as sentiment analysis, question answering, and text classification.
> 
> If you have any more information about the Mistral 7B language model, such as its capabilities, performance, or the organization that developed it, I may be able to provide more assistance.


The release of Mistral 7B language model can be found here: https://mistral.ai/news/announcing-mistral-7b/

![Selenium doc](./images/mistral.png)  

To resolve this, we can use RAG to feed details about the release note. The code below will take the contents of a webpage and follow the 5 steps outlined above to retrieve the relevant information.

In [None]:
# === Load libraries
from pathlib import Path
import logging
import sys
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.vector_stores.faiss import FaissVectorStore
import faiss
import requests
from bs4 import BeautifulSoup
import time

start_time = time.time()

logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

# === Download web content
url_link = "https://mistral.ai/news/announcing-mistral-7b/"
response = requests.get(url_link)
soup = BeautifulSoup(response.content, 'html.parser')
webpage_content = soup.get_text().strip()
data_folder = Path("./test_data")
if not data_folder.exists():
    data_folder.mkdir()
txt_file = data_folder / "webpage_content.txt"
with open(txt_file, "w") as f:
    f.write(webpage_content)

# Settings for embedding model
embedding_name = "intfloat/multilingual-e5-large-instruct"
embedding_chunk_size = 512
embed_model = HuggingFaceEmbedding(model_name=embedding_name, max_length=512)

# Settings for LLM
LLAMA2_13B_CHAT = "/kellogg/data/llm_models_opensource/llama2_meta_huggingface/models--meta-llama--Llama-2-13b-chat-hf/snapshots/29655417e51232f4f2b9b5d3e1418e5a9b04e80e"
selected_model = LLAMA2_13B_CHAT
llm = HuggingFaceLLM(
    context_window=4096,
    max_new_tokens=256,
    tokenizer_name=selected_model,
    model_name=selected_model,
    device_map="auto",
    # generate_kwargs={"temperature": 0.5, "top_p": 0.9, "top_k": 2, "do_sample": True},
)

# === 1. Load
documents = SimpleDirectoryReader(data_folder).load_data()

# Set up FAISS vector store
d = 1024 # embedding dimension
faiss_index = faiss.IndexFlatL2(d)
vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, 
    # === 2. Transform
    transformations=[SentenceSplitter(chunk_size=embedding_chunk_size)],
    # === 3. Embed
    embed_model=embed_model,
    # === 4. Store
    storage_context=storage_context
)

# === 5. Retrieve
query_engine = index.as_query_engine(llm=llm)

print("====================================")
print(f"Selected LLM: {selected_model}")
query = "[INST]What is a Mistral 7B language model?[/INST]"
response = query_engine.query(query)
print("====================================")
print(f"Query: {query}")
print("Response: ")
print(response)
print("====================================")

print(f"Execution time: {time.time() - start_time} seconds")

__Output:__  
```
Selected LLM: /kellogg/data/llm_models_opensource/llama2_meta_huggingface/models--meta-llama--Llama-2-13b-chat-hf/snapshots/29655417e51232f4f2b9b5d3e1418e5a9b04e80e

Batches:   0%|          | 0/1 [00:00<?, ?it/s]
Batches: 100%|██████████| 1/1 [00:00<00:00, 66.66it/s]
====================================
Query: [INST]What is a Mistral 7B language model?[/INST]
Response: 
Based on the provided context information, a Mistral 7B language model is a powerful language model developed by Mistral AI. It is a 7.3 billion parameter model that outperforms other 7B models on various benchmarks and approaches the performance of CodeLlama 7B on code tasks while remaining good at English tasks. The model is released under the Apache 2.0 license and can be used without restrictions. It is easy to fine-tune for any task and has been demonstrated to outperform Llama 2 13B chat.
====================================
```

#### <font color='purple'>More sample scripts</font>  
More sample scripts can be found at the [scripts/rag](https://github.com/rs-kellogg/krs-openllm-cookbook/tree/main/scripts/rag) folder of our github repo.