# Retrieval-Augmented Text Generation

In [None]:
import os
import langchain
import textwrap
import warnings

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_qdrant import Qdrant
from langchain_huggingface import HuggingFaceEmbeddings

In [None]:
from llama_cpp import Llama
from scipy import spatial
from qdrant_client import QdrantClient

In [None]:
from ssec_tutorials import (
    OLMO_MODEL,
    QDRANT_PATH,
    QDRANT_COLLECTION_NAME,
    download_qdrant_data,
)

In [None]:
warnings.filterwarnings("ignore")

Although trained on large datasets, stale data can severely limit LLMs. It faces several challenges:

1. The models are trained on internet content, so they might not generate relevant output when prompted for information that is not publicly available on the internet.

2. The models are trained up to a certain date, they might not generate relevant output when prompted for content and information that has happened after the training completion date of the model.

3. The models are trained to be more generalized. This means that they can only produce generic outputs and might not perform as expected when prompted for specific deep-dive concepts related to a particular topic.

One way to dynamically integrate relevant external information is retrieval-augmented generation (RAG), which can help improve the reliability of LLM outputs. 

## RAG Framework

RAG proposes a solution to this issue by supplementing the prompt sent to the LLM with information from external sources through a retrieval model, thereby providing the LLM with more relevant input to generation responses. It allows you to use pre-trained LLMs without fine-tuning them or training your own LLM on your training data. 

![RAG Workflow](../../images/rag-workflow.webp)


Image Source: [Medium Blog](https://medium.com/@henryhengluo/intro-of-retrieval-augmented-generation-rag-and-application-demos-c1d9239ababf)

Multiple concepts influence RAG pipeline:

1. Retrieval
2. Augmentation
3. Generation

## Retrieval

The retrieval phase can also be considered the data and query/prompt preparation phase, focusing on efficient information retrieval or data access. To improve your RAG pipeline, the pre-retrieval phase contains tasks such as: `(1): Indexing, (2) Query Manipulation, (3) Data Modification, (4) Search, and (5) Ranking.` In this tutorial, we primarily focus on indexing and search. 

`Indexing` enables fast and accurate information retrieval that sets up the context for any LLM to improve its response to a given user prompt or query. 

We will be indexing abstracts for all astrophysics papers and Astropy's documentation, a common core package for Astronomy in Python. 

### Embeddings

Embeddings, also called "Vector Embedding," help LLMs develop a semantic understanding of the textual data they are trained on. In simpler terms, these embedding models lay the groundwork for LLMs to perform tasks like sentence completion, similarity search, questions and answers, etc.

#### Vector

At the lowest level, machines only understand numeric values. For LLMs to work, natural language is converted into an array of numeric values before they are fed into the models. These arrays of numeric values are called "Vector."

An example of a vector: [2.5, 1.0, 3.3, 7.8]

The above is an example of a vector of size 4. 

In [None]:
import numpy as np

vector = np.array([2.5, 1.7, 3.3, 7.8])
print(f"Vector: {vector}")

#### Tokens

We stated above that **"texts are converted into an array of numeric values called vectors"**.

But depending on your use case, each word, sentence, paragraph, or entire document can be represented as a vector. 

Tokens are the smallest natural language units converted into a vector. It could be at the character level, sub-word level, word level, sentence level, paragraph level, or document level.

Example: Consider the text below.

`Earth is a planet of the solar system. There are 9 planets in the solar system. 
All planets revolve around the sun. Sun is a star.`


Case 1.) **Tokenizing the entire paragraph into vector.**  
Tokenization: The entire paragraph is a single token.   
Vectorization: A single vector.  
Sample Vector Representation: [3.1, 6.8, 5.4, 8.0, 7.1]

Case 2.) **Tokenizing each sentence into vectors.**  
Tokenization: One token for each sentence (total 4 tokens)  
Vectorization: One vector for each sentence (total 4 vectors).   
Sample Vector Representation: [[1.2, 2.3, 3.8, 7.9, 0.8], [2.5, 3.0, 8.2, 6.6, 4.1], [3.2, 6.5, 8.1, 9.3, 1.4], [1.1, 0.7, 7.2, 3.5, 8.5]]

Case 3.) **Tokenizing each word in the paragraph into a vector. There are 26 words in the paragraph, ignoring punctuation. Each word gets converted into a vector.**  
Tokenization: One token for each word in the paragraph (26 tokens)  
Vectorization: One vector for each token (total 26 vectors).    
Sample Vector Representation: [[2.1, 3.2, 4.1, 9.8, 7.0], [8.2, 4.2, 7.1, 3.8, 2.0].....total 26 such representations]


#### Tokenizers

Tokenizers are components responsible for converting large texts into tokens (tokenization). Different types of pre-trained tokenizers are available. You can even train your own tokenizers. But for the scope of this tutorial, we will use a pre-trained one. 

Generally, each tokenizer follows the following steps:

1. Break down the original text into tokens. These tokens could again be at the character, sub-word, word, sentence, paragraph, or document levels.
2. Assign a unique identifier to each of the tokens created.

In [None]:
# For example, here is how you can split a short sentence into chunks of text
from langchain_text_splitters import CharacterTextSplitter

In [None]:
text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=10,
    chunk_overlap=0,
)
text_splitter.split_text(text="Earth is a planet in the solar system.")

[Learn more about how to split text into tokens in LangChain here.](https://python.langchain.com/v0.2/docs/how_to/split_by_token/) 

#### Embedding Models

A language model needs to understand how tokens are related to each other in the context of human language. To understand this semantic relationship, these tokens are converted into numerical vectors.

Embedding Models are trained upon these tokens to develop an "embedding space."

- Before the training, the embedding model initializes an N-dimensional 'vector' corresponding to each 'token' with random values. (Value of N depends on the embedding model)
  
- During the embedding model training, the values for these vectors are updated across iterations. In this process, similar or related tokens are updated to have similarly valued vectors.
  
- After the training, the collection of all the 'vectors' corresponding to all the tokens is called the "embedding space."

- "Embedding Space" is an encoded representation of meanings of tokens and inter-token relationships.

> We now embed our relevant documents (knowledge base) into a pre-trained embedding model. 

In [None]:
# Setup the embedding, we are using the MiniLM model here
embeddings_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L12-v2"
)

In [None]:
query_result = embeddings_model.embed_query("Earth is a planet in the solar system.")

In [None]:
# Dimension of vector
len(query_result)

In [None]:
query_result[-3:]

In an embedding space, you can find how similar two vectors are using `dot product` or  using `cosine similarity.`  

In [None]:
print(
    "Similarity:",
    1
    - spatial.distance.cosine(
        query_result,
        embeddings_model.embed_query("Mars is a planet in the solar system."),
    ),
)

In [None]:
print(
    "Similarity:",
    1
    - spatial.distance.cosine(
        query_result, embeddings_model.embed_query("Hello Tacoma.")
    ),
)

#### Vector Stores

Once the embeddings are created for our relevant documents or knowledge base, we need to store these embeddings in the database for fast retrieval. 

The type of databases that store these vector embeddings are called "Vector Stores." We will use a vector store called "Qdrant," as shown below. 

In the below code, 
- Vector store works along with the embedding model to create vector embeddings.
- Vector embeddings are stored in the Qdrant Vector database collection.

We have already created a vector database that contains the astrophysics paper abstracts and Astropy's documentation, please refer to the notebook in the Appendix. 

In [None]:
download_qdrant_data()

In [None]:
QDRANT_PATH

In [None]:
QDRANT_COLLECTION_NAME

In [None]:
# Setting up Qdrant
if os.path.exists(QDRANT_PATH):
    print(f"Loading existing Qdrant collection '{QDRANT_COLLECTION_NAME}'")

    client = QdrantClient(path=QDRANT_PATH)

    qdrant = Qdrant(
        client=client,
        collection_name=QDRANT_COLLECTION_NAME,
        embeddings=embeddings_model,
    )

### Search

In [None]:
# Setup the retriever for later step
# mmr stands for  Maximum Marginal Relevance
# "MMR selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples."
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

In [None]:
retriever.invoke("What is dark matter?")

In [None]:
retriever.invoke("How can I perform celestial coordinate transformations?")

In [None]:
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [None]:
print(format_docs(retriever.invoke("What is dark matter?")))

In [None]:
print(
    format_docs(
        retriever.invoke("How can I perform celestial coordinate transformations?")
    )
)

## Augmentation & Generation

Now that we can retrieve the most relevant document based on a question, we can use the retrieved document and send it along with the prompt to increase the context for the LLM.

This can also be referred to as the `retrieval-augmented prompt.`

In [None]:
olmo = LlamaCpp(
    model_path=str(OLMO_MODEL),
    temperature=0.8,
    verbose=False,
    n_ctx=2048,
    max_tokens=512,
)

In [None]:
# Create a prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata["tokenizer.chat_template"],
    template_format="jinja2",
    partial_variables={"add_generation_prompt": True, "eos_token": "<|endoftext|>"},
)

In [None]:
prompt_template

In [None]:
# Test the prompt you want to send to OLMo.

question = "What is dark matter?"
context = format_docs(retriever.invoke(question))

prompt_template.format(
    messages=[
        {
            "role": "user",
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}""",
        }
    ]
)

In [None]:
# Test the prompt you want to send to OLMo.
question = "How can I perform celestial coordinate transformations?"
context = format_docs(retriever.invoke(question))

prompt_template.format(
    messages=[
        {
            "role": "user",
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}""",
        }
    ]
)

One way to generate the response with OLMo is to build `context` using the `question` beforehand, as shown above, create an llm_chain then `invoke` it with `messages`.

In [None]:
# Chain the prompt template and olmo
llm_chain = prompt_template | olmo

In [None]:
question = "What is dark matter?"
context = format_docs(retriever.invoke(question))

# Invoke the chain with a question and other parameters.
llm_chain.invoke(
    {
        "messages": [
            {
                "role": "user",
                "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:
    
                Context: {context}
                
                Question: {question}""",
            }
        ],
    },
    config={"callbacks": [StreamingStdOutCallbackHandler()]},
)

We can further use [LangChain's convenience functions](https://python.langchain.com/v0.2/docs/tutorials/rag/#built-in-chains) to streamline our pipeline using [create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html) and [create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html).

`create_stuff_documents_chain` specifies how retrieved context is fed into a prompt and LLM. 

On looking its signature, notice that it accepts `prompt` argument of type `BasePromptTemplate` but it needs input keys as `context` and `input`.

In [None]:
# Uncomment below line and run the cell
# create_stuff_documents_chain?

In [None]:
# This is how we can transform our prompt_template, so that it accepts `context` and `input` as input_variables
transformed_prompt_template = PromptTemplate.from_template(
    prompt_template.partial(
        messages=[
            {
                "role": "user",
                "content": "You are an astrophysics expert. Please answer the question on astrophysics based on the following context. \
                            Context: {context} \
                            Question: {input}",
            }
        ]
    ).format()
)
transformed_prompt_template

In [None]:
document_chain = create_stuff_documents_chain(
    llm=olmo, prompt=transformed_prompt_template
)

We can run this by passing in the context directly:

In [None]:
question = "What is dark matter?"
document_chain.invoke(
    {
        "input": question,
        "context": retriever.invoke(question),
    }
)

However, we want the context to be dynamically generated using the passed input or question.

From LangChain's documentation: `create_retrieval_chain` adds the retrieval step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key `input`, and includes input, context, and answer in its output.

In [None]:
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [None]:
response = retrieval_chain.invoke({"input": "What is dark matter?"})

In [None]:
response

In [None]:
print(response["answer"])

In [None]:
response = retrieval_chain.invoke(
    {"input": "How many dimensions are there in the elemental abundances of stars?"}
)
print(response["answer"])