# Retrieval-Augmented Text Generation

In [3]:
import os
import langchain
import textwrap
import warnings

In [4]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_qdrant import Qdrant
from langchain_huggingface import HuggingFaceEmbeddings

In [18]:
from llama_cpp import Llama
from scipy import spatial
from qdrant_client import QdrantClient

In [26]:
from ssec_tutorials import OLMO_MODEL, QDRANT_PATH, QDRANT_COLLECTION_NAME, download_qdrant_data

In [6]:
warnings.filterwarnings('ignore')

Although trained on large datasets, stale data can severely limit LLMs. It faces several challenges:

1. The models are trained on internet content, so they might not generate relevant output when prompted for information that is not publicly available on the internet.

2. The models are trained up to a certain date, they might not generate relevant output when prompted for content and information that has happened after the training completion date of the model.

3. The models are trained to be more generalized. This means that they can only produce generic outputs and might not perform as expected when prompted for specific deep-dive concepts related to a particular topic.

One way to dynamically integrate relevant external information is retrieval-augmented generation (RAG), which can help improve the reliability of LLM outputs. 

## RAG Framework

RAG proposes a solution to this issue by supplementing the prompt sent to the LLM with information from external sources through a retrieval model, thereby providing the LLM with more relevant input to generation responses. It allows you to use pre-trained LLMs without fine-tuning them or training your own LLM on your training data. 

![RAG Workflow](../../images/rag-workflow.webp)


Image Source: [Medium Blog](https://medium.com/@henryhengluo/intro-of-retrieval-augmented-generation-rag-and-application-demos-c1d9239ababf)

Multiple concepts influence RAG pipeline:

1. Retrieval
2. Augmentation
3. Generation

## Retrieval

The retrieval phase can also be considered the data and query/prompt preparation phase, focusing on efficient information retrieval or data access. To improve your RAG pipeline, the pre-retrieval phase contains tasks such as: `(1): Indexing, (2) Query Manipulation, (3) Data Modification, (4) Search, and (5) Ranking.` In this tutorial, we primarily focus on indexing and search. 

`Indexing` enables fast and accurate information retrieval that sets up the context for any LLM to improve its response to a given user prompt or query. 

We will be indexing abstracts for all astrophysics papers and Astropy's documentation, a common core package for Astronomy in Python. 

### Embeddings

Embeddings, also called "Vector Embedding," help LLMs develop a semantic understanding of the textual data they are trained on. In simpler terms, these embedding models lay the groundwork for LLMs to perform tasks like sentence completion, similarity search, questions and answers, etc.

#### Vector

At the lowest level, machines only understand numeric values. For LLMs to work, natural language is converted into an array of numeric values before they are fed into the models. These arrays of numeric values are called "Vector."

An example of a vector: [2.5, 1.0, 3.3, 7.8]

The above is an example of a vector of size 4. 

In [7]:
import numpy as np

vector = np.array([2.5, 1.7, 3.3, 7.8])
print(f"Vector: {vector}") 

Vector: [2.5 1.7 3.3 7.8]


#### Tokens

We stated above that **"texts are converted into an array of numeric values called vectors"**.

But depending on your use case, each word, sentence, paragraph, or entire document can be represented as a vector. 

Tokens are the smallest natural language units converted into a vector. It could be at the character level, sub-word level, word level, sentence level, paragraph level, or document level.

Example: Consider the text below.

`Earth is a planet of the solar system. There are 9 planets in the solar system. 
All planets revolve around the sun. Sun is a star.`


Case 1.) **Tokenizing the entire paragraph into vector.**  
Tokenization: The entire paragraph is a single token.   
Vectorization: A single vector.  
Sample Vector Representation: [3.1, 6.8, 5.4, 8.0, 7.1]

Case 2.) **Tokenizing each sentence into vectors.**  
Tokenization: One token for each sentence (total 4 tokens)  
Vectorization: One vector for each sentence (total 4 vectors).   
Sample Vector Representation: [[1.2, 2.3, 3.8, 7.9, 0.8], [2.5, 3.0, 8.2, 6.6, 4.1], [3.2, 6.5, 8.1, 9.3, 1.4], [1.1, 0.7, 7.2, 3.5, 8.5]]

Case 3.) **Tokenizing each word in the paragraph into a vector. There are 26 words in the paragraph, ignoring punctuation. Each word gets converted into a vector.**  
Tokenization: One token for each word in the paragraph (26 tokens)  
Vectorization: One vector for each token (total 26 vectors).    
Sample Vector Representation: [[2.1, 3.2, 4.1, 9.8, 7.0], [8.2, 4.2, 7.1, 3.8, 2.0].....total 26 such represenatations]


#### Tokenizers

Tokenizers are components responsible for converting large texts into tokens (tokenization). Different types of pre-trained tokenizers are available. You can even train your own tokenizers. But for the scope of this tutorial, we will use a pre-trained one. 

Generally, each tokenizer follows the following steps:

1. Break down the original text into tokens. These tokens could again be at the character, sub-word, word, sentence, paragraph, or document levels.
2. Assign a unique identifier to each of the tokens created.

In [8]:
# For example, here is how you can split a short sentence into chunks of text
from langchain_text_splitters import CharacterTextSplitter

In [7]:
text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=10,
    chunk_overlap=0,
)
text_splitter.split_text(text="Earth is a planet in the solar system.")

['Earth is a', 'planet in', 'the solar', 'system.']

[Learn more about how to split text into tokens in LangChain here.](https://python.langchain.com/v0.2/docs/how_to/split_by_token/) 

#### Embedding Models

A language model needs to understand how tokens are related to each other in the context of human language. To understand this semantic relationship, these tokens are converted into numerical vectors.

Embedding Models are trained upon these tokens to develop an "embedding space."

- Before the training, the embedding model initializes an N-dimensional 'vector' corresponding to each 'token' with random values. (Value of N depends on the embedding model)
  
- During the embedding model training, the values for these vectors are updated across iterations. In this process, similar or related tokens are updated to have similarly valued vectors.
  
- After the training, the collection of all the 'vectors' corresponding to all the tokens is called the "embedding space."

- "Embedding Space" is an encoded representation of meanings of tokens and inter-token relationships.

> We now embed our relevant documents (knowledge base) into a pre-trained embedding model. 

In [9]:
# Setup the embedding, we are using the MiniLM model here
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

In [10]:
query_result = embeddings_model.embed_query("Earth is a planet in the solar system.")

In [11]:
# Dimension of vector
len(query_result)

384

In [12]:
query_result[-3:]

[0.05409970134496689, 0.07589352875947952, -0.04195248335599899]

In an embedding space, you can find how similar two vectors are using `dot product` or  using `cosine similarity.`  

In [19]:
print("Similarity:", 1 - spatial.distance.cosine(query_result, embeddings_model.embed_query("Mars is a planet in the solar system.")))

Similarity: 0.7926038246541188


In [20]:
print("Similarity:", 1 - spatial.distance.cosine(query_result, embeddings_model.embed_query("Hello Tacoma.")))

Similarity: 0.08770955995845531


#### Vector Stores

Once the embeddings are created for our relevant documents or knowledge base, we need to store these embeddings in the database for fast retrieval. 

The type of databases that store these vector embeddings are called "Vector Stores." We will use a vector store called "Qdrant," as shown below. 

In the below code, 
- Vector store works along with the embedding model to create vector embeddings.
- Vector embeddings are stored in the Qdrant Vector database collection.

We have already created a vector database that contains the astrophysics paper abstracts and Astropy's documentation, please refer to the notebook in the Appendix. 

In [22]:
download_qdrant_data()

Qdrant data already exists at /Users/a42/.cache/ssec_tutorials/scipy_qdrant


PosixPath('/Users/a42/.cache/ssec_tutorials/scipy_qdrant')

In [24]:
QDRANT_PATH

PosixPath('/Users/a42/.cache/ssec_tutorials/scipy_qdrant')

In [25]:
QDRANT_COLLECTION_NAME

'arxiv_astro-ph_abstracts_astropy_github_documentation'

In [23]:
# Setting up Qdrant
if os.path.exists(QDRANT_PATH):
    print(f"Loading existing Qdrant collection '{QDRANT_COLLECTION_NAME}'")
    
    client = QdrantClient(path=QDRANT_PATH)
    
    qdrant = Qdrant(
        client=client,
        collection_name=QDRANT_COLLECTION_NAME,
        embeddings=embeddings_model
    )

Loading existing Qdrant collection 'arxiv_astro-ph_abstracts_astropy_github_documentation'


### Search

In [27]:
# Setup the retriever for later step
# mmr stands for  Maximum Marginal Relevance 
# "MMR selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples."
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

In [28]:
retriever.invoke("What is dark matter?")

[Document(page_content='  I give a review of the development of the concept of dark matter. The dark\nmatter story passed through several stages from a minor observational puzzle to\na major challenge for theory of elementary particles. Modern data suggest that\ndark matter is the dominant matter component in the Universe, and that it\nconsists of some unknown non-baryonic particles. Properties of dark matter\nparticles determine the structure of the cosmic web.\n', metadata={'id': 1109.558, 'title': 'Dark matter', '_id': '363091ccc8f643fa9b51eed9aa157ad9', '_collection_name': 'arxiv_astro-ph_abstracts_astropy_github_documentation'}),
 Document(page_content='  Even though there are strong astrophysical and cosmological indications to\nsupport the existence of dark matter, its exact nature remains unknown. We\nexpect dark matter to produce standard model particles when annihilating or\ndecaying, assuming that it is composed of Weakly Interacting Massive Particles\n(WIMPs). These standar

In [29]:
retriever.invoke("How can I perform celestial coordinate transformations?")

 Document(page_content='  In Paper I, Greisen & Calabretta (2002) describe a generalized method for\nassigning physical coordinates to FITS image pixels. This paper implements this\nmethod for all spherical map projections likely to be of interest in astronomy.\nThe new methods encompass existing informal FITS spherical coordinate\nconventions and translations from them are described. Detailed examples of\nheader interpretation and construction are given.\n', metadata={'id': 'astro-ph/0207413', 'title': 'Representations of celestial coordinates in FITS', '_id': 'fc988d45866549c4ae09bed143c04b6d', '_collection_name': 'arxiv_astro-ph_abstracts_astropy_github_documentation'})]

In [30]:
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [31]:
print(format_docs(retriever.invoke("What is dark matter?")))

  I give a review of the development of the concept of dark matter. The dark
matter story passed through several stages from a minor observational puzzle to
a major challenge for theory of elementary particles. Modern data suggest that
dark matter is the dominant matter component in the Universe, and that it
consists of some unknown non-baryonic particles. Properties of dark matter
particles determine the structure of the cosmic web.


  Even though there are strong astrophysical and cosmological indications to
support the existence of dark matter, its exact nature remains unknown. We
expect dark matter to produce standard model particles when annihilating or
decaying, assuming that it is composed of Weakly Interacting Massive Particles
(WIMPs). These standard model particles could in turn yield neutrinos that can
be detected by the IceCube neutrino telescope. The Milky Way is expected to be
permeated by a dark matter halo with an increased density towards its centre.
This halo is expe

In [32]:
print(format_docs(retriever.invoke("How can I perform celestial coordinate transformations?")))

.. _astropy-coordinates-transforming:

Transforming between Systems
****************************

`astropy.coordinates` supports a rich system for transforming
coordinates from one frame to another. While common astronomy frames
are built into Astropy, the transformation infrastructure is dynamic.
This means it allows users to define new coordinate frames and their
transformations. The topic of writing your own coordinate frame or
transforms is detailed in :ref:`astropy-coordinates-design`, and this
section is focused on how to *use* transformations.

The full list of built-in coordinate frames, the included transformations,
and the frame names are shown as a (clickable) graph in the
`~astropy.coordinates` API documentation.

Examples
--------

..
  EXAMPLE START
  Transforming Coordinates to Another Frame

The recommended method of transformation is shown below::

    >>> import astropy.units as u
    >>> from astropy.coordinates import SkyCoord
    >>> gc = SkyCoord(l=0*u.degree, b=4

## Augmentation & Generation

Now that we can retrieve the most relevant document based on a question, we can use the retrieved document and send it along with the prompt to increase the context for the LLM.

This can also be referred to as the `retrieval-augmented prompt.`

In [33]:
olmo = LlamaCpp(
    model_path=str(OLMO_MODEL),
    temperature=0.8,
    verbose=False,  
    n_ctx=2048,
    max_tokens=512,
)

In [34]:
# Create a prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata['tokenizer.chat_template'], 
    template_format="jinja2",
    partial_variables={"add_generation_prompt": True, "eos_token": "<|endoftext|>"},
)

In [35]:
prompt_template

PromptTemplate(input_variables=['messages'], partial_variables={'add_generation_prompt': True, 'eos_token': '<|endoftext|>'}, template="{{ eos_token }}{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", template_format='jinja2')

In [36]:
# Test the prompt you want to send to OLMo.

question = "What is dark matter?"
context = format_docs(retriever.invoke(question))

prompt_template.format(
    messages=[
        {
            "role": "user", 
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}"""
        }
    ]
)

'<|endoftext|>\n\n<|user|>\nYou are an astrophysics expert. Please answer the question on astrophysics based on the following context:\n\n            Context:   I give a review of the development of the concept of dark matter. The dark\nmatter story passed through several stages from a minor observational puzzle to\na major challenge for theory of elementary particles. Modern data suggest that\ndark matter is the dominant matter component in the Universe, and that it\nconsists of some unknown non-baryonic particles. Properties of dark matter\nparticles determine the structure of the cosmic web.\n\n\n  Even though there are strong astrophysical and cosmological indications to\nsupport the existence of dark matter, its exact nature remains unknown. We\nexpect dark matter to produce standard model particles when annihilating or\ndecaying, assuming that it is composed of Weakly Interacting Massive Particles\n(WIMPs). These standard model particles could in turn yield neutrinos that can\nbe

In [37]:
# Test the prompt you want to send to OLMo.
question = "How can I perform celestial coordinate transformations?"
context = format_docs(retriever.invoke(question))

prompt_template.format(
    messages=[
        {
            "role": "user", 
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}"""
        }
    ]
)



One way to generate the response with OLMo is to build `context` using the `question` beforehand, as shown above, create an llm_chain then `invoke` it with `messages`.

In [38]:
# Chain the prompt template and olmo
llm_chain = prompt_template | olmo

In [39]:
question = "What is dark matter?"
context = format_docs(retriever.invoke(question))

# Invoke the chain with a question and other parameters. 
llm_chain.invoke(
    {
        "messages":
            [{
                "role": "user", 
                "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:
    
                Context: {context}
                
                Question: {question}"""
            }
        ], 
    },
    config={
        'callbacks' : [StreamingStdOutCallbackHandler()]
    }
)

 Dark matter is a theoretical particle that is believed to make up most of the matter in the universe, but does not emit any light or radiation, making it invisible to our current technology. This mysterious substance was discovered through its gravitational effects on visible matter such as stars, galaxies, and the large-scale structure of the universe. Since it does not interact with light, dark matter is often referred to as "non-baryonic" because it does not contain any baryonic matter (i.e., ordinary atoms).

The exact nature of dark matter remains unknown, but based on current observations and theoretical models, it is believed that dark matter particles likely do not have charges or spin, and are stable, meaning they have not decayed into visible matter yet. Some possibilities for what dark matter may be include weakly interacting massive particles (WIMPs) such as neutralinos, axion-like particles (ALPs), or other unknown particles that could interact via gravity only.

In the c

' Dark matter is a theoretical particle that is believed to make up most of the matter in the universe, but does not emit any light or radiation, making it invisible to our current technology. This mysterious substance was discovered through its gravitational effects on visible matter such as stars, galaxies, and the large-scale structure of the universe. Since it does not interact with light, dark matter is often referred to as "non-baryonic" because it does not contain any baryonic matter (i.e., ordinary atoms).\n\nThe exact nature of dark matter remains unknown, but based on current observations and theoretical models, it is believed that dark matter particles likely do not have charges or spin, and are stable, meaning they have not decayed into visible matter yet. Some possibilities for what dark matter may be include weakly interacting massive particles (WIMPs) such as neutralinos, axion-like particles (ALPs), or other unknown particles that could interact via gravity only.\n\nIn 

We can further use [LangChain's convenience functions](https://python.langchain.com/v0.2/docs/tutorials/rag/#built-in-chains) to streamline our pipeline using [create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html) and [create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html).

`create_stuff_documents_chain` specifies how retrieved context is fed into a prompt and LLM. 

On looking its signature, notice that it accepts `prompt` argument of type `BasePromptTemplate` but it needs input keys as `context` and `input`.

In [40]:
# Uncomment below line and run the cell
#create_stuff_documents_chain?

In [41]:
# This is how we can tranform our prompt_template, so that it accepts `context` and `input` as input_variables
transformed_prompt_template = PromptTemplate.from_template(
    prompt_template.partial(
        messages=[
            {
                "role": "user", 
                "content": "You are an astrophysics expert. Please answer the question on astrophysics based on the following context. \
                            Context: {context} \
                            Question: {input}"
            }
        ]
    ).format()
)
transformed_prompt_template

PromptTemplate(input_variables=['context', 'input'], template='<|endoftext|>\n\n<|user|>\nYou are an astrophysics expert. Please answer the question on astrophysics based on the following context.                             Context: {context}                             Question: {input}\n\n\n<|assistant|>\n\n')

In [42]:
document_chain = create_stuff_documents_chain(
    llm=olmo, 
    prompt=transformed_prompt_template
)

We can run this by passing in the context directly:

In [43]:
question = "What is dark matter?"
document_chain.invoke({
    "input": question,
    "context": retriever.invoke(question),
})

'Dark matter is a theoretical phenomenon that explains the observed gravitational effects in the universe, which cannot be attributed to the presence of visible or ordinary matter. In the context provided, dark matter refers to non-baryonic matter particles that do not emit, absorb or reflect any electromagnetic radiation and are therefore invisible to us. The term "dark" does not refer to its color but rather to its lack of visibility due to this absence of electromagnetic radiation.\n\nDark matter is believed to make up approximately 85% of the total matter content in the universe (excluding ordinary particles that are composed of quarks and leptons). It has a significant impact on the cosmic structure, including the formation and evolution of galaxies, stars, and star systems. However, direct evidence for dark matter particles remains elusive due to their weak interactions with other types of matter.\n\nThe existence of dark matter is based on various observational indications from 

However, we want the context to be dynamically generated using the passed input or question.

From LangChain's documentation: `create_retrieval_chain` adds the retrieval step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key `input`, and includes input, context, and answer in its output.

In [44]:
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [54]:
response = retrieval_chain.invoke({"input": "What is dark matter?"})

In [55]:
response

{'input': 'What is dark matter?',
 'context': [Document(page_content='  I give a review of the development of the concept of dark matter. The dark\nmatter story passed through several stages from a minor observational puzzle to\na major challenge for theory of elementary particles. Modern data suggest that\ndark matter is the dominant matter component in the Universe, and that it\nconsists of some unknown non-baryonic particles. Properties of dark matter\nparticles determine the structure of the cosmic web.\n', metadata={'id': 1109.558, 'title': 'Dark matter', '_id': '363091ccc8f643fa9b51eed9aa157ad9', '_collection_name': 'arxiv_astro-ph_abstracts_astropy_github_documentation'}),
  Document(page_content='  Even though there are strong astrophysical and cosmological indications to\nsupport the existence of dark matter, its exact nature remains unknown. We\nexpect dark matter to produce standard model particles when annihilating or\ndecaying, assuming that it is composed of Weakly Intera

In [56]:
print(response["answer"])

 Dark matter is a theoretical concept in astrophysics and cosmology that describes the existence of a form of matter that cannot be seen or detected directly, but still has an effect on the observable universe. This mysterious substance accounts for approximately 85% of the total mass-energy content of the universe, and its absence from visible light is known as "dark" matter.

Dark matter particles are assumed to have weakly interacting massive particles (WIMPs), which means that they interact very weakly with other subatomic particles or even light itself. These properties allow dark matter to be a significant component of our cosmic neighborhood without being easily detected by any direct observation method. The exact nature and properties of dark matter remain unknown, making it an intriguing and active field of research in astrophysics and particle physics.

In this context, the Milky Way is expected to be permeated with a dark matter halo that has a higher density towards its cen

In [57]:
response = retrieval_chain.invoke({"input": "How many dimensions are there in the elemental abundances of stars?"})
print(response["answer"])

 The provided context discusses the detailed abundance analysis of 140 stars using the method presented by Erspamer and North (2002). The study reveals a correlation between [Fe/H] and logg for stars within 1.8 to 2.0 Mo, but does not find an anticorrelation between vsini and elemental abundances as suggested in Varenne and Monier (1999).

To answer your question, the number of dimensions in the elemental abundances of stars is two - the spatial dimension represented by the location of the star in the solar neighbourhood, and the radial velocity dimension, which is also two-dimensional. In this context, "dimensions" refers to the number of independent variables being considered.
