In [1]:
import os

from langchain_community.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings

from qdrant_client import QdrantClient

In [2]:
import langchain
import textwrap

In [3]:
from langchain import PromptTemplate
from llama_cpp import Llama

In [269]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# Retrieval-Augmented Text Generation

Although trained on large datasets, stale data can severely limit LLMs. It faces several challenges:

1. The models are trained on internet content, so they might not generate relevant output when prompted for information that is not publicly available on the internet.

2. The models are trained up to a certain date, they might not generate relevant output when prompted for content and information that has happened after the training completion date of the model.

3. The models are trained to be more generalized. This means that they can only produce generic outputs and might not perform as expected when prompted for specific deep-dive concepts related to a particular topic.

One way to dynamically integrate relevant external information is retrieval-augmented generation (RAG), which can help improve the reliability of LLM outputs. 

## RAG Framework

RAG proposes a solution to this issue by supplementing the prompt sent to the LLM with information from external sources through a retrieval model, thereby providing the LLM with more relevant input to generation responses. It allows you to use pre-trained LLMs without fine-tuning them or training your own LLM on your training data. 

<center><img src="../../images/rag-workflow.webp" width="60%"/></center>


Image Source: [Medium Blog](https://medium.com/@henryhengluo/intro-of-retrieval-augmented-generation-rag-and-application-demos-c1d9239ababf)

Multiple concepts influence RAG pipeline:

1. Retrieval
2. Augmentation
3. Generation

## Retrieval

The retrieval phase can also be considered the data and query/prompt preparation phase, focusing on efficient information retrieval or data access. To improve your RAG pipeline, the pre-retrieval phase contains tasks such as: `(1): Indexing, (2) Query Manipulation, (3) Data Modification, (4) Search, and (5) Ranking.` In this tutorial, we primarily focus on indexing and search. 

`Indexing` enables fast and accurate information retrieval that sets up the context for any LLM to improve its response to a given user prompt or query. 

We will be indexing abstracts for all astrophysics papers and Astropy's documentation, a common core package for Astronomy in Python. 

### Embeddings

Embeddings, also called "Vector Embedding," help LLMs develop a semantic understanding of the textual data they are trained on. In simpler terms, these embedding models lay the groundwork for LLMs to perform tasks like sentence completion, similarity search, questions and answers, etc.

#### Vector

At the lowest level, machines only understand numeric values. For LLMs to work, natural language is converted into an array of numeric values before they are fed into the models. These arrays of numeric values are called "Vector."

An example of a vector: [2.5, 1.0, 3.3, 7.8]

The above is an example of a vector of size 4. 

In [5]:
import numpy as np

vector = np.array([2.5, 1.7, 3.3, 7.8])
print(f"Vector: {vector}") 

Vector: [2.5 1.7 3.3 7.8]


#### Tokens

We stated above that **"texts are converted into an array of numeric values called vectors"**.

But depending on your use case, each word, sentence, paragraph, or entire document can be represented as a vector. 

Tokens are the smallest natural language units converted into a vector. It could be at the character level, sub-word level, word level, sentence level, paragraph level, or document level.

Example: Consider the text below.

`Earth is a planet of the solar system. There are 9 planets in the solar system. 
All planets revolve around the sun. Sun is a star.`


Case 1.) **Tokenizing the entire paragraph into vector.**  
Tokenization: The entire paragraph is a single token.   
Vectorization: A single vector.  
Sample Vector Representation: [3.1, 6.8, 5.4, 8.0, 7.1]

Case 2.) **Tokenizing each sentence into vectors.**  
Tokenization: One token for each sentence (total 4 tokens)  
Vectorization: One vector for each sentence (total 4 vectors).   
Sample Vector Representation: [[1.2, 2.3, 3.8, 7.9, 0.8], [2.5, 3.0, 8.2, 6.6, 4.1], [3.2, 6.5, 8.1, 9.3, 1.4], [1.1, 0.7, 7.2, 3.5, 8.5]]

Case 3.) **Tokenizing each word in the paragraph into a vector. There are 26 words in the paragraph, ignoring punctuation. Each word gets converted into a vector.**  
Tokenization: One token for each word in the paragraph (26 tokens)  
Vectorization: One vector for each token (total 26 vectors).    
Sample Vector Representation: [[2.1, 3.2, 4.1, 9.8, 7.0], [8.2, 4.2, 7.1, 3.8, 2.0].....total 26 such represenatations]


#### Tokenizers

Tokenizers are components responsible for converting large texts into tokens (tokenization). Different types of pre-trained tokenizers are available. You can even train your own tokenizers. But for the scope of this tutorial, we will use a pre-trained one. 

Generally, each tokenizer follows the following steps:

1. Break down the original text into tokens. These tokens could again be at the character, sub-word, word, sentence, paragraph, or document levels.
2. Assign a unique identifier to each of the tokens created.

In [6]:
# For example, here is how you can split a short sentence into chunks of text
from langchain_text_splitters import CharacterTextSplitter

In [7]:
text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=10,
    chunk_overlap=0,
)
text_splitter.split_text(text="Earth is a planet in the solar system.")

['Earth is a', 'planet in', 'the solar', 'system.']

[Learn more about how to split text into tokens in LangChain here.](https://python.langchain.com/v0.2/docs/how_to/split_by_token/) 

#### Embedding Models

A language model needs to understand how tokens are related to each other in the context of human language. To understand this semantic relationship, these tokens are converted into numerical vectors.

Embedding Models are trained upon these tokens to develop an "embedding space."

- Before the training, the embedding model initializes an N-dimensional 'vector' corresponding to each 'token' with random values. (Value of N depends on the embedding model)
  
- During the embedding model training, the values for these vectors are updated across iterations. In this process, similar or related tokens are updated to have similarly valued vectors.
  
- After the training, the collection of all the 'vectors' corresponding to all the tokens is called the "embedding space."

- "Embedding Space" is an encoded representation of meanings of tokens and inter-token relationships.

> We now embed our relevant documents (knowledge base) into a pre-trained embedding model. 

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer, util

In [9]:
# Setup the embedding, we are using the MiniLM model here
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2")

In [10]:
query_result = embeddings_model.embed_query("Earth is a planet in the solar system.")

In [11]:
# Dimension of vector
len(query_result)

384

In [12]:
query_result[-3:]

[0.05409970507025719, 0.07589351385831833, -0.04195251315832138]

In an embedding space, you can find how similar two vectors are using `dot product` or  using `cosine similarity.`  

In [13]:
print("Similarity:", util.dot_score(query_result, embeddings_model.embed_query("Mars is a planet in the solar system.")))

Similarity: tensor([[0.7926]])


In [14]:
print("Similarity:", util.dot_score(query_result, embeddings_model.embed_query("Hello Tacoma.")))

Similarity: tensor([[0.0877]])


In [15]:
# Get the value of the max sequence_length
print(f"Model's maximum sequence length: {SentenceTransformer('sentence-transformers/all-MiniLM-L12-v2').max_seq_length}")

Model's maximum sequence length: 128


So, we should ensure that our chunk sizes or individual documents are below this limit because any longer chunk will be truncated before processing, thus losing critical information.

#### Vector Stores

Once the embeddings are created for our relevant documents or knowledge base, we need to store these embeddings in the database for fast retrieval. 

The type of databases that store these vector embeddings are called "Vector Stores." We will use a vector store called "Qdrant," as shown below. 

In the below code, 
- Vector store works along with the embedding model to create vector embeddings.
- Vector embeddings are stored in the Qdrant Vector database collection.

We have already created a vector database that contains the astrophysics paper abstracts and Astropy's documentation, please refer to the notebook in the Appendix. 

In [16]:
# TODO: Fix module paths
qdrant_path = "../../resources/data/qdrant/scipy_qdrant/"

# TODO: Change collection name to 
qdrant_collection = "arxiv_astro-ph_abstracts_astropy_github_documentation"

In [17]:
# Setting up Qdrant
if os.path.exists(qdrant_path):
    print(f"Loading existing Qdrant collection '{qdrant_collection}'")
    
    client = QdrantClient(path=qdrant_path)
    
    qdrant = Qdrant(
        client=client,
        collection_name=qdrant_collection,
        embeddings=embeddings_model
    )

Loading existing Qdrant collection 'arxiv_astro-ph_abstracts_astropy_github_documentation'


### Search

In [19]:
# Setup the retriever for later step
# mmr stands for  Maximum Marginal Relevance 
# "MMR selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples."
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

In [20]:
retriever.invoke("What is dark matter?")

[Document(page_content='  Modern cosmology successfully deals with the origin and the evolution of the\nUniverse at large scales, but it is unable to completely answer the question\nabout the nature of the fundamental objects that it is describing. As a matter\nof fact, about 95\\% of the constituents of the Universe is indeed completely\nunknown: it cannot be described in terms of known particles. Despite intense\nefforts to shed light on this literal darkness by dark matter and dark energy\ndirect and indirect searches, not much progress has been made so far. In this\nwork, we take a different perspective by reviewing and elaborating an old idea\nof studying the mass-radius distribution of structures in the Universe in\nrelationship with the fundamental forces acting on them. As we will describe in\ndetail, the distribution of the observed structures in the Universe is not\ncompletely random, but it reflects the intimate features of the involved\nparticles and the nature of the funda

In [21]:
retriever.invoke("How can I perform celestial coordinate transformations?")

[Document(page_content='  In Paper I, Greisen & Calabretta (2002) describe a generalized method for\nassigning physical coordinates to FITS image pixels. This paper implements this\nmethod for all spherical map projections likely to be of interest in astronomy.\nThe new methods encompass existing informal FITS spherical coordinate\nconventions and translations from them are described. Detailed examples of\nheader interpretation and construction are given.\n', metadata={'id': 'astro-ph/0207413', 'title': 'Representations of celestial coordinates in FITS', '_id': '97984f0ccf1c445cad6c671ea68446a4', '_collection_name': 'arxiv_astro-ph_abstracts_astropy_github_documentation'}),
 Document(page_content='  We successfully measured the trigonometric parallax of Sagittarius A* (Sgr\nA*) to be $117\\pm17$ micro-arcseconds ($\\mu$as) using the VLBI Exploration of\nRadio Astrometry (VERA) with the newly developed broad-band signal-processing\nsystem named OCTAVE-DAS. The measured parallax correspo

In [22]:
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [23]:
print(format_docs(retriever.invoke("What is dark matter?")))

  Modern cosmology successfully deals with the origin and the evolution of the
Universe at large scales, but it is unable to completely answer the question
about the nature of the fundamental objects that it is describing. As a matter
of fact, about 95\% of the constituents of the Universe is indeed completely
unknown: it cannot be described in terms of known particles. Despite intense
efforts to shed light on this literal darkness by dark matter and dark energy
direct and indirect searches, not much progress has been made so far. In this
work, we take a different perspective by reviewing and elaborating an old idea
of studying the mass-radius distribution of structures in the Universe in
relationship with the fundamental forces acting on them. As we will describe in
detail, the distribution of the observed structures in the Universe is not
completely random, but it reflects the intimate features of the involved
particles and the nature of the fundamental interactions at play. The obse

In [113]:
print(format_docs(retriever.invoke("How can I perform celestial coordinate transformations?")))

  In Paper I, Greisen & Calabretta (2002) describe a generalized method for
assigning physical coordinates to FITS image pixels. This paper implements this
method for all spherical map projections likely to be of interest in astronomy.
The new methods encompass existing informal FITS spherical coordinate
conventions and translations from them are described. Detailed examples of
header interpretation and construction are given.


  We successfully measured the trigonometric parallax of Sagittarius A* (Sgr
A*) to be $117\pm17$ micro-arcseconds ($\mu$as) using the VLBI Exploration of
Radio Astrometry (VERA) with the newly developed broad-band signal-processing
system named OCTAVE-DAS. The measured parallax corresponds to a Galactocentric
distance at the Sun of $R_0 = 8.5^{+1.5}_{-1.1}$ kpc. By combining the
astrometric results with VERA and the Very Long Baseline Array (VLBA) over a
monitoring period of 25 years, the proper motion of Sgr A* is obtained to be
$(\mu_\alpha, \mu_\delta) = (-

## Augmentation & Generation

Now that we can retrieve the most relevant document based on a question, we can use the retrieved document and send it along with the prompt to increase the context for the LLM.

This can also be referred to as the `retrieval-augmented prompt.`

In [274]:
# Make sure the model path is correct for your system!
# TODO: Fix model path to cache folder
olmo = LlamaCpp(
    model_path="../../resources/models/OLMo-7B-Instruct-GGUF/OLMo-7B-Instruct-Q4_K_M.gguf",
    temperature=0.8,
    verbose=False,  
    n_ctx=2048,
    max_tokens=512,
)

In [275]:
# Create a prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata['tokenizer.chat_template'], 
    template_format="jinja2",
    partial_variables={"add_generation_prompt": True, "eos_token": "<|endoftext|>"},
)

In [276]:
prompt_template

PromptTemplate(input_variables=['messages'], partial_variables={'add_generation_prompt': True, 'eos_token': '<|endoftext|>'}, template="{{ eos_token }}{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n'  + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}", template_format='jinja2')

In [277]:
# Test the prompt you want to send to OLMo.

question = "What is dark matter?"

prompt_template.format(
    messages=[
        {
            "role": "user", 
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}"""
        }
    ]
)

'<|endoftext|>\n\n<|user|>\nYou are an astrophysics expert. Please answer the question on astrophysics based on the following context:\n\n            Context:   Modern cosmology successfully deals with the origin and the evolution of the\nUniverse at large scales, but it is unable to completely answer the question\nabout the nature of the fundamental objects that it is describing. As a matter\nof fact, about 95\\% of the constituents of the Universe is indeed completely\nunknown: it cannot be described in terms of known particles. Despite intense\nefforts to shed light on this literal darkness by dark matter and dark energy\ndirect and indirect searches, not much progress has been made so far. In this\nwork, we take a different perspective by reviewing and elaborating an old idea\nof studying the mass-radius distribution of structures in the Universe in\nrelationship with the fundamental forces acting on them. As we will describe in\ndetail, the distribution of the observed structures 

In [278]:
# Test the prompt you want to send to OLMo.
question = "How can I perform celestial coordinate transformations?"
context = format_docs(retriever.invoke(question))

prompt_template.format(
    messages=[
        {
            "role": "user", 
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}"""
        }
    ]
)

'<|endoftext|>\n\n<|user|>\nYou are an astrophysics expert. Please answer the question on astrophysics based on the following context:\n\n            Context:   In Paper I, Greisen & Calabretta (2002) describe a generalized method for\nassigning physical coordinates to FITS image pixels. This paper implements this\nmethod for all spherical map projections likely to be of interest in astronomy.\nThe new methods encompass existing informal FITS spherical coordinate\nconventions and translations from them are described. Detailed examples of\nheader interpretation and construction are given.\n\n\n  We successfully measured the trigonometric parallax of Sagittarius A* (Sgr\nA*) to be $117\\pm17$ micro-arcseconds ($\\mu$as) using the VLBI Exploration of\nRadio Astrometry (VERA) with the newly developed broad-band signal-processing\nsystem named OCTAVE-DAS. The measured parallax corresponds to a Galactocentric\ndistance at the Sun of $R_0 = 8.5^{+1.5}_{-1.1}$ kpc. By combining the\nastrometri

One way to generate the response with OLMo is to build `context` using the `question` beforehand, as shown above, create an llm_chain then `invoke` it with `messages`.

In [279]:
# Chain the prompt template and olmo
llm_chain = prompt_template | olmo

In [280]:
question = "What is dark matter?"
context = format_docs(retriever.invoke(question))

# Invoke the chain with a question and other parameters. 
llm_chain.invoke(
    {
        "messages":
            [{
                "role": "user", 
                "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:
    
                Context: {context}
                
                Question: {question}"""
            }
        ], 
    },
    config={
        'callbacks' : [StreamingStdOutCallbackHandler()]
    }
)

 Dark matter is a theoretical particle or collection of particles that, despite being invisible, is believed to make up approximately 95% of the matter in the universe. It is called "dark" because it does not emit light and therefore cannot be seen like other particles, such as those comprising stars and galaxies. Dark matter is thought to influence the motion and distribution of stars, galaxies, and entire celestial bodies due to its massive presence. Its properties are not well understood, but scientists currently believe it could be a type of weakly interacting massive particle (WIMP) or an axion, among other possibilities.

' Dark matter is a theoretical particle or collection of particles that, despite being invisible, is believed to make up approximately 95% of the matter in the universe. It is called "dark" because it does not emit light and therefore cannot be seen like other particles, such as those comprising stars and galaxies. Dark matter is thought to influence the motion and distribution of stars, galaxies, and entire celestial bodies due to its massive presence. Its properties are not well understood, but scientists currently believe it could be a type of weakly interacting massive particle (WIMP) or an axion, among other possibilities.'

We can further use [LangChain's convenience functions](https://python.langchain.com/v0.2/docs/tutorials/rag/#built-in-chains) to streamline our pipeline using [create_stuff_documents_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.combine_documents.stuff.create_stuff_documents_chain.html) and [create_retrieval_chain](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html).

`create_stuff_documents_chain` specifies how retrieved context is fed into a prompt and LLM. 

On looking its signature, notice that it accepts `prompt` argument of type `BasePromptTemplate` but it needs input keys as `context` and `input`.

In [288]:
# Uncomment below line and run the cell
#create_stuff_documents_chain?

In [298]:
# This is how we can tranform our prompt_template, so that it accepts `context` and `input` as input_variables
transformed_prompt_template = PromptTemplate.from_template(
    prompt_template.partial(
        messages=[
            {
                "role": "user", 
                "content": "You are an astrophysics expert. Please answer the question on astrophysics based on the following context. \
                            Context: {context} \
                            Question: {input}"
            }
        ]
    ).format()
)
transformed_prompt_template

PromptTemplate(input_variables=['context', 'input'], template='<|endoftext|>\n\n<|user|>\nYou are an astrophysics expert. Please answer the question on astrophysics based on the following context.                             Context: {context}                             Question: {input}\n\n\n<|assistant|>\n\n')

In [299]:
document_chain = create_stuff_documents_chain(
    llm=olmo, 
    prompt=transformed_prompt_template
)

We can run this by passing in the context directly:

In [300]:
question = "What is dark matter?"
document_chain.invoke({
    "input": question,
    "context": retriever.invoke(question),
})

" Dark Matter is a theoretical concept in astrophysics and cosmology that describes the existence of a type of matter that cannot be seen or detected using traditional methods such as light, energy, or radiation. In contrast to visible matter, which makes up approximately 85% of the universe's total mass-energy content, dark matter constitutes about 25% of this mass. It does not emit, absorb, or reflect any electromagnetic radiation and is therefore invisible to us.\n\nDark Matter is an essential component in modern cosmology, as its presence is required to reconcile the mathematical predictions of cosmological models with the large-scale structures observed in the universe such as galaxies and galaxy clusters. These structures form due to the gravitational clustering of dark matter particles. The nature of dark matter remains a mystery, and it is still an active area of research.\n\nOne of the possible candidates for dark matter is the axion, which was first introduced to solve a prob

However, we want the context to be dynamically generated using the passed input or question.

From LangChain's documentation: `create_retrieval_chain` adds the retrieval step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key `input`, and includes input, context, and answer in its output.

In [301]:
retrieval_chain = create_retrieval_chain(retriever, document_chain)

In [303]:
response = retrieval_chain.invoke({"input": "What is dark matter?"})

In [304]:
response

{'input': 'What is dark matter?',
 'context': [Document(page_content='  Modern cosmology successfully deals with the origin and the evolution of the\nUniverse at large scales, but it is unable to completely answer the question\nabout the nature of the fundamental objects that it is describing. As a matter\nof fact, about 95\\% of the constituents of the Universe is indeed completely\nunknown: it cannot be described in terms of known particles. Despite intense\nefforts to shed light on this literal darkness by dark matter and dark energy\ndirect and indirect searches, not much progress has been made so far. In this\nwork, we take a different perspective by reviewing and elaborating an old idea\nof studying the mass-radius distribution of structures in the Universe in\nrelationship with the fundamental forces acting on them. As we will describe in\ndetail, the distribution of the observed structures in the Universe is not\ncompletely random, but it reflects the intimate features of the i

In [305]:
print(response["answer"])

 Dark matter is a theoretical entity in astrophysics and cosmology that is yet to be directly detected through its observable signals or particles. It is an unseen substance that accounts for approximately 85% of the mass content of the universe, making it an essential component in understanding the structure and evolution of the cosmos (1).

The mysterious nature of dark matter lies in its non-existent presence compared to the visible matter we can detect through our senses or instruments. Although astronomers have discovered numerous celestial objects that cannot be explained by ordinary matter alone, this unexplained mass is significantly more than what can be observed directly. The invisible, yet abundant dark matter particles interact with other matter only through gravity and do not emit, reflect, or absorb any electromagnetic radiation (2).

The term "dark" refers to the lack of direct detection of dark matter particles or their signatures in our laboratories. Scientists have hy

In [306]:
response = retrieval_chain.invoke({"input": "How can I perform celestial coordinate transformations?"})
print(response["answer"])

 To perform celestial coordinate transformations using the information provided in the context, follow these steps:

1. Familiarize yourself with spherical map projections commonly used in astronomy, such as Lambert conical projection and Cylindrical equidistant projection. These are discussed in Paper I of Greisen & Calabretta (2002).

2. Understand the new methods implemented in Paper I to assign physical coordinates to FITS image pixels. The paper discusses how to convert informal FITS conventions into these new methods, and vice versa. Detailed examples are provided.

3. Learn about header interpretation and construction. This involves using the information within the FITS headers to determine the position of objects in the sky and constructing proper motions and orbital velocities.

4. Apply the trigonometric parallax measurement obtained by VERA, VLBA, and monitoring over 25 years. The paper describes how to calculate angular orbital velocities using equatorial coordinates ($\mu_