In [23]:
import os

from langchain_community.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings

from qdrant_client import QdrantClient

In [35]:
import langchain
import textwrap

In [33]:
from langchain import PromptTemplate
from llama_cpp import Llama

In [34]:
from langchain_community.llms import LlamaCpp
from langchain_core.callbacks import StreamingStdOutCallbackHandler

# Retrieval-Augmented Text Generation

Although trained on large datasets, stale data can severely limit LLMs. It faces several challenges:

1. The models are trained on internet content, so they might not generate relevant output when prompted for information that is not publicly available on the internet.

2. The models are trained up to a certain date, they might not generate relevant output when prompted for content and information that has happened after the training completion date of the model.

3. The models are trained to be more generalized. This means that they can only produce generic outputs and might not perform as expected when prompted for specific deep-dive concepts related to a particular topic.

One way to dynamically integrate relevant external information is retrieval-augmented generation (RAG), which can help improve the reliability of LLM outputs. 

## RAG Framework

RAG proposes a solution to this issue by supplementing the prompt sent to the LLM with information from external sources through a retrieval model, thereby providing the LLM with more relevant input to generation responses. It allows you to use pre-trained LLMs without fine-tuning them or training your own LLM on your training data. 

<center><img src="../../images/rag-workflow.webp" width="60%"/></center>


Image Source: [Medium Blog](https://medium.com/@henryhengluo/intro-of-retrieval-augmented-generation-rag-and-application-demos-c1d9239ababf)

Multiple concepts influence RAG pipeline:

1. Retrieval
2. Augmentation
3. Generation

## Retrieval

The retrieval phase can also be considered the data and query/prompt preparation phase, focusing on efficient information retrieval or data access. To improve your RAG pipeline, the pre-retrieval phase contains tasks such as: `(1): Indexing, (2) Query Manipulation, (3) Data Modification, (4) Search, and (5) Ranking.` In this tutorial, we primarily focus on indexing and search. 

`Indexing` enables fast and accurate information retrieval that sets up the context for any LLM to improve its response to a given user prompt or query. 

We will be indexing abstracts for all astrophysics papers and Astropy's documentation, a common core package for Astronomy in Python. 

### Embeddings

Embeddings, also called "Vector Embedding," help LLMs develop a semantic understanding of the textual data they are trained on. In simpler terms, these embedding models lay the groundwork for LLMs to perform tasks like sentence completion, similarity search, questions and answers, etc.

#### Vector

At the lowest level, machines only understand numeric values. For LLMs to work, natural language is converted into an array of numeric values before they are fed into the models. These arrays of numeric values are called "Vector."

An example of a vector: [2.5, 1.0, 3.3, 7.8]

The above is an example of a vector of size 4. 

In [5]:
import numpy as np

vector = np.array([2.5, 1.7, 3.3, 7.8])
print(f"Vector: {vector}") 

Vector: [2.5 1.7 3.3 7.8]


#### Tokens

We stated above that **"texts are converted into an array of numeric values called vectors"**.

But depending on your use case, each word, sentence, paragraph, or entire document can be represented as a vector. 

Tokens are the smallest natural language units converted into a vector. It could be at the character level, sub-word level, word level, sentence level, paragraph level, or document level.

Example: Consider the text below.

`Earth is a planet of the solar system. There are 9 planets in the solar system. 
All planets revolve around the sun. Sun is a star.`


Case 1.) **Tokenizing the entire paragraph into vector.**  
Tokenization: The entire paragraph is a single token.   
Vectorization: A single vector.  
Sample Vector Representation: [3.1, 6.8, 5.4, 8.0, 7.1]

Case 2.) **Tokenizing each sentence into vectors.**  
Tokenization: One token for each sentence (total 4 tokens)  
Vectorization: One vector for each sentence (total 4 vectors).   
Sample Vector Representation: [[1.2, 2.3, 3.8, 7.9, 0.8], [2.5, 3.0, 8.2, 6.6, 4.1], [3.2, 6.5, 8.1, 9.3, 1.4], [1.1, 0.7, 7.2, 3.5, 8.5]]

Case 3.) **Tokenizing each word in the paragraph into a vector. There are 26 words in the paragraph, ignoring punctuation. Each word gets converted into a vector.**  
Tokenization: One token for each word in the paragraph (26 tokens)  
Vectorization: One vector for each token (total 26 vectors).    
Sample Vector Representation: [[2.1, 3.2, 4.1, 9.8, 7.0], [8.2, 4.2, 7.1, 3.8, 2.0].....total 26 such represenatations]


#### Tokenizers

Tokenizers are components responsible for converting large texts into tokens (tokenization). Different types of pre-trained tokenizers are available. You can even train your own tokenizers. But for the scope of this tutorial, we will use a pre-trained one. 

Generally, each tokenizer follows the following steps:

1. Break down the original text into tokens. These tokens could again be at the character, sub-word, word, sentence, paragraph, or document levels.
2. Assign a unique identifier to each of the tokens created.

In [6]:
# For example, here is how you can split a short sentence into chunks of text
from langchain_text_splitters import CharacterTextSplitter

In [7]:
text_splitter = CharacterTextSplitter(
    separator=" ",
    chunk_size=10,
    chunk_overlap=0,
)
text_splitter.split_text(text="Earth is a planet in the solar system.")

['Earth is a', 'planet in', 'the solar', 'system.']

[Learn more about how to split text into tokens in LangChain here.](https://python.langchain.com/v0.2/docs/how_to/split_by_token/) 

#### Embedding Models

A language model needs to understand how tokens are related to each other in the context of human language. To understand this semantic relationship, these tokens are converted into numerical vectors.

Embedding Models are trained upon these tokens to develop an "embedding space."

- Before the training, the embedding model initializes an N-dimensional 'vector' corresponding to each 'token' with random values. (Value of N depends on the embedding model)
  
- During the embedding model training, the values for these vectors are updated across iterations. In this process, similar or related tokens are updated to have similarly valued vectors.
  
- After the training, the collection of all the 'vectors' corresponding to all the tokens is called the "embedding space."

- "Embedding Space" is an encoded representation of meanings of tokens and inter-token relationships.

> We now embed our relevant documents (knowledge base) into a pre-trained embedding model. 

In [8]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer, util

In [9]:
# Setup the embedding, we are using the MiniLM model here
embeddings_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [10]:
query_result = embeddings_model.embed_query("Earth is a planet in the solar system.")

In [11]:
# Dimension of vector
len(query_result)

384

In [12]:
query_result[-3:]

[-0.030922148376703262, 0.039250507950782776, 0.01346262451261282]

In an embedding space, you can find how similar two vectors are using `dot product` or  using `cosine similarity.`  

In [13]:
print("Similarity:", util.dot_score(query_result, embeddings_model.embed_query("Mars is a planet in the solar system.")))

Similarity: tensor([[0.7257]])


In [14]:
print("Similarity:", util.dot_score(query_result, embeddings_model.embed_query("Hello Tacoma.")))

Similarity: tensor([[0.0174]])


In [15]:
# Get the value of the max sequence_length
print(f"Model's maximum sequence length: {SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2').max_seq_length}")

Model's maximum sequence length: 256


So, we should ensure that our chunk sizes or individual documents are below this limit because any longer chunk will be truncated before processing, thus losing critical information.

#### Vector Stores

Once the embeddings are created for our relevant documents or knowledge base, we need to store these embeddings in the database for fast retrieval. 

The type of databases that store these vector embeddings are called "Vector Stores." We will use a vector store called "Qdrant," as shown below. 

In the below code, 
- Vector store works along with the embedding model to create vector embeddings.
- Vector embeddings are stored in the Qdrant Vector database collection.

We have already created a vector database that contains the astrophysics paper abstracts and Astropy's documentation, please refer to the notebook in the Appendix. 

In [17]:
# TODO: Fix module paths
qdrant_path = "../../resources/data/scipy_qdrant/"

# TODO: Change collection name to 
qdrant_collection = "arxiv_astro-ph_abstracts"

In [18]:
model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [24]:
# Setting up Qdrant
if os.path.exists(qdrant_path):
    print(f"Loading existing Qdrant collection '{qdrant_collection}'")
    
    client = QdrantClient(path=qdrant_path)
    
    qdrant = Qdrant(
        client=client,
        collection_name=qdrant_collection,
        embeddings=model
    )

Loading existing Qdrant collection 'arxiv_astro-ph_abstracts'


### Search

In [25]:
# Setup the retriever for later step
# mmr stands for  Maximum Marginal Relevance 
# "MMR selects examples based on a combination of which examples are most similar to the inputs, while also optimizing for diversity. It does this by finding the examples with the embeddings that have the greatest cosine similarity with the inputs, and then iteratively adding them while penalizing them for closeness to already selected examples."
retriever = qdrant.as_retriever(search_type="mmr", search_kwargs={"k": 2})

In [26]:
retriever.invoke("What is dark matter?")

[Document(page_content="  Dark matter is one of the greatest unsolved mysteries in cosmology at the\npresent time. About 80% of the universe's gravitating matter is non-luminous,\nand its nature and distribution are for the most part unknown. In this paper,\nwe will outline the history, astrophysical evidence, candidates, and detection\nmethods of dark matter, with the goal to give the reader an accessible but\nrigorous introduction to the puzzle of dark matter. This review targets\nadvanced students and researchers new to the field of dark matter, and includes\nan extensive list of references for further study.\n", metadata={'id': '1006.2483', 'title': 'Dark Matter: A Primer', 'categories': 'hep-ph astro-ph.CO', '_id': '70c556bd7c644b62aa8ef4e50d312e51', '_collection_name': 'arxiv_astro-ph_abstracts'}),
 Document(page_content='  It is suggested that Dark Matter in the Universe is made of stars and black\nholes of WIMP matter.\n', metadata={'id': 'astro-ph/0204375', 'title': 'WIMP Star

In [27]:
retriever.invoke("How can I perform celestial coordinate transformations?")

[Document(page_content='  I present simple analytical equations to transform proper motion vectors from\nequatorial to Galactic coordinates.\n', metadata={'id': '1306.2945', 'title': 'Transformation of the equatorial proper motion to the Galactic system', 'categories': 'astro-ph.IM', '_id': '1439d4c0b3fb4aec9619d9a1cbfdfc0a', '_collection_name': 'arxiv_astro-ph_abstracts'}),
 Document(page_content='  In Paper I, Greisen & Calabretta (2002) describe a generalized method for\nassigning physical coordinates to FITS image pixels. This paper implements this\nmethod for all spherical map projections likely to be of interest in astronomy.\nThe new methods encompass existing informal FITS spherical coordinate\nconventions and translations from them are described. Detailed examples of\nheader interpretation and construction are given.\n', metadata={'id': 'astro-ph/0207413', 'title': 'Representations of celestial coordinates in FITS', 'categories': 'astro-ph', '_id': '31329efafff241c8a6c8891d490

In [28]:
# Post-processing
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [30]:
print(format_docs(retriever.invoke("What is dark matter?")))

  Dark matter is one of the greatest unsolved mysteries in cosmology at the
present time. About 80% of the universe's gravitating matter is non-luminous,
and its nature and distribution are for the most part unknown. In this paper,
we will outline the history, astrophysical evidence, candidates, and detection
methods of dark matter, with the goal to give the reader an accessible but
rigorous introduction to the puzzle of dark matter. This review targets
advanced students and researchers new to the field of dark matter, and includes
an extensive list of references for further study.


  It is suggested that Dark Matter in the Universe is made of stars and black
holes of WIMP matter.



In [29]:
print(format_docs(retriever.invoke("How can I perform celestial coordinate transformations?")))

  I present simple analytical equations to transform proper motion vectors from
equatorial to Galactic coordinates.


  In Paper I, Greisen & Calabretta (2002) describe a generalized method for
assigning physical coordinates to FITS image pixels. This paper implements this
method for all spherical map projections likely to be of interest in astronomy.
The new methods encompass existing informal FITS spherical coordinate
conventions and translations from them are described. Detailed examples of
header interpretation and construction are given.



## Augmentation

Now that we can retrieve the most relevant document based on a question, we can use the retrieved document and send it along with the prompt to increase the context for the LLM.

This can also be referred to as the `retrieval-augmented prompt.`

In [36]:
# Make sure the model path is correct for your system!
# TODO: Fix model path to cache folder
olmo = LlamaCpp(
    model_path="../../resources/models/OLMo-7B-Instruct-GGUF/OLMo-7B-Instruct-Q4_K_M.gguf",
    temperature=0.8,
    verbose=False,  
)

In [37]:
# Create a prompt template using OLMo's tokenizer chat template we saw in module 1.
prompt_template = PromptTemplate.from_template(
    template=olmo.client.metadata['tokenizer.chat_template'], 
    template_format="jinja2"
)

In [39]:
# Test the prompt you want to send to OLMo.
context = format_docs(retriever.invoke("What is dark matter?"))

question = "What is dark matter?"

prompt_template.format(
    messages=[
        {
            "role": "user", 
            "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:

            Context: {context}
            
            Question: {question}"""
        }
    ], 
    add_generation_prompt=True, 
    eos_token="<|endoftext|>"
)

"<|endoftext|>\n\n<|user|>\nYou are an astrophysics expert. Please answer the question on astrophysics based on the following context:\n\n            Context:   Dark matter is one of the greatest unsolved mysteries in cosmology at the\npresent time. About 80% of the universe's gravitating matter is non-luminous,\nand its nature and distribution are for the most part unknown. In this paper,\nwe will outline the history, astrophysical evidence, candidates, and detection\nmethods of dark matter, with the goal to give the reader an accessible but\nrigorous introduction to the puzzle of dark matter. This review targets\nadvanced students and researchers new to the field of dark matter, and includes\nan extensive list of references for further study.\n\n\n  It is suggested that Dark Matter in the Universe is made of stars and black\nholes of WIMP matter.\n\n            \n            Question: What is dark matter?\n\n\n<|assistant|>\n\n"

## Generation

In [41]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

In [60]:
# Chain the prompt template and olmo
llm_chain = prompt_template | olmo

In [61]:
question = "What is dark matter?"
context = format_docs(retriever.invoke(question))

In [62]:
# Invoke the chain with a question and other parameters. 
llm_chain.invoke(
    {
        "messages":
            [{
                "role": "user", 
                "content": f"""You are an astrophysics expert. Please answer the question on astrophysics based on the following context:
    
                Context: {context}
                
                Question: {question}"""
            }
        ], 
        "add_generation_prompt": True, 
        "eos_token": "<|endoftext|>",
    },
    config={
        'callbacks' : [StreamingStdOutCallbackHandler()]
    }
)

 Dark matter is a theoretical entity that is currently the most popular solution to explaining the motion and observed properties of galaxies, stars, and other celestial bodies in the universe. Although it's difficult to visualize, its presence can be inferred from various astronomical observations. In this context, dark matter is made up of non-luminous materials that do not emit, reflect, or absorb light. Instead, their existence and properties are determined through their gravitational effects on visible matter like stars, gas, and dust (1).

The term "dark matter" originated from the observation that the distribution and motions of celestial bodies in the universe don't match up with the visible matter alone. Dark matter is estimated to account for around 85% of the total mass-energy content of the universe (2), which is significantly more massive than any known form of matter, including stars and galaxies.

To date, dark matter remains one of the greatest unsolved mysteries in cos

' Dark matter is a theoretical entity that is currently the most popular solution to explaining the motion and observed properties of galaxies, stars, and other celestial bodies in the universe. Although it\'s difficult to visualize, its presence can be inferred from various astronomical observations. In this context, dark matter is made up of non-luminous materials that do not emit, reflect, or absorb light. Instead, their existence and properties are determined through their gravitational effects on visible matter like stars, gas, and dust (1).\n\nThe term "dark matter" originated from the observation that the distribution and motions of celestial bodies in the universe don\'t match up with the visible matter alone. Dark matter is estimated to account for around 85% of the total mass-energy content of the universe (2), which is significantly more massive than any known form of matter, including stars and galaxies.\n\nTo date, dark matter remains one of the greatest unsolved mysteries