<a href="https://colab.research.google.com/github/ranjithsrajan/PyLab/blob/main/M4_AST_12_RAG_with_LangChain_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Applied Data Science and Machine Learning
## A program by IITM and TalentSprint
### Assignment 12: RAG - Retrieval Augmented Generation

**(with OpenAI LLMs)**

## Learning Objectives

At the end of the experiment, you will be able to:

1. Load the Documents
2. Splitting the documents into chunks
3. Embedding the chunks and storing them in vector db
4. Retrieving the relevant chunks to the query
 * Addressing Diversity
 * Addressing Specificity
5. Connecting with LLM to get a final grounded answer

## Introduction

> **RAG diagram:**
>
> <img src='https://drive.google.com/uc?id=1sCVvpsmtZEU1WSK1FFGMGHbEjrgtCNLi'>

---
---

> **Vector Store and Retrieval:**
>
> <img src='https://drive.google.com/uc?id=1_zX5gtSNrV8Qdx7Nz4_gMR8dCwvxCDS7' width=750px>

---
---

> **Embedding Model:**
>
> <img src='https://drive.google.com/uc?id=1HnvjGJ4HmpS-0wndpH-Q8cKMwIwWkTUe'>

---
---

> **Retrieval in Action:**
>
> <img src='https://drive.google.com/uc?id=1ry2TWFsewwqYP3Lw9muuPmbyuQqXwnYV' width=800px>

---
---

> **Example workflow with embedding model:**
>
><br>
>
> <img src='https://drive.google.com/uc?id=1zTuMMX54L2HrnmCYktTxVfMVrkIz8w15' width=600px>

---
---

### Setup Steps:

In [3]:
#@title Run this cell to complete the setup for this Notebook
from IPython import get_ipython

ipython = get_ipython()

notebook= "M4_AST_12_RAG_with_LangChain_C" #name of the notebook

def setup():
#  ipython.magic("sx pip3 install torch")
    from IPython.display import HTML, display
    display(HTML('<script src="https://dashboard.talentsprint.com/aiml/record_ip.html?traineeId={0}&recordId={1}"></script>'.format(getId(),submission_id)))
    print("Setup completed successfully")
    return

def submit_notebook():
    ipython.magic("notebook -e "+ notebook + ".ipynb")

    import requests, json, base64, datetime

    url = "https://dashboard.talentsprint.com/xp/app/save_notebook_attempts"
    if not submission_id:
      data = {"id" : getId(), "notebook" : notebook, "mobile" : getPassword(),"batch":"IITM-PG-ADSML-07"}
      r = requests.post(url, data = data)
      r = json.loads(r.text)

      if r["status"] == "Success":
          return r["record_id"]
      elif "err" in r:
        print(r["err"])
        return None
      else:
        print ("Something is wrong, the notebook will not be submitted for grading")
        return None

    elif getAnswer() and getComplexity() and getAdditional() and getConcepts() and getComments() and getMentorSupport():
      f = open(notebook + ".ipynb", "rb")
      file_hash = base64.b64encode(f.read())

      data = {"complexity" : Complexity, "additional" :Additional,
              "concepts" : Concepts, "record_id" : submission_id,
              "answer" : Answer, "id" : Id, "file_hash" : file_hash,
              "notebook" : notebook,
              "feedback_experiments_input" : Comments,
              "feedback_mentor_support": Mentor_support,"batch":"IITM-PG-ADSML-07"}
      r = requests.post(url, data = data)
      r = json.loads(r.text)
      if "err" in r:
        print(r["err"])
        return None
      else:
        print("Your submission is successful.")
        print("Ref Id:", submission_id)
        print("Date of submission: ", r["date"])
        print("Time of submission: ", r["time"])
        print("View your submissions: https://learn-iitm.talentsprint.com/notebook_submissions")
        #print("For any queries/discrepancies, please connect with mentors through the chat icon in LMS dashboard.")
        return submission_id
    else: submission_id


def getAdditional():
  try:
    if not Additional:
      raise NameError
    else:
      return Additional
  except NameError:
    print ("Please answer Additional Question")
    return None

def getComplexity():
  try:
    if not Complexity:
      raise NameError
    else:
      return Complexity
  except NameError:
    print ("Please answer Complexity Question")
    return None

def getConcepts():
  try:
    if not Concepts:
      raise NameError
    else:
      return Concepts
  except NameError:
    print ("Please answer Concepts Question")
    return None


# def getWalkthrough():
#   try:
#     if not Walkthrough:
#       raise NameError
#     else:
#       return Walkthrough
#   except NameError:
#     print ("Please answer Walkthrough Question")
#     return None

def getComments():
  try:
    if not Comments:
      raise NameError
    else:
      return Comments
  except NameError:
    print ("Please answer Comments Question")
    return None


def getMentorSupport():
  try:
    if not Mentor_support:
      raise NameError
    else:
      return Mentor_support
  except NameError:
    print ("Please answer Mentor support Question")
    return None

def getAnswer():
  try:
    if not Answer:
      raise NameError
    else:
      return Answer
  except NameError:
    print ("Please answer Question")
    return None


def getId():
  try:
    return Id if Id else None
  except NameError:
    return None

def getPassword():
  try:
    return password if password else None
  except NameError:
    return None

submission_id = None
### Setup
if getPassword() and getId():
  submission_id = submit_notebook()
  if submission_id:
    setup()
else:
  print ("Please complete Id and Password cells before running setup")



Setup completed successfully


### Install Dependencies

In [4]:
%%capture
!pip -q install langchain==0.3.27
!pip -q install openai==2.3.0
!pip -q install langchain-core==0.3.79
!pip -q install langchain-community==0.3.31
!pip -q install sentence-transformers==5.1.1
!pip -q install langchain-huggingface==0.3.1
!pip -q install langchain-experimental==0.3.4
!pip -q install langchainhub==0.1.21
!pip -q install langchain-openai==0.3.35
!pip -q install langchain-chroma==0.2.6
!pip -q install chromadb==1.1.1
!pip -q install pypdf==6.1.1

### Import Required Packages

In [5]:
import os
import openai
import numpy as np
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

#### **Provide your OpenAI API key**

In [7]:
# Read OpenAI key from Colab Secrets

from google.colab import userdata

api_key = userdata.get('OPENAI_API_KEY')           # <-- change this as per your Colab secret's name
os.environ['OPENAI_API_KEY'] = api_key
openai.api_key = os.getenv('OPENAI_API_KEY')

### Load LLM

In [8]:
# Load Model

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0)

In [9]:
# General query
response = llm.invoke("What is the Capital of India?")
print(response.content)

The capital of India is New Delhi.


### **Loading the documents**

[PDF Loader](https://docs.langchain.com/oss/javascript/integrations/document_loaders/file_loaders/pdf)

In [15]:
# UPLOAD the Docs first to this notebook, then run this cell

from langchain_community.document_loaders import PyPDFLoader

# Load PDF
loaders = [
    PyPDFLoader("/content/drive/MyDrive/pca_d1.pdf"),
    PyPDFLoader("/content/drive/MyDrive/ens_d2.pdf"),
    PyPDFLoader("/content/drive/MyDrive/ens_d2.pdf"),    # Loading duplicate documents on purpose
]

docs = []
for loader in loaders:
    docs.extend(loader.load())


In [14]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [16]:
len(docs)        # 7 pages were there in total from above documents

7

In [17]:
print(docs[0].page_content)

1 
 
 
N 
 
1 Principal Component Analysis 
In real world data analysis tasks we analyze complex data i.e. multi dimensional data. We plot the  
data and find various patterns in it or use it to train some machine learning models.  One way to  
think about dimensions is that suppose you have an data point x , if we consider this data point as 
a physical object then dimensions are merely a basis of view, like where is the data located when 
it is observed from horizontal axis or vertical axis. 
As the dimensions of data increases, the difficulty to visualize it and perform computations on 
it also increases. So, how to reduce the dimensions of a data:- 
• Remove the redundant dimensions 
• Only keep the most important dimensions  
Let us first try to understand some terms:- 
Variance : It is a measure of the variability or it simply measures how spread the data set is.  
Mathematically, it is the average squared deviation from the mean score. We use the following 
formula to compute va

In [18]:
docs

[Document(metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2024-04-04T07:17:15+05:30', 'author': 'Ramendra Kumar', 'moddate': '2024-04-04T07:17:15+05:30', 'source': '/content/drive/MyDrive/pca_d1.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'}, page_content='1 \n \n \nN \n \n1 Principal Component Analysis \nIn real world data analysis tasks we analyze complex data i.e. multi dimensional data. We plot the  \ndata and find various patterns in it or use it to train some machine learning models.  One way to  \nthink about dimensions is that suppose you have an data point x , if we consider this data point as \na physical object then dimensions are merely a basis of view, like where is the data located when \nit is observed from horizontal axis or vertical axis. \nAs the dimensions of data increases, the difficulty to visualize it and perform computations on \nit also increases. So, how to reduce the dimensions of a data:- \n• Remove th

In [19]:
print(docs[0].page_content)

1 
 
 
N 
 
1 Principal Component Analysis 
In real world data analysis tasks we analyze complex data i.e. multi dimensional data. We plot the  
data and find various patterns in it or use it to train some machine learning models.  One way to  
think about dimensions is that suppose you have an data point x , if we consider this data point as 
a physical object then dimensions are merely a basis of view, like where is the data located when 
it is observed from horizontal axis or vertical axis. 
As the dimensions of data increases, the difficulty to visualize it and perform computations on 
it also increases. So, how to reduce the dimensions of a data:- 
• Remove the redundant dimensions 
• Only keep the most important dimensions  
Let us first try to understand some terms:- 
Variance : It is a measure of the variability or it simply measures how spread the data set is.  
Mathematically, it is the average squared deviation from the mean score. We use the following 
formula to compute va

### **Splitting of document**

[Recursively split by character](https://api.python.langchain.com/en/latest/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)

[Split by character](https://api.python.langchain.com/en/latest/text_splitters/character/langchain_text_splitters.character.CharacterTextSplitter.html)

In [20]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [21]:
# Split
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 500,
    chunk_overlap = 50
)

In [22]:
splits = text_splitter.split_documents(docs)

print(len(splits))
print(len(splits[0].page_content) )
splits[0].page_content

25
498


'1 \n \n \nN \n \n1 Principal Component Analysis \nIn real world data analysis tasks we analyze complex data i.e. multi dimensional data. We plot the  \ndata and find various patterns in it or use it to train some machine learning models.  One way to  \nthink about dimensions is that suppose you have an data point x , if we consider this data point as \na physical object then dimensions are merely a basis of view, like where is the data located when \nit is observed from horizontal axis or vertical axis.'

In [23]:
splits[0]

Document(metadata={'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021', 'creationdate': '2024-04-04T07:17:15+05:30', 'author': 'Ramendra Kumar', 'moddate': '2024-04-04T07:17:15+05:30', 'source': '/content/drive/MyDrive/pca_d1.pdf', 'total_pages': 3, 'page': 0, 'page_label': '1'}, page_content='1 \n \n \nN \n \n1 Principal Component Analysis \nIn real world data analysis tasks we analyze complex data i.e. multi dimensional data. We plot the  \ndata and find various patterns in it or use it to train some machine learning models.  One way to  \nthink about dimensions is that suppose you have an data point x , if we consider this data point as \na physical object then dimensions are merely a basis of view, like where is the data located when \nit is observed from horizontal axis or vertical axis.')

### **Embeddings**

Let's take our splits and embed them.

In [24]:
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model='text-embedding-3-small')

In [25]:
embedding

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x7fa7efdf5940>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x7fa7ef89a570>, model='text-embedding-3-small', dimensions=None, deployment='text-embedding-ada-002', openai_api_version=None, openai_api_base=None, openai_api_type=None, openai_proxy=None, embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=None, disallowed_special=None, chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

### **Understanding similarity search with a toy example**

In [26]:
sentence1 = "i like dogs"
sentence2 = "i like cats"
sentence3 = "the weather is ugly, too hot outside"

In [27]:
embedding1 = embedding.embed_query(sentence1)
embedding2 = embedding.embed_query(sentence2)
embedding3 = embedding.embed_query(sentence3)

In [28]:
len(embedding1), len(embedding2), len(embedding3)

(1536, 1536, 1536)

In [29]:
embedding1[:10]

[0.01652492582798004,
 -0.033298008143901825,
 5.593090918409871e-06,
 0.006350534502416849,
 0.027861136943101883,
 -0.011934041976928711,
 -0.007687192410230637,
 0.0372685007750988,
 -0.07273223251104355,
 -0.022074593231081963]

In [30]:
import numpy as np

def cosine_similarity(vector1, vector2):
    # Ensure that the vectors are numpy arrays
    vector1 = np.array(vector1)
    vector2 = np.array(vector2)

    # Calculate the dot product of the vectors
    dot_product = np.dot(vector1, vector2)

    # Calculate the magnitude (norm) of the vectors
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)

    # Compute cosine similarity
    if norm_vector1 == 0 or norm_vector2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (norm_vector1 * norm_vector2)


In [31]:
cosine_similarity(embedding1, embedding2), cosine_similarity(embedding1, embedding3), cosine_similarity(embedding2, embedding3)

(np.float64(0.7222047452671774),
 np.float64(0.2025683886169236),
 np.float64(0.18210909214104934))

### **Vectorstores**

In [32]:
from langchain_chroma import Chroma       # Light-weight and in memory

In [33]:
persist_directory = 'docs/chroma/'
!rm -rf ./docs/chroma  # remove old database files if any

In [34]:
vectordb = Chroma.from_documents(
    documents=splits,                    # splits we created earlier
    embedding=embedding,
    persist_directory=persist_directory, # save the directory
)

In [35]:
print(vectordb._collection.count()) # same as number of splits

25


### **Similarity Search in Vector store**

Algorithms for retrieving relevant chunks In Vector databases,

In vector databases, algorithms for retrieving relevant chunks to a query are often based on **similarity search techniques**, primarily using nearest neighbor search.

Here are some common approaches:

> **Approximate Nearest Neighbor (ANN) Search:** Vector databases frequently use ANN algorithms to improve efficiency when searching for vectors that
are close to the query vector.
>
> Popular **ANN** algorithms include:

> 1. HNSW (Hierarchical Navigable Small World Graph): This is a graph-based approach that finds approximate nearest neighbors using a multi-
layered graph structure.

> 2. Faiss: An open-source library developed by Facebook, which uses various algorithms for fast similarity search, such as Product Quantization and
Inverted File System (IVF).

> 3. Annoy (Approximate Nearest Neighbors Oh Yeah): Developed by Spotify, it uses a forest of random projection trees for approximate nearest
neighbor search.


In [36]:
question = "How does ensemble method works?"

In [37]:
docs = vectordb.similarity_search(question, k=6)     # k --> No. of Document object to return

In [38]:
print(len(docs))

for i in range(len(docs)):
    print(docs[i].page_content)
    print('='*140)

6
Why use Ensemble Methods? 
Ensemble Methods are used in order to: 
• decrease variance (bagging) 
• decrease bias (boosting) 
• improve predictions (stacking) 
 
Bagging 
Bagging actually refers to Bootstrap Aggregators. 
Bagging tests multiple models on the data by sampling and replacing data i.e it utilizes bootstrap - 
ping. In turn, this reduces the noise and variance by utilizing multiple samples. Each hypothesis
Why use Ensemble Methods? 
Ensemble Methods are used in order to: 
• decrease variance (bagging) 
• decrease bias (boosting) 
• improve predictions (stacking) 
 
Bagging 
Bagging actually refers to Bootstrap Aggregators. 
Bagging tests multiple models on the data by sampling and replacing data i.e it utilizes bootstrap - 
ping. In turn, this reduces the noise and variance by utilizing multiple samples. Each hypothesis
considered. The product is bought by the user when the combined ratings of the group is positive. 
The user gets a fairer idea about the product when all 

### **Edge cases where failure may happen**

1. Lack of Diversity : Semantic search fetches all similar documents, but does not enforce diversity.

    - Notice that we're getting duplicate chunks (because of the duplicate `ens_d2.pdf` in the index). `docs[0]` and `docs[1]` are indentical.

  **Addressing Diversity - MMR (Maximum Marginal Relevance)**

Maximum Marginal Relevance (MMR) is a method used to retrieve relevant items to a query while avoiding redundancy. It does this by ensuring a balance between relevancy and diversity in the items retrieved.

<img src='https://miro.medium.com/v2/resize:fit:828/format:webp/1*U-9mPt5tBfPBPrwC4_oD1w.png'>

In [39]:
question = 'How ensemble method works?'
docs = vectordb.similarity_search(question, k=3)     # Without MMR

print(len(docs))

for i in range(len(docs)):
    print(docs[i].page_content)
    print('='*140)

3
Why use Ensemble Methods? 
Ensemble Methods are used in order to: 
• decrease variance (bagging) 
• decrease bias (boosting) 
• improve predictions (stacking) 
 
Bagging 
Bagging actually refers to Bootstrap Aggregators. 
Bagging tests multiple models on the data by sampling and replacing data i.e it utilizes bootstrap - 
ping. In turn, this reduces the noise and variance by utilizing multiple samples. Each hypothesis
Why use Ensemble Methods? 
Ensemble Methods are used in order to: 
• decrease variance (bagging) 
• decrease bias (boosting) 
• improve predictions (stacking) 
 
Bagging 
Bagging actually refers to Bootstrap Aggregators. 
Bagging tests multiple models on the data by sampling and replacing data i.e it utilizes bootstrap - 
ping. In turn, this reduces the noise and variance by utilizing multiple samples. Each hypothesis
considered. The product is bought by the user when the combined ratings of the group is positive. 
The user gets a fairer idea about the product when all 

**Example 1. Addressing Diversity - MMR-Maximum Marginal Relevance**

In [40]:
docs_with_mmr = vectordb.max_marginal_relevance_search(question, k=3, fetch_k=6)   # With MMR

print(len(docs_with_mmr))

for i in range(len(docs_with_mmr)):
    print(docs_with_mmr[i].page_content)
    print('='*140)

3
Why use Ensemble Methods? 
Ensemble Methods are used in order to: 
• decrease variance (bagging) 
• decrease bias (boosting) 
• improve predictions (stacking) 
 
Bagging 
Bagging actually refers to Bootstrap Aggregators. 
Bagging tests multiple models on the data by sampling and replacing data i.e it utilizes bootstrap - 
ping. In turn, this reduces the noise and variance by utilizing multiple samples. Each hypothesis
considered. The product is bought by the user when the combined ratings of the group is positive. 
The user gets a fairer idea about the product when all the ratings are combined. 
Here, the combination of ratings is done so that the decision making process of the user is made  
easy. 
Ensemble Methods refer to combining many different machine learning models in order to get a  
more powerful prediction. 
Thus, ensemble methods increase the accuracy of the predictions.
1  
 
Ensemble Methods 
Let us consider a real world situation which uses Ensemble Methods, which is, 

2. Lack of specificity:  The question may be from a particular doc but answer may contain information from other doc.

  **Addressing Specificity: Working with metadata - Manually**

  **Working with metadata using self-query retriever - Automatically**

**Example 2. Addressing Specificity: Working with metadata - Manually**

In [41]:
# Without metadata information
question = "What is variance?"

docs = vectordb.similarity_search(question, k=5)

for doc in docs:
    print({'page': doc.metadata['page'], 'source': doc.metadata['source']})    # metadata contains information about from which doc the answer has been fetched

{'page': 0, 'source': '/content/drive/MyDrive/pca_d1.pdf'}
{'page': 0, 'source': '/content/drive/MyDrive/ens_d2.pdf'}
{'page': 0, 'source': '/content/drive/MyDrive/ens_d2.pdf'}
{'page': 0, 'source': '/content/drive/MyDrive/pca_d1.pdf'}
{'page': 1, 'source': '/content/drive/MyDrive/pca_d1.pdf'}


We can filter the results based on metadata.

In [42]:
# With metadata information
question = "what is the role of variance in pca?"
docs = vectordb.similarity_search(
    question,
    k=5,
    filter={"source":'/content/ens_d2.pdf'}     # manually passing metadata, using metadata filter.
)

for doc in docs:
    print({'page': doc.metadata['page'], 'source': doc.metadata['source']})

In [43]:
# With metadata information + MMR

docs_with_mmr = vectordb.max_marginal_relevance_search(question,
                                                       k=2,
                                                       fetch_k=5,
                                                       filter={"source":'/content/ens_d2.pdf'}     # manually passing metadata, using metadata filter.
                                                       )

In [44]:
for i in range(len(docs_with_mmr)):
    print(docs_with_mmr[i].page_content)
    print('='*140)

[**Addressing Specificity -Automatically: Working with metadata using self-query retriever**](https://drive.google.com/file/d/1cwsZ19oCJFhQDEMfDmIWjEPOI-Fs-9iB/view?usp=sharing)

### **Additional tricks: Compression**

Another approach for improving the quality of retrieved docs is compression. Information most relevant to a query may be buried in a document with a lot of irrelevant text. Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

[Contextual compression](https://blog.langchain.com/improving-document-retrieval-with-contextual-compression/) is meant to fix this.

## **Retrieval**

**Vectorstore as a retriever**

**Better Approach**

In [45]:
# Without MMR
question = "What is principal component analysis?"
retriever = vectordb.as_retriever(search_kwargs={"k": 3})
docs = retriever.invoke(question)
docs

[Document(id='e3b88648-f845-450c-866f-9f94d3fa3dda', metadata={'author': 'Ramendra Kumar', 'page_label': '2', 'source': '/content/drive/MyDrive/pca_d1.pdf', 'moddate': '2024-04-04T07:17:15+05:30', 'page': 1, 'total_pages': 3, 'creationdate': '2024-04-04T07:17:15+05:30', 'producer': 'Microsoft® Word 2021', 'creator': 'Microsoft® Word 2021'}, page_content='2 \n \n \n \nSo, what does Principal Component Analysis (PCA) do? \nPCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are  \northogonal (and hence linearly independent) and ranked according to the variance of data along  \nthem. It means more important principle axis occurs first. (more important = more variance/more  \nspread out data) \n \nHow does PCA work? \n• Calculate the covariance matrix X of data points.'),
 Document(id='27bbfb8e-9b06-4615-b3d7-e5acd5eb9035', metadata={'total_pages': 3, 'page_label': '1', 'producer': 'Microsoft® Word 2021', 'source': '/content/drive/MyDrive/pca_d1.pdf'

In [46]:
# With MMR
retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k":5})
docs = retriever.invoke(question)
docs

[Document(id='e3b88648-f845-450c-866f-9f94d3fa3dda', metadata={'page': 1, 'creationdate': '2024-04-04T07:17:15+05:30', 'creator': 'Microsoft® Word 2021', 'total_pages': 3, 'producer': 'Microsoft® Word 2021', 'page_label': '2', 'author': 'Ramendra Kumar', 'moddate': '2024-04-04T07:17:15+05:30', 'source': '/content/drive/MyDrive/pca_d1.pdf'}, page_content='2 \n \n \n \nSo, what does Principal Component Analysis (PCA) do? \nPCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are  \northogonal (and hence linearly independent) and ranked according to the variance of data along  \nthem. It means more important principle axis occurs first. (more important = more variance/more  \nspread out data) \n \nHow does PCA work? \n• Calculate the covariance matrix X of data points.'),
 Document(id='27bbfb8e-9b06-4615-b3d7-e5acd5eb9035', metadata={'page_label': '1', 'author': 'Ramendra Kumar', 'moddate': '2024-04-04T07:17:15+05:30', 'source': '/content/drive/MyDri

## **Augmentation**

In [47]:
from langchain_core.prompts import PromptTemplate                                    # To format prompts
from langchain_core.output_parsers import StrOutputParser                            # to transform the output of an LLM into a more usable format
from langchain.schema.runnable import RunnableParallel, RunnablePassthrough          # Required by LCEL (LangChain Expression Language)

In [48]:
# Build prompt
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""

QA_PROMPT = PromptTemplate(input_variables=["context", "question"], template=template)

## **Creating final RAG Chain**

> <img src='https://www.pinecone.io/_next/image/?url=https%3A%2F%2Fcdn.sanity.io%2Fimages%2Fvr8gru94%2Fproduction%2F63f8a8482c9ec06a8d7d1041514f87c06dd108a9-3442x942.png&w=3840&q=75' width=1200px>

[[Image source](https://www.pinecone.io/learn/series/langchain/langchain-expression-language/)]

Above figure describes the LCEL flow using `RunnableParallel` and `RunnablePassthrough`.

A Runnable is a **unit of execution** in the LangChain framework. It represents a specific task or operation that can be performed.

Examples of Runnables include data transformations, computations, or any other operation that can be **expressed** in the LCEL(LangChain expression language).

[Runnable Lambdas](https://api.python.langchain.com/en/latest/core/runnables/langchain_core.runnables.base.RunnableLambda.html) is a LangChain abstraction that allows us to turn Python functions into **pipe-compatible functions**, similar to the Runnable class.

[RunnablePassthrough](https://api.python.langchain.com/en/latest/core/runnables/langchain_core.runnables.passthrough.RunnablePassthrough.html) on its own allows you to pass inputs unchanged. This typically is **used in conjuction with [RunnableParallel](https://api.python.langchain.com/en/latest/core/runnables/langchain_core.runnables.base.RunnableParallel.html)** to pass data through to a new key in the map.

The **RunnableParallel** object allows us to define multiple values and operations, and run them all in parallel.

The **RunnablePassthrough** object is used as a “passthrough” that takes any input to the current component ('retrieval' in above figure) and allows us to provide it in the component output via the “question” key or any other custom key.

In [49]:

def get_context_info(question):
    retriever = vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 3, "fetch_k":5})
    docs = retriever.invoke(question)
    return docs


In [50]:
from langchain_core.runnables import RunnableLambda

retrieval = RunnableParallel(
    {
        "context": RunnableLambda(lambda x: get_context_info(x["question"])),
        "question": RunnableLambda(lambda x: x["question"])
        }
    )

In [51]:
from pprint import pprint

pprint(retrieval.invoke({"question": "What is PCA ?"}))

{'context': [Document(id='e3b88648-f845-450c-866f-9f94d3fa3dda', metadata={'page_label': '2', 'creationdate': '2024-04-04T07:17:15+05:30', 'page': 1, 'producer': 'Microsoft® Word 2021', 'moddate': '2024-04-04T07:17:15+05:30', 'creator': 'Microsoft® Word 2021', 'author': 'Ramendra Kumar', 'source': '/content/drive/MyDrive/pca_d1.pdf', 'total_pages': 3}, page_content='2 \n \n \n \nSo, what does Principal Component Analysis (PCA) do? \nPCA finds a new set of dimensions (or a set of basis of views) such that all the dimensions are  \northogonal (and hence linearly independent) and ranked according to the variance of data along  \nthem. It means more important principle axis occurs first. (more important = more variance/more  \nspread out data) \n \nHow does PCA work? \n• Calculate the covariance matrix X of data points.'),
             Document(id='ec8fc7f8-5a51-43aa-ad9a-ed8c05078203', metadata={'author': 'Ramendra Kumar', 'producer': 'Microsoft® Word 2021', 'page': 1, 'moddate': '2024-04

In [52]:
pprint(retrieval.invoke({"question": "How ensemble methods works?"}))

{'context': [Document(id='efb09d85-eb32-456d-985d-af39c669a1b9', metadata={'creator': 'Microsoft® Word 2021', 'page_label': '1', 'source': '/content/drive/MyDrive/ens_d2.pdf', 'producer': 'Microsoft® Word 2021', 'page': 0, 'author': 'Abhinav', 'moddate': '2024-04-04T07:17:46+05:30', 'creationdate': '2024-04-04T07:17:46+05:30', 'total_pages': 2}, page_content='Why use Ensemble Methods? \nEnsemble Methods are used in order to: \n• decrease variance (bagging) \n• decrease bias (boosting) \n• improve predictions (stacking) \n \nBagging \nBagging actually refers to Bootstrap Aggregators. \nBagging tests multiple models on the data by sampling and replacing data i.e it utilizes bootstrap - \nping. In turn, this reduces the noise and variance by utilizing multiple samples. Each hypothesis'),
             Document(id='c44dbcbe-5c0b-40a4-9a90-b5fcc31467d9', metadata={'page_label': '1', 'creationdate': '2024-04-04T07:17:46+05:30', 'total_pages': 2, 'author': 'Abhinav', 'producer': 'Microsoft® Wo

In [53]:
# RAG Chain

rag_chain = (retrieval                     # Retrieval
             | QA_PROMPT                   # Augmentation
             | llm                         # Generation
             | StrOutputParser()
             )

In [54]:
response = rag_chain.invoke({"question": "What is PCA ?"})

response

'Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data while preserving as much variance as possible. It achieves this by finding a new set of dimensions (or basis) that are orthogonal (linearly independent) and ranked according to the variance of the data along them. The process involves calculating the covariance matrix of the data points, determining the eigenvectors and eigenvalues, sorting the eigenvectors by their eigenvalues in decreasing order, and selecting the top k eigenvectors to form the new dimensions. PCA is particularly useful for simplifying data, visualizing it, and improving the performance of machine learning algorithms.\n\nThanks for asking!'

In [55]:
response = rag_chain.invoke({"question": "What is principal component analysis?"})

response

'Principal Component Analysis (PCA) is a statistical technique used in data analysis to reduce the dimensionality of complex, multi-dimensional data while preserving as much variance as possible. It achieves this by finding a new set of dimensions (or basis of views) that are orthogonal (linearly independent) and ranked according to the variance of the data along them. The principal axes that capture the most variance are prioritized, allowing for a more efficient representation of the data. PCA is particularly useful for visualizing data, identifying patterns, and preparing data for machine learning models.\n\nThanks for asking!'

In [56]:
response = rag_chain.invoke({"question": "How ensemble method works?"})

print(response)

Ensemble methods work by combining multiple machine learning models to improve the overall prediction accuracy. There are different techniques within ensemble methods, including:

1. **Bagging (Bootstrap Aggregating)**: This technique reduces variance by training multiple models on different subsets of the data, created through sampling with replacement. Each model makes predictions, and the final output is typically the average (for regression) or majority vote (for classification) of these predictions.

2. **Boosting**: This method aims to reduce bias by sequentially training models, where each new model focuses on the errors made by the previous ones. The predictions are combined to create a stronger overall model.

3. **Stacking**: In this approach, multiple models are trained, and their predictions are used as inputs to a higher-level model, which makes the final prediction. This can leverage the strengths of different models to improve accuracy.

Overall, ensemble methods enhance

In [57]:
# For queries that is not in documents
response = rag_chain.invoke({"question": "Who is the CEO of OpenAI "})

print(response)

I don't know. Thanks for asking!


[**Details of Chroma through LangChain**](https://python.langchain.com/docs/integrations/vectorstores/chroma/)

## Reusing Vector DB

### **Download the vector DB**

In [58]:
# Zip the entire folder
!zip -r /content/docs.zip /content/docs

  adding: content/docs/ (stored 0%)
  adding: content/docs/chroma/ (stored 0%)
  adding: content/docs/chroma/chroma.sqlite3 (deflated 58%)
  adding: content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/ (stored 0%)
  adding: content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/link_lists.bin (stored 0%)
  adding: content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/length.bin (deflated 70%)
  adding: content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/header.bin (deflated 63%)
  adding: content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/data_level0.bin (deflated 100%)


In [59]:
from google.colab import files
files.download("/content/docs.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### **Upload the vector db from previous step and unzip**

In [60]:
!unzip /content/docs.zip  -d /

Archive:  /content/docs.zip
replace /content/docs/chroma/chroma.sqlite3? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/docs/chroma/chroma.sqlite3  
replace /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/link_lists.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
 extracting: /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/link_lists.bin  
replace /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/length.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/length.bin  
replace /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/header.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/header.bin  
replace /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/data_level0.bin? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: /content/docs/chroma/1e70a146-ac3e-4014-b014-74466990cb73/data_level0.bin  


In [61]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model='text-embedding-3-small')

vectordb = Chroma(persist_directory = 'docs/chroma/',
                  embedding_function = embedding
                  )

### Please answer the questions below to complete the experiment:




In [62]:
#@title One key advantage of RAG over a standalone LLM is: { run: "auto", form-width: "500px", display-mode: "form" }
Answer = "Ability to use updated external knowledge" #@param ["", "Smaller model size", "Faster image generation", "Ability to use updated external knowledge"]

In [63]:
#@title How was the experiment? { run: "auto", form-width: "500px", display-mode: "form" }
Complexity = "Good and Challenging for me" #@param ["","Too Simple, I am wasting time", "Good, But Not Challenging for me", "Good and Challenging for me", "Was Tough, but I did it", "Too Difficult for me"]


In [64]:
#@title If it was too easy, what more would you have liked to be added? If it was very difficult, what would you have liked to have been removed? { run: "auto", display-mode: "form" }
Additional = "NA" #@param {type:"string"}


In [65]:
#@title Can you identify the concepts from the lecture which this experiment covered? { run: "auto", vertical-output: true, display-mode: "form" }
Concepts = "Yes" #@param ["","Yes", "No"]


In [66]:
#@title  Text and image description/explanation and code comments within the experiment: { run: "auto", vertical-output: true, display-mode: "form" }
Comments = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [67]:
#@title Mentor Support: { run: "auto", vertical-output: true, display-mode: "form" }
Mentor_support = "Very Useful" #@param ["","Very Useful", "Somewhat Useful", "Not Useful", "Didn't use"]


In [68]:
#@title Run this cell to submit your notebook for grading { vertical-output: true }
try:
  if submission_id:
      return_id = submit_notebook()
      if return_id : submission_id = return_id
  else:
      print("Please complete the setup first.")
except NameError:
  print ("Please complete the setup first.")

Your submission is successful.
Ref Id: 2420
Date of submission:  16 Feb 2026
Time of submission:  20:41:16
View your submissions: https://learn-iitm.talentsprint.com/notebook_submissions
