# Retrieval-Augmented Language Models – Bridging LLMs with Efficient Knowledge Retrieval

####  Large Language Models (LLMs) are powerful but have limitations like forgetting recent information and hallucination.

#### Retrieval-Augmented Language Models (RAG) solve these problems by allowing models to fetch relevant information from external sources instead of relying only on what they were trained on.

![image.png](attachment:image.png)

#### This session will cover how retrieval-based models work, the different ways they retrieve information (like using sparse and dense retrieval methods), and how they improve accuracy and efficiency.

#### We will explore models like kNN-LMs, REALM, RETRO, and RAG, showing how they use retrieval to enhance responses.

![image.png](attachment:image.png)

#### Additionally, we will discuss strategies for improving retrieval, aligning retrieved knowledge with model outputs, and refining prompts for better results, especially in low-resource settings.

#### By combining retrieval with language models, we can build smaller, more efficient, and more reliable AI systems that provide accurate, well-supported answers in real-world applications.

## Load File

In [2]:
import pandas as pd

data = pd.read_csv("/content/sampled_reviews.csv")

data.head()

Unnamed: 0.1,Unnamed: 0,Id,Time,ProductId,UserId,Score,Summary,Text,combined
0,9953,9954,1344470400,B001CGTN1I,A3S3VSXEFXBMRC,5,What an amazing product!,I got this product because I used to buy chia ...,Title: What an amazing product!; Content: I go...
1,3850,3851,1337558400,B002HY8GNA,A3LC8ZA3XARKWX,5,completly addicted love them cant get enough,These Mega lollies are the best i eat 10 a day...,Title: completly addicted love them cant get e...
2,4962,4963,1309392000,B001DW2RGO,A18WGZSR2TB9RJ,3,Has an affect.,Six Hour Power does help to create a more ener...,Title: Has an affect.; Content: Six Hour Power...
3,3886,3887,1349136000,B005GX7GVW,A1I34N9LFOSCX7,5,Yum!,This soup cooks up quickly and is very yummy! ...,Title: Yum!; Content: This soup cooks up quick...
4,5437,5438,1346025600,B008YGWIZM,A33947M1Y587GX,5,"This stuff is the ""put on everything"" sauce",So I have sampled and bought my fair share of ...,"Title: This stuff is the ""put on everything"" s..."


# How to retrieve from a book or corpus?

## Sparse Retrieval

### TF-IDF

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
documents = data['combined'].tolist()

In [5]:
vectorizer = TfidfVectorizer()

tfidf_matrix = vectorizer.fit_transform(documents)

print("TF-IDF matrix shape:", tfidf_matrix.shape)

TF-IDF matrix shape: (2000, 9324)


In [6]:
feature_names = vectorizer.get_feature_names_out()
print("Number of features:", len(feature_names))
print("Some feature names:", feature_names[:10])

Number of features: 9324
Some feature names: ['00' '000' '008' '032' '05' '06' '062' '09' '090' '0xk6hzpjrkaed855hewp']


In [7]:
dense_rep = tfidf_matrix.toarray()
print("\nTF-IDF vector for first document:")
print(dense_rep[0])


TF-IDF vector for first document:
[0. 0. 0. ... 0. 0. 0.]


### BM25

In [12]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [13]:
import numpy as np
from rank_bm25 import BM25Okapi

documents = data['combined'].tolist()

tokenized_docs = [doc.lower().split() for doc in documents]

bm25 = BM25Okapi(tokenized_docs)


In [14]:
query = "bm25 retrieval function"

query_tokens = query.lower().split()

scores = bm25.get_scores(query_tokens)

print("Query:", query)
for i, score in enumerate(scores):
    print(f"Doc {i} score: {score:.4f} -> {documents[i]}")

Query: bm25 retrieval function
Doc 0 score: 0.0000 -> Title: What an amazing product!; Content: I got this product because I used to buy chia water in Kreation cafe, and they sell one bottle for like $6. This little bag can make at least c hundred of these bottles, if not much more. What a great saving! I make a bottle (or two) of chia water, add agave syrup and stevia to it and slowly drink throughout the day. These seeds have so many benefits! And I LOVE the smooth sensation of them on my tongue. It's a fun water, and I am buying more. For now I have been using it for about 2 weeks, and my bag is still almost fool.
Doc 1 score: 0.0000 -> Title: completly addicted love them cant get enough; Content: These Mega lollies are the best i eat 10 a day so they go quickly, i'm not going to tell you how many ive bought since Jan. 1 this year. They are great i love them.
Doc 2 score: 0.0000 -> Title: Has an affect.; Content: Six Hour Power does help to create a more energetic and alert state.  

## Building an Inverted Index

In [15]:
# inverted_index = {}

# for term_idx, term in enumerate(feature_names):
#     col = tfidf_matrix[:, term_idx]
#     doc_ids = col.nonzero()[0]  # equivalent to col.indices
#     inverted_index[term] = list(doc_ids)

# # --------------------------------------------
# # Print Inverted Index
# # --------------------------------------------
# print("\n=== Inverted Index ===")
# # for term, doc_list in inverted_index.items():
# #     print(f"{term}: {doc_list}")

# # --------------------------------------------
# # Example Query
# # --------------------------------------------
# query_terms = ["food", "dog"]
# matched_docs = {}

# for term in query_terms:
#     if term in inverted_index:
#         for doc_id in inverted_index[term]:
#             matched_docs[doc_id] = matched_docs.get(doc_id, 0) + 1

# # Sort documents by number of matching query terms
# ranked_results = sorted(matched_docs.items(), key=lambda x: x[1], reverse=True)

# print(f"\nQuery: {query_terms}")
# print("Ranked Matching Documents:")
# for doc_id, match_count in ranked_results:
#     print(f"Doc {doc_id} (matches: {match_count}) -> {documents[doc_id]}")

## Dense Passage Retriever (DPR)

In [4]:
data = data.sample(n=500, random_state=42)

In [5]:
documents = data['combined'].tolist()

In [3]:
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer
from sentence_transformers.util import cos_sim
import torch


question_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-single-nq-base")
question_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-single-nq-base")

passage_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")
passage_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-single-nq-base")

device = "cuda" if torch.cuda.is_available() else "cpu"
question_encoder = question_encoder.to(device)
passage_encoder = passage_encoder.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/493 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-question_encoder-single-nq-base were not used when initializing DPRQuestionEncoder: ['question_encoder.bert_model.pooler.dense.bias', 'question_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRQuestionEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRQuestionEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokenizer'.


pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [6]:
!pip install faiss-gpu-cu12

Collecting faiss-gpu-cu12
  Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Collecting numpy<2 (from faiss-gpu-cu12)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
Downloading faiss_gpu_cu12-1.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.9/47.9 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.3/18.3 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: numpy, faiss-gpu-cu12
  Attempting uninstall: numpy
    Found existing installation: numpy 2.0.2
    Uninstalling numpy-

In [7]:
from tqdm import tqdm

In [34]:
def encode_passages_batched(passages, batch_size=8):
    all_embeddings = []

    for i in tqdm(range(0, len(passages), batch_size)):
        batch_passages = passages[i:i + batch_size]
        inputs = passage_tokenizer(batch_passages, padding=True, truncation=True, return_tensors="pt", max_length=512).to(device)

        with torch.no_grad():
            outputs = passage_encoder(**inputs)
            embeddings = outputs.pooler_output

        all_embeddings.append(embeddings.cpu())

    return torch.cat(all_embeddings, dim=0)

passage_embeddings = encode_passages_batched(documents, batch_size=128)

100%|██████████| 4/4 [00:15<00:00,  3.80s/it]


In [19]:
import faiss

In [None]:
passage

In [20]:
import numpy as np

In [26]:
import faiss.contrib.torch_utils

In [29]:
# prompt: faiss store passage_embeddings

dimension = passage_embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
# Convert passage_embeddings to a NumPy array before adding it to the index
index.add(passage_embeddings)

In [32]:
# prompt: search a query in this index

def search_index(query, top_k=5):
  """
  Searches the index for the given query.

  Args:
    query: The query string.
    top_k: The number of top results to return.

  Returns:
    A list of tuples (document_index, similarity_score) for the top matching documents.
  """
  question_inputs = question_tokenizer(query, padding=True, truncation=True, return_tensors="pt").to(device)

  with torch.no_grad():
    question_embedding = question_encoder(**question_inputs).pooler_output.cpu()


  D, I = index.search(question_embedding, k=top_k)

  return list(zip(I[0], D[0]))

# Example Usage
query = "Which dog foods are good for golden retriever?"  # Replace with your query
results = search_index(query)

print(f"Query: {query}")
for doc_index, score in results:
  print(f"Doc {doc_index} (Score: {score:.4f}) -> {documents[doc_index]}")


Query: Which dog foods are good for golden retriever?
Doc 124 (Score: 99.1531) -> Title: The Best Training Treat; Content: My dog's breeder sent home a baggie of cheese and egg flavored Charlie Bears for my Maltese puppy.  I used them to reward him from the get-go, and he will do just about anything if he knows this treat is coming.  A great training aid.  They are also small enough to give often, and are made with good ingredients.  I carry them in my pocket and my little guy keeps an eye on that pocket!
Doc 184 (Score: 99.6995) -> Title: You bet your life!; Content: I have pain and numbness down both arms from a pinched nerve at the bottom of my neck.  I started drinking this juice three months into what will likely be a four to six month recovery.  Because my doctor has put me on a long-term anti-inflammatory medication, I am not allowed to take over-the-counter medications like Advil and Tylenol.  I can say that taking a glass of this juice has the same pain-killing effect as takin

In [35]:
# prompt: chunk the documents in size of 500 tokens each and then run passage encoder

def chunk_documents(documents, tokenizer, max_length=500):
  """Chunks documents into smaller passages of a specified maximum length.

  Args:
    documents: A list of documents (strings).
    tokenizer: The tokenizer to use for tokenization.
    max_length: The maximum length of each chunk in tokens.

  Returns:
    A list of chunks (strings).
  """
  chunks = []
  for document in documents:
    tokens = tokenizer.tokenize(document)
    for i in range(0, len(tokens), max_length):
      chunk = tokenizer.convert_tokens_to_string(tokens[i:i + max_length])
      chunks.append(chunk)
  return chunks

# Chunk the documents
chunked_documents = chunk_documents(documents, passage_tokenizer)

In [37]:
len(chunked_documents)

508

In [41]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.20-py3-none-any.whl.metadata (2.4 kB)
Collecting langchain<1.0.0,>=0.3.21 (from langchain-community)
  Downloading langchain-0.3.21-py3-none-any.whl.metadata (7.8 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting langchain-text-spli

In [43]:
# prompt: Write DPREncoder code using langchain

from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document

def create_dpr_encoder(documents, question_encoder, passage_encoder, tokenizer):

    embeddings = HuggingFaceEmbeddings(model_name=passage_encoder)

    docs = [Document(page_content=doc) for doc in documents]
    db = FAISS.from_documents(docs, embeddings)
    return db


db = create_dpr_encoder(documents, question_encoder, passage_encoder, passage_tokenizer)
query = "Which dog foods are good for golden retriever?"
results = db.similarity_search(query)
for doc in results:
    print(doc.page_content)


Some weights of DPRQuestionEncoder were not initialized from the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base and are newly initialized: ['bert_model.embeddings.LayerNorm.bias', 'bert_model.embeddings.LayerNorm.weight', 'bert_model.embeddings.position_embeddings.weight', 'bert_model.embeddings.token_type_embeddings.weight', 'bert_model.embeddings.word_embeddings.weight', 'bert_model.encoder.layer.0.attention.output.LayerNorm.bias', 'bert_model.encoder.layer.0.attention.output.LayerNorm.weight', 'bert_model.encoder.layer.0.attention.output.dense.bias', 'bert_model.encoder.layer.0.attention.output.dense.weight', 'bert_model.encoder.layer.0.attention.self.key.bias', 'bert_model.encoder.layer.0.attention.self.key.weight', 'bert_model.encoder.layer.0.attention.self.query.bias', 'bert_model.encoder.layer.0.attention.self.query.weight', 'bert_model.encoder.layer.0.attention.self.value.bias', 'bert_model.encoder.layer.0.attention.self.value.weight', 'bert_model.encoder.layer.0.i

RuntimeError: expand(torch.cuda.LongTensor{[32, 512, 1]}, size=[32, 768]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

In [None]:
from tqdm import tqdm

In [None]:
# def encode_passages_batched(passages, batch_size=8):
#     all_embeddings = []

#     for i in tqdm(range(0, len(passages), batch_size)):
#         batch_passages = passages[i:i + batch_size]
#         inputs = passage_tokenizer(batch_passages, padding=True, truncation=True, return_tensors="pt", max_length=512).to(device)

#         with torch.no_grad():
#             outputs = passage_encoder(**inputs)
#             embeddings = outputs.pooler_output  # shape: (batch_size, hidden_size)

#         all_embeddings.append(embeddings.cpu())

#     return torch.cat(all_embeddings, dim=0).numpy()

# passage_embeddings = encode_passages_batched(documents, batch_size=128)

## ColBERT

In [None]:
data["doc_id"] = data.index
data = data[["doc_id", "combined"]]

In [None]:
def save_df_to_colbert_format(df, output_path, id_col="doc_id", text_col="combined"):
    with open(output_path, "w", encoding="utf-8") as f:
        for _, row in df.iterrows():
            doc_id = str(row[id_col])
            text = row[text_col].replace("\n", " ").replace("\t", " ").strip()
            f.write(f"{doc_id}\t{text}\n")

# Save to disk
save_df_to_colbert_format(data, "colbert_corpus.tsv")

In [None]:
from ragatouille import RAGPretrainedModel

# Load ColBERTv2
colbert_model = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")

# Define your documents


# Index the documents
colbert_model.index(
    collection=documents,
    index_name="my-colbert-index",
    max_document_length=180  # token limit per doc
)

ImportError: cannot import name 'Trainer' from 'colbert' (/home/sahil/miniconda3/envs/ragtaxo/lib/python3.10/site-packages/colbert/__init__.py)

## Hypothetical Document Embeddings (HyDE)

# How to increase diversity?

In [44]:
1. Remove duplicates...
2. Remove using BM25
3. Rephrase query and use multiple queries

SyntaxError: invalid syntax (<ipython-input-44-c51b90f128aa>, line 1)

## KNN-LM

# How to Improve RAG outputs?

![image.png](attachment:image.png)

![image.png](attachment:image.png)

# Content credits
- Graham Neubig’s lecture - https://phontron.com/class/anlp2024/assets/slides/anlp-10-rag.pdf
- ACL 2023 Tutorial - https://acl2023-retrieval-lm.github.io/