<a href="https://colab.research.google.com/github/remytr/RAG-System/blob/main/RAG_Systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **What is the project aim?**
A RAG (Retrieval-Augmented Generation) system with systematic experimentation to understand what makes retrieval work well.

## Loading the Dataset

Loading MS MARCO dataset from HuggingFace. It's a collection of datasets used for deep learning in search.



In [None]:
from datasets import load_dataset

ds = load_dataset("microsoft/ms_marco", "v2.1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

v2.1/validation-00000-of-00001.parquet:   0%|          | 0.00/210M [00:00<?, ?B/s]

v2.1/train-00000-of-00007.parquet:   0%|          | 0.00/240M [00:00<?, ?B/s]

v2.1/train-00001-of-00007.parquet:   0%|          | 0.00/240M [00:00<?, ?B/s]

v2.1/train-00002-of-00007.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

v2.1/train-00003-of-00007.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

v2.1/train-00004-of-00007.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

v2.1/train-00005-of-00007.parquet:   0%|          | 0.00/242M [00:00<?, ?B/s]

v2.1/train-00006-of-00007.parquet:   0%|          | 0.00/244M [00:00<?, ?B/s]

v2.1/test-00000-of-00001.parquet:   0%|          | 0.00/204M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/101093 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/808731 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/101092 [00:00<?, ? examples/s]

In [None]:
ds.shape

{'validation': (101093, 6), 'train': (808731, 6), 'test': (101092, 6)}

In [None]:
print(ds)

DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 101093
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 808731
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 101092
    })
})


In [None]:
print(ds['train']['answers'][0])

['The immediate impact of the success of the manhattan project was the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.']


In [None]:
for i in range(3):
  print(ds['train'][i])

{'answers': ['The immediate impact of the success of the manhattan project was the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.'], 'passages': {'is_selected': [1, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'passage_text': ['The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.', 'The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.', 'Essay on The Manhattan Project - The Manhattan Project The Manhattan Project was to see if making an atomic bomb possible. The success of this projec

In [None]:
print(ds['train'].features)

{'answers': List(Value('string')), 'passages': {'is_selected': List(Value('int32')), 'passage_text': List(Value('string')), 'url': List(Value('string'))}, 'query': Value('string'), 'query_id': Value('int32'), 'query_type': Value('string'), 'wellFormedAnswers': List(Value('string'))}


In [None]:
print(ds['train'].column_names)

['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers']


In [None]:
print(len(ds['train']))

808731


What is is_selected?
It's of type int32 and is a list of 1s and 0s. What does 1 and 0 mean? 1 means the passage is relevant to the query. is_selected = 0 means this passage is not relevant to the query.



In [None]:
example = ds['train'][0]
selected = sum(example['passages']['is_selected'])
total = len(example['passages']['is_selected'])
print(selected / total *100)

# Only 10% of passages in first example are relevant to the query.

10.0


## Exploring FAISS

Installing FAISS (Facebook AI Similarity Search)

In [None]:
!pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (7.7 kB)
Downloading faiss_cpu-1.13.0-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (23.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m41.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.13.0


#### Example of using FAISS

In [None]:
import requests
from io import StringIO
import pandas as pd

In [None]:
res = requests.get('https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/sick2014/SICK_train.txt')
# create dataframe
data = pd.read_csv(StringIO(res.text), sep='\t')
data.head()

Unnamed: 0,pair_ID,sentence_A,sentence_B,relatedness_score,entailment_judgment
0,1,A group of kids is playing in a yard and an ol...,A group of boys in a yard is playing and a man...,4.5,NEUTRAL
1,2,A group of children is playing in the house an...,A group of kids is playing in a yard and an ol...,3.2,NEUTRAL
2,3,The young boys are playing outdoors and the ma...,The kids are playing outdoors near a man with ...,4.7,ENTAILMENT
3,5,The kids are playing outdoors near a man with ...,A group of kids is playing in a yard and an ol...,3.4,NEUTRAL
4,9,The young boys are playing outdoors and the ma...,A group of kids is playing in a yard and an ol...,3.7,NEUTRAL


In [None]:
# we take all samples from both sentence A and B
# Turns all values into a list.
sentences = data['sentence_A'].tolist()
sentences[:5]

['A group of kids is playing in a yard and an old man is standing in the background',
 'A group of children is playing in the house and there is no man standing in the background',
 'The young boys are playing outdoors and the man is smiling nearby',
 'The kids are playing outdoors near a man with a smile',
 'The young boys are playing outdoors and the man is smiling nearby']

In [None]:
# we take all samples from both sentence A and B
sentences = data['sentence_A'].tolist()
sentence_b = data['sentence_B'].tolist()
sentences.extend(sentence_b)  # merge them
len(set(sentences))  # together we have ~4.5K unique sentences

4802

In [None]:
# urls = [
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.train.tsv',
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/MSRpar.test.tsv',
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2012/OnWN.test.tsv',
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2013/OnWN.test.tsv',
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/OnWN.test.tsv',
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2014/images.test.tsv',
#     'https://raw.githubusercontent.com/brmson/dataset-sts/master/data/sts/semeval-sts/2015/images.test.tsv'
# ]

In [None]:
# # each of these dataset have the same structure, so we loop through each creating our sentences data
# for url in urls:
#     res = requests.get(url)
#     # extract to dataframe
#     data = pd.read_csv(StringIO(res.text), sep='\t', header=None, on_bad_lines='skip')
#     # add to columns 1 and 2 to sentences list
#     sentences.extend(data[1].tolist())
#     sentences.extend(data[2].tolist())

In [None]:
# len(set(sentences))

In [None]:
# remove duplicates and NaN
sentences = [word for word in list(set(sentences)) if type(word) is str]

In [None]:
import faiss

index = faiss.IndexFlatL2(d)

In [None]:
d = sentence_embeddings.shape[1]
d

768

In [None]:
index.is_trained

True

In [None]:
from sentence_transformers import SentenceTransformer
# initialize sentence transformer model
model = SentenceTransformer('bert-base-nli-mean-tokens')
# create sentence embeddings
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(4802, 768)

## New Start

In [None]:
#!pip install datasets sentence-transformers faiss-cpu pandas numpy

In [None]:
from datasets import load_dataset
from sentence_transformers import SentenceTransformer
#from faiss_index_class import FaissIndex # Import the class defined above
import pandas as pd
import numpy as np
import time

In [None]:
dataset = load_dataset('ms_marco', 'v1.1', split='validation', streaming=True)
print("Validation dataset loaded in streaming mode.")

README.md: 0.00B [00:00, ?B/s]

Validation dataset loaded in streaming mode.


Below we are filtering the massive amount of MS MARCO data to only get the rows that are relevant for RAG evaluation. We need to filter as it is a large dataset and not every query has a clean, single, human-verified relevant passage marked with 1.


If you tested your RAG system with a query that didn't have a $\mathbf{ground}$ $\mathbf{truth}$ $\mathbf{context}$ marked, and your RAG system returned nothing, you wouldn't know if your system failed or if the query was simply untestable.

In [None]:

# Configuration
SAMPLE_SIZE = 1000
test_data_rows = []
query_count = 0

print(f"Starting to extract {SAMPLE_SIZE} valid test queries...")

# Iterate through the streaming dataset row by row
for row in dataset:

    # MS MARCO has a nested structure for passages
    passage_texts = row['passages']['passage_text']
    is_selected_flags = row['passages']['is_selected']

    # Want the human answer (ground truth)
    # Find the ground truth context (the passage where 'is_selected' is 1)
    relevant_passages = [
        text for text, is_selected in zip(passage_texts, is_selected_flags)
        if is_selected == 1
    ]

    # A valid test query must have at least one marked relevant passage
    if relevant_passages:

        # Collect the data points needed for evaluation
        test_data_rows.append({
            'query_id': row['query_id'],
            'query': row['query'],
            'answer': row['answers'][0],  # Take the first human-written answer
            'ground_truth_context': relevant_passages[0], # The exact passage text
        })

        query_count += 1

        # Stop once we hit the sample limit
        if query_count >= SAMPLE_SIZE:
            break

Starting to extract 1000 valid test queries...
✅ Extracted 1000 valid queries.
