In this notebook we will divide the general task of building the RAG system into smaller parts and test solutions for every one of them.

As a data source we picked full text of the book "THE COUNT OF MONTE CRISTO" english version.

# Chunking

As the base of the RAG system is a typical LLM, it has limited contex window, so we can't give whole text at once, hence, we need to divide it into chunks.

In [41]:
import os
import pandas as pd
from langchain_text_splitters import CharacterTextSplitter
import seaborn as sns
import ast

In [54]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
DATA_FILE_PATH = os.path.join(os.path.dirname(os.getcwd()), "data\\CountMonteCristoFull.txt")

In [None]:
with open(DATA_FILE_PATH, "r", encoding="utf-8") as f:
    data_corpus = f.read()

The whole text consists of 2646643 characters

In [None]:
len(data_corpus)

2646643

Based on the information, provided on HugginFace [website](https://huggingface.co/meta-llama/Meta-Llama-3-8B), llama3-8b-8192 models(which was chosen as LLM for this system) context window is the size 8192. This means that we should divide the whole dataset into parts about the size of 6000-6500 tokens.\
Besides that, to ensure integrity of models understanding of the text, these parts will be overlapping by 400-600 tokens to ensure that connections between consecutive chunks are retained.\
As we got from experimenting, these are the correlations between the lenghts of text chunks and tokens chunks
* 5500 len text chunk + 600 len overlap ~ 1200-1450 len token chunks(this variant was tested, but the system responces were the worst among all tests)
* 8000 len text chunk + 800 len overlap ~ 1700-2100 len token chunks(results provided by the system were better than with 5.5k text characters split)
* 10000 len text chunk + 1000 len overlap ~ 2300-2600 len token chunks(this variant was tested and the system responces were almost consistently good)

After further analyzing chapters size, which are between 10k-60k symbols, the decision was made to split text into chunks of the size 8k symbols(1.7k-2.1 tokens).

In [None]:
splitter = CharacterTextSplitter(separator="\n\n", chunk_size=8000, chunk_overlap=800)
text_chunks = splitter.create_documents([data_corpus])

In [None]:
print('Example of text chunk:', text_chunks[0].page_content,
      '\nNumber of chunks:', len(text_chunks))

Example of text chunk: ﻿The Project Gutenberg eBook of The Count of Monte Cristo
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Count of Monte Cristo

Author: Alexandre Dumas
        Auguste Maquet

Release date: January 1, 1998 [eBook #1184]
                Most recently updated: February 4, 2024

Language: English

Credits: Anonymous Project Gutenberg Volunteers, Dan Muller and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK THE COUNT OF MONTE CRISTO ***
THE COUNT OF MONTE CRISTO

by Alexandre Dumas [père]

0009m 

0011m 

0019m 

Contents


 VOLUME ONE
Chapter 1. Marsei

In [None]:
DENSE_RETRIEVER_MODEL_NAME = "all-MiniLM-L6-v2"
CROSS_ENCODER_MODEL_NAME = 'cross-encoder/ms-marco-MiniLM-L-12-v2'
LLM_CORE_MODEL_NAME = "groq/llama3-8b-8192"

For further pusposes, when the citation part comes up, we need not only to show the user the part of the text that may answer his question, but also provide him the approximate location of the answer. For this, we will attach metadata for each text chunk, that will tell which Chapter in the book this text belongs to. Chapter sizes are between 10k-60k characters in length, so there are no chances that several chapter names would occur in one chunk.

In [None]:
prev_chapter_name = ''
for chunk in text_chunks:
    chunk.metadata['belongs_to'] = set()
    curr_chapter_name = ''
    index_start_chapter_name = chunk.page_content.find('Chapter')

    if index_start_chapter_name == -1:
        curr_chapter_name = prev_chapter_name
    else:
        # if prev_chapter_name is not empty and next chapter start further than first 40% of the chunk.
        # This means that the name of the prev chapter isn't in this chunk, but relevant info can be found.
        if prev_chapter_name != '' and index_start_chapter_name > int(len(chunk.page_content) * 0.4):
            chunk.metadata['belongs_to'].add(prev_chapter_name)

        index_end_chapter_name = chunk.page_content.find('\n', index_start_chapter_name)
        curr_chapter_name = chunk.page_content[index_start_chapter_name:index_end_chapter_name]
        prev_chapter_name = curr_chapter_name
    chunk.metadata['belongs_to'].add(curr_chapter_name)

    chunk.metadata['belongs_to'] = list(chunk.metadata['belongs_to'])

In [None]:
text_chunks[:5]

[Document(metadata={'belongs_to': ['Chapter 1. Marseilles—The Arrival']}, page_content='\ufeffThe Project Gutenberg eBook of The Count of Monte Cristo\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Count of Monte Cristo\n\nAuthor: Alexandre Dumas\n        Auguste Maquet\n\nRelease date: January 1, 1998 [eBook #1184]\n                Most recently updated: February 4, 2024\n\nLanguage: English\n\nCredits: Anonymous Project Gutenberg Volunteers, Dan Muller and David Widger\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE COUNT OF MONTE CRISTO ***\nTHE COUNT OF MONTE CRIST

Here we will define an additional function to perform text cleaning. Text will be cleaned by removing punctuation, converting to lowercase, removing special characters and extra whitespace.

In [43]:
import re
import string

def clean_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text)

    return text.strip()

Now, when all the preperations of the dataset are done, we can form a new version of the documents collection.

In [None]:
chunked_data_corpus = []

for index, chunk in enumerate(text_chunks):
    chunked_data_corpus.append({
        'raw_text': chunk.page_content,
        'cleaned_text': clean_text(chunk.page_content),
        'chapter_name': chunk.metadata['belongs_to']
    })

In [None]:
chunked_data_corpus_df = pd.DataFrame(chunked_data_corpus)
chunked_data_corpus_df.head()

Unnamed: 0,raw_text,cleaned_text,chapter_name
0,﻿The Project Gutenberg eBook of The Count of M...,the project gutenberg ebook of the count of mo...,[Chapter 1. Marseilles—The Arrival]
1,"On the 24th of February, 1815, the look-out at...",on the 24th of february 1815 the lookout at no...,[Chapter 1. Marseilles—The Arrival]
2,The order was executed as promptly as it would...,the order was executed as promptly as it would...,[Chapter 1. Marseilles—The Arrival]
3,0027m\n\n“How could that bring me into trouble...,0027m how could that bring me into trouble sir...,[Chapter 1. Marseilles—The Arrival]
4,"“Ah, M. Morrel,” exclaimed the young seaman, w...",ah m morrel exclaimed the young seaman with te...,"[Chapter 1. Marseilles—The Arrival, Chapter 2...."


In [None]:
chunked_data_corpus_df.to_csv('../data/chunked_data_corpus.csv', index=False)

In [71]:
chunked_data_corpus_df = pd.read_csv('/content/drive/MyDrive/chunked_data_corpus.csv')
# chunked_data_corpus_df = pd.read_csv('../data/chunked_data_corpus.csv')
chunked_data_corpus_df = chunked_data_corpus_df.to_dict('records')

In [None]:
!pip install litellm python-dotenv

Collecting litellm
  Downloading litellm-1.53.1-py3-none-any.whl.metadata (33 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting tiktoken>=0.7.0 (from litellm)
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Downloading litellm-1.53.1-py3-none-any.whl (6.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m83.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m57.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-dotenv, tiktoken, litellm
Successfully installed litellm-1.53.1 python-dotenv-1.0.1 tiktoken-0.8.0


In [None]:
from litellm import completion
from dotenv import load_dotenv
import os

load_dotenv()

# GROQ_API_KEY = os.getenv("GROQ_API_KEY")
GROQ_API_KEY = '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'

# Retriever

The algorithm, that typicaly used to implement retriever functionality is simple TF-IDF score. Here, we will be using more efficient its implementation BM25.\
They are both information retrieval methods used to rank documents based on their relevance to a query.\
Key differences between them are:
* Term Saturation: The impact of term frequency diminishes as term frequency increases.
* Document Length Normalization: Adjusts scores based on document length using length normalization parameter and average document length.

Based on this, BM25 is more suitable for task like RAG systems.

Aside from TF-IDF kind of score to compare query with the documents, that is rather general, we can also use dense retriever.\
Dense retriever(which will be implemented as bi-encoder) is the one that catches semantical filling of the text, so basically can compare text in different languages with the same meanings by computing vector embeddings for each document and query and compare their similarity by using smth like cosine similarity.\
To implement dense retriever, often models like [sentence bert](https://arxiv.org/abs/1908.10084) or [LaBSE](https://arxiv.org/abs/2007.01852) are chosen.
For our system, we will try smth more simple but still effective, like "all-MiniLM-L6-v2" model.\
Using retriever we will firstly pick 20%(60 docs) of the most relevant docs out of all using BM25. Then, using dense retriever, we will pick 50%(30 docs) out of all picked by BM25 as most similar.
Besides that, for time efficiency, we will precompute chunks embeddings online and simply pick them in dense retrieval part.

# Reranker

As both retirievers(bm25 and dense) capture only key words and general information about the text respectivelly, the retrieved documents can be pretty far from query semanticaly. To overcome this issue, we can use reranker, which will be implemented as cross-encoder, that encodes pairs od document and query into numerical vectors and captures specific information that is asked in a query. This approach is less cost efficient, so it can't be performed on the whole document corpus. So, we will pass only 30 docs(10% of the whole corpus), which were returned by retriever. As a result, we will pick 2 most relevant documents to pass to the LLM context.\
Exactly 2 documents were considered maximum to pick, cause each document is approximately 1700-2100 long in tokens. LLM's context window is 8000 tokens, so picking 2 most relevant docs we will be left with 2000 tokens for query and answer.

In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
from rank_bm25 import BM25Okapi
import numpy as np
import torch
from sentence_transformers import SentenceTransformer, util, CrossEncoder

  from tqdm.autonotebook import tqdm, trange


In [73]:
class HybridRetrieverReranker:
    def __init__(self, dataset, dense_model_name=DENSE_RETRIEVER_MODEL_NAME, cross_encoder_model=CROSS_ENCODER_MODEL_NAME):
        self.dataset = dataset
        self.bm25_corpus = [entry['cleaned_text'] for entry in dataset]
        self.tokenized_corpus = [chunk.split() for chunk in self.bm25_corpus]
        self.bm25 = BM25Okapi(self.tokenized_corpus)

        self.dense_model = SentenceTransformer(dense_model_name)
        self.cross_encoder = CrossEncoder(cross_encoder_model)

        self.dataset = self.__calculate_save_embeddings__(self.dataset)


    def __calculate_save_embeddings__(self, dataset):
        docs_embeddings = self.dense_model.encode([entry['raw_text'] for entry in dataset], convert_to_tensor=True)
        for doc, embedding in zip(dataset, docs_embeddings):
          doc['embedding'] = embedding
        return dataset


    def bm25_retrieve(self, query, top_k=70):
        """
        Retrieve top K documents using BM25.

        Args:
            query (str): Query text.
            top_k (int): Number of top BM25 documents to retrieve.

        Returns:
            list of dict: Top K BM25 results.
        """
        cleaned_query = clean_text(query)
        query_tokens = cleaned_query.split()
        bm25_scores = self.bm25.get_scores(query_tokens)
        top_k_indices = np.argsort(bm25_scores)[::-1][:top_k]
        return [self.dataset[idx] for idx in top_k_indices]


    def dense_retrieve(self, query, candidates, top_n=35):
        """
        Retrieve top N documents using dense retrieval with LaBSE.

        Args:
            query (str): Query text.
            candidates (list of dict): Candidate documents from BM25.
            top_n (int): Number of top dense results to retrieve.

        Returns:
            list of dict: Top N dense results.
        """
        query_embedding = self.dense_model.encode(query, convert_to_tensor=True)
        candidate_embeddings = torch.stack([doc['embedding'] for doc in candidates])

        similarities = util.pytorch_cos_sim(query_embedding, candidate_embeddings).squeeze(0)

        top_n_indices = torch.topk(similarities, top_n).indices
        return [candidates[idx] for idx in top_n_indices]


    def rerank(self, query, candidates, top_n=3):
        """
        Rerank top documents using a CrossEncoder.

        Args:
            query (str): Query text.
            candidates (list of dict): Candidate documents from dense retriever.
            top_n (int): Number of top reranked results to return.

        Returns:
            list of dict: Top N reranked documents.
        """
        query_document_pairs = [(query, doc['raw_text']) for doc in candidates]
        scores = self.cross_encoder.predict(query_document_pairs)
        top_n_indices = np.argsort(scores)[::-1][:top_n]
        return [candidates[idx] for idx in top_n_indices]


    def hybrid_retrieve(self, query, enable_dense=True, enable_rerank=True, top_k_bm25=60, top_n_dense=30, top_n_rerank=2):
        """
        Perform hybrid retrieval: BM25 followed by dense retrieval and optional reranking.

        Args:
            query (str): Query text.
            top_k_bm25 (int): Number of top BM25 documents to retrieve.
            top_n_dense (int): Number of top dense results to retrieve.
            enable_dense (bool): Whether dense retrieval should be enabled.
            enable_rerank (bool): Whether reranking should be enabled.
            top_n_rerank (int): Number of top reranked documents to return.

        Returns:
            list of dict: Final top results after hybrid retrieval and reranking.
        """
        bm25_results = self.bm25_retrieve(query, top_k=top_k_bm25)

        if enable_dense:
            dense_results = self.dense_retrieve(query, bm25_results, top_n=top_n_dense)
        else:
            dense_results = bm25_results

        if enable_rerank:
            final_results = self.rerank(query, dense_results, top_n=top_n_rerank)
        else:
            final_results = dense_results

        return final_results

In [None]:
retriever = HybridRetrieverReranker(chunked_data_corpus_df)
query = "Who is Monte Cristo?"
results = retriever.hybrid_retrieve(query)

for doc in results:
  print(doc['chapter_name'], '\t', doc['raw_text'][:doc['raw_text'].index('.')], '\n')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

['Chapter 41. The Presentation'] 	 The countess paused a moment; then, after a slight hesitation, she
resumed 

['Chapter 48. Ideology'] 	 Villefort, astonished at this reply, which he by no means expected,
started like a soldier who feels the blow levelled at him over the
armor he wears, and a curl of his disdainful lip indicated that from
that moment he noted in the tablets of his brain that the Count of
Monte Cristo was by no means a highly bred gentleman 



# LiteLLM library (llama3-8b-8192 model).

In [74]:
class QuestionAnsweringBot:


    def __init__(self, docs) -> None:
        self.retriever = HybridRetrieverReranker(docs)

    def __get_answer__(self, question: str) -> str:
        PROMPT = """\
            You are an intelligent assistant designed to provide accurate and relevant answers based on the provided context.

            Rules:
            - Always analyze the provided context thoroughly before answering.
            - Respond with factual and concise information.
            - If context is ambiguous or insufficient, say 'I don't know.'
            - Do not speculate or fabricate information beyond the provided context.
            - Follow user instructions on the response style(default style is detailed response if user didn't provide any specifications):
              - If the user asks for a detailed response, provide comprehensive explanations.
              - If the user requests brevity, give concise and to-the-point answers.
            - When applicable, summarize and synthesize information from the context to answer effectively.
            - Avoid using information outside the given context.
          """
        context = self.retriever.hybrid_retrieve(question)

        context_text = [doc['raw_text'] for doc in context]

        response = completion(
                                model=LLM_CORE_MODEL_NAME,
                                messages=[
                                    {"role": "system", "content": PROMPT},
                                    {"role": "user", "content": f"Context: {context_text}\nQuestion: {question}"}
                            ],
                            api_key=GROQ_API_KEY
                            )
        return response, context

    def form_response(self, question):
      llm_response, context = self.__get_answer__(question)

      metadata_raw = [doc['chapter_name'] for doc in context]
      metadata_cleaned = [ast.literal_eval(item) for item in metadata_raw]

      print('User:', question)
      print('System:', llm_response.choices[0].message.content)
      if "don't know" not in llm_response.choices[0].message.content:
        print('Resources:', [chapter for doc in metadata_cleaned for chapter in doc])

In [75]:
bot = QuestionAnsweringBot(chunked_data_corpus_df)

# Testing with different prompts

In [76]:
question = "What is the title of Chapter 64?"
bot.form_response(question)

User: What is the title of Chapter 64?
System: According to the provided context, the title of Chapter 64 is "The Beggar".
Resources: ['Chapter 64. The Beggar', 'Chapter 1. Marseilles—The Arrival']


In [77]:
question = "Who is the current president of US?"
bot.form_response(question)

User: Who is the current president of US?
System: I'm happy to help! However, please note that I'm analyzing the context provided, which is an excerpt from an 1844 novel called "The Count of Monte Cristo" by Alexandre Dumas. The context doesn't provide any information about current events, including the presidency of the United States.

In fact, the novel was written over 170 years ago, which means the presidency would have been held by someone else during that time.
Resources: ['Chapter 86. The Trial', 'Chapter 87. The Challenge', 'Chapter 75. A Signed Statement', 'Chapter 76. Progress of Cavalcanti the Younger']


In [78]:
question = "Who is the author of the book The Count of Monte Cristo?"
bot.form_response(question)

User: Who is the author of the book The Count of Monte Cristo?
System: The author of the book "The Count of Monte Cristo" is Alexandre Dumas, assisted by Auguste Maquet.
Resources: ['Chapter 1. Marseilles—The Arrival', 'Chapter 41. The Presentation']


In [79]:
question = "Who is Monte Cristo?"
bot.form_response(question)

User: Who is Monte Cristo?
System: Based on the provided context, Monte Cristo is a count who has recently settled in Paris with the intention of spending six million francs during the next twelve months. He is described as a mysterious and wealthy individual, known for his impressive appearance and mannerisms.
Resources: ['Chapter 46. Unlimited Credit', 'Chapter 47. The Dappled Grays']


In [80]:
question = "Tell me about all the main identites in Monte Cristo?"
bot.form_response(question)

User: Tell me about all the main identites in Monte Cristo?
System: A great choice! Alexandre Dumas' masterpiece, "The Count of Monte Cristo", is a thrilling adventure novel filled with complex characters and intricate plots. Here's a list of the main identities and characters:

**Protagonist:**

1. **Edmond Dantès** ((alias Count of Monte Cristo): A young and charming sailor who is wrongfully accused and imprisoned. He uses his newfound wealth and resources to exact revenge on those who wronged him.

**Antagonists:**

1. **Fernand Mondego**: Edmond's former friend and colleague, who betrayed him to save his own marriage.
2. **Danglars**: A wealthy shipowner who grew jealous of Edmond's success and coveted his ship.
3. **Villefort**: A corrupt prosecutor who framed Edmond and served as a loyal servant to the king.
4. **Morrel**: A wealthy shipowner and Edmond's friend, whose son Maximilian is deeply in love with Mercédès.

**Supporting Characters:**

1. **Mercédès**: Edmond's girlfrien

In [81]:
question = "Name me all the persons under whose name Monte Cristo appeared"
bot.form_response(question)

User: Name me all the persons under whose name Monte Cristo appeared
System: In the provided context, Monte Cristo appears under the following names:

1. Abbé Busoni
2. The Count
3. The Count of Monte Cristo

Note that Monte Cristo is a nom de guerre (a fictional name) used by the main character, Edmond Dantès, who seeks revenge against those who wronged him.
Resources: ['Chapter 84. Beauchamp', 'Chapter 46. Unlimited Credit', 'Chapter 45. The Rain of Blood']


In [82]:
question = "How many years does Edmon Dantes spent in prison?"
bot.form_response(question)

User: How many years does Edmon Dantes spent in prison?
System: Edmond Dantès, the main character in the novel, was imprisoned for 14 years.
Resources: ['Chapter 15. Number 34 and Number 27', 'Chapter 22. The Smugglers']


In [84]:
question = "What is the name of the ship Edmon Dantes used to work on while working in Morrels company?"
bot.form_response(question)

User: What is the name of the ship Edmon Dantes used to work on while working in Morrels company?
System: The name of the ship Edmon Dantes used to work on while working in Morrel's company is the Pharaon.
Resources: ['Chapter 5. The Marriage Feast', 'Chapter 1. Marseilles—The Arrival']


In [85]:
question = "What is the title of Chapter 93?"
bot.form_response(question)

User: What is the title of Chapter 93?
System: According to the provided context, the title of Chapter 93 is "Valentine".
Resources: ['Chapter 1. Marseilles—The Arrival', 'Chapter 33. Roman Bandits']
