In this notebook we will divide the general task of building the RAG system into smaller parts and test solutions for every one of them.

As a data source we picked full text of the book "THE COUNT OF MONTE CRISTO" english version.

# Chunking

As the base of the RAG system is a typical LLM, it has limited contex window, so we can't give whole text at once, hence, we need to divide it into chunks.

In [2]:
import os
import pandas as pd

from transformers import AutoTokenizer
from langchain_text_splitters import RecursiveCharacterTextSplitter, CharacterTextSplitter
import seaborn as sns

In [3]:
DATA_FILE_PATH = os.path.join(os.path.dirname(os.getcwd()), "data\\CountMonteCristoFull.txt")

In [4]:
with open(DATA_FILE_PATH, "r", encoding="utf-8") as f:
    data_corpus = f.read()

The whole text consists of 2646643 characters

In [5]:
len(data_corpus)

2646643

Based on the information, provided on HugginFace [website](https://huggingface.co/meta-llama/Meta-Llama-3-8B), llama3-8b-8192 models(which was chosen as LLM for this system) context window is the size 8192. This means that we should divide the whole dataset into parts about the size of 6000-6500 tokens.\
Besides that, to ensure integrity of models understanding of the text, these parts will be overlapping by 400-600 tokens to ensure that connections between consecutive chunks are retained.\
As we got from experimenting, these are the correlations between the lenghts of text chunks and tokens chunks
* 5500 len text chunk + 600 len overlap ~ 1200-1450 len token chunks
* 8000 len text chunk + 800 len overlap ~ 1700-2100 len token chunks
* 10000 len text chunk + 1000 len overlap ~ 2300-2600 len token chunks

After further analyzing chapters size, which are between 10k-60k symbols, the decision was made to split text into chunks of the size 10k symbols(2.3k-2.6 tokens).

In [6]:
splitter = CharacterTextSplitter(separator="\n\n", chunk_size=10000, chunk_overlap=1000)
text_chunks = splitter.create_documents([data_corpus])

In [7]:
print('Example of text chunk:', text_chunks[0].page_content, 
      '\nNumber of chunks:', len(text_chunks)) 

Example of text chunk: ﻿The Project Gutenberg eBook of The Count of Monte Cristo
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Title: The Count of Monte Cristo

Author: Alexandre Dumas
        Auguste Maquet

Release date: January 1, 1998 [eBook #1184]
                Most recently updated: February 4, 2024

Language: English

Credits: Anonymous Project Gutenberg Volunteers, Dan Muller and David Widger


*** START OF THE PROJECT GUTENBERG EBOOK THE COUNT OF MONTE CRISTO ***
THE COUNT OF MONTE CRISTO

by Alexandre Dumas [père]

0009m 

0011m 

0019m 

Contents


 VOLUME ONE
Chapter 1. Marsei

In [8]:
tokenizer = AutoTokenizer.from_pretrained("mlabonne/Meta-Llama-3-8B")
tokenized_chunks = [tokenizer(chunk.page_content)['input_ids'] for chunk in text_chunks]



In [9]:
tokenizer._bos_token, tokenizer._eos_token

('<|begin_of_text|>', '<|end_of_text|>')

In [10]:
print('Total number of tokenized chunks:', len(tokenized_chunks))

Total number of tokenized chunks: 294


For further pusposes, when the citation part comes up, we need not only to show the user the part of the text that may answer his question, but also provide him the approximate location of the answer. For this, we will attach metadata for each text chunk, that will tell which Chapter in the book this text belongs to. There are only a few maybe Chapters that are less big than set chunk_size(10_000), so the odds of several Chapter names would be in one text chunk are low.

In [11]:
prev_chapter_name = '' 
for chunk in text_chunks:
    curr_chapter_name = ''
    index_start_chapter_name = chunk.page_content.find('Chapter')
    if index_start_chapter_name == -1:
        curr_chapter_name = prev_chapter_name
    else:
        index_end_chapter_name = chunk.page_content.find('\n', index_start_chapter_name)
        curr_chapter_name = chunk.page_content[index_start_chapter_name:index_end_chapter_name]
        prev_chapter_name = curr_chapter_name
    chunk.metadata['belongs_to'] = curr_chapter_name

In [12]:
text_chunks[:5]

[Document(metadata={'belongs_to': 'Chapter 1. Marseilles—The Arrival'}, page_content='\ufeffThe Project Gutenberg eBook of The Count of Monte Cristo\n    \nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org. If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.\n\nTitle: The Count of Monte Cristo\n\nAuthor: Alexandre Dumas\n        Auguste Maquet\n\nRelease date: January 1, 1998 [eBook #1184]\n                Most recently updated: February 4, 2024\n\nLanguage: English\n\nCredits: Anonymous Project Gutenberg Volunteers, Dan Muller and David Widger\n\n\n*** START OF THE PROJECT GUTENBERG EBOOK THE COUNT OF MONTE CRISTO ***\nTHE COUNT OF MONTE CRISTO\

Here we will define an additional function to perform text cleaning. Text will be cleaned by removing punctuation, converting to lowercase, removing special characters and extra whitespace.

In [1]:
import re
import string

def clean_text(text):
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    text = re.sub(r'\s+', ' ', text)

    return text.strip()

Now, when all the preperations of the dataset are done, we can form a new version of the documents collection.

In [13]:
chunked_data_corpus = []

for index, chunk in enumerate(text_chunks):
    chunked_data_corpus.append({
        'raw_text': chunk.page_content,
        'tokenized_text': tokenized_chunks[index],
        'cleaned_text': clean_text(chunk.page_content),
        'chapter_name': chunk.metadata['belongs_to']
    })

In [14]:
chunked_data_corpus_df = pd.DataFrame(chunked_data_corpus)
chunked_data_corpus_df.head()

Unnamed: 0,raw_text,tokenized_text,cleaned_text,chapter_name
0,﻿The Project Gutenberg eBook of The Count of M...,"[128000, 3305, 791, 5907, 52686, 58610, 315, 5...",the project gutenberg ebook of the count of mo...,Chapter 1. Marseilles—The Arrival
1,"“Why, you see, Edmond,” replied the owner, who...","[128000, 2118, 10445, 11, 499, 1518, 11, 3279,...",why you see edmond replied the owner who appea...,Chapter 1. Marseilles—The Arrival
2,"“Sometimes one and the same thing,” said Morre...","[128000, 2118, 32148, 832, 323, 279, 1890, 324...",sometimes one and the same thing said morrel w...,Chapter 2. Father and Son
3,“Whom does this belong to?” he inquired.\n\n“T...,"[128000, 2118, 1671, 316, 1587, 420, 9352, 311...",whom does this belong to he inquired to me to ...,Chapter 3. The Catalan
4,“Really; and you think this cousin pays her at...,"[128000, 2118, 49885, 26, 323, 499, 1781, 420,...",really and you think this cousin pays her atte...,Chapter 3. The Catalans


In [15]:
chunked_data_corpus_df.to_csv('../data/chunked_data_corpus.csv', index=False)

<h1>LiteLLM library (llama3-8b-8192 model).</h1>

In [19]:
from litellm import completion
from dotenv import load_dotenv
import os

load_dotenv()

GROQ_API_KEY = os.getenv("GROQ_API_KEY")

In [100]:
class QuestionAnsweringBot:

    #def __init__(self, docs) -> None:
    #    self.retriever = Retriever(docs)

    def answer_question(self, question: str) -> str:
        PROMPT = """\
You are a helpful assistant that can answer questions. 

Rules:
- Reply with the answer only and nothing but the answer.
- Say 'I don't know' if you don't know the answer.
- Use the provided context.
        """

        context = """Chapter 1. Marseilles—The Arrival
Chapter 2. Father and Son
Chapter 3. The Catalans
Chapter 4. Conspiracy
Chapter 5. The Marriage Feast
Chapter 6. The Deputy Procureur du Roi
Chapter 7. The Examination
Chapter 8. The Château d’If
Chapter 9. The Evening of the Betrothal
Chapter 10. The King’s Closet at the Tuileries
Chapter 11. The Corsican Ogre
Chapter 12. Father and Son
Chapter 13. The Hundred Days
Chapter 14. The Two Prisoners
Chapter 15. Number 34 and Number 27
Chapter 16. A Learned Italian
Chapter 17. The Abbé’s Chamber
Chapter 18. The Treasure
Chapter 19. The Third Attack
Chapter 20. The Cemetery of the Château d’If
Chapter 21. The Island of Tiboulen
Chapter 22. The Smugglers
Chapter 23. The Island of Monte Cristo
Chapter 24. The Secret Cave
Chapter 25. The Unknown
Chapter 26. The Pont du Gard Inn
Chapter 27. The Story""" #self.retriever.get_docs(question)

        response = completion(
                                model="groq/llama3-8b-8192", 
                                messages=[
                                    {"role": "system", "content": PROMPT},
                                    {"role": "user", "content": f"Context: {context}\nQuestion: {question}"}
                            ],
                            )
        return response
    
bot = QuestionAnsweringBot()

In [None]:
question = "What is the 25th chapter?"

answer = bot.answer_question(question)
print(question)
print(answer.choices[0].message.content)

What is the 25 chapter?
The Unknown


In [103]:
question = "Who is the current president of US?"

answer = bot.answer_question(question)
print(question)
print(answer.choices[0].message.content)

Who is the current president of US?
I don't know


In [63]:
print(answer.choices)

[Choices(finish_reason='length', index=0, message=Message(content=' (Part 2)\nThe US is a country of 50 states, each with its own constitution and laws. The president of the US is the head of state and head of government of the US. The president is elected by the people of the US to a four-year term. The president is the commander-in-chief of the US armed forces. The president is the chief executive of the US government. The president is the head of the executive branch of the US government. The president is the head', role='assistant', tool_calls=None, function_call=None))]
