## RAG
* Retrieval of relevant information from a document
* Augment the relevant information
* Generate a summary of the relevant information

In [1]:
!nvidia-smi

Thu Sep 26 17:32:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3080 Ti     Off | 00000000:01:00.0 Off |                  N/A |
|  0%   51C    P8              18W / 350W |     10MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Steps to follow
1. Open a pdf
2. Format text
3. Embed chunks of text and turn into embedding
4. Build a retrieval system
5. Generate a prompt that incorporates retrieved pieces of text
6. Generate answer

Github link format for raw files
example
https://github.com/mrdbourke/simple-local-rag/blob/main/human-nutrition-text.pdf
instead of blob replace with raw/refs/heads
https://github.com/mrdbourke/simple-local-rag/raw/refs/heads/main/human-nutrition-text.pdf

In [6]:
import os
import requests
from tqdm.auto import tqdm

pdf_path = "human-nutrition-text.pdf"

if not os.path.exists(pdf_path):
    print("File not exist ")

    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"
    filename = pdf_path
    request = requests.get(url,
                        #    stream=True
                           )
    if request.status_code == 200:
        with open(filename, "wb") as f:
            for buffer in tqdm(request.iter_content(),total=float(request.headers['Content-Length'])):
                f.write(buffer)
        print("File downloaded")

else:
    print("File exists")

File not exist 


  0%|          | 0/26891229.0 [00:00<?, ?it/s]

File downloaded


In [9]:
import fitz

def text_formatter(text:str) -> str:
    cleaned_text = text.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pdf(pdf_path: str):
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,# 41 substracted as original page numbers start from page 41 in pdf
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_sentence_count_raw": len(text.split('.')),
                                "text": text})  
        
    
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path)
pages_and_texts[:2]


0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_senence_count_raw': 1,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_senence_count_raw': 1,
  'text': ''}]

In [10]:
import random
random.sample(pages_and_texts, 3)

[{'page_number': 373,
  'page_char_count': 732,
  'page_word_count': 120,
  'page_senence_count_raw': 11,
  'text': 'available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user experience it is strongly  recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.  \xa0 An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=246  \xa0 An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=246  Defining Protein  |  373'},
 {'page_number': 1011,
  'page_char_count': 1416,
  'page_word_count': 238,
  'page_senence_count_raw': 16,
  'text': 'Protecting the Public 

In [11]:
import pandas as pd
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_senence_count_raw,text
0,-41,29,4,1,Human Nutrition: 2020 Edition
1,-40,0,1,1,
2,-39,320,54,1,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,3,Contents Preface University of Hawai‘i at Mā...


In [12]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_senence_count_raw
count,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,14.18
std,348.86,560.38,95.76,9.54
min,-41.0,0.0,1.0,1.0
25%,260.75,762.0,134.0,8.0
50%,562.5,1231.5,214.5,13.0
75%,864.25,1603.5,271.0,19.0
max,1166.0,2308.0,429.0,82.0


Embedding models https://www.sbert.net/docs

In [16]:
from spacy.lang.en import English
nlp = English()

nlp.add_pipe('sentencizer')
doc = nlp("THis is a sentence. This is another sentence.")
assert len(list(doc.sents)) == 2
list(doc.sents)

[THis is a sentence., This is another sentence.]

In [17]:
for item in tqdm(pages_and_texts, total=len(pages_and_texts)):
    item['sentences'] = list(nlp(item['text']).sents)

    item['sentences'] = [str(sentence) for sentence in item['sentences']]

    item['page_sentence_count_spacy'] = len(item['sentences'])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [18]:
pages_and_texts[100]

{'page_number': 59,
 'page_char_count': 629,
 'page_word_count': 109,
 'page_senence_count_raw': 4,
 'text': 'Digestive  system  without  labels by  Mariana  Ruiz / Public  Domain  Knowing how to maintain the balance of friendly bacteria in your  intestines through proper diet can promote overall health. Recent  scientific studies have shown that probiotic supplements positively  affect intestinal microbial flora, which in turn positively affect  immune system function. As good nutrition is known to influence  immunity, there is great interest in using probiotic foods and other  immune-system-friendly foods as a way to prevent illness. In this  chapter we will explore not only immune system function, but also  Introduction  |  59',
 'sentences': ['Digestive  system  without  labels by  Mariana  Ruiz / Public  Domain  Knowing how to maintain the balance of friendly bacteria in your  intestines through proper diet can promote overall health.',
  'Recent  scientific studies have shown tha

In [19]:
df= pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_senence_count_raw,page_sentence_count_spacy
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,14.18,10.32
std,348.86,560.38,95.76,9.54,6.3
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,8.0,5.0
50%,562.5,1231.5,214.5,13.0,10.0
75%,864.25,1603.5,271.0,19.0,15.0
max,1166.0,2308.0,429.0,82.0,28.0


## Chunking our sentences after getting indiviual sentences
We chunk in the size of **10** sentences

In [20]:
num_sentence_chunk_size = 10


def split_list(input_list:list,
               slice_size:int = num_sentence_chunk_size)-> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_split = list(range(25))
split_list(test_split)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [22]:
for item in tqdm(pages_and_texts):
    item['sentence_chunks']= split_list(input_list =item['sentences'], slice_size=num_sentence_chunk_size)
    item['num_chunks'] = len(item["sentence_chunks"])

  0%|          | 0/1208 [00:00<?, ?it/s]

In [24]:
random.sample(pages_and_texts, k=1)

[{'page_number': 136,
  'page_char_count': 1497,
  'page_word_count': 261,
  'page_senence_count_raw': 15,
  'text': 'Source: National Heart, Lung, and Blood Institute. Accessed  November 4, 2012. https://www.nhlbi.nih.gov.  BMI Limitations  A BMI is a fairly simple measurement and does not take into account  fat mass or fat distribution in the body, both of which are additional  predictors of disease risk. Body fat weighs less than muscle mass.  Therefore, BMI can sometimes underestimate the amount of body  fat in overweight or obese people and overestimate it in more  muscular people. For instance, a muscular athlete will have more  muscle mass (which is heavier than fat mass) than a sedentary  individual of the same height. Based on their BMIs the muscular  athlete would be less “ideal” and may be categorized as more  overweight or obese than the sedentary individual; however this is  an infrequent problem with BMI calculation. Additionally, an older  person with osteoporosis (decre

In [25]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_senence_count_raw,page_sentence_count_spacy,num_chunks
count,1208.0,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,14.18,10.32,1.53
std,348.86,560.38,95.76,9.54,6.3,0.64
min,-41.0,0.0,1.0,1.0,0.0,0.0
25%,260.75,762.0,134.0,8.0,5.0,1.0
50%,562.5,1231.5,214.5,13.0,10.0,1.0
75%,864.25,1603.5,271.0,19.0,15.0,2.0
max,1166.0,2308.0,429.0,82.0,28.0,3.0


In [26]:
import re 
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['page_number'] = item['page_number']

        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        
        chunk_dict['sentence_chunk'] = joined_sentence_chunk

        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len([word for word in joined_sentence_chunk.split(' ')])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk) / 4    # Assuming 1 token  = ~4 char
        
        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/1208 [00:00<?, ?it/s]

1843

In [27]:
random.sample(pages_and_chunks, k=1)

[{'page_number': 158,
  'sentence_chunk': 'reactions as it can store a large amount of heat, is electrically neutral, and has a pH of 7.0, meaning it is not acidic or basic. Additionally, water is involved in many enzymatic reactions as an agent to break bonds or, by its removal from a molecule, to form bonds. Water As a Lubricant/Shock Absorber Many may view the slimy products of a sneeze as gross, but sneezing is essential for removing irritants and could not take place without water.Mucus, which is not only essential to discharge nasal irritants, is also required for breathing, transportation of nutrients along the gastrointestinal tract, and elimination of waste materials through the rectum.Mucus is composed of more than 90 percent water and a front-line defense against injury and foreign invaders. It protects tissues from irritants, entraps pathogens, and contains immune-system cells that destroy pathogens.Water is also the main component of the lubricating fluid between joints an