## RAG
* Retrieval of relevant information from a document
* Augment the relevant information
* Generate a summary of the relevant information

In [1]:
!nvidia-smi

Thu Sep 26 17:32:57 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 3080 Ti     Off | 00000000:01:00.0 Off |                  N/A |
|  0%   51C    P8              18W / 350W |     10MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

## Steps to follow
1. Open a pdf
2. Format text
3. Embed chunks of text and turn into embedding
4. Build a retrieval system
5. Generate a prompt that incorporates retrieved pieces of text
6. Generate answer

Github link format for raw files
example
https://github.com/mrdbourke/simple-local-rag/blob/main/human-nutrition-text.pdf
instead of blob replace with raw/refs/heads
https://github.com/mrdbourke/simple-local-rag/raw/refs/heads/main/human-nutrition-text.pdf

In [6]:
import os
import requests
from tqdm.auto import tqdm

pdf_path = "human-nutrition-text.pdf"

if not os.path.exists(pdf_path):
    print("File not exist ")

    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"
    filename = pdf_path
    request = requests.get(url,
                        #    stream=True
                           )
    if request.status_code == 200:
        with open(filename, "wb") as f:
            for buffer in tqdm(request.iter_content(),total=float(request.headers['Content-Length'])):
                f.write(buffer)
        print("File downloaded")

else:
    print("File exists")

File not exist 


  0%|          | 0/26891229.0 [00:00<?, ?it/s]

File downloaded


In [9]:
import fitz

def text_formatter(text:str) -> str:
    cleaned_text = text.replace("\n", " ").strip()

    return cleaned_text

def open_and_read_pdf(pdf_path: str):
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        pages_and_texts.append({"page_number": page_number - 41,# 41 substracted as original page numbers start from page 41 in pdf
                                "page_char_count": len(text),
                                "page_word_count": len(text.split(" ")),
                                "page_senence_count_raw": len(text.split('.')),
                                "text": text})  
        
    
    return pages_and_texts

pages_and_texts = open_and_read_pdf(pdf_path)
pages_and_texts[:2]


0it [00:00, ?it/s]

[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_senence_count_raw': 1,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_senence_count_raw': 1,
  'text': ''}]

In [10]:
import random
random.sample(pages_and_texts, 3)

[{'page_number': 373,
  'page_char_count': 732,
  'page_word_count': 120,
  'page_senence_count_raw': 11,
  'text': 'available in the web-based textbook and not available in the  downloadable versions (EPUB, Digital PDF, Print_PDF, or  Open Document).  Learning activities may be used across various mobile  devices, however, for the best user experience it is strongly  recommended that users complete these activities using a  desktop or laptop computer and in Google Chrome.  \xa0 An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=246  \xa0 An interactive or media element has been  excluded from this version of the text. You can  view it online here:  http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=246  Defining Protein  |  373'},
 {'page_number': 1011,
  'page_char_count': 1416,
  'page_word_count': 238,
  'page_senence_count_raw': 16,
  'text': 'Protecting the Public 

In [11]:
import pandas as pd
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_senence_count_raw,text
0,-41,29,4,1,Human Nutrition: 2020 Edition
1,-40,0,1,1,
2,-39,320,54,1,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,3,Human Nutrition: 2020 Edition by University of...
4,-37,797,145,3,Contents Preface University of Hawai‘i at Mā...


In [12]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_senence_count_raw
count,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,198.3,14.18
std,348.86,560.38,95.76,9.54
min,-41.0,0.0,1.0,1.0
25%,260.75,762.0,134.0,8.0
50%,562.5,1231.5,214.5,13.0
75%,864.25,1603.5,271.0,19.0
max,1166.0,2308.0,429.0,82.0
