# Create and run a local RAG pipeline from scratch


## What is RAG ?

RAG stands for retrieval augmented Generation.

It was introduced in the paper [_Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks_](https://arxiv.org/abs/2005.11401).

The goal of RAG is to take information and pass it to an LLM so it can generate outputs based on that information.

- **Retrieval** --> Find Relevant information given a query , e.g. "what are the macronutrients and what do they do?" --> retrieves passages of the text related to the macronutrients from a nutrition textbook .

- **Augmented** --> To take the relevant information and augment out input(prompt) to an LLm with that relevant information

- **Generation**--> take result of above two steps and pass them on to a LLM for generative outputs


In [3]:
import torch

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs .

1. To prevent hallucinations - LLMs are capable of generating _good looking_ texts , but that doesn't mean , it is factually correct , RAG can help LLMs to generate passage based on relevant passages that are factual .

2. Work with Custom Data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general , However that also means the responses can be generic in nature , RAG helps generating based on specific data.


## What can be RAG used for?

- Customer Support QNA chat -- Treat your existing support docs as a resource and when a customer asks a question , you could have a retrieval system , retrieve relevant documentation snippets and then have an LLM craft those snippets into an answer .

- Email chain analysis -- Lets say you're a large insurance company and you have chains and chais of emails of customer claims . You have use a RAG pipeline to find revelant information from those emailand then use an LLM to process them into structured data.

- Company internal Documentation Chat

- TextBook Q&A -- Lets say you are a student and you've got a 1200 page textbook read textbook , you could build a RAG pipeline to go through and find relevant passages to the questions you have..

Common theme -- take your document to a query and process them with an LLM

From this angle , you can consider an LLM as a calculator for words.


## Why Local?

Fun...

Privacy , Speed and Cost

- Privacy -- IF you have a private documentation, maybe you dont want to send you information to an API , You want to setup an LLM and run it on your own Hardware.

- Speed -- Whenever you use an API , you have to send some kind of data across the internet which takes time. Running Locally means we dont have to wait for transfer of data

- Cost -- If You own you own hardware , the cost is paid , no or least operational cost , only Initial cost.

- no Vendor Lockin - if API shuts down , you dont have to worry


In [4]:
print(torch.backends.mps.is_available())

True


## Key terms

| Term                                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Token**                           | A sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. A token can be a whole word,<br> part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.<br> Text gets broken into tokens before being passed to an LLM.                                                                                                                                                                                                                                                                                  |
| **Embedding**                       | A learned numerical representation of a piece of data. For example, a sentence of text could be represented by a vector with<br> 768 values. Similar pieces of text (in meaning) will ideally have similar values.                                                                                                                                                                                                                                                                                                                                                                                        |
| **Embedding model**                 | A model designed to accept input data and output a numerical representation. For example, a text embedding model may take in 384 <br>tokens of text and turn it into a vector of size 768. An embedding model can and often is different to an LLM model.                                                                                                                                                                                                                                                                                                                                                 |
| **Similarity search/vector search** | Similarity search/vector search aims to find two vectors which are close together in high-demensional space. For example, <br>two pieces of similar text passed through an embedding model should have a high similarity score, whereas two pieces of text about<br> different topics will have a lower similarity score. Common similarity score measures are dot product and cosine similarity.                                                                                                                                                                                                         |
| **Large Language Model (LLM)**      | A model which has been trained to numerically represent the patterns in text. A generative LLM will continue a sequence when given a sequence. <br>For example, given a sequence of the text "hello, world!", a genertive LLM may produce "we're going to build a RAG pipeline today!".<br> This generation will be highly dependant on the training data and prompt.                                                                                                                                                                                                                                     |
| **LLM context window**              | The number of tokens a LLM can accept as input. For example, as of March 2024, GPT-4 has a default context window of 32k tokens<br> (about 96 pages of text) but can go up to 128k if needed. A recent open-source LLM from Google, Gemma (March 2024) has a context<br> window of 8,192 tokens (about 24 pages of text). A higher context window means an LLM can accept more relevant information<br> to assist with a query. For example, in a RAG pipeline, if a model has a larger context window, it can accept more reference items<br> from the retrieval system to aid with its generation.      |
| **Prompt**                          | A common term for describing the input to a generative LLM. The idea of "[prompt engineering](https://en.wikipedia.org/wiki/Prompt_engineering)" is to structure a text-based<br> (or potentially image-based as well) input to a generative LLM in a specific way so that the generated output is ideal. This technique is<br> possible because of a LLMs capacity for in-context learning, as in, it is able to use its representation of language to breakdown <br>the prompt and recognize what a suitable output may be (note: the output of LLMs is probable, so terms like "may output" are used). |


## What we're going to build

We're going to build RAG pipeline which enables us to chat with a PDF document, specifically an open-source [nutrition textbook](https://pressbooks.oer.hawaii.edu/humannutrition2/), ~1200 pages long.

You could call our project NutriChat!

We'll write the code to:

1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

The above steps can broken down into two major sections:

1. Document preprocessing/embedding creation (steps 1-3).
2. Search and answer (steps 4-6).

And that's the structure we'll follow.

It's similar to the workflow outlined on the NVIDIA blog which [details a local RAG pipeline](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/).

<img src="https://github.com/mrdbourke/simple-local-rag/blob/main/images/simple-local-rag-workflow-flowchart.png?raw=true" alt="flowchart of a local RAG workflow" />


## 1. Document/Text Processing and Embedding Creation

Ingredients:

- PDF document of choice.
- Embedding model of choice.

Steps:

1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).


### Import PDF Document

This will work with many other kinds of documents.

However, we'll start with PDF since many people have PDFs.

But just keep in mind, text files, email chains, support documentation, articles and more can also work.

We're going to pretend we're nutrition students at the University of Hawai'i, reading through the open-source PDF textbook [_Human Nutrition: 2020 Edition_](https://pressbooks.oer.hawaii.edu/humannutrition2/).

There are several libraries to open PDFs with Python but I found that [PyMuPDF](https://github.com/pymupdf/pymupdf) works quite well in many cases.

First we'll download the PDF if it doesn't exist.


In [5]:
import os 
import requests

#Get pdf path
pdf_path = "human-nutrition-text.pdf"

#download pdf if it does not exist 

if not os.path.exists(pdf_path):
    print(f"[INFO] files doesn't exist , downloading...")

    # The URL of the PDF you want to download
    url = "https://pressbooks.oer.hawaii.edu/humannutrition2/open/download?type=pdf"

    # The local filename to save the downloaded file
    filename = pdf_path

    # Send a GET request to the URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        # Open a file in binary write mode and save the content to it
        with open(filename, "wb") as file:
            file.write(response.content)
        print(f"The file has been downloaded and saved as {filename}")
    else:
        print(f"Failed to download the file. Status code: {response.status_code}")
else:
    print(f"File {pdf_path} exists.")

File human-nutrition-text.pdf exists.


PDF acquired!

We can import the pages of our PDF to text by first defining the PDF path and then opening and reading it with PyMuPDF (`import fitz`).

We'll write a small helper function to preprocess the text as it gets read. Note that not all text will be read in the same so keep this in mind for when you prepare your text.

We'll save each page to a dictionary and then append that dictionary to a list for ease of use later.


In [7]:
import fitz
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """
    Performs minor formatting on texts.
    """
    cleaned_text = text.replace('\n', " " ).strip()

    return cleaned_text

def open_and_read_pdf(pdf_path : str)-> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_text = []
    for page_number , page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text = text)
        pages_and_text.append({
            "page_number": page_number -41,
            "page_char_count": len(text),
            "page_word_count": len(text.split(" ")),
            "page_sentence_count_row": len(text.split(". ")),
            "page_token_count": len(text)/4,
            "text":text
            })
    return pages_and_text

pages_and_text = open_and_read_pdf(pdf_path = pdf_path)
pages_and_text[:2]

1208it [00:01, 692.11it/s]


[{'page_number': -41,
  'page_char_count': 29,
  'page_word_count': 4,
  'page_sentence_count_row': 1,
  'page_token_count': 7.25,
  'text': 'Human Nutrition: 2020 Edition'},
 {'page_number': -40,
  'page_char_count': 0,
  'page_word_count': 1,
  'page_sentence_count_row': 1,
  'page_token_count': 0.0,
  'text': ''}]

In [8]:
import random 

random.sample(pages_and_text , k=3)

[{'page_number': 81,
  'page_char_count': 86,
  'page_word_count': 12,
  'page_sentence_count_row': 1,
  'page_token_count': 21.5,
  'text': 'http://pressbooks.oer.hawaii.edu/ humannutrition2/?p=84    The Digestive System  |  81'},
 {'page_number': 788,
  'page_char_count': 1315,
  'page_word_count': 226,
  'page_sentence_count_row': 18,
  'page_token_count': 328.75,
  'text': 'oranges. Additionally, since 1998, food manufacturers  have been required to add folate to cereals and other  grain products.2  Weight Gain during Pregnancy  During pregnancy, a mother’s body changes in many ways. One  of the most notable and significant changes is weight gain. If a  pregnant woman does not gain enough weight, her unborn baby  will be at risk. Poor weight gain, especially in the second and third  trimesters, could result not only in low birth weight, but also infant  mortality and intellectual disabilities. Therefore, it is vital for a  pregnant woman to maintain a healthy amount of weight gain.

In [9]:
import pandas as pd

df= pd.DataFrame(pages_and_text)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_row,page_token_count,text
0,-41,29,4,1,7.25,Human Nutrition: 2020 Edition
1,-40,0,1,1,0.0,
2,-39,320,54,1,80.0,Human Nutrition: 2020 Edition UNIVERSITY OF ...
3,-38,212,32,1,53.0,Human Nutrition: 2020 Edition by University of...
4,-37,797,147,3,199.25,Contents Preface University of Hawai‘i at Mā...


### Get some stats on the text

Let's perform a rough exploratory data analysis (EDA) to get an idea of the size of the texts (e.g. character counts, word counts etc) we're working with.

The different sizes of texts will be a good indicator into how we should split our texts.

Many embedding models have limits on the size of texts they can ingest, for example, the [`sentence-transformers`](https://www.sbert.net/docs/pretrained_models.html) model [`all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2) has an input size of 384 tokens.

This means that the model has been trained in ingest and turn into embeddings texts with 384 tokens (1 token ~= 4 characters ~= 0.75 words).

Texts over 384 tokens which are encoded by this model will be auotmatically reduced to 384 tokens in length, potentially losing some information.

We'll discuss this more in the embedding section.

For now, let's turn our list of dictionaries into a DataFrame and explore it.


In [10]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_row,page_token_count
count,1208.0,1208.0,1208.0,1208.0,1208.0
mean,562.5,1148.0,199.5,10.52,287.0
std,348.86,560.38,95.83,6.55,140.1
min,-41.0,0.0,1.0,1.0,0.0
25%,260.75,762.0,134.0,5.0,190.5
50%,562.5,1231.5,216.0,10.0,307.88
75%,864.25,1603.5,272.0,15.0,400.88
max,1166.0,2308.0,430.0,39.0,577.0


Okay, looks like our average token count per page is 287.

For this particular use case, it means we could embed an average whole page with the `all-mpnet-base-v2` model (this model has an input capacity of 384).


### Further text processing (splitting pages into sentences)

The ideal way of processing text before embedding it is still an active area of research.

A simple method I've found helpful is to break the text into chunks of sentences.

As in, chunk a page of text into groups of 5, 7, 10 or more sentences (these values are not set in stone and can be explored).

But we want to follow the workflow of:

`Ingest text -> split it into groups/chunks -> embed the groups/chunks -> use the embeddings`

Some options for splitting text into sentences:

1. Split into sentences with simple rules (e.g. split on ". " with `text = text.split(". ")`, like we did above).
2. Split into sentences with a natural language processing (NLP) library such as [spaCy](https://spacy.io/) or [nltk](https://www.nltk.org/).

Why split into sentences?

- Easier to handle than larger pages of text (especially if pages are densely filled with text).
- Can get specific and find out which group of sentences were used to help within a RAG pipeline.

> **Resource:** See [spaCy install instructions](https://spacy.io/usage).

Let's use spaCy to break our text into sentences since it's likely a bit more robust than just using `text.split(". ")`.
