# Create and run a local RAG pipeline from scratch


## What is RAG ?

RAG stands for retrieval augmented Generation.

The goal of RAG is to take information and pass it to an LLM so it can generate outputs based on that information.

- **Retrieval** --> Find Relevant information given a query , e.g. "what are the macronutrients and what do they do?" --> retrieves passages of the text related to the macronutrients from a nutrition textbook .

- **Augmented** --> To take the relevant information and augment out input(prompt) to an LLm with that relevant information

- **Generation**--> take result of above two steps and pass them on to a LLM for generative outputs


In [3]:
import torch

## Why RAG?

The main goal of RAG is to improve the generation outputs of LLMs .

1. To prevent hallucinations - LLMs are capable of generating _good looking_ texts , but that doesn't mean , it is factually correct , RAG can help LLMs to generate passage based on relevant passages that are factual .

2. Work with Custom Data - Many base LLMs are trained with internet-scale data. This means they have a fairly good understanding of language in general , However that also means the responses can be generic in nature , RAG helps generating based on specific data.


## What can be RAG used for?

- Customer Support QNA chat -- Treat your existing support docs as a resource and when a customer asks a question , you could have a retrieval system , retrieve relevant documentation snippets and then have an LLM craft those snippets into an answer .

- Email chain analysis -- Lets say you're a large insurance company and you have chains and chais of emails of customer claims . You have use a RAG pipeline to find revelant information from those emailand then use an LLM to process them into structured data.

- Company internal Documentation Chat

- TextBook Q&A -- Lets say you are a student and you've got a 1200 page textbook read textbook , you could build a RAG pipeline to go through and find relevant passages to the questions you have..

Common theme -- take your document to a query and process them with an LLM

From this angle , you can consider an LLM as a calculator for words.


## Why Local?

Fun...

Privacy , Speed and Cost

- Privacy -- IF you have a private documentation, maybe you dont want to send you information to an API , You want to setup an LLM and run it on your own Hardware.

- Speed -- Whenever you use an API , you have to send some kind of data across the internet which takes time. Running Locally means we dont have to wait for transfer of data

- Cost -- If You own you own hardware , the cost is paid , no or least operational cost , only Initial cost.

- no Vendor Lockin - if API shuts down , you dont have to worry


In [4]:
print(torch.backends.mps.is_available())

True


## Key terms

| Term                                | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ----------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Token**                           | A sub-word piece of text. For example, "hello, world!" could be split into ["hello", ",", "world", "!"]. A token can be a whole word,<br> part of a word or group of punctuation characters. 1 token ~= 4 characters in English, 100 tokens ~= 75 words.<br> Text gets broken into tokens before being passed to an LLM.                                                                                                                                                                                                                                                                                  |
| **Embedding**                       | A learned numerical representation of a piece of data. For example, a sentence of text could be represented by a vector with<br> 768 values. Similar pieces of text (in meaning) will ideally have similar values.                                                                                                                                                                                                                                                                                                                                                                                        |
| **Embedding model**                 | A model designed to accept input data and output a numerical representation. For example, a text embedding model may take in 384 <br>tokens of text and turn it into a vector of size 768. An embedding model can and often is different to an LLM model.                                                                                                                                                                                                                                                                                                                                                 |
| **Similarity search/vector search** | Similarity search/vector search aims to find two vectors which are close together in high-demensional space. For example, <br>two pieces of similar text passed through an embedding model should have a high similarity score, whereas two pieces of text about<br> different topics will have a lower similarity score. Common similarity score measures are dot product and cosine similarity.                                                                                                                                                                                                         |
| **Large Language Model (LLM)**      | A model which has been trained to numerically represent the patterns in text. A generative LLM will continue a sequence when given a sequence. <br>For example, given a sequence of the text "hello, world!", a genertive LLM may produce "we're going to build a RAG pipeline today!".<br> This generation will be highly dependant on the training data and prompt.                                                                                                                                                                                                                                     |
| **LLM context window**              | The number of tokens a LLM can accept as input. For example, as of March 2024, GPT-4 has a default context window of 32k tokens<br> (about 96 pages of text) but can go up to 128k if needed. A recent open-source LLM from Google, Gemma (March 2024) has a context<br> window of 8,192 tokens (about 24 pages of text). A higher context window means an LLM can accept more relevant information<br> to assist with a query. For example, in a RAG pipeline, if a model has a larger context window, it can accept more reference items<br> from the retrieval system to aid with its generation.      |
| **Prompt**                          | A common term for describing the input to a generative LLM. The idea of "[prompt engineering](https://en.wikipedia.org/wiki/Prompt_engineering)" is to structure a text-based<br> (or potentially image-based as well) input to a generative LLM in a specific way so that the generated output is ideal. This technique is<br> possible because of a LLMs capacity for in-context learning, as in, it is able to use its representation of language to breakdown <br>the prompt and recognize what a suitable output may be (note: the output of LLMs is probable, so terms like "may output" are used). |


## What we're going to build

We're going to build RAG pipeline which enables us to chat with a PDF document, specifically an open-source [nutrition textbook](https://pressbooks.oer.hawaii.edu/humannutrition2/), ~1200 pages long.

You could call our project NutriChat!

We'll write the code to:

1. Open a PDF document (you could use almost any PDF here).
2. Format the text of the PDF textbook ready for an embedding model (this process is known as text splitting/chunking).
3. Embed all of the chunks of text in the textbook and turn them into numerical representation which we can store for later.
4. Build a retrieval system that uses vector search to find relevant chunks of text based on a query.
5. Create a prompt that incorporates the retrieved pieces of text.
6. Generate an answer to a query based on passages from the textbook.

The above steps can broken down into two major sections:

1. Document preprocessing/embedding creation (steps 1-3).
2. Search and answer (steps 4-6).

And that's the structure we'll follow.

It's similar to the workflow outlined on the NVIDIA blog which [details a local RAG pipeline](https://developer.nvidia.com/blog/rag-101-demystifying-retrieval-augmented-generation-pipelines/).

<img src="https://github.com/mrdbourke/simple-local-rag/blob/main/images/simple-local-rag-workflow-flowchart.png?raw=true" alt="flowchart of a local RAG workflow" />


## 1. Document/Text Processing and Embedding Creation

Ingredients:

- PDF document of choice.
- Embedding model of choice.

Steps:

1. Import PDF document.
2. Process text for embedding (e.g. split into chunks of sentences).
3. Embed text chunks with embedding model.
4. Save embeddings to file for later use (embeddings will store on file for many years or until you lose your hard drive).


### Import PDF Document

This will work with many other kinds of documents.

However, we'll start with PDF since many people have PDFs.

But just keep in mind, text files, email chains, support documentation, articles and more can also work.

We're going to pretend we're nutrition students at the University of Hawai'i, reading through the open-source PDF textbook [_Human Nutrition: 2020 Edition_](https://pressbooks.oer.hawaii.edu/humannutrition2/).

There are several libraries to open PDFs with Python but I found that [PyMuPDF](https://github.com/pymupdf/pymupdf) works quite well in many cases.

First we'll download the PDF if it doesn't exist.
