<a href="https://colab.research.google.com/github/rajkstats/uplimit_nlp/blob/main/rag_rk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

> DUPLICATE THIS COLAB TO START WORKING ON IT. Using File > Save a copy to drive.

# Retrieval Agumented Generation
Welcome to the lab on Retrieval Augmented Generation (RAG)!  This is a light introduction to RAG, a
complex and incredibly useful application of Large Language Models.  In this lab, we will download a
corpus of text, ingest the text as chunks into a searchable vector database, then perform a basic
RAG question/answer inference.

In this lab, you'll learn:
- How to load a vector database from PDF Wikipedia articles.
- How to load text chunks as embeddings into a vector database (FAISS).
- How to perform RAG inference using a local LLM.
- (OPTIONAL) How to perform RAG inference using the OpenAI Chat Completion API.

Let's get started!

### Imports
As usual, we will install and import all the required libraries for this lab.

In [1]:
!pip install faiss-cpu transformers sentence-transformers bs4 pypdf ipywidgets wikipedia-api fpdf openai einops --quiet

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m362.9/362.9 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.9/318.9 kB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0

In [2]:
import faiss
from transformers import AutoModelForCausalLM, AutoTokenizer
from sentence_transformers import SentenceTransformer
import requests
from bs4 import BeautifulSoup
import os
from tqdm import tqdm
from pathlib import Path
from pypdf import PdfReader
from collections import namedtuple
import numpy as np
import ipywidgets as widgets
from IPython.display import display
import openai
import wikipediaapi
from fpdf import FPDF
import re
import torch

torch.set_default_device('cuda')

### PDF Data Download (for reference only)
The following code is provided for you as a reference.  The two functions work together to (1) scrape a URL for PDF
links and then (2) download those PDFs to a local folder.

Note that we are not using these functions in the lab, but they are provided because you may find
them to be useful if you are trying to build your own RAG pipeline from an online corpus of PDFs for
work!

In [3]:
def find_pdf_links(url):
    """
    Finds and returns all the PDF links present in a webpage.

    Args:
    url (str): The URL of the webpage to scan for PDF links.

    Returns:
    list: A list of URLs (str) that are linked to PDF files.
    """
    # Send a GET request to the specified URL
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code != 200:
        raise Exception(f"Failed to load page: {url}")

    # Parse the content with BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')

    # Find all anchor tags, then filter out those with href ending in '.pdf'
    pdf_links = [a['href'] for a in soup.find_all('a', href=True) if a['href'].endswith('.pdf')]

    return pdf_links

def download_to_file(pdf_url, filepath):
    """
    Downloads a PDF from the given URL, saving it to the indicated filepath.

    Args:
    pdf_url (str): The URL from where to download the PDF.
    filepath (str): The filepath to save the PDF to.

    Returns:
    int: Updated total size of the downloaded PDF.
    """
    with requests.get(pdf_url, stream=True) as pdf_response:
        if pdf_response.status_code != 200:
            print(f"Failed to download PDF: {pdf_url}")
            return 0

        # Create a file to store the PDF
        with open(Path(filepath) / os.path.basename(pdf_url), 'wb') as f:
            for chunk in pdf_response.iter_content(chunk_size=8192):
                if chunk:  # filter out keep-alive new chunks
                    f.write(chunk)

            # Update total size
            total_size = f.tell()

    return total_size

#########
# USAGE #
#########
# url = 'https://SCRAPE_URL.example.com'
# pdf_links = find_pdf_links(url)
# total_size = 0
# download_dir = './pdf_dataset'
# if not os.path.exists(download_dir):
#     os.makedirs(download_dir)
# for pdf_link in tqdm(pdf_links):
#     total_size += download_to_file(pdf_link, download_dir)

# print(f'Total size of downloaded PDFs: {total_size / 1e6:.2f} MB')


### Wikipedia Data Download
In this lab, we will download a few articles from Wikipedia to use as reference material in our RAG
pipeline.  The following functions are provided for you. They download a set of wikipedia articles
as PDFs.  Note that the choice of PDFs here is unnecessary in this context, but we think you might
find it useful, since PDF is a common format for professional reference documents (research
articles, business policies, the law, etc.).

In [26]:
wiki_wiki = wikipediaapi.Wikipedia('Uplimit Week 3 Project', 'en')

articles = ['PageRank', 'Knowledge graph']

def get_wikipedia_article(article_title):
    wiki_page = wiki_wiki.page(article_title)
    if not wiki_page.exists():
        raise Exception(f"Page {article_title} does not exist.")
    return wiki_page.text

def article_to_pdf(article_content, to_filename):
    latin_characters_only = ''.join(re.findall(r'[A-Za-z0-9\s.,!?]', article_content))
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("helvetica", size=12)
    pdf.multi_cell(w=190, h=10, txt=latin_characters_only)
    pdf.output(to_filename)

download_dir = './pdf_dataset'
if not os.path.exists(download_dir):
    os.makedirs(download_dir)

for article_title in tqdm(articles, 'Retrieving Wiki articles'):
    # Retrieve the wikipedia article
    article_content = get_wikipedia_article(article_title)
    # Convert the article to a PDF and save it locally
    article_to_pdf(article_content, download_dir + "/" + article_title + '.pdf')

Retrieving Wiki articles: 100%|██████████| 2/2 [00:00<00:00,  2.67it/s]


### Vector Embedding Model
Next, we'll create a text embedding model to process the pdf chunks.  We have chosen an open source model that
performs competitively on the Massive Text Embeddings Benchmark (MTEB).  It does a decent job, but
feel free to try out another!

[MTEB LEADERBOARD LINK](https://huggingface.co/spaces/mteb/leaderboard)

In [27]:
embed_model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)



### Calculate Chunk Embeddings
Now it's your turn!  You will complete the code below to calcualte the text embeddings of each
text chunk.  Don't forget to save the PDF metadata!

In [28]:
# Create a record template to save the  chunk metadata
# In a production system, this would be a database record.  Here, we'll just store it in local memory.
ChunkMetadata = namedtuple('ChunkMetadata', ['filename', 'location', 'text', 'embedding'])
chunk_metadata = []
for pdf in tqdm(os.listdir(download_dir), 'Chunking PDFs.'):
    reader = PdfReader(download_dir + '/' + pdf)
    for pageNum, page in enumerate(reader.pages):
        text = page.extract_text()
        # 1. Calculate the text embedding of the chunk
        embedding = embed_model.encode(text)
        # 2. Store the embedding and other metadata in the chunk_metadata list
        #    HINT:  A record can be created like this (use pageNum for location):
        #           ChunkMetadata('name.pdf', 2, 'the text...', [the embedding])
        chunk_metadata.append(ChunkMetadata(pdf, pageNum, text, embedding))

Chunking PDFs.: 100%|██████████| 3/3 [00:04<00:00,  1.45s/it]


Alright, now let's inspect the metadata for one of our chunks!  (We've added a little check here to
make sure you've calculated the chunks correctly.)

In [29]:
print(f"Extracted text: {chunk_metadata[2].text}")

Extracted text: General structure A network of entities, their semantic types, properties, and relationships. To
represent properties, categorical or numerical values are often used.
Supporting reasoning over inferred ontologies A knowledge graph acquires and integrates
information into an ontology and applies a reasoner to derive new knowledge.
There are, however, many knowledge graph representations for which some of these features are
not relevant.  For those knowledge graphs, this simpler definition may be more useful
A digital structure that represents knowledge as concepts and the relationships between them facts.
A knowledge graph can include an ontology that allows both humans and machines to understand
and reason about its contents.
Implementations
In addition to the above examples, the term has been used to describe open knowledge projects
such as YAGO and Wikidata federations like the Linked Open Data cloud a range of commercial
search tools, including Yahoos semantic search

In [31]:
  assert chunk_metadata[2].text.startswith("General structure"), "Test Failed! Make sure you're loading the chunk text correctly!"
  assert np.all(np.isclose(chunk_metadata[2].embedding[:2], [-1.09604634e-02,  3.91411260e-02], atol=0.1)), "Test Failed! Make sure you're loading the chunk embedding correctly!"
  print("Looks like you've loaded the PDFs correctly! Let's move on to inference!")

Looks like you've loaded the PDFs correctly! Let's move on to inference!


### Load Vector DB
Now, as in prior labs, we are going to load the text embeddings into a FAISS vector database.

In [32]:
# 1. Get all the chunk embeddings
embeddings = [chunk.embedding for chunk in chunk_metadata]
# 2. Convert the embeddings into a numpy array for loading into FAISS
embeddings = np.array(embeddings)
# 3. Normalize the embeddings so they are easily comparable
faiss.normalize_L2(embeddings)
# 4. Create the FAISS index
vector_db = faiss.IndexFlatL2(embeddings.shape[1])
# 5. Add the embeddings to the FAISS index
vector_db.add(embeddings)


# IMPORTANT:  It is very important that the order of the embeddings in the FAISS index is the same
# as the order in our metadata list.  This way, we can correlate the FAISS search results with our
# chunk metadata!

### Inference Pipeline
Alright we're all done with preprocessing our PDFs!  It was pretty straightforward, although in
production things can become much more complex when processing many GBs of diverse content.  Let's
move onto the inference pipeline!

The goal of the inference pipeline is to answer user queries based on the reference material we
ingested earlier.  Here are the steps:
1. Search the FAISS index based on the user query.
2. Retrieve the chunk metadata based on the search results.
3. Generate a response to the user query based on the chunk metadata.

Let's get started!

First, let's load up a local language model.  We'll be using Phi-2 a language developed and released
by Microsoft.  Considering it's small size (2.2B), it's quite capable!  Note here that Phi-2 is a
*decoder* model.  That is, it continuously predicts next-word until predicting an EOS (end of
sequence) token.  The Phi-2 model card recommends a specific style for this kind of interaction,
which is called "instruct":  We instruct the model to answer our question.  (the other kind of
interaction is called "chat" and it isn't totally appropriate here.)

NOTE:  You must be running a GPU (recommend the free T4 on colab) for this model to perform in a reasonable timeframe.

In [33]:
# Load phi-2 model and tokenizer
phi2_model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", torch_dtype="auto", trust_remote_code=True)
phi2_tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
phi2_tokenizer.pad_token = phi2_tokenizer.eos_token

config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [34]:
def generate_local_llm(text_input):
    # Tokenize the input text without returning the attention mask
    inputs = phi2_tokenizer(text_input, return_tensors="pt", return_attention_mask=False)
    # Calculate the total number of input tokens
    num_input_tokens = inputs['input_ids'].size(1)
    # Generate text with the total length of input tokens + 200
    outputs = phi2_model.generate(**inputs, max_length=num_input_tokens + 200, pad_token_id=phi2_tokenizer.eos_token_id)
    # Decode and return the generated text
    return phi2_tokenizer.batch_decode(outputs[:,num_input_tokens:])[0]

def generate_rag_response(user_query, vector_db, chunk_metadata, embed_model, generate_llm, num_chunks=1):

    # 1. Convert the user's query into a text embedding
    user_query_embed = embed_model.encode(user_query)
    # 2. Reshape the user's query for FAISS
    user_query_embed = np.array(user_query_embed).reshape(1, -1)
    # 3. Normalize the user's query embedding
    faiss.normalize_L2(user_query_embed)
    # 4. Search the FAISS index for chunks similar to the user's query (ignore the distances- they're already in order)
    _, indices = vector_db.search(user_query_embed, num_chunks)
    # 5. Collect the text content of the reference material into a single string
    reference_material = '\n'.join([chunk_metadata[chunk_index].text for chunk_index in indices[0]])
    print(reference_material)
    # 6. Create a the LLM input using the user query and reference material
    instruction = "Instruct: Using only the reference material, answer the user's question."
    llm_input = (f"{instruction}\n\n"
                 f"User query: {user_query}\n\n"
                 f"Reference material: {reference_material}\n\n"
                  "Output: ")
    # 7. Generate a LLM response to the instruction
    response = generate_llm(llm_input)

    return response

### Try it out!
Alright, we're done with the inference side of RAG!  Let's use a widget to query our pipeline!

In [35]:
query_widget = widgets.Text(
    placeholder='Ask a question!',
    description='Query:',
)
display(query_widget)

Text(value='', description='Query:', placeholder='Ask a question!')

In [38]:
# Try out the following prompt!
# "When was Python conceived?"
if query_widget.value == '':
    print('Please enter a query into the box above.')
else:
    print(query_widget.value)
    print()
    rag_response = generate_rag_response(
        user_query = query_widget.value,
        vector_db = vector_db,
        chunk_metadata = chunk_metadata,
        embed_model = embed_model,
        generate_llm = generate_local_llm)
    print(rag_response)

When was Python conceived?

Python is a highlevel, generalpurpose programming language. Its design philosophy emphasizes
code readability with the use of significant indentation.
Python is dynamically typed and garbagecollected. It supports multiple programming paradigms,
including structured particularly procedural, objectoriented and functional programming. It is often
described as a batteries included language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC
programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in
2000. Python 3.0, released in 2008, was a major revision not completely backwardcompatible with
earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.
Python consistently ranks as one of the most popular programming languages, and has gained
widespread use in the machine learning community.
History
Python was invented in the lat

In [39]:
chunk_metadata

[ChunkMetadata(filename='Knowledge graph.pdf', location=0, text='In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a\ngraphstructured data model or topology to represent and operate on data. Knowledge graphs are\noften used to store interlinked descriptions of entities   objects, events, situations or abstract\nconcepts   while also encoding the freeform semantics or relationships underlying these entities.\nSince the development of the Semantic Web, knowledge graphs have often been associated with\nlinked open data projects, focusing on the connections between concepts and entities. They are\nalso historically associated with and used by search engines such as Google, Bing, Yext and Yahoo\nknowledgeengines and questionanswering services such as WolframAlpha, Apples Siri, and\nAmazon Alexa and social networks such as LinkedIn and Facebook.\nRecent developments in data science and machine learning, particularly in graph neural networks\nand repre

### (OPTIONAL) Generate with OpenAI API
The next part of the lab is completely optional.  It requires an OpenAI API token.

We will be replacing our mighty, but very small, Phi-2 model with the very sophisticated GPT-4 model
from OpenAI.  You should expect more concise, accurate responses from this model.

Please note that OpenAI does not offer an "instruct" API and their "completions" API has been deprecated.  The only
API for accessing GPT-4 is the Chat Completions API, which we will be using.  To get around this
restriction, we'll simply frame our instruct query as a user chat message.  The model will take care
of the rest!

In [40]:
from google.colab import userdata
oai_client = openai.Client(api_key=userdata.get('OPENAI_API_KEY'))

def generate_openai_llm(text_input):
    response = oai_client.chat.completions.create(
        model="gpt-4-turbo-preview",
        messages=[
            {"role": "user", "content": text_input}
        ]
    )
    return response.choices[0].message.content

In [41]:
oai_query_widget = widgets.Text(
    placeholder='Ask a question!',
    description='Query:',
)
display(oai_query_widget)

Text(value='', description='Query:', placeholder='Ask a question!')

In [42]:
print(oai_query_widget.value)
print()
oai_rag_response = generate_rag_response(
    user_query = oai_query_widget.value,
    vector_db = vector_db,
    chunk_metadata = chunk_metadata,
    embed_model = embed_model,
    generate_llm = generate_openai_llm)
print(oai_rag_response)

When was Python conceived?

Python is a highlevel, generalpurpose programming language. Its design philosophy emphasizes
code readability with the use of significant indentation.
Python is dynamically typed and garbagecollected. It supports multiple programming paradigms,
including structured particularly procedural, objectoriented and functional programming. It is often
described as a batteries included language due to its comprehensive standard library.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC
programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was released in
2000. Python 3.0, released in 2008, was a major revision not completely backwardcompatible with
earlier versions. Python 2.7.18, released in 2020, was the last release of Python 2.
Python consistently ranks as one of the most popular programming languages, and has gained
widespread use in the machine learning community.
History
Python was invented in the lat

# All Done!
Thanks for going on this intro to RAG journey with us!  We hope you learned a lot about how decoder
and text embedding models can be used to answer complex questions!  One thing that we want to
emphasize before leaving is the power of the RAG technique:

Phi-2 is several orders of magnitude
smaller than GPT-4, but was still able to provide a satisfactory responses to our queries. This is
incredible!  It shows that this technique is scalable both UP and DOWN.  For example, GPT-4 cannot currently run
on a mobile or edge device, but Phi-2 could easily!  Think about the possibilities!