<a href="https://colab.research.google.com/github/sheldonkemper/portfolio/blob/main/CAM_DS_Retrieval_augumented_generation(RAG)_2_2_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**First things first** - please go to 'File' and select 'Save a copy in Drive' so that you have your own version of this activity set up and ready to use.
Remember to update the portfolio index link to your own work once completed!

#Demonstration 2.2.1 Retrieval augmented generation
In this demonstration, you will use the LangChain package to perform RAG with LLMs, and learn how to:
- Load, preprocess, and split documents from multiple sources and different formats into manageable chunks.
- Combine document retrieval and LLM generation for context-aware answers.
- Generate embeddings and perform searches to retrieve relevant documents.
- Implement a full RAG pipeline combining a retriever and LLM.
- Use OpenAI models in addition to Hugging Face models.



**Important**: The demonstration uses closed-source models from OpenAI that require API keys. You will be advised to register for an account at the OpenAI developer platform if you do not already have one. The provision of API keys is restricted to personal usage only and is subject to OpenAI’s rate limits. At the time of writing this programme, a sufficient quota of API keys was being offered without charge, but a recent change at OpenAI required that anyone requesting free keys had to add a small credit to their account for the query to work. You will be reimbursed for this credit.

#### Get your OpenAI key

1. Log in at [OpenAI developer platform](https://platform.openai.com/api-keys).
2. Create a new secret key.
3. Copy and paste the key into a document for safe-keeping.
4. Paste the key into *two* locations below, where it says 'Replace with API key'.

Note that each time you run the code cell, it sends a request to OpenAI to use the API. There are [OpenAI rate limits](https://platform.openai.com/docs/guides/rate-limits/usage-tiers); for example, for gpt-3.5-turbo, requests are limited to 3 per minute or 200 per day. Your code will not work if you exceed the requests.

In [None]:

!pip install -q torch transformers accelerate bitsandbytes transformers sentence-transformers faiss-gpu
!pip install datasets
!pip install langchain-community
!pip install rapidocr-onnxruntime
!pip install pypdf
!pip install langchain
!pip install langchain-chroma
!pip install openai


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.8/255.8 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [

## Load the documents

### Use the CSV loader

In [None]:
from datasets import load_dataset

dataset = load_dataset("rajpurkar/squad")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 87599
    })
    validation: Dataset({
        features: ['id', 'title', 'context', 'question', 'answers'],
        num_rows: 10570
    })
})

In [None]:
dataset['train']['context'][0]

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [None]:
dataset['train']['question'][0]

'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'

In [None]:
dataset['train']['answers'][0]

{'text': ['Saint Bernadette Soubirous'], 'answer_start': [515]}

In [None]:
dataset['train']['answers'][0]['text']

['Saint Bernadette Soubirous']

In [None]:
texts = dataset['train']['context']
questions  = dataset['train']['question']
answers  = dataset['train']['answers']

In [None]:
answers = [ans['text'] for ans in answers]

In [None]:
import pandas as pd
df = pd.DataFrame()
df['context'] = texts
df['question'] = questions
df['answer'] = answers

In [None]:
df = df.sample(n=10)

In [None]:
df

Unnamed: 0,context,question,answer
45257,After fans noticed Mercury's increasingly gaun...,What year was Freddie Mercury's final public a...,[1990]
2404,"China Daily, a CCP-controlled news organizatio...",According to article Tibet has remained under ...,[the central government of China]
26482,"The Houston Theater District, located downtown...",To what type of arts is Houston home?,[major performing arts]
59587,Detroit (/dᵻˈtrɔɪt/) is the most populous city...,How many people inhabit metro Detroit?,[5.3 million]
7534,"Since the show's inception in 2002, ten of the...","As of 2012, how many finalists did American Id...",[131]
25207,"According to the United States Census Bureau, ...",What is the area of water?,[6.290 square miles]
32376,"The solution was automation, in the form of a ...",How did the Predictor display the information ...,[as a pointer mounted on the gun]
60380,"In the UK and Ireland, ""exhibition match"" and ...",What are 'friendlies' to honor a player usuall...,[testimonial matches]
29112,The Lancashire economy relies strongly on the ...,Which direction does the M6 motorway run?,[north to south]
79004,The southern portion of the Point Loma peninsu...,What was the original name of today's Marine C...,[Camp Kearny]


In [None]:
df.to_csv('squad.csv')

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader


loader = CSVLoader(file_path='/content/squad.csv')
data = loader.load()

In [None]:
data

[Document(metadata={'source': '/content/squad.csv', 'row': 0}, page_content=': 45257\ncontext: After fans noticed Mercury\'s increasingly gaunt appearance in 1988, rumours began to spread that Mercury was suffering from AIDS. Mercury flatly denied this, insisting he was merely "exhausted" and too busy to provide interviews. The band decided to continue making albums, starting with The Miracle in 1989 and continuing with Innuendo in 1991. Despite his deteriorating health, the lead singer continued to contribute. For the last two albums made while Mercury was still alive, the band credited all songs to Queen, rather than specific members of the group, freeing them of internal conflict and differences. In 1990, Queen ended their contract with Capitol and signed with Disney\'s Hollywood Records, which has since remained the group\'s music catalogue owner in the United States and Canada. That same year, Mercury made his final public appearance when he joined the rest of Queen to collect the

## Load URLs and web pages

In [None]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://fourthrev.com/blog-announcing-data-science-career-accelerator/")




In [None]:
docs = loader.load()

In [None]:
docs

[Document(metadata={'source': 'https://fourthrev.com/blog-announcing-data-science-career-accelerator/', 'title': 'Announcing: New Data Science Career Accelerator | FourthRev', 'description': "We're launching our new Data Science Career Accelerator in collaboration with the University of Cambridge Institute of Continuing Education. Read more here.", 'language': 'en-US'}, page_content='\n\n\n\n\n\n\nAnnouncing: New Data Science Career Accelerator | FourthRev\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip to content\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\nCareer Accelerators\n\nProduct Management\nUX & UI Product Design\nDigital Marketing\nData Analytics\nData Science\n\n\nWhy FourthRev\n\nAbout Us\nHow it works\nCareers\n\n\nPartners\nResources\n\nAll resources\nBlog\n\n\n \n\nCareer Accelerators\n\nProduct Management\nUX & UI Product Design\nDigital Marketing\nData Analytics\nData Science\

### Load the PDF files

In [None]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://arxiv.org/pdf/1706.03762")


In [None]:
docs = loader.load()


In [None]:
docs[4].page_content

'output values. These are concatenated and once again projected, resulting in the final values, as\ndepicted in Figure 2.\nMulti-head attention allows the model to jointly attend to information from different representation\nsubspaces at different positions. With a single attention head, averaging inhibits this.\nMultiHead( Q, K, V ) = Concat(head 1, ...,head h)WO\nwhere head i= Attention( QWQ\ni, KWK\ni, V WV\ni)\nWhere the projections are parameter matrices WQ\ni∈Rdmodel×dk,WK\ni∈Rdmodel×dk,WV\ni∈Rdmodel×dv\nandWO∈Rhdv×dmodel.\nIn this work we employ h= 8 parallel attention layers, or heads. For each of these we use\ndk=dv=dmodel/h= 64 . Due to the reduced dimension of each head, the total computational cost\nis similar to that of single-head attention with full dimensionality.\n3.2.3 Applications of Attention in our Model\nThe Transformer uses multi-head attention in three different ways:\n•In "encoder-decoder attention" layers, the queries come from the previous decoder layer,\nand

## Split documents

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=30)

chunked_docs = splitter.split_documents(docs)

In [None]:
chunked_docs

[Document(metadata={'source': 'https://arxiv.org/pdf/1706.03762', 'page': 0}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗'),
 Document(metadata={'source': 'https://arxiv.org/pdf/1706.03762', 'page': 0}, page_content='Google Brain\nlukaszkaiser@google.com\nIllia Polosukhin∗ ‡\nillia.polosukhin@gmail.com\nAbstract\nThe dominant sequence transduction models are based on complex recurrent or\nconvolutional neural networks that include an encoder and a decoder. The best\nperforming models also connect the enco

## Create the embeddings and the retriever

In [None]:
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings

embedding_function =  HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5')



  embedding_function =  HuggingFaceEmbeddings(model_name='BAAI/bge-base-en-v1.5')
The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## Set up vector databases

## FAISS vector database

In [None]:
db = FAISS.from_documents(chunked_docs, embedding_function )

## Chroma vector database

In [None]:
from langchain_chroma import Chroma
db = Chroma.from_documents(chunked_docs, embedding_function)

We need a way to return (retrieve) the documents given an unstructured query. For that, we will use the `as_retriever` method, using the `db` as a backbone:
- `search_type="similarity"` means we want to perform similarity search between the query and documents.
- `search_kwargs={'k': 4}` instructs the retriever to return top 4 results.


In [None]:
retriever = db.as_retriever(
    search_type="similarity",
    search_kwargs={'k': 4}
)

The vector database and retriever are now set up.

### Save the vector database

In [None]:
!mkdir '/content/docs'

In [None]:
persist_directory = '/content/docs'

In [None]:
!rm -rf ./content/docs  # remove old database files if any

In [None]:
db = Chroma.from_documents(
    documents=chunked_docs,
    embedding=embedding_function,
    persist_directory=persist_directory
)

In [None]:
print(db._collection.count())

90


## Load the quantised model

For this example, we chose [`HuggingFaceH4/zephyr-7b-beta`](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta), a small but powerful model.

With many models being released every week, you may want to substitute this model with the latest one. The best way to keep track of open source LLMs is to check the [open-source LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

To make inference faster, we will load the quantised version of the model:

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_name = 'HuggingFaceH4/zephyr-7b-beta'

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained(model_name)

config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

## Set up the LLM chain

Finally, we have all the pieces we need to set up the LLM chain.

First, create a text_generation pipeline using the loaded model and its tokeniser.

Next, create a prompt template. This should follow the format of the model, so if you substitute the model checkpoint, ensure that you use the appropriate formatting.

In [None]:
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_core.output_parsers import StrOutputParser

text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    do_sample=True,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=400,
)

llm = HuggingFacePipeline(pipeline=text_generation_pipeline)

prompt_template = """
<|system|>
Answer the question based on your knowledge. Use the following context to help:

{context}

</s>
<|user|>
{question}
</s>
<|assistant|>

 """

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

  llm = HuggingFacePipeline(pipeline=text_generation_pipeline)


Finally, we need to combine the `llm_chain` with the retriever to create a RAG chain. We pass the original question through to the final generation step, as well as the retrieved context docs:

In [None]:
from langchain_core.runnables import RunnablePassthrough

retriever = db.as_retriever()

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)


## Compare the results

Let's see the difference RAG makes in generating answers to the library-specific questions.

In [None]:
question = "What is self attention according to the paper?"

First, let's see what kind of answer we can get with just the model itself, with no context added:

In [None]:
llm_chain.invoke({"context":"", "question": question})

'\n<|system|>\nAnswer the question based on your knowledge. Use the following context to help:\n\n\n\n</s>\n<|user|>\nWhat is self attention according to the paper?\n</s>\n<|assistant|>\n\n  According to the paper, self attention refers to a mechanism in deep learning models that allows for the direct interaction between different parts of an input sequence without the need for external context or queries. In other words, it enables a model to attend to multiple locations within an input sequence simultaneously, rather than relying solely on the relationships between the input and external queries or context. This technique has shown promising results in various natural language processing tasks, such as machine translation and text generation, by improving the ability of the model to capture long-range dependencies and generate more coherent and fluent outputs.'

In [None]:
rag_chain.invoke(question)

"\n<|system|>\nAnswer the question based on your knowledge. Use the following context to help:\n\n[Document(metadata={'page': 1, 'source': 'https://arxiv.org/pdf/1706.03762'}, page_content='described in section 3.2.\\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\\nof a single sequence in order to compute a representation of the sequence. Self-attention has been\\nused successfully in a variety of tasks including reading comprehension, abstractive summarization,\\ntextual entailment and learning task-independent sentence representations [4, 27, 28, 22].'), Document(metadata={'page': 14, 'source': 'https://arxiv.org/pdf/1706.03762'}, page_content='be\\njust\\n-\\nthis\\nis\\nwhat\\nwe\\nare\\nmissing\\n,\\nin\\nmy\\nopinion\\n.\\n<EOS>\\n<pad>Figure 5: Many of the attention heads exhibit behaviour that seems related to the structure of the\\nsentence. We give two such examples above, from two different heads from the encoder self

## Use OpenAI models instead

In [None]:
import os

os.environ['OPENAI_API_KEY'] = 'REPLACE WITH YOUR OPENAI KEY'

In [None]:
from langchain.llms import OpenAI

llm = OpenAI(model_name="gpt-3.5-turbo-instruct", temperature=0.0)

In [None]:
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

llm_chain = prompt | llm | StrOutputParser()

In [None]:
question = "What is self attention according to the paper?"

In [None]:
llm_chain.invoke({"context":"", "question": question})

'\nSelf attention is a mechanism that allows a model to attend to different positions of a sequence in order to compute a representation of the sequence. It is used in natural language processing tasks such as machine translation and text summarization.'

In [None]:
rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

In [None]:
rag_chain.invoke(question)

'\nSelf-attention is an attention mechanism that relates different positions of a single sequence in order to compute a representation of the sequence. It has been used successfully in various tasks such as reading comprehension, abstractive summarization, and textual entailment.'

## Key information
You have learned how to perform retrieval augmented generation (RAG) with both an open-source model and a closed-source model.

## Reflect
Compare the outputs from two or more models, and note your findings and observations.

> Select the pen from the toolbar to add your entry.