# RAG + LLM Assessment

In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

In [None]:
import os
os.kill(os.getpid(), 9)

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)

In [2]:
from google.colab import userdata
from google.colab import drive
from huggingface_hub import login
from langchain.chains import RetrievalQA
from langchain.document_loaders import UnstructuredPDFLoader
from langchain.document_loaders import UnstructuredURLLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import HuggingFacePipeline
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores.utils import filter_complex_metadata
from langchain.vectorstores import FAISS
from textwrap import fill
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig

import os

import locale
locale.getpreferredencoding = lambda: "UTF-8"

**Model**

In [4]:
hugging_face_token = userdata.get('HF_TOKEN_R')
login(hugging_face_token, add_to_git_credential = True)

Token is valid (permission: read).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [5]:
# not enough ram in colab to load either of these two models
# model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
# model_name = "meta-llama/Meta-Llama-3-8B-Instruct"
model_name = "meta-llama/Llama-2-7b-chat-hf"

quantization_config = BitsAndBytesConfig(load_in_8bit = True)
model = AutoModelForCausalLM.from_pretrained(
            model_name, device_map = 'auto',
            token = hugging_face_token,
            quantization_config = quantization_config,
            trust_remote_code = True
)




config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [8]:
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast = True, token = hugging_face_token)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens = 512
gen_cfg.temperature = 0.00000001
gen_cfg.return_full_text = True
gen_cfg.do_sample = True
gen_cfg.repetition_penalty = 1.11

pipe = pipeline (
    task = 'text-generation',
    model = model,
    tokenizer = tokenizer,
    generation_config = gen_cfg
)

llm = HuggingFacePipeline( pipeline = pipe )



In [6]:
template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

**Testing With Prompts**

In [9]:
# first prompt
# correct answer: Ambika Mod did play Emma Morley in the 2024 version of One Day
prompt_one   = "Did Ambika Mod ever play Emma Morley from One Day?"
result = llm(prompt.format(text = prompt_one))
print(fill(result.strip(), width = 100))

<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Did Ambika Mod ever play Emma Morley from
One Day?<|eot_id|><|start_header_id|>assistant<|end_header_id|>  No, Ambika Mod did not play Emma
Morley in the movie One Day. The role of Emma Morley was played by actress Anne Hathaway.


In [10]:
# second prompt
# correct answer: the milk men
prompt_two   = "In the movie Unfrosted, who are implicated in the assassination of Kennedy?"
result = llm(prompt.format(text = prompt_two))
print(fill(result.strip(), width = 100))

<|begin_of_text|><|start_header_id|>user<|end_header_id|>  In the movie Unfrosted, who are
implicated in the assassination of Kennedy?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
The following individuals are implicated in the assassination of President John F. Kennedy in the
movie Unfrosted:  1. Lee Harvey Oswald - The man accused of shooting Kennedy from the sixth floor of
the Texas School Book Depository building in Dallas, Texas. 2. Lyndon B. Johnson - The Vice
President of the United States at the time of Kennedy's assassination, who became the 36th President
of the United States after Kennedy's death. Some conspiracy theories suggest that Johnson was
involved in a cover-up of the assassination. 3. John Connally - The Secretary of the Treasury under
Kennedy and a passenger in the presidential limousine during the assassination. Some conspiracy
theories suggest that Connally was involved in a plot to kill Kennedy. 4. Ruby Pearl - A local
nightclub owner who shot and kill

In [11]:
# third prompt
# correct answer: Donny, a struggling comedian, and his female stalker
prompt_three  = "What is Baby Reindeer about?"
result = llm(prompt.format(text = prompt_three))
print(fill(result.strip(), width = 100))

<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is Baby Reindeer
about?<|eot_id|><|start_header_id|>assistant<|end_header_id|>  Baby Reindeer is a popular Christmas
song that tells the story of a baby reindeer who is learning to fly. The song is often sung by
children and is a fun way to teach them about the holiday season.  The lyrics of Baby Reindeer are
as follows:  <|begin_of_text|>Oh, baby reindeer, oh, baby reindeer Learning to fly, learning to fly
You're so high up in the sky, baby reindeer Flying so high, flying so high  You're so fast, you're
so bright Flying through the night Oh, baby reindeer, oh, baby reindeer Learning to fly, learning to
fly</|eot_id|>  The song is typically sung to the tune of "Jingle Bells" and is a fun way for
children to learn about the holiday season. It can be sung while playing with toy reindeer or while
watching a video of real reindeer flying.  Overall, Baby Reindeer is a fun and educational song that
can help children learn about 

In [12]:
# irrelevant prompt
prompt_irrelevant   = "What is artificial intelligence?"
result = llm(prompt.format(text = prompt_irrelevant))
print(fill(result.strip(), width = 100))

<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is artificial
intelligence?<|eot_id|><|start_header_id|>assistant<|end_header_id|>  Artificial intelligence (AI)
refers to the development of computer systems that can perform tasks that typically require human
intelligence, such as visual perception, speech recognition, decision-making, and language
translation. AI systems use algorithms and machine learning techniques to analyze data and make
predictions or decisions based on that data.  There are several types of AI, including:  1. Narrow
or weak AI: This type of AI is designed to perform a specific task, such as playing chess or
recognizing faces. Narrow AI is the most common form of AI and is used in many applications,
including virtual assistants, image recognition software, and self-driving cars. 2. General or
strong AI: This type of AI is designed to perform any intellectual task that a human can do. General
AI has not yet been developed, but it is the goal of many

**Documents**

In [14]:
drive.mount('/content/drive')
info_folder = '/content/drive/My Drive/new_info_pdf/'
documents = os.listdir(info_folder)

filenames = [info_folder + document for document in documents]

Mounted at /content/drive


In [15]:
documents

['unfrosted.pdf',
 'baby_reindeer.pdf',
 'one_day.pdf',
 'shogun.pdf',
 'damsel.pdf']

**LangChain-based RAG System**

In [17]:
# load documents and split into chunks
loaders = [UnstructuredPDFLoader(filename) for filename in filenames]

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/drive/My Drive/new_info_pdf/unfrosted.pdf


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Splitting text...
Loading raw document.../content/drive/My Drive/new_info_pdf/baby_reindeer.pdf
Splitting text...
Loading raw document.../content/drive/My Drive/new_info_pdf/one_day.pdf
Splitting text...
Loading raw document.../content/drive/My Drive/new_info_pdf/shogun.pdf
Splitting text...
Loading raw document.../content/drive/My Drive/new_info_pdf/damsel.pdf
Splitting text...


207

In [18]:
# vectorize the chunks, and store in a FAISS database
embeddings = HuggingFaceEmbeddings()
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [19]:
# some prompt engineering = our new template
prompt_template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just return an empty answer. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

In [20]:
# new prompt
prompt = PromptTemplate (
    template = prompt_template,
    input_variables = ["context", "question"]
)

# retrievalQA function
Chain_web = RetrievalQA.from_chain_type(
    llm = llm,
    chain_type = "stuff",
    retriever = db_pdf.as_retriever (
        search_type = "similarity_score_threshold",
        search_kwargs = {'k': 10, 'score_threshold': 0.1}
    ),
    chain_type_kwargs={"prompt": prompt},
)

**Testing the RAG System**

In [21]:
# first prompt
# correct answer: Ambika Mod did play Emma Morley in the 2024 version of One Day
result = Chain_web.invoke(prompt_one)
print(fill(result['result'].strip(), width = 100))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.
https://en.wikipedia.org/wiki/One_Day_(TV_series)  5/7  5/9/24, 6:49 AM  One Day (TV series) -
Wikipedia  It was a breakthrough for the careers of leads Ambika Mod and Leo Woodall.[17][18][19]
References  1. Ritman, Alex (30 November 2021). "Netflix Orders 5 U.K. Series, Including Adaptation
of David Nicholls' 'One Day' " (https://www.hollywoodreporter.com/tv/tv-news/netflix-david-nicholls-
one-day- adaptation-1235054656/). The Hollywood Reporter. Retrieved 30 November 2021.  2. Whittock,
Jesse (23 June 2022). " 'The White Lotus' Actor Leo Woodall & 'This Is Going To Hurt' Breakout
Ambika Mod To Lead Netflix Drama 'One Day' " (https://deadline.com/2022/06/netflix-dra ma-one-day-
casts-the-white-lotus-actor-leo-wood

In [22]:
# second prompt
# correct answer: the milk men
result = Chain_web.invoke(prompt_two)
print(fill(result['result'].strip(), width = 100))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.
https://en.wikipedia.org/wiki/Unfrosted  4/7  5/9/24, 6:44 AM  Unfrosted - Wikipedia  experience
that may leave the consumer feeling pangs of regret."[16] Metacritic, which uses a weighted average,
assigned the film a score of 43 out of 100, based on 29 critics, indicating "mixed or average"
reviews.[17]  Notes  a. A subtitle notes this as one of the film's few legitimate pieces of trivia
b. Though unnamed onscreen, Slattery and Hamm's characters are implied to be their Mad Men
characters Roger Sterling and Don Draper.[3]  References  1. "Unfrosted (12)"
(https://www.bbfc.co.uk/release/unfrosted-q29sbgvjdglvbjpwwc0xmdixmdy2). BBFC. May 2, 2024.
Retrieved May 3, 2024.  The following morning, Pop-Tarts sell out of ev

In [23]:
# third prompt
# correct answer: Donny, a struggling comedian, and his female stalker
result = Chain_web.invoke(prompt_three)
print(fill(result['result'].strip(), width = 100))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.  German outlet queer.de
said: "The hype is not for nothing. But this series requires strong nerves! It's best not to watch
it alone..."[16]  Audience viewership  Baby Reindeer debuted at number five on Netflix's Top 10 TV
English titles for the tracking week of 8–14 April 2024, with 10.4 million hours viewed.[17] On the
following week, it climbed to number one and garnered 52.8 million viewing hours.[18]  References
1. Venn, Lydia (18 April 2024). "Who was Llewellyn Harrison, whom Baby Reindeer is dedicated to?"
(https://www.cosmopolitan.com/uk/entertainment/a60532123/who-was-llewellyn-harrison-baby-rei ndeer-
dedication/). Cosmopolitan. Archived (https://web.archive.org/web/20240422011335/https://
www.cosmopolitan

In [24]:
# irrelevant prompt
result = Chain_web.invoke(prompt_irrelevant)
print(fill(result['result'].strip(), width = 100))



<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just return an empty answer. Don't try to make up an answer.
<|eot_id|><|start_header_id|>user<|end_header_id|>  What is artificial
intelligence?<|eot_id|><|start_header_id|>assistant<|end_header_id|>  Answer:
