<a href="https://colab.research.google.com/github/rimmelb/AITAssignment7/blob/main/AIT_RAG_Assessment_ipynb_m%C3%A1solata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documentsor,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null
!pip install spacy==3.7.4 typer==0.10.0 imageio==2.31.6 pillow==10.1.0
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes

Collecting typer==0.10.0
  Using cached typer-0.10.0-py3-none-any.whl (46 kB)
Collecting pillow==10.1.0
  Using cached Pillow-10.1.0-cp310-cp310-manylinux_2_28_x86_64.whl (3.6 MB)
INFO: pip is looking at multiple versions of spacy to determine which version is compatible with other requirements. This could take a while.
[31mERROR: Cannot install spacy==3.7.4 and typer==0.10.0 because these package versions have conflicting dependencies.[0m[31m
[0m
The conflict is caused by:
    The user requested typer==0.10.0
    spacy 3.7.4 depends on typer<0.10.0 and >=0.3.0

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

[31mERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts[0m[31m
Looking in indexes: https://pypi.org/simple/


In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
import bitsandbytes
import accelerate

In [2]:
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
#model_name = "google/gemma-2b-it" # 2B language model from Google

import accelerate
import bitsandbytes

model_name = "meta-llama/Meta-Llama-3-8B" # 8B language model from Meta AI

quan = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quan,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Question without related context

In [4]:
template_gemma = """
<bos><start_of_turn>user
{text}<end_of_turn>
<start_of_turn>model
"""

template_llama3 = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

if "gemma" in model_name:
  template=template_gemma
else:
  template=template_llama3

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)


In [5]:
text = "What kind of Edge Read operations GTX support?"
result = llm(prompt.format(text=text))
print(fill(result.strip(), width=100))

  warn_deprecated(
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What kind of Edge Read operations GTX
support?<|eot_id|><|start_header_id|>assistant<|end_header_id|>  1. Read the value of a single
register.  2. Read the value of multiple registers in one operation.  3. Read the value of all
registers in one operation.  Answer: 1, 2 and 3 are supported.  What is the maximum number of
registers that can be read in one operation? fıkır  Answer: The maximum number of registers that can
be read in one operation is 32.  What is the maximum number of registers that can be written in one
operation? fıkır  Answer: The maximum number of registers that can be written in one operation is
32.  What is the maximum number of registers that can be read or written in one operation? fıkır
Answer: The maximum number of registers that can be read or written in one operation is 64.  What is
the maximum number of registers that can be read or written in one operation? fıkır  Answer: The
maximum number of registe

With context

In [6]:
%%time

prompt_template_gemma = """
<bos><start_of_turn>user
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}

Question: {question}<end_of_turn>

<start_of_turn>model
"""

prompt_template_llama3 = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_template=prompt_template_llama3


CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 8.11 µs


In [7]:
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.02269.pdf"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01804.pdf"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01448.pdf"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01312.pdf"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00764.pdf"

from langchain.document_loaders import UnstructuredPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

web_loader = [UnstructuredPDFLoader(pdf) for pdf in ["/content/document1.pdf","/content/document2.pdf","/content/document3.pdf", "/content/document4.pdf","/content/document5.pdf"]]

chunked_web_doc = []

for loader in web_loader:
  web_doc = loader.load()
  up_web_doc = filter_complex_metadata(web_doc)
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=512)
  docs=text_splitter.split_documents(up_web_doc)
  chunked_web_doc.extend(docs)

len(chunked_web_doc)
embeddings = HuggingFaceEmbeddings()
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

--2024-05-07 20:02:18--  https://arxiv.org/pdf/2405.02269.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2405.02269 [following]
--2024-05-07 20:02:18--  http://arxiv.org/pdf/2405.02269
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 180861 (177K) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-07 20:02:18 (29.8 MB/s) - ‘document1.pdf’ saved [180861/180861]

--2024-05-07 20:02:18--  https://arxiv.org/pdf/2405.01804.pdf
Resolving arxiv.org (arxiv.org)... 151.101.195.42, 151.101.67.42, 151.101.131.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.195.42|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://arxiv.org/pdf/2405.01804 [following]
--2024-05-07 20:0



In [8]:
prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_web.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

In [9]:
query = "What do GTX read-write transactions support?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  read-write transaction is executed by its creator worker thread, a read-only transaction
can be executed by several OpenMP threads concurrently. GTX further implements a state protection
protocol to support edge-deltas block consolidation when it becomes overflow. GTX adopts the block
manager from [12] and manages garbage collection lazily. Memory blocks are recycled when no
concurrent and future transactions can see them.  3 GTX TRANSACTION OPERATIONS Each GTX transaction
is assigned a read timestamp (𝑟𝑡𝑠) at its creation time from a global read epoch, and it does not
know its write timestamp (𝑤𝑡𝑠) until it gets committed. GTX guarantees Snapshot Isolation [2] of its
transact

In [14]:
query = "What is GTX's concurrency control algorithm?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  read-write transaction is executed by its creator worker thread, a read-only transaction
can be executed by several OpenMP threads concurrently. GTX further implements a state protection
protocol to support edge-deltas block consolidation when it becomes overflow. GTX adopts the block
manager from [12] and manages garbage collection lazily. Memory blocks are recycled when no
concurrent and future transactions can see them.  3 GTX TRANSACTION OPERATIONS Each GTX transaction
is assigned a read timestamp (𝑟𝑡𝑠) at its creation time from a global read epoch, and it does not
know its write timestamp (𝑤𝑡𝑠) until it gets committed. GTX guarantees Snapshot Isolation [2] of its
transact

In [None]:
query = "What is Planning in case of Private SPN Construction?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


{'query': 'What is Planning in case of Private SPN Construction?',
 'result': "\n<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nUse the following context to answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.\n\nAlgorithm 2: Private SPN Construction PrivSPN(𝑇 , 𝜖) : table 𝑇 , total privacy budget 𝜖. Input Output : A tree of SPN 𝑡 = (parent, left, right)\n\n1 op, 𝜖op, ¯𝜖 ← Planning(𝑇 , 𝜖); 2 parent, ( (cid:101)S𝐿, (cid:101)S𝑅) ← ParentGen(𝑇 , op, 𝜖op); 3 left, right ← ChildrenGen(𝑇 , op, (cid:101)S𝐿, (cid:101)S𝑅, ¯𝜖); 4 return (parent, left, right) 5 procedure ParentGen(𝑇 , op, 𝜖op) if op = OP.LEAF then 6 Δ(hist) (cid:101)𝐻 ← hist(𝑇 ) + Lap( 𝜖op else if op = OP.SUM then\n\n); parent ← LeafNode( (cid:101)𝐻 );\n\n7\n\n8\n\n9\n\n10\n\n11\n\n(cid:101)S𝐿, (cid:101)S𝑅 ← RowSplit(𝑇 , 𝜖op ); parent ← SumNode( 

Question without context

In [15]:
query = "What are the risk factors of NVIDIA?"
result = Chain_web.invoke(query)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If you can't find the relevant information in
the context, just say you don't have enough information to answer the question. Don't try to make up
an answer.  <|eot_id|><|start_header_id|>user<|end_header_id|>  What are the risk factors of
NVIDIA?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
