# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documents or,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


Begin with imports!

In [3]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spacy 3.7.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.
weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.12.3 which is incompatible.[0m[31m
[0m[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
imageio 2.31.6 requires pillow<10.1.0,>=8.3.2, but you have pillow 10.3.0 which is incompatible.[0m[31m
[0m

Important: Restart the kernel after installing the packages.

In [None]:
import os
os.kill(os.getpid(), 9)

Imports!

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="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"
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Get five recent papers from arxiv (these are all recent computer graphics papers that folks in my lab at school have mentioned!) These are all quite recent papers and so much newer than the model.

In [2]:
arxiv_paper_urls = ["https://arxiv.org/html/2404.07984", "https://arxiv.org/pdf/2404.17672", "https://arxiv.org/pdf/2311.06979", "https://arxiv.org/pdf/2404.15228", "https://arxiv.org/pdf/2404.07503"]

web_loader = UnstructuredURLLoader(
    urls=arxiv_paper_urls, mode="elements", strategy="fast",
    )
web_doc = web_loader.load()
updated_web_doc = filter_complex_metadata(web_doc)

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2048, chunk_overlap=512)
chunked_web_doc = text_splitter.split_documents(updated_web_doc)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Create 3 relevant prompts and 1 irrelevant one.

In [3]:
template = """
<bos><start_of_turn>user
{text}<end_of_turn>
<start_of_turn>model
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

relevant_1 = "What is an issue with Cap3D?"
relevant_2 = "What is a benefit of synthetic data?"
relevant_3 = "What is visual program induction?"

irrelevant = "Who is the main character the new novel They Thought I Was Dead: Sandy's Story?"

Load in the Meta-LLAMA 3 model which has 8B parameters!

In [4]:
model_name = "google/gemma-2b-it" # NOTE: this model is only 2B parameters

# I could not for the life of me figrue out how to get quantization to work with the
# "meta-llama/Meta-Llama-3-8B-Instruct" model (or other sized models)
# It kept saying that I needed to install the "accelerate" module which was already installed
# I tried debugging this for a long time and nothing seemed to work

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm = HuggingFacePipeline(pipeline=pipe)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]

Attempt the prompts without the dataset loaded in!

In [5]:
result_relevant_1 = fill(llm(prompt.format(text=relevant_1)).strip(), width=100)
result_relevant_2 = fill(llm(prompt.format(text=relevant_2)).strip(), width=100)
result_relevant_3 = fill(llm(prompt.format(text=relevant_3)).strip(), width=100)

result_irrelevant = fill(llm(prompt.format(text=irrelevant)).strip(), width=100)


print(result_relevant_1)
print(result_relevant_2)
print(result_relevant_3)

print(result_irrelevant)

  warn_deprecated(


<bos><start_of_turn>user What is an issue with Cap3D?<end_of_turn> <start_of_turn>model **Issues
with Cap3D:**  **1. Performance Issues:** - Cap3D can be computationally expensive, especially for
complex models and scenes. - High polygon count models or materials can cause performance issues on
low-end hardware.  **2. Memory Usage:** - Cap3D can consume significant memory resources, especially
when rendering high-resolution or detailed models. - This can lead to system crashes or slow
performance.  **3. Compatibility Issues:** - Some older versions of Cap3D may not be compatible with
the latest version of Blender or other software. - Ensure that you are using a compatible version of
software.  **4. Bug Reports and Known Issues:** - Cap3D has a history of bug reports and known
issues. - It is important to stay updated with the latest patch notes and fixes.  **5. Limited
Support for Advanced Features:** - While Cap3D supports many advanced features in Blender, some
features may not be fu

As we can see, it got all 4 questions incorrect! Hopefully with RAG, it will do better.

Create a vector database!

In [6]:
embeddings = HuggingFaceEmbeddings() # default model_name="sentence-transformers/all-mpnet-base-v2"

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
%%time

# Create the vectorized db with FAISS
db_web = FAISS.from_documents(chunked_web_doc, embeddings)

CPU times: user 17.2 s, sys: 88.9 ms, total: 17.3 s
Wall time: 17.6 s


In [8]:
prompt_template = """
<bos><start_of_turn>user
Use the following context to Answer the question at the end. Do not use any other information. If you can't find the relevant information in the context, just say you don't have enough information to answer the question. Don't try to make up an answer.

{context}

Question: {question}<end_of_turn>

<start_of_turn>model"""

prompt = PromptTemplate(template=prompt_template, input_variables=["context", "question"])
Chain_web = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    # retriever=db.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 5, 'score_threshold': 0.8})
    # Similarity Search is the default way to retrieve documents relevant to a query, but we can use MMR by setting search_type = "mmr"
    # k defines how many documents are returned; defaults to 4.
    # score_threshold allows to set a minimum relevance for documents returned by the retriever, if we are using the "similarity_score_threshold" search type.
    # return_source_documents=True, # Optional parameter, returns the source documents used to answer the question
    retriever=db_web.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 10, 'score_threshold': 0.1}),
    chain_type_kwargs={"prompt": prompt},
)

In [9]:
result_relevant_1 = Chain_web.invoke(relevant_1)
result_relevant_2 = Chain_web.invoke(relevant_2)
result_relevant_3 = Chain_web.invoke(relevant_3)

result_irrelevant = Chain_web.invoke(irrelevant)

print(fill(result_relevant_1['result'].strip(), width=100))
print(fill(result_relevant_2['result'].strip(), width=100))
print(fill(result_relevant_3['result'].strip(), width=100))

print(fill(result_irrelevant['result'].strip(), width=100))



<bos><start_of_turn>user Use the following context to Answer the question at the end. Do not use any
other information. If you can't find the relevant information in the context, just say you don't
have enough information to answer the question. Don't try to make up an answer.  3.1 Issues in Cap3D
B Dataset: more details & results  B.1 Extra dataset details B.2 Captions: overcome the failure
cases in Cap3D B.3 Captions: Ours vs. Cap3D vs. human-authored B.4 Captions: Ours vs. ablated
variants B.5 Diffu: Ours vs. bottom 6-views vs. horizontal 6-views B.6 Failure cases B.7 Human
evaluation details  4.1 Correction of Cap3D Captions  B.2 Captions: overcome the failure cases in
Cap3D  B.2 Captions: overcome the failure cases in Cap3D  3 Method  3.1 Issues in Cap3D 3.2
DiffuRank Formulation 3.3 New 3D Captioning Framework  Image-Based Method: Approximately
10⁢k10𝑘10k10 italic_k renderings in Cap3D dataset were identified as having all-grey images, likely
due to rendering issues within the Ca

Conclusion:
- Question 1: According to the context, an issue with Cap3D is that it
has problems with rendering all-grey images.
- Question 2: The passage does not specify what a
benefit of synthetic data is, so I cannot answer this question from the provided context.
- Question 3: It presents the
program in the form of an animation and describes the changes made by the edit generator and state
evaluator
- Question 4: I cannot answer this question from the context because the context does not
provide information about the main character of the novel.

The model does a lot better at asnwering the questions with the given context! The answers for questions 1 and 3 in particular are good. Question 4 is meant to be irrelevant so the output was expected. Question 2 however is an example where I think if the model was given even more context, it could have found an answer but it could not in this case.
