# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documentsor,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT="" # Hiding the UAT
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 1. Topic: Latest scientific papers about security and cryptography from arxiv.org.  (20 points)


In [36]:
# All papers were published after April 28th, 2024
# "IDPFilter: Mitigating Interdependent Privacy Issues in Third-Party Apps"
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01411"

# "KDPrint: Passive Authentication using Keystroke Dynamics-to-Image Encoding via Standardization"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01080"

# "Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01509"

# "Hiding Sensitive Information Using PDF Steganography"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00865"

# "Error Correction Capabilities of Non-Linear Cryptographic Hash Functions"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01495"

# "Purify Unlearnable Examples via Rate-Constrained Variational Autoencoders"
!wget -O document6.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01460"

# "Towards Trust Proof for Secure Confidential Virtual Machines"
!wget -O document7.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01030"


--2024-05-06 00:41:50--  https://arxiv.org/pdf/2405.01411
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.3.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25297840 (24M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-06 00:41:50 (188 MB/s) - ‘document1.pdf’ saved [25297840/25297840]

--2024-05-06 00:41:50--  https://arxiv.org/pdf/2405.01080
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.3.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12161435 (12M) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-06 00:41:50 (159 MB/s) - ‘document2.pdf’ saved [12161435/12161435]

--2024-05-06 00:41:51--  https://arxiv.org/pdf/2405.01509
Resolving arxiv.org (arxiv.org)... 151.101.131.42, 151.101.3.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.131.

##2. Create three relevant prompts to the dataset, and one irrelevant prompt.  (20 points)

In [71]:
question_1 = "Explain what IDPFilter is, as proposed in a paper published in 2024?"
question_2 = "How can one cryptographically hide malwares in a PDF document?"
question_3 = "What is the benefit of encoding keystokes as images for training keystroke authentication mechanism, as proposed in a paper published in 2024?"
irrelevant_question = "Where is the name of the U.S. President?"

##3. Load an LLM with at least 5B parameters. (10 points)

In [4]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # 0.0 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm_original = HuggingFacePipeline(pipeline=pipe) # Original model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


##4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset.  (10 points)

In [72]:
template_orginal = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_original = PromptTemplate(
    input_variables=["text"],
    template=template_orginal,
)

In [73]:
result = llm_original(prompt_original.format(text=question_1))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Explain what IDPFilter is, as proposed in
a paper published in 2024?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but
since the paper you're referring to was published in 2024, it's likely that I don't have access to
that information yet. As of my training data cutoff in 2022, there may not be any papers published
in 2024.  However, if you can provide me with more context or details about the paper, such as its
title, authors, or a brief summary, I'd be happy to try and help you understand what IDPFilter is.
Alternatively, you can also try searching for the paper online or checking academic databases like
Google Scholar or ResearchGate to see if they have any information on the topic.


In [74]:
result = llm_original(prompt_original.format(text=question_2))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  How can one cryptographically hide
malwares in a PDF document?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I cannot provide
information or guidance on illegal or harmful activities, including hiding malware in a PDF
document. Can I help you with something else?


In [75]:
result = llm_original(prompt_original.format(text=question_3))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the benefit of encoding keystokes
as images for training keystroke authentication mechanism, as proposed in a paper published in
2024?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but since I'm an AI
trained until 2021, I don't have information about papers published in 2024. Additionally, I
couldn't find any specific information on encoding keystrokes as images for training keystroke
authentication mechanisms.  However, I can provide some general insights on keystroke authentication
and image-based approaches:  Keystroke authentication is a biometric method that analyzes the unique
patterns of typing behavior, such as timing, pressure, and movement, to verify an individual's
identity. Encoding keystrokes as images could potentially be used to create a visual representation
of the typing pattern, which might help improve the accuracy or robustness of the authentication
system.  Some potential bene

##5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)



In [85]:
template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not use any other information. If there is no relevant information in the context, just say I don't have enough information. Don't try to make up an answer.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context","text"],
    template=template,
)

In [77]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf",
                                                "/content/document2.pdf",
                                                "/content/document3.pdf",
                                                "/content/document4.pdf",
                                                "/content/document5.pdf",
                                                "/content/document6.pdf",
                                                "/content/document7.pdf"]]

In [78]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...
Loading raw document.../content/document6.pdf
Splitting text...
Loading raw document.../content/document7.pdf
Splitting text...


542

In [79]:
%%time
embeddings = HuggingFaceEmbeddings()
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)



CPU times: user 11.7 s, sys: 160 ms, total: 11.9 s
Wall time: 12.7 s


In [86]:
llm = HuggingFacePipeline(pipeline=pipe) # Trained with the five papers
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

##6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)



In [87]:
result = Chain_pdf.invoke(question_1)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If there is no relevant information in the
context, just say I don't have enough information. Don't try to make up an answer.  5.5 IDPFilter:
mitigation scheme  Our IDP mitigation scheme comprises three distinct interactions with regard to a
given pair (app, user) during IDPFilter’s operation (see Fig. 8):  1. Filter setup. Initially, the
IDP service provides a standard API to all inter- ested third-party applications, encompassing
various filtering schemes under  IDPFilter: Mitigating Interdependent Privacy Issues in Third-Party
Apps  Fig. 6. Overview of the IDPFilter API framework  different constraints. Third-party
application developers can select and inte- grate the services that best align with their
requirements. The Filter Setup can be done at any time the users use the service.  23 third parties
are stakeholders outside of 

In [88]:
result = Chain_pdf.invoke(question_2)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If there is no relevant information in the
context, just say I don't have enough information. Don't try to make up an answer.  6 Conclusion  By
utilizing an adaptable strategy of least-significant bit insertion on string representations of
floating- point operands to perform imperceptible changes to PDF stream operators, this paper
presents a novel PDF steganography method. Our method has a higher carrying capacity than previous
PDF operator-based methods due to the use of all viable operators found within a PDF file. Our case
study of embedding malware into a given PDF cover file justifies effectiveness of our proposed
approach. Acknowledgments. This work was supported by US National Science Foundation under grant
number CNS-1928349, CNS-2225424, and 2043022-DGE.  References  [1] Document management — Portable
document format — Par

In [89]:
result = Chain_pdf.invoke(question_3)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If there is no relevant information in the
context, just say I don't have enough information. Don't try to make up an answer.  Figure 1: The
overall user authentication process of the proposed system.  3  KDPrint: Passive Authentication
using Keystroke Dynamics-to-Image Encoding via Standardization  3.1 System overview  10  KDPrint:
Passive Authentication using Keystroke Dynamics-to-Image Encoding via Standardization  6 Conclusion
5  (1)  KDPrint: Passive Authentication using Keystroke Dynamics-to-Image Encoding via
Standardization  indicates the size of the marker (*) at that position. Color represents the color
of the marker, with three channels for red, green, and blue. Lastly, Opacity indicates the
transparency of the marker. Due to the standardization of data, in the ideal user input, all
keystrokes converge to the origin, resu

In [90]:
result = Chain_pdf.invoke(irrelevant_question)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not use any other information. If there is no relevant information in the
context, just say I don't have enough information. Don't try to make up an answer.
<|eot_id|><|start_header_id|>user<|end_header_id|>  Where is the name of the U.S.
President?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough information.
