# RAG + LLM Assessment

Your task is to create a Retrieval-Augmented Generation (RAG) system using a Large Language Model (LLM). The RAG system should be able to retrieve relevant information from a knowledge base and generate coherent and informative responses to user queries.

Steps:

1. Choose a domain and collect a suitable dataset of documents (at least 5 documents - PDFs or HTML pages) to serve as the knowledge base for your RAG system. Select one of the following topics:
   * latest scientific papers from arxiv.org,
   * fiction books released,
   * legal documentsor,
   * social media posts.

   Make sure that the documents are newer then the training dataset of the applied LLM. (20 points)

2. Create three relevant prompts to the dataset, and one irrelevant prompt. (20 points)

3. Load an LLM with at least 5B parameters. (10 points)

4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset. (10 points)

5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)

6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)


In [1]:
!pip install transformers>=4.32.0 optimum>=1.12.0 > null
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ > null
!pip install langchain > null
!pip install chromadb > null
!pip install sentence_transformers > null # ==2.2.2
!pip install unstructured > null
!pip install pdf2image > null
!pip install pdfminer.six > null
!pip install unstructured-pytesseract > null
!pip install unstructured-inference > null
!pip install faiss-gpu > null
!pip install pikepdf > null
!pip install pypdf > null
!pip install accelerate > null
!pip install pillow_heif > null
!pip install -i https://pypi.org/simple/ bitsandbytes > null

In [None]:
import os
os.kill(os.getpid(), 9)

In [1]:
from huggingface_hub import login
from langchain.llms import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline, BitsAndBytesConfig
from textwrap import fill
from langchain.prompts import PromptTemplate
import locale
from langchain.document_loaders import UnstructuredURLLoader
from langchain.vectorstores.utils import filter_complex_metadata # 'filter_complex_metadata' removes complex metadata that are not in str, int, float or bool format
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA

locale.getpreferredencoding = lambda: "UTF-8"

# you need to define your private User Access Token from Huggingface
# to be able to access models with accepted licence
HUGGINGFACE_UAT=""
login(HUGGINGFACE_UAT)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## 1. Topic: Latest scientific papers about Cryptography and Security from arxiv.org.  (20 points)


In [2]:
# All papers were published after April 28th, 2024
# "IDPFilter: Mitigating Interdependent Privacy Issues in Third-Party Apps"
!wget -O document1.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01411"

# "KDPrint: Passive Authentication using Keystroke Dynamics-to-Image Encoding via Standardization"
!wget -O document2.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01080"

# "Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models"
!wget -O document3.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01509"

# "Hiding Sensitive Information Using PDF Steganography"
!wget -O document4.pdf --no-check-certificate "https://arxiv.org/pdf/2405.00865"

# "Error Correction Capabilities of Non-Linear Cryptographic Hash Functions"
!wget -O document5.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01495"

# "Purify Unlearnable Examples via Rate-Constrained Variational Autoencoders"
!wget -O document6.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01460"

# "Towards Trust Proof for Secure Confidential Virtual Machines"
!wget -O document7.pdf --no-check-certificate "https://arxiv.org/pdf/2405.01030"


--2024-05-08 18:09:40--  https://arxiv.org/pdf/2405.01411
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.195.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25297840 (24M) [application/pdf]
Saving to: ‘document1.pdf’


2024-05-08 18:09:40 (152 MB/s) - ‘document1.pdf’ saved [25297840/25297840]

--2024-05-08 18:09:40--  https://arxiv.org/pdf/2405.01080
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.195.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12161435 (12M) [application/pdf]
Saving to: ‘document2.pdf’


2024-05-08 18:09:41 (108 MB/s) - ‘document2.pdf’ saved [12161435/12161435]

--2024-05-08 18:09:41--  https://arxiv.org/pdf/2405.01509
Resolving arxiv.org (arxiv.org)... 151.101.3.42, 151.101.195.42, 151.101.67.42, ...
Connecting to arxiv.org (arxiv.org)|151.101.3.42|:44

##2. Create three relevant prompts to the dataset, and one irrelevant prompt.  (20 points)

In [41]:
question_1 = "Explain what IDPFilter is, as proposed in a paper published in 2024?"
question_2 = "How can one hide malwares in a PDF document using PDF steganography?"
question_3 = "What is the benefit of encoding keystokes as images for training keystroke authentication mechanism, as proposed in a paper published in 2024?"
irrelevant_question = "What is AI?" # This is an irrelevant question, as the chosen topic is Cryptography and Security.

##3. Load an LLM with at least 5B parameters. (10 points)

In [4]:
model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # 8B language model from Meta AI

quantization_config = BitsAndBytesConfig(load_in_8bit=True)

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             quantization_config=quantization_config,
                                             trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)

gen_cfg = GenerationConfig.from_pretrained(model_name)
gen_cfg.max_new_tokens=512
gen_cfg.temperature=0.0000001 # For RAG we would like to have determenistic answers
gen_cfg.return_full_text=True
gen_cfg.do_sample=True
gen_cfg.repetition_penalty=1.11

pipe=pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=gen_cfg
)

llm_original = HuggingFacePipeline(pipeline=pipe) # Original model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


##4. Test the LLM with your prompts. The goal should be that without the collected dataset your model is unable to answer the question. If it gives you a good answer, select another question to answer and maybe a different dataset.  (10 points)

In [5]:
template_orginal = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt_original = PromptTemplate(
    input_variables=["text"],
    template=template_orginal,
)

In [47]:
result = llm_original(prompt_original.format(text=question_1))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  Explain what IDPFilter is, as proposed in
a paper published in 2024?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but
since the paper you're referring to was published in 2024, it's likely that I don't have access to
that information yet. As of my training data cutoff in 2022, there may not be any papers published
in 2024.  However, if you can provide me with more context or details about the paper, such as its
title, authors, or a brief summary, I'd be happy to try and help you understand what IDPFilter is.
Alternatively, you can also try searching for the paper online or checking academic databases like
Google Scholar or ResearchGate to see if they have any information on the topic.


In [48]:
result = llm_original(prompt_original.format(text=question_2))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  How can one hide malwares in a PDF
document using PDF steganography?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I cannot
provide information or guidance on illegal activities, including hiding malware in a PDF document.
Is there anything else I can help you with?


In [49]:
result = llm_original(prompt_original.format(text=question_3))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is the benefit of encoding keystokes
as images for training keystroke authentication mechanism, as proposed in a paper published in
2024?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I apologize, but since I'm an AI
trained until 2021, I don't have information about papers published in 2024. Additionally, I
couldn't find any specific information on encoding keystrokes as images for training keystroke
authentication mechanisms.  However, I can provide some general insights on keystroke authentication
and image-based approaches:  Keystroke authentication is a biometric method that analyzes the unique
patterns of typing behavior, such as timing, pressure, and movement, to verify an individual's
identity. Encoding keystrokes as images could potentially be used to create a visual representation
of the typing pattern, which might help improve the accuracy or robustness of the authentication
system.  Some potential bene

In [46]:
result = llm_original(prompt_original.format(text=irrelevant_question))
print(fill(result.strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>user<|end_header_id|>  What is
AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|> Artificial Intelligence (AI) refers to
the development of computer systems that can perform tasks that typically require human
intelligence, such as:  1. **Learning**: AI systems can learn from data and improve their
performance over time. 2. **Reasoning**: AI systems can draw conclusions and make decisions based on
given information. 3. **Problem-solving**: AI systems can identify problems and find solutions. 4.
**Perception**: AI systems can interpret and understand data from sensors, images, or other sources.
There are many types of AI, including:  1. **Narrow or Weak AI**: Designed to perform a specific
task, like playing chess or recognizing faces. 2. **General or Strong AI**: A hypothetical AI that
can perform any intellectual task that a human can. 3. **Superintelligence**: An AI that is
significantly more intelligent than the best human minds.  Some exam

##5. Create a LangChain-based RAG system by setting up a vector database from the documents. (20 points)



In [38]:
# Referenced https://community.openai.com/t/out-of-context-questions-in-retrieval-augmented-generation/434871 to prevent the LLM to produce answers to the irrelevant question.
template = """
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Use the following context to answer the question at the end. Do not make up an answer. If a response cannot be formed strictly using the context, just say I don't have enough relevant information to answer the question.

{context}<|eot_id|><|start_header_id|>user<|end_header_id|>

{question}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

prompt = PromptTemplate(
    input_variables=["context","text"],
    template=template,
)

In [21]:
from langchain.document_loaders import UnstructuredPDFLoader
import os
loaders = [UnstructuredPDFLoader(fn) for fn in ["/content/document1.pdf",
                                                "/content/document2.pdf",
                                                "/content/document3.pdf",
                                                "/content/document4.pdf",
                                                "/content/document5.pdf",
                                                "/content/document6.pdf",
                                                "/content/document7.pdf"]]

In [22]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunked_pdf_doc = []

for loader in loaders:
    print("Loading raw document..." + loader.file_path)
    pdf_doc = loader.load()
    updated_pdf_doc = filter_complex_metadata(pdf_doc)
    print("Splitting text...")
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=256)
    documents = text_splitter.split_documents(updated_pdf_doc)
    chunked_pdf_doc.extend(documents)

len(chunked_pdf_doc)

Loading raw document.../content/document1.pdf
Splitting text...
Loading raw document.../content/document2.pdf
Splitting text...
Loading raw document.../content/document3.pdf
Splitting text...
Loading raw document.../content/document4.pdf
Splitting text...
Loading raw document.../content/document5.pdf
Splitting text...
Loading raw document.../content/document6.pdf
Splitting text...
Loading raw document.../content/document7.pdf
Splitting text...


542

In [23]:
%%time
embeddings = HuggingFaceEmbeddings()
db_pdf = FAISS.from_documents(chunked_pdf_doc, embeddings)



CPU times: user 11.8 s, sys: 282 ms, total: 12.1 s
Wall time: 13.8 s


In [39]:
llm = HuggingFacePipeline(pipeline=pipe) # Loading seven papers
Chain_pdf = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db_pdf.as_retriever(search_type="similarity_score_threshold", search_kwargs={'k': 4, 'score_threshold': 0.2}),
    chain_type_kwargs={"prompt": prompt},
)

##6. Provide your three relevant and one irrelevant prompts to your RAG system. For the relevant prompts, your RAG system should return relevant answers, and for the irrelevant prompt, an empty answer. (20 points)



In [43]:
result = Chain_pdf.invoke(question_1)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not make up an answer. If a response cannot be formed strictly using the
context, just say I don't have enough relevant information to answer the question.  5.5 IDPFilter:
mitigation scheme  Our IDP mitigation scheme comprises three distinct interactions with regard to a
given pair (app, user) during IDPFilter’s operation (see Fig. 8):  1. Filter setup. Initially, the
IDP service provides a standard API to all inter- ested third-party applications, encompassing
various filtering schemes under  IDPFilter: Mitigating Interdependent Privacy Issues in Third-Party
Apps  Fig. 6. Overview of the IDPFilter API framework  different constraints. Third-party
application developers can select and inte- grate the services that best align with their
requirements. The Filter Setup can be done at any time the users use the service.  23 third parties
are stakeholders outside of u

In [44]:
result = Chain_pdf.invoke(question_2)
print(fill(result['result'].strip(), width=100))

# It is interesting that the Llama with RAG is happy to give answers on how to hide malwares in PDF,
#  while the original Llama considered answering the question would promote illegal activities and rejected giving any pieces of information.

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not make up an answer. If a response cannot be formed strictly using the
context, just say I don't have enough relevant information to answer the question.  document. With
our steganography Python script, that is done with the following command: $ python3 full pdf steg.py
stat cover.pdf Which gives the results shown:  Our malware sample has a size of 335,872 bytes, which
is clearly smaller than the 464,007 byte carrying capacity of the cover PDF. Thus, we can continue
on with the process. The command to embed the hidden data within the PDF is the following: $ python3
full pdf steg.py embed cover.pdf steg.pdf trojan-petya The results of this command are shown below:
After the steganography algorithm finishes processing the PDF file, the changes made in the document
are almost imperceptible. To show an example, a page before (left) and after (right) data hiding is


In [45]:
result = Chain_pdf.invoke(question_3)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not make up an answer. If a response cannot be formed strictly using the
context, just say I don't have enough relevant information to answer the question.  Figure 1: The
overall user authentication process of the proposed system.  3  KDPrint: Passive Authentication
using Keystroke Dynamics-to-Image Encoding via Standardization  3.1 System overview  10  KDPrint:
Passive Authentication using Keystroke Dynamics-to-Image Encoding via Standardization  6 Conclusion
5  (1)  KDPrint: Passive Authentication using Keystroke Dynamics-to-Image Encoding via
Standardization  indicates the size of the marker (*) at that position. Color represents the color
of the marker, with three channels for red, green, and blue. Lastly, Opacity indicates the
transparency of the marker. Due to the standardization of data, in the ideal user input, all
keystrokes converge to the origin, resul

In [42]:
result = Chain_pdf.invoke(irrelevant_question)
print(fill(result['result'].strip(), width=100))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>  Use the following context to answer the
question at the end. Do not make up an answer. If a response cannot be formed strictly using the
context, just say I don't have enough relevant information to answer the question.
<|eot_id|><|start_header_id|>user<|end_header_id|>  What is
AI?<|eot_id|><|start_header_id|>assistant<|end_header_id|> I don't have enough relevant information
to answer the question.
