# Introduction


## Objective

Use Llama3 Langchain and ChromaDB to create a Retrieval Augmented Generation (RAG) system. This will allow us to ask questions about our documents (that were not included in the training data), without fine-tunning the Large Language Model (LLM).
When using RAG, if you are given a question, you first do a retrieval step to fetch any relevant documents from a special database, a vector database where these documents were indexed.

## Definitions

* LLM - Large Language Model  
* Llama3- LLM from Meta
* Langchain - a framework designed to simplify the creation of applications using LLMs
* Vector database - a database that organizes data through high-dimmensional vectors  
* ChromaDB - vector database  
* RAG - Retrieval Augmented Generation (see below more details about RAGs)

## Model details

* **Model**: Llama 3  
* **Variation**: 8b-chat-hf  (8b: 8B dimm.; hf: HuggingFace)
* **Version**: V1  
* **Framework**: Transformers  

Llama3 model is pretrained and fine-tuned with 15T+ (more than 15 Trillion) tokens and 8 to 70 Billion parameters which makes it one of the powerful open source models. It is a highly improvement over Llama2 model.


## What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) has proven their ability to understand context and provide accurate answers to various NLP tasks, including summarization, Q&A, when prompted. While being able to provide very good answers to questions about information that they were trained with, they tend to hallucinate when the topic is about information that they do "not know", i.e. was not included in their training data. Retrieval Augmented Generation combines external resources with LLMs. The main two components of a RAG are therefore a retriever and a generator.  

The retriever part can be described as a system that is able to encode our data so that can be easily retrieved the relevant parts of it upon queriying it. The encoding is done using text embeddings, i.e. a model trained to create a vector representation of the information. The best option for implementing a retriever is a vector database. As vector database, there are multiple options, both open source or commercial products. Few examples are ChromaDB, Mevius, FAISS, Pinecone, Weaviate. Our option in this Notebook will be a local instance of ChromaDB (persistent).

For the generator part, the obvious option is a LLM. In this Notebook we will use a quantized Llama3 model, from the Kaggle Models collection.  

The orchestration of the retriever and generator will be done using Langchain. A specialized function from Langchain allows us to create the receiver-generator in one line of code.



# Installations, imports, utils

In [None]:
!pip install transformers==4.33.0 accelerate==0.22.0 einops==0.6.1 langchain==0.0.300 xformers==0.0.21 \
bitsandbytes==0.41.1 sentence_transformers==2.2.2 chromadb==0.4.12

Collecting transformers==4.33.0
  Downloading transformers-4.33.0-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.22.0
  Downloading accelerate-0.22.0-py3-none-any.whl (251 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops==0.6.1
  Downloading einops-0.6.1-py3-none-any.whl (42 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain==0.0.300
  Downloading langchain-0.0.300-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting xformers==0.0.21
  Downloading xformers-0.0.21-cp310-cp310-manylinux2014_x86_64.whl (167.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer
from time import time
#import chromadb
#from chromadb.config import Settings
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

In [None]:
model_id = 'meta-llama/Meta-Llama-3-8B'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# set quantization configuration to load large model with less GPU memory
# this requires the `bitsandbytes` library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

print(device)

cuda:0


In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: read).
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your term

Prepare the model and the tokenizer.

In [None]:
time_start = time()
model_config = transformers.AutoConfig.from_pretrained(
   model_id,
    trust_remote_code=True,
    max_new_tokens=1024
)
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
time_end = time()
print(f"Prepare model, tokenizer: {round(time_end-time_start, 3)} sec.")



config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]



model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/73.0 [00:00<?, ?B/s]

Prepare model, tokenizer: 184.716 sec.


Define the query pipeline.

In [None]:
time_start = time()
query_pipeline = transformers.pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        torch_dtype=torch.float16,
        max_length=1024,
        device_map="auto",)
time_end = time()
print(f"Prepare pipeline: {round(time_end-time_start, 3)} sec.")

Prepare pipeline: 1.05 sec.


We define a function for testing the pipeline.

In [None]:
def test_model(tokenizer, pipeline, message):
    """
    Perform a query
    print the result
    Args:
        tokenizer: the tokenizer
        pipeline: the pipeline
        message: the prompt
    Returns
        None
    """
    time_start = time()
    sequences = pipeline(
        message,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=200,)
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."

    question = sequences[0]['generated_text'][:len(message)]
    answer = sequences[0]['generated_text'][len(message):]

    return f"Question: {question}\nAnswer: {answer}\nTotal time: {total_time}"


## Test the query pipeline

We test the pipeline with a query about the meaning of State of the Union (SOTU).

In [None]:
from IPython.display import display, Markdown
def colorize_text(text):
    for word, color in zip(["Reasoning", "Question", "Answer", "Total time"], ["blue", "red", "green", "magenta"]):
        text = text.replace(f"{word}:", f"\n\n**<font color='{color}'>{word}:</font>**")
    return text

In [None]:
response = test_model(tokenizer,
                    query_pipeline,
                   "Please explain what is EU AI Act.")
display(Markdown(colorize_text(response)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




**<font color='red'>Question:</font>** Please explain what is EU AI Act.


**<font color='green'>Answer:</font>**  What is the purpose of this regulation?
The EU AI Act is a European Union regulation that aims to regulate the development and deployment of artificial intelligence (AI) technologies in the European Union. The regulation was adopted in April 2022 and will enter into force in 2023. The purpose of the regulation is to ensure that AI technologies are developed and deployed in a way that is safe, ethical, and respectful of fundamental rights. The regulation sets out a number of requirements for AI developers and deployers, including a requirement to conduct a risk assessment before deploying an AI system, and a requirement to ensure that AI systems are designed and deployed in a way that is transparent and accountable. The regulation also includes a number of provisions aimed at ensuring that AI technologies are used in a way that is fair and non-discriminatory. The regulation is intended to apply to all AI technologies, regardless of their size or scope, and will apply to both private and public sector


**<font color='magenta'>Total time:</font>** 20.763 sec.

In [None]:
response = test_model(tokenizer,
                    query_pipeline,
                   "In the context of EU AI Act, how is performed the testing of high-risk AI systems in real world conditions?")
display(Markdown(colorize_text(response)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




**<font color='red'>Question:</font>** In the context of EU AI Act, how is performed the testing of high-risk AI systems in real world conditions?


**<font color='green'>Answer:</font>**  In particular, how are performed the testing of high-risk AI systems in real world conditions in the case of the AI systems that are developed by the SMEs? Are there any specific provisions that are provided for SMEs?
The testing of high-risk AI systems in real world conditions is carried out by the manufacturer of the AI system, or by the person or entity that has placed the AI system on the market, or by the person or entity that has imported the AI system. In the case of SMEs, the testing of high-risk AI systems in real world conditions is carried out by the manufacturer of the AI system, or by the person or entity that has placed the AI system on the market, or by the person or entity that has imported the AI system.
The testing of high-risk AI systems in real world conditions is carried out by the manufacturer of the AI system, or by


**<font color='magenta'>Total time:</font>** 24.322 sec.

The answer is not really useful. Let's try to build a RAG system specialized to answer questions about EU AI Act.

# Retrieval Augmented Generation

## Check the model with a HuggingFace pipeline


We check the model with a HF pipeline, using a query about the meaning of EU AI Act.

In [None]:
llm = HuggingFacePipeline(pipeline=query_pipeline)

# checking again that everything is working fine
time_start = time()
question = "Please explain what EU AI Act is."
response = llm(prompt=question)
time_end = time()
total_time = f"{round(time_end-time_start, 3)} sec."
full_response =  f"Question: {question}\nAnswer: {response}\nTotal time: {total_time}"
display(Markdown(colorize_text(full_response)))

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




**<font color='red'>Question:</font>** Please explain what EU AI Act is.


**<font color='green'>Answer:</font>**  How can it be used?
The EU AI Act is a regulation that is currently being discussed in the European Parliament. The goal of the regulation is to ensure that AI is used safely and ethically. The regulation would apply to all AI systems that are used in the EU, including those used in healthcare, education, and other areas.
The regulation would require AI systems to be designed and used in a way that ensures that they are safe and ethical. This would include ensuring that AI systems do not discriminate against people, do not cause harm, and are used in a way that is in line with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is transparent and accountable. This would include ensuring that people are aware of how AI systems are used, and that there are mechanisms in place to ensure that AI systems are used in a way that is consistent with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is secure and resilient. This would include ensuring that AI systems are protected against cyber attacks, and that they are able to withstand disruptions in the event of a natural disaster or other emergency.
The regulation would also require that AI systems are developed and used in a way that is sustainable. This would include ensuring that AI systems are designed in a way that is environmentally friendly, and that they do not contribute to climate change.
The regulation would also require that AI systems are developed and used in a way that is inclusive and accessible. This would include ensuring that AI systems are designed in a way that is accessible to people with disabilities, and that they are designed in a way that is inclusive of all people, regardless of their background or beliefs.
The regulation would also require that AI systems are developed and used in a way that is fair and just. This would include ensuring that AI systems are designed in a way that is fair and just, and that they are designed in a way that is consistent with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is accountable and transparent. This would include ensuring that people are aware of how AI systems are used, and that there are mechanisms in place to ensure that AI systems are used in a way that is consistent with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is secure and resilient. This would include ensuring that AI systems are protected against cyber attacks, and that they are able to withstand disruptions in the event of a natural disaster or other emergency.
The regulation would also require that AI systems are developed and used in a way that is sustainable. This would include ensuring that AI systems are designed in a way that is environmentally friendly, and that they do not contribute to climate change.
The regulation would also require that AI systems are developed and used in a way that is inclusive and accessible. This would include ensuring that AI systems are designed in a way that is accessible to people with disabilities, and that they are designed in a way that is inclusive of all people, regardless of their background or beliefs.
The regulation would also require that AI systems are developed and used in a way that is fair and just. This would include ensuring that AI systems are designed in a way that is fair and just, and that they are designed in a way that is consistent with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is accountable and transparent. This would include ensuring that people are aware of how AI systems are used, and that there are mechanisms in place to ensure that AI systems are used in a way that is consistent with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is secure and resilient. This would include ensuring that AI systems are protected against cyber attacks, and that they are able to withstand disruptions in the event of a natural disaster or other emergency.
The regulation would also require that AI systems are developed and used in a way that is sustainable. This would include ensuring that AI systems are designed in a way that is environmentally friendly, and that they do not contribute to climate change.
The regulation would also require that AI systems are developed and used in a way that is inclusive and accessible. This would include ensuring that AI systems are designed in a way that is accessible to people with disabilities, and that they are designed in a way that is inclusive of all people, regardless of their background or beliefs.
The regulation would also require that AI systems are developed and used in a way that is fair and just. This would include ensuring that AI systems are designed in a way that is fair and just, and that they are designed in a way that is consistent with the EU's values and principles.
The regulation would also require that AI systems are developed and used in a way that is accountable and transparent. This would include ensuring that people are aware of how AI systems are used, and that


**<font color='magenta'>Total time:</font>** 91.868 sec.

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!apt-get -qq install -y graphviz
!pip install pydot



In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m286.7/290.4 kB[0m [31m9.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


## Ingestion of data using Text loder

We will ingest the EU AI Ac.

In [None]:
loader = PyPDFLoader("/content/aiact_final_draft.pdf")
documents = loader.load()

## Split data in chunks

We split data in chunks using a recursive character text splitter.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
all_splits = text_splitter.split_documents(documents)

## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.

In [None]:
model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {"device": "cuda"}

embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

.gitattributes:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.

In [None]:
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings, persist_directory="chroma_db")

## Initialize chain

In [None]:
retriever = vectordb.as_retriever()

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

## Test the Retrieval-Augmented Generation


We define a test function, that will run the query and time it.

In [None]:
def test_rag(qa, query):

    time_start = time()
    response = qa.run(query)
    time_end = time()
    total_time = f"{round(time_end-time_start, 3)} sec."

    full_response =  f"Question: {query}\nAnswer: {response}\nTotal time: {total_time}"
    display(Markdown(colorize_text(full_response)))

Let's check few queries.

In [None]:
query = "How is performed the testing of high-risk AI systems in real world conditions?"
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m




**<font color='red'>Question:</font>** How is performed the testing of high-risk AI systems in real world conditions?


**<font color='green'>Answer:</font>**  The testing of high-risk AI systems in real world conditions shall be performed in accordance with the real world testing plan and the risk management system. The real world testing plan shall be drawn up by the provider or prospective provider and submitted to the market surveillance authority. The risk management system shall be established by the provider or prospective provider. The testing of high-risk AI systems in real world conditions shall be performed in accordance with the real world testing plan and the risk management system. The real world testing plan shall be drawn up by the provider or prospective provider and submitted to the market surveillance authority. The risk management system shall be established by the provider or prospective provider. The testing of high-risk AI systems in real world conditions shall be performed in accordance with the real world testing plan and the risk management system. The real world testing plan shall be drawn up by the provider or prospective provider and submitted to the market surveillance authority. The risk management system shall be established by the provider or prospective provider. The testing of high-risk AI systems in real world conditions shall be performed in accordance with the real world testing plan and the risk management system. The real world testing plan shall be drawn up by the provider or prospective provider and submitted to the market surveillance authority. The risk management system shall be established by the provider or prospective provider. The testing of high-risk AI systems in real world conditions shall be performed in accordance with the real world testing plan and the risk management system. The real world testing plan shall be drawn up by the provider or prospective provider and submitted to the market surveillance authority


**<font color='magenta'>Total time:</font>** 34.173 sec.

In [None]:
query = "What are the operational obligations of notified bodies?"
test_rag(qa, query)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m




**<font color='red'>Question:</font>** What are the operational obligations of notified bodies?


**<font color='green'>Answer:</font>**  The operational obligations of notified bodies are to verify the conformity of high-risk AI systems in accordance with the conformity assessment procedures referred to in Article 43.

Explanation: Notified bodies are responsible for ensuring that high-risk AI systems meet the requirements of the AI Act. They must have the necessary expertise, resources, and infrastructure to carry out their tasks effectively and efficiently. They must also be independent of any providers or operators with an economic interest in the AI systems they assess. This ensures that they remain impartial and objective in their assessments.



**<font color='magenta'>Total time:</font>** 14.799 sec.

## Document sources

Let's check the documents sources, for the last query run.

In [None]:
docs = vectordb.similarity_search(query)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['metadata']['source'])
    print("Text: ", doc_details['page_content'], "\n")

Query: What are the operational obligations of notified bodies?
Retrieved documents: 4
Source:  /content/aiact_final_draft.pdf
Text:  5.
 
Notified bodies shall be organised and operated so as to safeguard the independence, 
objectivity and impartiality of their activities. Notified b
odies shall document and 
implement a structure and procedures to safeguard impartiality and to promote and apply 
the principles of impartiality throughout their organisation, personnel and assessment 
activities.
 
6.
 
Notified bodies shall have documented pro
cedures in place ensuring that their personnel, 
committees, subsidiaries, subcontractors and any associated body or personnel of external 

Source:  /content/aiact_final_draft.pdf
Text:  authority accordingly.
 
2.
 
Notified bodies
 
shall take full responsibility for the tasks performed by subcontractors or 
subsidiaries wherever these are established.
 
3.
 
Activities may be subcontracted or carried out by a subsidiary only with the agreemen

# Conclusions


We used Langchain, ChromaDB and Llama3 as a LLM to build a Retrieval Augmented Generation solution. For testing, we were using the EU AI Act from 2023.  
The answers to questions from EU AI Act are correct, when using a RAG model.  

To improve the solution, we will have to refine the RAG implementation, first by optimizing the embeddings, then by using more complex RAG schemes.





inf
