<a href="https://colab.research.google.com/github/tinkvu/LLM-EU-AI-Act/blob/main/Notebooks/RAG_Implementation_using_OrcawiseLLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What is a Retrieval Augmented Generation (RAG) system?

Large Language Models (LLMs) have showcased their ability to grasp contextual information and deliver accurate responses across various Natural Language Processing (NLP) tasks, such as summarization and question answering, when prompted. While these models excel in providing precise answers related to the information they were trained on, they often struggle with 'hallucinating' responses when confronted with topics not present in their training data. Retrieval Augmented Generation bridges the gap by amalgamating external resources with LLMs. A RAG system primarily comprises two core components: a retriever and a generator.

The retriever component serves as a system capable of encoding data in a manner that facilitates the retrieval of relevant information when queried. This encoding leverages text embeddings, involving a model trained to generate vector representations of the information. An optimal approach for implementing a retriever is to employ a vector database. Numerous options exist, including both open-source and commercial products. Examples include **ChromaDB, Mevius, FAISS, Pinecone, and Weaviate**. In this Notebook, we will utilize a local instance of `ChromaDB`.

Regarding the generator component, the LLM in this Notebook that we are gonna to utilize is the `LLaMA v2` model available through HuggingFace (You may also access the model from the Kaggle Models collection if you prefer).

The coordination of the retriever and generator will be streamlined through Langchain. A specialized function within Langchain allows us to construct the retriever-generator combination with just one line of code.

# Installations and Imports

In [1]:
# Install specific package versions required for our RAG task using pip
!pip install torch==2.0.1 transformers==4.33.0 accelerate==0.24.1 einops==0.7.0 langchain==0.0.326 bitsandbytes==0.41.1 xformers==0.0.21 sentence_transformers==2.2.2 chromadb==0.4.15

Collecting torch==2.0.1
  Downloading torch-2.0.1-cp310-cp310-manylinux1_x86_64.whl (619.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers==4.33.0
  Downloading transformers-4.33.0-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m97.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.24.1
  Downloading accelerate-0.24.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.4/261.4 kB[0m [31m33.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting einops==0.7.0
  Downloading einops-0.7.0-py3-none-any.whl (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.6/44.6 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain==0.0.326
  Downloading langchain-0.0.326-py3-none-any.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [6]:
pip install peft

Collecting peft
  Downloading peft-0.11.1-py3-none-any.whl (251 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/251.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/251.6 kB[0m [31m3.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.6/251.6 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: peft
Successfully installed peft-0.11.1


In [14]:
# Import necessary libraries and modules
from torch import cuda, bfloat16
import torch
import transformers
from transformers import AutoTokenizer, T5Tokenizer
from time import time
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings, OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import FAISS

In [7]:
from peft import PeftModel, PeftConfig
from transformers import AutoModelForSeq2SeqLM


#### NOTE: To download the LLama 2 model, you must first log in to your Hugging Face account. However, please ensure that you have obtained the necessary model access permissions from both Meta and HuggingFace prior to doing so.

In [4]:
# Log in to your Hugging Face model hub account using your access token
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Load our Model

In [8]:

config = PeftConfig.from_pretrained("Orcawise/eu_ai_act_orcawise_july12")
base_model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")

# Define the model to be used from the Hugging Face model hub
model = PeftModel.from_pretrained(base_model, "Orcawise/eu_ai_act_orcawise_july12")

# Determine the device for inference (GPU if available, else CPU)
device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# Configure quantization settings for efficient GPU memory usage through 'bitsandbytes' library
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

adapter_config.json:   0%|          | 0.00/603 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

adapter_model.safetensors:   0%|          | 0.00/7.12M [00:00<?, ?B/s]

In [9]:
device

'cuda:0'

In [10]:
# Define the model text generation parameters
generate_kwargs = dict(
#         streamer=streamer,
#         max_new_tokens= 1024,
        do_sample=True,
        top_p= 0.9,
        top_k= 50,
        temperature= 0.6,
        num_beams=1,
        repetition_penalty= 1.2,
    )

In [18]:
# Record the starting time
time_1 = time()

# Load the tokenizer for the specified model
#tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer = T5Tokenizer.from_pretrained('google/flan-t5-base')
# Create a text generation pipeline with specified settings
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    model_kwargs = generate_kwargs
)

# Record the ending time and calculate the preparation time
time_2 = time()
print(f"Prepare pipeline: {round(time_2-time_1, 3)} sec.")

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. If you see this, DO NOT PANIC! This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
The model 'PeftModelForSeq2SeqLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'LlamaForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'FalconForCausalLM', 'GitFo

Prepare pipeline: 3.716 sec.


#### Define a function for testing the text generation pipeline.

In [19]:
# This function generates the text answer based on a given prompt using the pipeline, and print
# the result.
def test_model(tokenizer, pipeline, prompt_to_test):
    # Time the text generation process
    time_1 = time()
    # Generate text based on the prompt with specified settings
    sequences = pipeline(
        prompt_to_test,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        max_length=500,
    )
    time_2 = time()
    # Print the time taken for inference
    print(f"Test inference: {round(time_2-time_1, 3)} sec.")
    # Print the generated text
    for seq in sequences:
        print(f"Result: {seq['generated_text']}")

In [23]:
# Test the text generation function with a specific prompt
test_model(tokenizer, pipeline,
           "What is EU AI Act")

Test inference: 42.227 sec.
Result: What is EU AI ActArticle Article  Article AI AI Article Article Article Article Article AI AI Article AI AI Article Article Article Article Article AI AI AI Member Article Article Article Article Article Article Article Article Article AI AI AI Provide AI Rec Article Article Article Article Article Article Article Article AI AI Article Article Provide AI Article Article Article Article Article EU Article Article Article Article Article Provide AI AI Article Article Article Article AI Article Article AI AI Article Article Article Article Article AI AI AI AI  Article Article Article Article Article AI Article Article Article Article Article Article AI AI AI Article EU Member Article Anne Article Anne AI AI AI Article Article Article Article Article Article Article Article Article Article Article Article Article Article Article  Article Article AI AI AI AI AI AI Article AI Article Article Article Article Article Article  AI AI AI Article AI AI Article A

#### Check the model with a HuggingFace pipeline

In [None]:
# Create a HuggingFacePipeline using the previously defined text generation pipeline
llm = HuggingFacePipeline(pipeline=pipeline)

# Use the pipeline to generate text based on a specific prompt
# Check that the model and pipeline are working as expected
llm(prompt="Provide a brief explanation of climate change and its effects on the environment. Summarize it in 100 words.")

# Retrieval Augmented Generation

### Using Text loder for data ingestion

In [None]:
# Define a TextLoader to load documents from a file
# Load some latest data of your choice to test the model for RAG
loader = TextLoader("/kaggle/input/qa-testing/random_mix_news.txt", encoding="utf8")
# In my case, I randomly selected a few topics and incorporated them into the text file.

# Load the documents from the file
documents = loader.load()

### Split data in chunks

In [None]:
# Initialize a text splitter to break down documents into smaller text chunks
# The 'chunk_size' defines the size of each chunk, and 'chunk_overlap' specifies the overlap between chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

# Split the loaded documents into smaller text chunks
all_splits = text_splitter.split_documents(documents)

### Generating Embeddings and Storing Them in a Vector Store

In [None]:
# Specify the model for generating sentence embeddings
model_name = "sentence-transformers/all-mpnet-base-v2"

# Define model-specific keyword arguments, such as the device for inference
model_kwargs = {"device": "cuda"}

# Create embeddings using the HuggingFace model and the specified settings
embeddings = HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

### Create a Chroma vector database from the split documents

In [None]:
# Using the specified embeddings and persisting the data in the "chroma_db" directory
vectordb = Chroma.from_documents(documents=all_splits, embedding=embeddings,
                                 persist_directory="chroma_db")

In [None]:
# Create a retriever from the Chroma vector database
retriever = vectordb.as_retriever()

# Initialize a RetrievalQA object with specified settings
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

### Conclude by testing the RAG system with inquiries related to your input data

In [None]:
# Define a function to test the RetrievalQA system
def test_rag(qa, query):
    # Print the query being tested
    print(f"Query: {query}\n")

    # Measure the inference time
    time_1 = time()
    result = qa.run(query)

    time_2 = time()

    # Print the inference time and the result
    print(f"Inference time: {round(time_2-time_1, 3)} sec.")
    print("\nResult: ", result)

In [None]:
# Specify a instruction that you want the Llama 2 model to follow for RAG task
instruction = "You must provide only one answer, and it must be based solely on the context provided. Don't try to make up answers or speculate beyond what is provided."

In [None]:
query = "Nam Joo-hyuk was accused of what? Give me a detailed report."
test_rag(qa, query + instruction)

In [None]:
query = "Who is Nam Joo-hyuk?"
test_rag(qa, query + instruction)

In [None]:
query = "Tell me the plot summary of My Demon. Summarize it in 200 words."
test_rag(qa, query + instruction)

In [None]:
query = "Name the cast members of My Demon."
test_rag(qa, query + instruction)

In [None]:
query = "What is 'Big Lemon & Paeroa bottle'? Can you explain in detail including little of its history?"
test_rag(qa, query + instruction)

In [None]:
query = "Can you tell me how to load Llama-2 model faster in several bullet points."
test_rag(qa, query + instruction)