<a href="https://colab.research.google.com/github/vngabriel/rag-for-llms/blob/main/integrating_vector_database_with_llms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Integrating Vector Database with Large Language Model (LLM)


## Installing dependencies

In [None]:
!pip install -qU \
    datasets==2.14.5 \
    einops==0.6.1 \
    accelerate==0.20.3 \
    datasets==2.14.5 \
    chromadb

## Importing dependencies

In [None]:
from pprint import pprint

import chromadb
from chromadb.utils import embedding_functions
from datasets import load_dataset
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

## Running ChromaDB as a remote database (only works locally or in production)

Run the following command:

`chroma run --host localhost --port 8000 --path ./chromadb`

and change the code below from `chromadb.Client()` to `chromadb.HttpClient("http://localhost:8000")`

## Code

### Class to handle Vector Database

In [None]:
class VectorStore:

    def __init__(self, collection_name):
        # Initialize the embedding model

        self.embedding_model = embedding_functions.DefaultEmbeddingFunction()

        self.chroma_client = chromadb.Client()
        print(f"Heartbeat: {self.chroma_client.heartbeat()}")
        print(f"Version: {self.chroma_client.get_version()}")

        self.collection = self.chroma_client.create_collection(
            name=collection_name, get_or_create=True
        )

        print(f"All collections: {self.chroma_client.list_collections()}")

    def populate_vectors(self, dataset):
        # Method to populate the vector store with embeddings from a dataset

        for i, item in enumerate(dataset):
            combined_text = f"{item['instruction']}. {item['context']}"
            embeddings = self.embedding_model([combined_text])
            self.collection.add(
                embeddings=embeddings, documents=[item["context"]], ids=[f"id_{i}"]
            )

    def search_context(self, query, n_results=1):
        # Method to search the ChromaDB collection for relevant context based on a query

        query_embeddings = self.embedding_model([query])
        return self.collection.query(
            query_embeddings=query_embeddings, n_results=n_results
        )


### Class to handle language model

In [None]:
class Model:

    def __init__(self):
        # https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
        model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        self.pipeline, self.tokenizer = self.initialize_model(model_name)

    def initialize_model(self, model_name):
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForCausalLM.from_pretrained(model_name)

        if torch.cuda.is_available():
            pipeline = transformers.pipeline(
                "text-generation", model=model, tokenizer=tokenizer, device=0, max_length=1000
            )
        else:
            pipeline = transformers.pipeline(
                "text-generation", model=model, tokenizer=tokenizer, max_length=1000
            )

        return pipeline, tokenizer

    def generate_answer(self, prompt):
        sequences = self.pipeline(prompt)

        return sequences[0]["generated_text"]

## Creating the Vector Database

In [None]:
# Load only the training split of the dataset
train_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
# Filter the dataset to only include entries with the 'closed_qa' category
closed_qa_dataset = train_dataset.filter(lambda example: example['category'] == "closed_qa")

print()
pprint(closed_qa_dataset[0])
print()
pprint(closed_qa_dataset[-1])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



{'category': 'closed_qa',
 'context': 'Virgin Australia, the trading name of Virgin Australia Airlines '
            'Pty Ltd, is an Australian-based airline. It is the largest '
            'airline by fleet size to use the Virgin brand. It commenced '
            'services on 31 August 2000 as Virgin Blue, with two aircraft on a '
            'single route. It suddenly found itself as a major airline in '
            "Australia's domestic market after the collapse of Ansett "
            'Australia in September 2001. The airline has since grown to '
            'directly serve 32 cities in Australia, from hubs in Brisbane, '
            'Melbourne and Sydney.',
 'instruction': 'When did Virgin Australia start operating?',
 'response': 'Virgin Australia commenced services on 31 August 2000 as Virgin '
             'Blue, with two aircraft on a single route.'}

{'category': 'closed_qa',
 'context': 'The term one-child policy refers to a population planning '
            'initiative in

In [None]:
vector_store = VectorStore("knowledge-base")
vector_store.populate_vectors(closed_qa_dataset)

Heartbeat: 1722469521872045046
Version: 0.5.5
All collections: [Collection(id=058e3330-dcfd-48f3-a928-15987b965afb, name=knowledge-base)]


### Add custom data to Vector Database

In [None]:
custom_data = {
    "category": "closed_qa",
    "context": "Jhulia Rayssa Mendes Leal (Imperatriz, January 4, 2008) is a Brazilian skateboarder, Olympic runner-up at the 2020 Summer Olympics in Tokyo, being the youngest Brazilian Olympic medalist. Furthermore, she is the Pan-American champion, winning the gold medal in street skateboarding at the 2023 Pan-American Games, held in Santiago, Chile, marking the first gold medal for the Brazilian delegation in this edition of the Pan. In 2024, she won bronze at the Paris Olympic Games, becoming the youngest Brazilian athlete to win two medals at two different Olympic Games. Rayssa is also a gold medalist at the X Games and 2022 world champion. Popularly called 'A Fadinha' (The Little Fairy), Rayssa earned this nickname after her video performing skateboarding tricks dressed as a fairy went viral on the internet at the age of seven. Since then, she has become well-known in the Brazilian skate scene and on social media. Her success in competitions made her a recognized athlete in the skateboarding world.",
    "instruction": "Who is Rayssa Leal?",
    "response": "Rayssa Leal, also known as 'A Fadinha' (The Little Fairy), is a Brazilian skateboarder and the youngest Brazilian Olympic medalist, having won silver at the 2020 Tokyo Olympics and bronze at the 2024 Paris Olympics. She is also a Pan-American champion, X Games gold medalist, and 2022 world champion in street skateboarding."
}

combined_text = f"{custom_data['instruction']}. {custom_data['context']}"
embeddings = vector_store.embedding_model([combined_text])
vector_store.collection.add(
    embeddings=embeddings, documents=[custom_data["context"]], ids=["id_1000000000"]
)

In [None]:
vector_store.collection.get(ids=["id_1000000000"])

{'ids': ['id_1000000000'],
 'embeddings': None,
 'metadatas': [None],
 'documents': ["Jhulia Rayssa Mendes Leal (Imperatriz, January 4, 2008) is a Brazilian skateboarder, Olympic runner-up at the 2020 Summer Olympics in Tokyo, being the youngest Brazilian Olympic medalist. Furthermore, she is the Pan-American champion, winning the gold medal in street skateboarding at the 2023 Pan-American Games, held in Santiago, Chile, marking the first gold medal for the Brazilian delegation in this edition of the Pan. In 2024, she won bronze at the Paris Olympic Games, becoming the youngest Brazilian athlete to win two medals at two different Olympic Games. Rayssa is also a gold medalist at the X Games and 2022 world champion. Popularly called 'A Fadinha' (The Little Fairy), Rayssa earned this nickname after her video performing skateboarding tricks dressed as a fairy went viral on the internet at the age of seven. Since then, she has become well-known in the Brazilian skate scene and on social med

## Creating the model

In [None]:
model = Model()

## Generating text

### Question 1

In [None]:
user_question = "Who is the president of the USA"

#### Without context

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": user_question},
]
prompt = model.pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate an answer to the user question using the LLM
answer = model.generate_answer(prompt)
print("\nAnswer:\n")
print(answer.split("<|assistant|>")[-1].strip())

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



Answer:

The current president of the United States of America is Joe Biden. He was sworn in as the 46th president of the United States on January 20, 2021.


#### With context

In [None]:
context_response = vector_store.search_context(user_question)
print(context_response)

{'ids': [['id_1075']], 'distances': [[0.8493637442588806]], 'metadatas': [[None]], 'embeddings': None, 'documents': [['From Simple English Wikipedia, the free encyclopedia\nPresident of the\nUnited States of America\nSeal of the President of the United States.svg\nSeal of the President of the United States\nFlag of the President of the United States.svg\nFlag of the President of the United States\nJoe Biden presidential portrait.jpg\nIncumbent\nJoe Biden\nsince January 20, 2021\nExecutive branch of the U.S. government\nExecutive Office of the President\nStyle\t\nMr. President\n(informal)\nThe Honorable\n(formal)\nHis Excellency\n(diplomatic)\nType\t\nHead of state\nHead of government\nAbbreviation\tPOTUS\nMember of\t\nCabinet\nDomestic Policy Council\nNational Economic Council\nNational Security Council\nResidence\tWhite House\nSeat\tWashington, D.C.\nAppointer\tElectoral College\nTerm length\tFour years, renewable once\nConstituting instrument\tConstitution of the United States\nInaug

In [None]:
context = "".join(context_response["documents"][0])[:900]

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": f"{context}\n\n{user_question}"},
]
prompt = model.pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate an answer using the model, incorporating the fetched context
enriched_answer = model.generate_answer(prompt)

print("Answer:\n")
print(enriched_answer.split("<|assistant|>")[-1].strip())

Answer:

The president of the United States of America is Joe Biden, who was sworn in as the 46th president of the United States on January 20, 2021. Biden is the incumbent president, having been elected in the 2020 presidential election. The president is the head of state and head of government of the United States, and is the leader of the executive branch of the government. The president is also the head of the National Security Council, the Domestic Policy Council, and the National Economic Council. The president is elected for a four-year term, with the inauguration ceremony taking place on the first Monday of January each year. The president is the first in the line of succession, and the vice president serves as the acting president in the event of the president's death, resignation, or inability to perform their duties. The president is also the commander-in-chief of the United States Armed Forces.


### Question 2

In [None]:
user_question = "Who is Rayssa Leal"

#### Without context

In [None]:
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": user_question},
]
prompt = model.pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate an answer to the user question using the LLM
answer = model.generate_answer(prompt)

print("Answer:\n")
print(answer.split("<|assistant|>")[-1].strip())

Answer:

Rayssa Leal is a Brazilian model and actress who is best known for her work in the fashion industry. She has appeared in numerous high-profile fashion campaigns and has been featured in numerous magazines, including Vogue, Elle, and GQ. Rayssa has also appeared in several films and television shows, including "The Crown," "The Crown," and "The Crown." She is known for her natural beauty, curvaceous figure, and effortless style.


#### With context

In [None]:
context_response = vector_store.search_context(user_question)
print(context_response)

{'ids': [['id_1000000000']], 'distances': [[0.7231837511062622]], 'metadatas': [[None]], 'embeddings': None, 'documents': [["Jhulia Rayssa Mendes Leal (Imperatriz, January 4, 2008) is a Brazilian skateboarder, Olympic runner-up at the 2020 Summer Olympics in Tokyo, being the youngest Brazilian Olympic medalist. Furthermore, she is the Pan-American champion, winning the gold medal in street skateboarding at the 2023 Pan-American Games, held in Santiago, Chile, marking the first gold medal for the Brazilian delegation in this edition of the Pan. In 2024, she won bronze at the Paris Olympic Games, becoming the youngest Brazilian athlete to win two medals at two different Olympic Games. Rayssa is also a gold medalist at the X Games and 2022 world champion. Popularly called 'A Fadinha' (The Little Fairy), Rayssa earned this nickname after her video performing skateboarding tricks dressed as a fairy went viral on the internet at the age of seven. Since then, she has become well-known in the 

In [None]:
context = "".join(context_response["documents"][0])[:900]

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": f"{context}\n\n{user_question}"},
]
prompt = model.pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate an answer using the model, incorporating the fetched context
enriched_answer = model.generate_answer(prompt)

print("Answer:\n")
print(enriched_answer.split("<|assistant|>")[-1].strip())

Answer:

Jhulia Rayssa Mendes Leal (Imperatriz, January 4, 2008) is a Brazilian skateboarder, Olympic runner-up at the 2020 Summer Olympics in Tokyo, being the youngest Brazilian Olympic medalist. She is also a gold medalist at the X Games and 2022 world champion. Rayssa is popularly called "A Fadinha" (The Little Fairy) after her viral video of skateboarding tricks dressed as a fairy went viral on the internet at the age of seven. She has become well-known in the Brazilian skate scene and has earned the nickname "The Little Fairy" for her unique style and talent.


### Question 3

In [None]:
user_question = "How many Olympic medals did Rayssa Leal win"

#### With context

In [None]:
context_response = vector_store.search_context(user_question)
print(context_response)
context = "".join(context_response["documents"][0])[:900]

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": f"{context}\n\n{user_question}"},
]
prompt = model.pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate an answer using the model, incorporating the fetched context
enriched_answer = model.generate_answer(prompt)

print("\nAnswer:\n")
print(enriched_answer.split("<|assistant|>")[-1].strip())

{'ids': [['id_1000000000']], 'distances': [[0.6633252501487732]], 'metadatas': [[None]], 'embeddings': None, 'documents': [["Jhulia Rayssa Mendes Leal (Imperatriz, January 4, 2008) is a Brazilian skateboarder, Olympic runner-up at the 2020 Summer Olympics in Tokyo, being the youngest Brazilian Olympic medalist. Furthermore, she is the Pan-American champion, winning the gold medal in street skateboarding at the 2023 Pan-American Games, held in Santiago, Chile, marking the first gold medal for the Brazilian delegation in this edition of the Pan. In 2024, she won bronze at the Paris Olympic Games, becoming the youngest Brazilian athlete to win two medals at two different Olympic Games. Rayssa is also a gold medalist at the X Games and 2022 world champion. Popularly called 'A Fadinha' (The Little Fairy), Rayssa earned this nickname after her video performing skateboarding tricks dressed as a fairy went viral on the internet at the age of seven. Since then, she has become well-known in the 