# Vector Databases with LLMs

**Content:** using an embedding database (vector DB) to store information that is provided to an LLM so it can be incorporated into its responses.

**Model:**  [TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0) - one of the smartest Small Language Models. 

**Environment:** Kaggle (free)

**Keywords:** Vector Database, ChromaDB, RAG, Embeddings.

[Source](https://github.com/peremartra/Large-Language-Model-Notebooks-Course/blob/main/2-Vector%20Databases%20with%20LLMs/how-to-use-a-embedding-database-with-a-llm-from-hf.ipynb)

# 1. Set Up

## 1.1. Import & Load the Libraries

In [13]:
!pip install -q transformers # ==4.41.2

In [14]:
# to transform the sentences into fixed-length vectors (i.e., embeddings)
!pip install -q sentence-transformers #==2.2.2

[xformers](https://github.com/facebookresearch/xformers) provides utilities that facilitate working with transformer models. It must be installed to avoid errors when working with the model and embeddings.

In [15]:
#!pip install -q xformers==0.0.23

In [16]:
# vector DB to store embeddings - easy to use & open source
!pip install -q chromadb #==0.4.20 

In [17]:
!pip install --upgrade -q chromadb

In [18]:
import numpy as np   # a library for numerical computing.
import pandas as pd   # a library for data manipulation

## 1.2. Load the Dataset

**/kaggle/input/**: a read-only directory automatically created by Kaggle

In [19]:
news = pd.read_csv('/kaggle/input/topic-labeled-news-dataset/labelled_newscatcher_dataset.csv', sep=';')
MAX_NEWS = 1000   # to limit the number of news
DOCUMENT="title"
TOPIC="topic"

ChromaDB requires that the data has a unique identifier, so create a new column called Id:

In [20]:
news["id"] = news.index
news.head()

Unnamed: 0,topic,link,domain,published_date,title,lang,id
0,SCIENCE,https://www.eurekalert.org/pub_releases/2020-0...,eurekalert.org,2020-08-06 13:59:45,A closer look at water-splitting's solar fuel ...,en,0
1,SCIENCE,https://www.pulse.ng/news/world/an-irresistibl...,pulse.ng,2020-08-12 15:14:19,"An irresistible scent makes locusts swarm, stu...",en,1
2,SCIENCE,https://www.express.co.uk/news/science/1322607...,express.co.uk,2020-08-13 21:01:00,Artificial intelligence warning: AI will know ...,en,2
3,SCIENCE,https://www.ndtv.com/world-news/glaciers-could...,ndtv.com,2020-08-03 22:18:26,Glaciers Could Have Sculpted Mars Valleys: Study,en,3
4,SCIENCE,https://www.thesun.ie/tech/5742187/perseid-met...,thesun.ie,2020-08-12 19:54:36,Perseid meteor shower 2020: What time and how ...,en,4


Select a small portion of News, as we are using a free and limited space!

In [21]:
subset_news = news.head(MAX_NEWS)

# 2. Vector Database

## 2.1. Import & configure

Import the **Settings** class from **chromadb.config** module. To change the setting for the ChromaDB system, and customize its behavior.

In [22]:
import chromadb
from chromadb.config import Settings

In [23]:
chroma_client = chromadb.PersistentClient(path="/path/to/persist/directory")

## 2.2. Filling & Querying

In [24]:
from datetime import datetime

The Data in ChromaDB is stored in collections. If the collection exists, it needs to be deleted.

Create the collection by calling the **create_collection**:

In [25]:
collection_name = "news_collection"+datetime.now().strftime("%s")

if len(chroma_client.list_collections()) > 0 and collection_name in [chroma_client.list_collections()[0].name]:
        chroma_client.delete_collection(name=collection_name)

collection = chroma_client.create_collection(name=collection_name)

Add the data to the collection. `The` add function requires at least `documents`, `metadatas`, and `ids`.

- `document`: to store the large text. In this dataset, it's column *title*.
- `metadatas`: a list of topics.
- `ids`: unique ids for each row. It MUST be unique => create the ID using the range of MAX_NEWS.

In [26]:
collection.add(
    documents=subset_news[DOCUMENT].tolist(),
    metadatas=[{TOPIC: topic} for topic in subset_news[TOPIC].tolist()],
    ids=[f"id{x}" for x in range(MAX_NEWS)],
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:01<00:00, 81.6MiB/s]


Once information is stored inside the Database, it can be queried, and ask for data that matches the needs.

The search is done inside the content of the document, and it dosn't look for the exact word, or phrase. The results will be based on the **similarity** between the **search terms** and the **content of documents**.

The **metadata** is not used in the search, but they can be utilized for **filtering** or **refining** the results after the initial search.

In [27]:
results = collection.query(query_texts=["laptop"], n_results=10 )

print(results)

{'ids': [['id173', 'id829', 'id117', 'id535', 'id141', 'id218', 'id390', 'id273', 'id56', 'id900']], 'embeddings': None, 'documents': [['The Legendary Toshiba is Officially Done With Making Laptops', '3 gaming laptop deals you can’t afford to miss today', 'Lenovo and HP control half of the global laptop market', 'Asus ROG Zephyrus G14 gaming laptop announced in India', 'Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865)', "Apple's Next MacBook Could Be the Cheapest in Company's History", "Features of Huawei's Desktop Computer Revealed", 'Redmi to launch its first gaming laptop on August 14: Here are all the details', 'Toshiba shuts the lid on laptops after 35 years', 'This is the cheapest Windows PC by a mile and it even has a spare SSD slot']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[{'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'topic': 'TECHNOLOGY'}, {'

## 2.3. Vector MAP (to Test!)

In [42]:
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

In [36]:
get_doc = collection.get(ids="id141", 
                       include=["documents", "embeddings"])

In [39]:
word_vectors = get_doc["embeddings"]
word_list = get_doc["documents"]

In [40]:
word_list

['Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865)']

In [41]:
word_vectors

array([[-8.08560848e-02, -4.99637052e-02, -2.37774849e-02,
        -1.10536022e-02,  2.66577117e-02, -4.47933301e-02,
        -2.88966335e-02,  2.66561043e-02,  1.43972272e-03,
        -1.64078418e-02,  6.53492734e-02, -6.90199286e-02,
        -5.74807823e-02,  1.01116151e-02,  5.04303500e-02,
        -2.05776445e-03,  7.25640804e-02, -1.24373689e-01,
         1.06594423e-02, -1.09420463e-01, -1.14324046e-02,
        -1.03760120e-02, -2.06108317e-02, -2.43940949e-02,
         7.82847628e-02,  5.82055887e-03,  2.33177263e-02,
        -8.24382976e-02, -2.72650588e-02,  4.66747722e-03,
         4.34018858e-03,  3.25280502e-02, -2.60309745e-02,
         7.96390548e-02,  4.21820618e-02, -1.21199943e-01,
         4.90708388e-02, -7.62584656e-02,  4.33162451e-02,
        -8.36045742e-02, -7.14040175e-02, -1.87925138e-02,
         3.60493883e-02,  4.28456254e-02,  2.57604383e-02,
         3.97251472e-02, -7.09129497e-03,  3.51899490e-02,
         2.73690969e-02,  9.28983930e-03, -3.91616635e-0

# 3. Vector DB & LLM

## 2.1. Load the model & Create the Prompt

- **Autotokenizer:** a utility class for tokenizing text inputs that are **compatible** with various pre-trained language models.
- **AutoModelForCasualLLM:** provides an interface to pre-trained language models specifically designed for language generation tasks using causal language modeling (e.g., GPT models).
- **pipeline:** provides a simple interface for performing various NLP tasks.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # "databricks/dolly-v2-3b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
lm_model = AutoModelForCausalLM.from_pretrained(model_id)

In [None]:
# Initialize the pipeline
pipe = pipeline(
    "text-generation",
    model=lm_model,
    tokenizer=tokenizer,
    max_new_tokens=256, # a short response is sufficient for this task.
    device_map="auto", # instructing the model to automaticaly select the most appropiate device: CPU or GPU (for processing the text generation)
)

## 2.2. Create the extended prompt

The prompt have **two parts**:
- The **relevant context** that is the information recovered from the database.
- The **user's question**.

We only need to **join** the two parts together to create the prompt!

**Note:** Limit the context length to avoid memory issues with documents containing very large text.

In [46]:
question = "Can I buy a new Toshiba laptop?"
context = " ".join([f"#{str(i)}" for i in results["documents"][0]])
#context = context[0:5120]
prompt_template = f"""
Relevant context: {context}
Considering the relevant context, answer the question.
Question: {question}
Answer: """

print(prompt_template)


Relevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot
Considering the relevant context, answer the question.
Question: Can I buy a new Toshiba laptop?
Answer: 


Send the prompt to the model:

In [47]:
lm_response = pipe(prompt_template)
print(lm_response[0]["generated_text"])


Relevant context: #The Legendary Toshiba is Officially Done With Making Laptops #3 gaming laptop deals you can’t afford to miss today #Lenovo and HP control half of the global laptop market #Asus ROG Zephyrus G14 gaming laptop announced in India #Acer Swift 3 featuring a 10th-generation Intel Ice Lake CPU, 2K screen, and more launched in India for INR 64999 (US$865) #Apple's Next MacBook Could Be the Cheapest in Company's History #Features of Huawei's Desktop Computer Revealed #Redmi to launch its first gaming laptop on August 14: Here are all the details #Toshiba shuts the lid on laptops after 35 years #This is the cheapest Windows PC by a mile and it even has a spare SSD slot
Considering the relevant context, answer the question.
Question: Can I buy a new Toshiba laptop?
Answer: 
Based on the information provided, it is not possible to purchase a new Toshiba laptop at the moment. The company has ceased manufacturing laptops, and there is no indication that they will be resuming prod