## Vector Databases

<img src="../assets/module_5/vector_banner.jpg" height="25%">

We started this workshop with **text representation** as one of the key components of any NLP system.
As we progressed from simple Bag of Words setup to highly contextualised Transformer models, we now have rich & dense representations.
The utility of such representations also increased multifold from word/sentence representations to features that can used for a number of downstream tasks.

These representations, also called as vectors or embedding vectors are long series of numbers. Their retrieval and persistence requires specialised database management systems called **Vector Databases**.

Vector Databases are particularly suited for handling data in the form of vectors, embeddings, or feature representations, which are commonly used in various applications like machine learning, natural language processing, computer vision, and recommendation systems.

Key Features:
- High-dimensional Data Support
- Similarity Search
- Indexing Techniques
- Dimensionality Reduction

There are a number of different off-the-shelf options available, such as:
- [ChromaDB](https://www.trychroma.com/)
- [PineCone](https://www.pinecone.io/)
- [Milvus](https://milvus.io/)
- [Weaviate](https://weaviate.io/)
- [AeroSpike](https://aerospike.com/)
- [OpenSearch](https://opensearch.org/)


## Let us Begin with Installation

<a target="_blank" href="https://colab.research.google.com/github/raghavbali/llm_workshop_dhs23/blob/main/module_05/1.vector_databases_hf_inference_endpoint.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [1]:
%%capture
# install dependencies
!pip install -q chromadb
!pip install retry
#!pip install -U sentence-transformers

## Imports

In [2]:
import json
import requests
import pandas as pd
from retry import retry

import chromadb
from chromadb.api.types import Documents, Embeddings

## HuggingFace Inference EndPoints 🤗

Another key offering from HuggingFace is *[Inference Endpoints](https://huggingface.co/inference-endpoints)*.
These endpoints provide access to hundreds of large models hosted on HuggingFace infra for easy use.

All you need is a quick [sign-up](https://huggingface.co/login) and an API Key and bingo!


## Sentence Transformers

This is an amazing python framework initially proposed along with the seminal paper titled [Sentence-BERT](https://www.sbert.net/).
It provides clean high-level interfaces to easily use Language Models for computing text embeddings for various use-cases.

In this notebook we will leverage pretrained models supported by sentence transformer rather than directly using the package.

There is a [leaderboard](https://huggingface.co/spaces/mteb/leaderboard) now maintained to keep track of the state-of-the-art embedding models called the **Massive Text Embedding Benchmark (MTEB) Leaderboard**

<img src="../assets/module_5/mteb.png">

> Source : [HuggingFace](https://huggingface.co/spaces/mteb/leaderboard)

## MPNET Model

- This model transforms sentences/paragraphs to a 768 dimensional vector space and is optimised for question-answering tasks.
- The model card is available [here](https://huggingface.co/pinecone/mpnet-retriever-discourse)

In [3]:
EMB_MODEL_ID = 'pinecone/mpnet-retriever-discourse'
HF_TOKEN = ''
EMB_API_URL = f"https://api-inference.huggingface.co/pipeline/feature-extraction/{EMB_MODEL_ID}"
HEADERS = {"Authorization": f"Bearer {HF_TOKEN}"}

## Embeddings using 🤗 Inference Endpoint
- We setup a utility function that takes a list of sentences as input and generates embeddings as response
- We use the ``retry`` package to allow for sufficient time and retries for the APIs to respond

In [4]:
@retry(tries=3, delay=10)
def get_embeddings(texts):
    response = requests.post(EMB_API_URL, headers=HEADERS, json={"inputs": texts})
    result = response.json()
    if isinstance(result, list):
      return result
    elif list(result.keys())[0] == "error":
      raise RuntimeError(
          "The API did not return a response"
          )

In [5]:
sample_texts = [
        "Another key offering from HuggingFace is Inference Endpoints. These endpoints provide access to hundreds of large models hosted on HuggingFace infra for easy use.",
        "This is an amazing python framework initially proposed along with the seminal paper titled Sentence-BERT. It provides clean high-level interfaces to easily use Language Models for computing text embeddings for various use-cases."
        ]

In [6]:
# generate embeddings
sample_embeddings = get_embeddings(sample_texts)
sample_embeddings



[[-0.5249431729316711,
  -0.04365357756614685,
  0.5124772787094116,
  0.21908250451087952,
  0.45604899525642395,
  -0.41028520464897156,
  -0.16367770731449127,
  -0.5703942775726318,
  0.2737348973751068,
  -0.0625704824924469,
  0.5730516910552979,
  -0.07522936910390854,
  0.16366876661777496,
  -0.48901644349098206,
  -0.6579405069351196,
  -0.4555187225341797,
  -1.1214828491210938,
  0.33167019486427307,
  0.26991039514541626,
  0.2382402867078781,
  -0.06299390643835068,
  -1.1421911716461182,
  -0.3283173441886902,
  -0.20789942145347595,
  -0.4092184603214264,
  -0.44582730531692505,
  -0.030791839584708214,
  -0.4313756227493286,
  0.5941035747528076,
  -0.0744466707110405,
  -0.14593197405338287,
  0.2835100293159485,
  -0.41431736946105957,
  -1.0458682775497437,
  -0.17988894879817963,
  0.006162078119814396,
  -0.3257652521133423,
  0.30198347568511963,
  -0.034965742379426956,
  0.4035990536212921,
  -0.8219583034515381,
  0.17588070034980774,
  0.0834009051322937,
  -

In [7]:
# check embedding length
len(sample_embeddings[0])

768

## Vector Database: ChromaDB

As mentioned above, there are a number of offering available. For this workshop we will make use of
[ChromaDB](https://www.trychroma.com/).

It is a super simple setup which is easy to use. The following figure showcases the overall flow

<img src="../assets/module_5/chroma_workflow.png">

> Source :[chromadb](https://docs.trychroma.com/)

### Create an Instance of the Database Client

In [8]:
# in memory
chroma_client = chromadb.Client()
# save to disk: client = chromadb.PersistentClient(path="/path/to/data")

In [9]:
def create_db_and_load_data(chroma_client,collection_name, embedding_func, documents):
  db = chroma_client.create_collection(name=collection_name,
                                       embedding_function=embedding_func)
  for i,d in enumerate(documents):
    db.add(
      documents=d,
      ids=str(i)
    )
  return db

## Insert Data

Now that we have a utility to interact with the vector database, let us add some data to it and check how it goes

In [10]:
db = create_db_and_load_data(chroma_client=chroma_client,
                             collection_name="nlp_llm_workshop",
                             embedding_func=get_embeddings,
                             documents=sample_texts)

In [11]:
pd.DataFrame(db.peek(0))

Unnamed: 0,ids,embeddings,metadatas,documents
0,0,"[-0.5249432921409607, -0.04365356266498566, 0....",,Another key offering from HuggingFace is Infer...
1,1,"[-0.5217247009277344, 0.5370818376541138, -0.2...",,This is an amazing python framework initially ...


## Retrieve Documents

In [12]:
question = "HuggingFace Key Offering"

In [13]:
db.query(query_texts=[question], n_results=1)

{'ids': [['0']],
 'distances': [[169.98214721679688]],
 'metadatas': [[None]],
 'embeddings': None,
 'documents': [['Another key offering from HuggingFace is Inference Endpoints. These endpoints provide access to hundreds of large models hosted on HuggingFace infra for easy use.']]}

In [14]:
def get_relevant_documents(query, db):
  relevant_doc = db.query(query_texts=[query], n_results=1)['documents'][0][0]
  return relevant_doc

In [15]:
# search using embeddings
get_relevant_documents(question, db)

'Another key offering from HuggingFace is Inference Endpoints. These endpoints provide access to hundreds of large models hosted on HuggingFace infra for easy use.'

## HuggingFace Powered Question Answering Setup

Similar to Embedding Endpoints, HF also provides us with capabilities to directly leverage models for tasks such as:
- Text Generation
- Question Answering, etc.

We can leverage local setups like GPT4ALL with LangChain, OpenAI APIs or even HuggingFace transformers as well. For this exercise, we will focus on leveraging **HuggingFace Endpoints** for **QA tasks** itself.

We will make use of [Roberta-Base-Squad2](https://huggingface.co/deepset/roberta-base-squad2) model.

In [16]:
QA_MODEL_ID = 'deepset/roberta-base-squad2'
HF_TOKEN = ''
QA_API_URL = f"https://api-inference.huggingface.co/models/{QA_MODEL_ID}"
HEADERS = {"Authorization": f"Bearer {HF_TOKEN}"}

In [17]:
def get_answer(question,context):
    payload = {
        "question": question,
        "context":context
    }
    data = json.dumps(payload)
    response = requests.request("POST", QA_API_URL, headers=HEADERS, data=data)
    try:
      decoded_response = json.loads(response.content.decode("utf-8"))
      return decoded_response['answer'], decoded_response['score'], ""
    except Exception as ex:
      return "Apologies but I could not find any relevant answer", 0.0, ex

In [18]:
print(question)

HuggingFace Key Offering


In [19]:
context = get_relevant_documents(question, db)
context

'Another key offering from HuggingFace is Inference Endpoints. These endpoints provide access to hundreds of large models hosted on HuggingFace infra for easy use.'

In [20]:
get_answer(question,context)

('Inference Endpoints', 0.18490087985992432, '')