<h1 align="center">
  <a href="https://uptrain.ai">
    <img width="300" src="https://user-images.githubusercontent.com/108270398/214240695-4f958b76-c993-4ddd-8de6-8668f4d0da84.png" alt="UpTrain">
  </a>
</h1>

<div style="text-align: center;">

# Creating Vector Embedding using Qdrant and Evaluating using UpTrain

## Why Use a Vector Database?
Vector databases store data as high-dimensional vectors, enabling fast and efficient similarity search and retrieval of data based on their vector representations.

This is particularly useful for large language models (LLMs), which need to process vast amounts of data and find relevant information quickly.

## Qdrant
Qdrant (pronounced: quadrant) is a vector similarity search engine and vector database. It provides a production-ready service with a convenient API to store, search, and manage points—vectors with an additional payload. Qdrant is tailored to extended filtering support, making it useful for all sorts of neural network or semantic-based matching, faceted search, and other applications.

## Example
A use case of vector databases in a customer support LLM is to act as a knowledge extension for the LLM and provide context from the enterprise. The vector database can be queried to retrieve existing similar information, eliminating the need to use sensitive enterprise data to train or fine-tune the LLM. Every time a question is asked, the question gets converted to an LLM-specific embedding, which is used to retrieve relevant context from the vector database.

</div>




#workflow

Evaluating Semantic Search Results with UpTrain and Qdrant
This Jupyter notebook explores the integration of UpTrain's LLM evaluation platform with Qdrant's vector search engine for a robust and insightful evaluation of retrieved search results.

### The workflow outlined here consists of three key steps:

#### Data Preparation and Embedding:
 We'll begin by processing and embedding our data using a suitable sentence transformer model. This creates high-dimensional vector representations capturing the semantic meaning of each data point.
#### Vector Search with Qdrant:
We'll leverage Qdrant's efficient vector search capabilities to retrieve relevant results based on a query vector, allowing us to explore the semantic relationships within the data.
#### UpTrain Evaluation of Retrieved Content:
 Utilizing UpTrain's pre-built and custom evaluation checks, we'll analyze the quality of the retrieved search results. This assessment focuses on aspects like response relevance, factual accuracy, and completeness, providing valuable insights into the effectiveness of the search process.

### Step 1 : Install all the libraries and make necessary imports

In [1]:
# Install required libraries
!pip install qdrant-client>=1.1.1
!pip install -U sentence-transformers
!pip install uptrain

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorboard 2.15.1 requires protobuf<4.24,>=3.19.6, but you have protobuf 4.25.2 which is incompatible.
tensorflow-metadata 1.14.0 requires protobuf<4.21,>=3.20.3, but you have protobuf 4.25.2 which is incompatible.[0m[31m
[0mCollecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: senten

In [2]:
# Import necessary libraries
import polars as pl  # For data manipulation
from qdrant_client import models, QdrantClient  # Qdrant client for vector search
from sentence_transformers import SentenceTransformer  # Model for generating embeddings
from uptrain import APIClient, Evals  # UpTrain client for embedding evaluation

### Step 2: Define your document to be embedded

In [3]:
# Load data (adjust based on your data source)
# Assuming you have a list of texts for Qdrant embedding and evaluation

# Let's make a semantic search for Sci-Fi books!
texts= [
  { "name": "The Time Machine", "description": "A man travels through time and witnesses the evolution of humanity.",
   "author": "H.G. Wells", "year": 1895 },
  { "name": "Ender's Game", "description": "A young boy is trained to become a military leader in a war against an alien race.",
   "author": "Orson Scott Card", "year": 1985 },
  { "name": "Brave New World", "description": "A dystopian society where people are genetically engineered and conditioned to conform to a strict social hierarchy.",
   "author": "Aldous Huxley", "year": 1932 },
  { "name": "The Hitchhiker's Guide to the Galaxy", "description": "A comedic science fiction series following the misadventures of an unwitting human and his alien friend.",
   "author": "Douglas Adams", "year": 1979 },
  { "name": "Dune", "description": "A desert planet is the site of political intrigue and power struggles.",
   "author": "Frank Herbert", "year": 1965 },
  { "name": "Snow Crash", "description": "A futuristic world where the internet has evolved into a virtual reality metaverse.",
   "author": "Neal Stephenson", "year": 1992 },
  { "name": "The War of the Worlds", "description": "A Martian invasion of Earth throws humanity into chaos.",
   "author": "H.G. Wells", "year": 1898 },
  { "name": "The Andromeda Strain", "description": "A deadly virus from outer space threatens to wipe out humanity.",
   "author": "Michael Crichton", "year": 1969 },
  { "name": "The Left Hand of Darkness", "description": "A human ambassador is sent to a planet where the inhabitants are genderless and can change gender at will."
  , "author": "Ursula K. Le Guin", "year": 1969 },

]
  # Replace with your actual text data


### Step 3: Choose your embedding model and create a memory instance for the same

In [4]:
# Create sentence transformer model
encoder = SentenceTransformer('all-MiniLM-L6-v2')  # Choose the embedding model


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [5]:
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

In [6]:
# Create collection to store books
qdrant.recreate_collection(
    collection_name="my_books",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)


True

### Step 4: Generate embedding and vectorise from the defined text in the previous step


In [7]:
# Generate embeddings and upload to Qdrant
embeddings = encoder.encode(texts)  # Generate embeddings for the texts
qdrant.upload_records(
    collection_name="my_books",  # Upload to the specified collection
    records=[models.Record(id=idx, vector=embedding.tolist()) for idx, embedding in enumerate(embeddings)]  # Create Qdrant records
)



  qdrant.upload_records(


In [8]:
# Let's vectorize descriptions and upload to qdrant

qdrant.upload_records(
    collection_name="my_books",
    records=[
        models.Record(
            id=idx,
            vector=encoder.encode(doc["description"]).tolist(),
            payload=doc
        ) for idx, doc in enumerate(texts)
    ]
)

  qdrant.upload_records(


### Step 5: Get your UpTrain API key and define it for later use


In [9]:
# Define UpTrain API key
UPTRAIN_API_KEY = "up-a********************f8e45e"  # Replace with your UpTrain API key


 <h2 align="center"> Open-source framework: </h2>

You can evaluate your responses via the open-source version by providing your OpenAI API key to run evaluations. UpTrain leverages a pipeline comprising GPT-3.5 calls for the same.



*   Evaluates the quality of search results retrieved from a Qdrant vector search using UpTrain's language model evaluation capabilities.
*   Assesses relevance, factual accuracy, and completeness of the responses,
Provides insights into the strengths and weaknesses of the retrieved content



### Step 6: Run your desired evaluations on the OSS platform by UpTrain

In [11]:
# Imports (updated)
from uptrain import EvalLLM, Evals, APIClient
import json


# Evaluation using UpTrain's EvalLLM
eval_llm = EvalLLM(openai_api_key="sk-************************f9")  # Use EvalLLM for embedding evaluation

hits = qdrant.search(
    collection_name="my_books",
    query_vector=encoder.encode("Aliens attacking our planet").tolist(),
    limit=3
)



results = []
for hit in hits:
    print(hit.payload.get("text", ""))
    print(hit.payload, "score:", hit.score)  # Print initial search results

    embedding_to_evaluate = hit.vector
    data = [{
        'question': "Give a list of book with Aliens attack our planet",
        'context': "",
        'response': hit.payload.get("text", "")
    }]
    results.append(eval_llm.evaluate(
        data=data,
        checks=[Evals.CONTEXT_RELEVANCE] #choose your evaluation metrics accordingly
    ))

print(json.dumps(results, indent=3))


[32m2024-01-27 14:54:46.935[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m



{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898} score: 0.5115954004839619


[32m2024-01-27 14:54:50.663[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m
[32m2024-01-27 14:54:53.785[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m
[32m2024-01-27 14:54:56.704[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m



{'name': 'The Andromeda Strain', 'description': 'A deadly virus from outer space threatens to wipe out humanity.', 'author': 'Michael Crichton', 'year': 1969} score: 0.409348291261325


[32m2024-01-27 14:54:58.821[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m
[32m2024-01-27 14:55:00.937[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m
[32m2024-01-27 14:55:02.451[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m



{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.34362304984912984


[32m2024-01-27 14:55:04.586[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m
[32m2024-01-27 14:55:07.005[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m100[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain[0m


[
   [
      {
         "question": "Give a list of book with Aliens attack our planet",
         "context": "",
         "response": "",
         "score_context_relevance": 0.5,
         "explanation_context_relevance": "Step 1: Read the question and the extracted context carefully.\nStep 2: Determine if the extracted context provides a list of books with aliens attacking our planet.\nStep 3: If the context provides a complete list, select option (A). If the context provides some relevant information but not a complete list, select option (B). If the context does not provide any relevant information, select option (C).\nStep 4: Double-check the selected option to ensure it is the most accurate choice.\n\n0.5\n0.5"
      }
   ],
   [
      {
         "question": "Give a list of book with Aliens attack our planet",
         "context": "",
         "response": "",
         "score_context_relevance": 0.0,
         "explanation_context_relevance": "1. Read the question: \"Give a list of bo

# [Alternate]: Using UpTrain Managed Service and visualizing results on UpTrain Dashboards
You can create a free UpTrain account [here](https://uptrain.ai/)
and get free trial credits. If you want more trial credits, [book a call with the maintainers of UpTrain here](https://calendly.com/uptrain-sourabh/30min).

UpTrain Managed service provides:

Dashboards with advanced drill-down and filtering options
Insights and common topics among failing cases
Observability and real-time monitoring of production data
Regression testing via seamless integration with your CI/CD pipelines

In [12]:
uptrain_client = APIClient(uptrain_api_key="up-a*********************45e")


hits = qdrant.search(
    collection_name="my_books",
    query_vector=encoder.encode("Aliens attack our planet").tolist(),
    limit=3
)

results = []
for hit in hits:
    print(hit.payload, "score:", hit.score)  # Print initial search results

    embedding_to_evaluate = hit.vector
    data = [{
        'question': "Aliens attack our planet",
        'context': "",
        'response': hit.payload.get("text", "")
    }]
    results.append( uptrain_client.log_and_evaluate(
        "uptrain-qdrant integration",
        data=data,
        checks=[ Evals.CONTEXT_RELEVANCE]
    ))

print(json.dumps(results, indent=3))

[32m2024-01-27 14:56:05.641[0m | [1mINFO    [0m | [36muptrain.framework.remote[0m:[36mlog_and_evaluate[0m:[36m507[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain server[0m


{'name': 'The War of the Worlds', 'description': 'A Martian invasion of Earth throws humanity into chaos.', 'author': 'H.G. Wells', 'year': 1898} score: 0.526554060446077


[32m2024-01-27 14:59:09.415[0m | [1mINFO    [0m | [36muptrain.framework.remote[0m:[36mlog_and_evaluate[0m:[36m507[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain server[0m


{'name': 'The Andromeda Strain', 'description': 'A deadly virus from outer space threatens to wipe out humanity.', 'author': 'Michael Crichton', 'year': 1969} score: 0.4260536877911949


[32m2024-01-27 14:59:10.932[0m | [1mINFO    [0m | [36muptrain.framework.remote[0m:[36mlog_and_evaluate[0m:[36m507[0m - [1mSending evaluation request for rows 0 to <50 to the Uptrain server[0m


{'name': "The Hitchhiker's Guide to the Galaxy", 'description': 'A comedic science fiction series following the misadventures of an unwitting human and his alien friend.', 'author': 'Douglas Adams', 'year': 1979} score: 0.361734363792688
[
   [
      {
         "question": "Aliens attack our planet",
         "context": "",
         "response": "",
         "score_context_relevance": 0.5,
         "explanation_context_relevance": "Step 1: Read the question \"Aliens attack our planet\" and the extracted context.\nStep 2: Compare the semantic similarity of the extracted context with the question.\nStep 3: Determine if the extracted context contains sufficient information to answer the given question completely, or if it contains relevant but incomplete information to form the answer, or if it doesn't have any relevant information at all.\nStep 4: Based on the comparison, select the appropriate option: (A) The extracted context can answer the given question completely. (B) The extracted c

You can access the uptrain dashboards at https://demo.uptrain.ai/dashboard/ by using the above defined UPTRAIN_API_KEY

