## Scalability of Vector Databases:

- Purpose: Vector databases are designed to handle and search through vast amounts of data efficiently. If you have tens of millions or even billions of data objects, traditional databases might slow down, but vector databases are optimized for this scale.
- How It Works: They use mathematical operations on vectors to retrieve relevant objects in real-time. This is particularly useful for tasks like vector search or semantic search.

## Approximate Nearest Neighbors (ANN):

- Concept: Instead of comparing a query vector to every stored vector one by one, ANN algorithms find vectors that are in the nearest neighborhood of the query vector. This speeds up the search process.
- Trade-off: The trade-off with ANN is that it might not always find the exact nearest neighbors, but it significantly boosts performance.

## Hierarchical Navigable Small Worlds (HNSW):

- Algorithm: HNSW is a popular ANN algorithm used in vector databases. It builds a graph of vectors with a hierarchy of points.
- Upper Levels: Start at the top with points that are far apart.
- Lower Levels: Move to lower levels with points that are closer together.
- Search: Begin the search at the top, moving across distant points to find the closest one to the query. Then drop down a level and repeat until the closest neighbors are found at the lowest level.

- Efficiency: This method allows the database to scale to billions of objects while still performing real-time searches.


## Why Use Vector Databases?
- Speed: They can handle large-scale data and perform searches quickly.
- Optimization: They are optimized for mathematical operations on vectors, making them suitable for applications like image and text search.

## Ways to measure performance of a vector DB
"Ways to measure performance of a vector DB," three key metrics are discussed to evaluate how well a vector database performs:

1 - Recall:

- What It Is: Recall measures how accurately the database finds similar items to your query.
- Analogy: Imagine you're looking for all your friends at a crowded party. Recall is like counting how many of your friends you actually find compared to how many are really there. The higher the recall, the more friends you find.

2 - Queries Per Second (QPS):

- What It Is: QPS measures how fast the database can handle search requests.
- Analogy: Think of QPS like the speed of a cashier at a busy store. The more customers (queries) the cashier can help per second, the faster the service. Higher QPS means the database can handle more searches quickly.

3 - Memory Usage:

- What It Is: This metric shows how much computer memory (RAM) the database uses to perform searches.
- Analogy: Imagine your computer's memory as a bookshelf. The more memory the database uses, the more space it takes up on the shelf. Using less memory is better because it leaves more room for other tasks.


## Trade-offs
- Speed vs. Accuracy: Often, there's a trade-off between recall and QPS. If you want faster searches (higher QPS), you might miss some similar items (lower recall).
- Memory Efficiency: Some databases use special techniques to use less memory while still performing well.

# CRUD Operations

In [10]:
import requests
import os
import json
import weaviate
from weaviate.embedded import EmbeddedOptions

In [11]:
# Downlaod the data https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json

response = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(response.text)

# Parse the json and preview the data using json_prin function
def json_print(data):
    print(json.dumps(data, indent=2))

print(type(data))
print(len(data))
print(json.dumps(data[0], indent=2))


<class 'list'>
10
{
  "Category": "SCIENCE",
  "Question": "This organ removes excess glucose from the blood & stores it as glycogen",
  "Answer": "Liver"
}


In [12]:
# Load the data into Weaviate
cohere_api_key = os.getenv("COHERE_APIKEY")

waeivate_api_key = os.getenv("WEAVIATE_API_KEY")

auth_config = weaviate.AuthApiKey(api_key=waeivate_api_key)

client = weaviate.Client(
    url="https://e2pxfwhqioinxijlmnqxw.c0.europe-west3.gcp.weaviate.cloud",
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": cohere_api_key
        #"X-Google-Studio-Api-Key": ai_studio_api_key
    }
)

In [17]:
# Check and delete already existed class
if client.schema.exists("Question"):
    client.schema.delete_class("Question")
print("Class Deleted Successfully")

Class Deleted Successfully


In [18]:
# Create class object
class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-cohere",
}

client.schema.create_class(class_obj)
print("Class created")

Class created


In [22]:
# Create multiple objects by specifying question, answer, and category properties
objects = [
    {
        "question": "What is the capital of France?",
        "answer": "Paris",
        "category": "Geography"
    },
    {
        "question": "Who is the current Prime Minister of France?",
        "answer": "Emmanuel Macron",
        "category": "Politics"
    },
    {
        "question": "What is the name of the largest city in France?",
        "answer": "London",
        "category": "Geography"
    },
    {
        "question": "Who is the current Prime Minister of Spain?",
        "answer": "Manuel Pujol",
        "category": "Politics"
    },
    {
        "question": "What is the name of the largest city in Spain?",
        "answer": "Madrid",
        "category": "Geography"
    },
    # Add more objects here
]

# Create each object separately
for obj in objects:
    uuid = client.data_object.create(obj, class_name="Question")
    print("Data object created with UUID:", uuid)

Data object created with UUID: 798c6b47-4d7c-4471-a9c4-533140ed176c
Data object created with UUID: 437ec891-34f8-4327-b7f7-477ac49b7519
Data object created with UUID: a9a56964-1d84-474c-ad80-1e4f99775455
Data object created with UUID: 545b6f97-54ae-4f85-9b79-fe0e5ad116aa
Data object created with UUID: 3d581d45-ac55-42eb-a28e-36080f83c758


In [27]:
# Read the objects that we just created using its id
objects_uuid = uuid
objects = client.data_object.get_by_id(objects_uuid, class_name="Question")

# print all objects
print("Objects retrieved successfully:")
print(json.dumps(objects, indent=2))

Objects retrieved successfully:
{
  "class": "Question",
  "creationTimeUnix": 1725454008570,
  "id": "3d581d45-ac55-42eb-a28e-36080f83c758",
  "lastUpdateTimeUnix": 1725454008570,
  "properties": {
    "answer": "Madrid",
    "category": "Geography",
    "question": "What is the name of the largest city in Spain?"
  },
  "vectorWeights": null
}


In [30]:
# Extract the vector for this object
data_object = client.data_object.get_by_id("3d581d45-ac55-42eb-a28e-36080f83c758", class_name="Question", with_vector=True)
print("Object retrieved successfully:")
print(json.dumps(data_object, indent=2))

Object retrieved successfully:
{
  "class": "Question",
  "creationTimeUnix": 1725454008570,
  "id": "3d581d45-ac55-42eb-a28e-36080f83c758",
  "lastUpdateTimeUnix": 1725454008570,
  "properties": {
    "answer": "Madrid",
    "category": "Geography",
    "question": "What is the name of the largest city in Spain?"
  },
  "vector": [
    0.009674072,
    0.024749756,
    0.06390381,
    0.022109985,
    -0.031341553,
    0.020889282,
    -0.011665344,
    -0.025344849,
    -0.022277832,
    -0.06304932,
    0.017974854,
    -0.009918213,
    -0.028625488,
    -0.011306763,
    -0.08300781,
    -0.008903503,
    -0.0010280609,
    -0.00630188,
    0.011222839,
    0.021118164,
    0.0047798157,
    0.032287598,
    0.004558563,
    -0.0390625,
    -0.07122803,
    0.0362854,
    -0.025100708,
    -0.0053215027,
    -0.022888184,
    -0.06518555,
    -0.014419556,
    0.036743164,
    0.07495117,
    0.036712646,
    -0.01902771,
    -0.026443481,
    -0.026321411,
    -0.04876709,
    -0

In [31]:
# Update the object with more details
client.data_object.update(
    uuid = objects_uuid, 
    class_name="Question",
    data_object={
        "answer": "Updated Answer",
    })

In [32]:
# View the updated object
data_object = client.data_object.get_by_id(objects_uuid, class_name="Question")
print("Object retrieved successfully:")
print(json.dumps(data_object, indent=2))

Object retrieved successfully:
{
  "class": "Question",
  "creationTimeUnix": 1725454008570,
  "id": "3d581d45-ac55-42eb-a28e-36080f83c758",
  "lastUpdateTimeUnix": 1725458160296,
  "properties": {
    "answer": "Updated Answer",
    "category": "Geography",
    "question": "What is the name of the largest city in Spain?"
  },
  "vectorWeights": null
}


In [33]:
# Delete the object from class using uuid

client.data_object.delete(uuid=objects_uuid, class_name="Question")
print("Object deleted successfully")

Object deleted successfully


In [36]:
# Explore the data in Weaviate
json_print(client.query.aggregate("Question").with_meta_count().do())

{
  "data": {
    "Aggregate": {
      "Question": [
        {
          "meta": {
            "count": 4
          }
        }
      ]
    }
  }
}
