# Vector Databases

## What is a Vector Database?

A vector database is a specialized database designed to store and retrieve data in the form of vectors. These vectors are numerical representations of objects such as text, images, or products, allowing us to find similar objects based on different similarity measures. In simple terms, instead of exact matches (like traditional databases), vector databases focus on **finding similar things**. For example, if you search for a product, the database returns items similar to it, like products with related features or uses.

### How ANN (Approximate Nearest Neighbors) and HNSW (Hierarchical Navigable Small World) Work

#### Approximate Nearest Neighbors (ANN)
When dealing with high-dimensional vectors (like text embeddings or image embeddings), finding exact matches between vectors can be computationally expensive and slow. Instead of searching for **exact** nearest neighbors, ANN focuses on finding **approximately** similar vectors quickly. ANN is a technique where you sacrifice a little bit of accuracy to gain **a lot of speed**.

In real-world applications, this trade-off is usually acceptable because most of the time, an approximate match is good enough. For example, when you're searching for similar products or images, the system doesn’t need to find the **perfect** match but just one that’s close enough.

#### Hierarchical Navigable Small World (HNSW)

HNSW is one of the most popular algorithms used for fast **ANN** searches. Here’s a breakdown of how it works:

1. **Building a Graph:**
   HNSW creates a graph structure where each point (vector) is connected to a few other vectors. These connections are made in such a way that the graph resembles a "small-world network," meaning it’s easy to travel from one point to another in a short number of steps.

2. **Hierarchy of Layers:**
   The graph is divided into several layers. The top layer contains fewer points and connections, while the bottom layer has more. Think of it like a multi-level building. The higher you are in the building, the fewer connections you need to navigate between rooms, and as you go lower, the number of rooms and connections increases. When you search for a vector, you start at the top and move down layer by layer, getting closer to the nearest neighbor.

3. **Navigating the Graph:**
   When a query vector comes in, HNSW starts by searching in the top layer and moves downward. At each layer, it looks for the most similar vectors based on the distance between them. As it moves through the layers, the search becomes more precise, and it narrows down to the closest vectors in the final layer. The process is fast because it reduces the search space in each layer.

4. **Efficiency Gains:**
   By organizing the graph in layers and connecting each vector only to a small number of other vectors, HNSW makes the search process **quick and efficient**, even in very large datasets.

#### Diagram of HNSW Process
![HNSW Diagram](https://images.app.goo.gl/8UNoRCgStboxjhaw9)

## Using Vector Databases
There are many options available for vector databases, both commercial and open-source. Throughout the course we will be using [Weaviate](https://weaviate.io/) as our vector database. Weaviate is an open-source, cloud-native vector database that allows you to store, search, and rank vectors. It pretty much ticks all the boxes for what we need in a vector database. 

### Using Weaviate
There are mainly many ways to install and use Weaviate. You can use it through Docker, or use their serverless cloud instance, or even through cloud marketplaces like AWS, Azure, and Google Cloud. But for now we will use Embedded Weaviate which is quite easy to setup and good enough for experimentation purposes.

### Getting Started with Weaviate

In [8]:
import weaviate
from dotenv import load_dotenv
import os

load_dotenv("./../../.env")
print(os.getenv("OPENAI_API_KEY"))

client = weaviate.connect_to_embedded(
    headers={
        "X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")
    }
)

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


{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-09-19T19:17:49+05:30"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-09-19T19:17:49+05:30"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-09-19T19:17:49+05:30"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-09-19T19:17:49+05:30"}
{"level":"info","msg":"open cluster service","servers":{"Embedded_at_8079":54281},"time":"2024-09-19T19:17:49+05:30"}
{"address":"192.168.155.215:54282","level":"info","msg":"starting cloud rpc server ...","time":"2024-09-19T19:17:49+05:30"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-09-19T19

{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-09-19T19:17:52+05:30","wait_for_cache_prefill":false}
{"level":"info","msg":"Completed loading shard article_vHmuJb5dmu3C in 44.742958ms","time":"2024-09-19T19:17:52+05:30"}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-19T19:17:52+05:30","took":70541}


#### Collections

A collection is a group of objects that share the same properties. For example a collection named Article will contain all the articles in your database. Each object in a collection is represented as a vector. Similarly another named Author contains all the authors in your database.

In [11]:
from weaviate.classes.config import Property, DataType, Configure

if client.collections.exists("Article"):
    client.collections.delete("Article")

client.collections.create(
    "Article",
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="body", data_type=DataType.TEXT, vectorize_property_name=True),
        Property(name="date", data_type=DataType.DATE)
    ],
    vectorizer_config=Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-small"
    )
)

<weaviate.collections.collection.sync.Collection at 0x11e090f10>

{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-09-19T19:18:16+05:30","wait_for_cache_prefill":false}
{"level":"info","msg":"Created shard article_YIxrcKzy5u7i in 4.329167ms","time":"2024-09-19T19:18:16+05:30"}
{"action":"hnsw_vector_cache_prefill","count":1000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-09-19T19:18:16+05:30","took":72833}


#### Create Objects

Currently, we are not manually handling the creation of vectors for the . Instead, the Weaviate client is going to handle that for us.

In [16]:
article = client.collections.get("Article")

uuid = article.data.insert({
    "title": "Machine Learning",
    "body": "Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.",
    "date": "2021-10-01T00:00:00Z",
    "category": "ML"
})

uuid

UUID('497564c2-993c-4b5b-bb09-82e43517b9aa')

#### Query Objects Using Similarity Search

Weaviate uses distance metrics to provide the similarity between vectors. We can use distance metrics to understand how similar two vectors are while still using cosine similarity under the hood using the below formula:

$$
distance = 1 - cosine\_similarity(a, b)
$$

The distance metric is a value between 0 and 2, where 0 means the vectors are identical, and 2 means they are completely different.

Incase you want to use other distance metrics such as dot product, euclidean distance, etc. you can refer to the [this](https://weaviate.io/developers/weaviate/manage-data/collections#specify-a-distance-metric) doc page understand how you can use them.

[This doc](https://weaviate.io/developers/weaviate/config-refs/distances#distance-vs-certainty) contains all the distance metrics that weaviate supports.


In [37]:
from weaviate.classes.query import MetadataQuery
import textwrap

response = article.query.near_text(
    query="ai and ml",
    return_metadata=MetadataQuery(distance=True, certainty=True), # return distance and certainty metrics
    include_vector=True # include the vector of the query
)


def print_objects(objects):
    """
        a function to print the retrieved objects
    """
    for obj in objects:
        print(f"ID: {obj.uuid.int}")
        print(f"Distance: {obj.metadata.distance}, Certainty: {obj.metadata.certainty}")
        print(f"Title: {obj.properties['title']}")
        print(f"Date: {obj.properties['date']}")
        print(f"Category: {obj.properties['category']}")
        print(f"Body: {textwrap.shorten(obj.properties['body'], width=100)}")
        print()


print_objects(response.objects)

ID: 97643186083395424410299917816510790058
Distance: 0.576338529586792, Certainty: 0.711830735206604
Title: Machine Learning
Date: 2021-10-01 00:00:00+00:00
Category: ML
Body: Machine learning is a subset of artificial intelligence (AI) that provides systems the ability [...]

ID: 145885670213049490926508863968058307604
Distance: 0.8108642101287842, Certainty: 0.5945678949356079
Title: Hello World
Date: 2021-10-01 00:00:00+00:00
Category: general
Body: This is the first article



#### Read Objects

In [40]:
objects = article.query.fetch_object_by_id(response.objects[0].uuid)
objects.properties

{'title': 'Machine Learning',
 'date': datetime.datetime(2021, 10, 1, 0, 0, tzinfo=datetime.timezone.utc),
 'body': 'Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Machine learning focuses on the development of computer programs that can access data and use it to learn for themselves.',
 'category': 'ML'}

#### Update Objects

In [41]:
article.data.update(
    uuid=response.objects[1].uuid,
    properties={
        "title": "Deep Learning",
        "body": "Deep learning is a subset of machine learning that is based on artificial neural networks. The learning process is based on the structure of the human brain. Deep learning algorithms attempt to draw similar conclusions as humans would by continually analyzing data with a given logical structure.",
    }
)

In [43]:
object = article.query.fetch_object_by_id(response.objects[1].uuid)
object.properties

{'title': 'Deep Learning',
 'date': datetime.datetime(2021, 10, 1, 0, 0, tzinfo=datetime.timezone.utc),
 'body': 'Deep learning is a subset of machine learning that is based on artificial neural networks. The learning process is based on the structure of the human brain. Deep learning algorithms attempt to draw similar conclusions as humans would by continually analyzing data with a given logical structure.',
 'category': 'general'}

#### Delete Object

In [46]:
article.data.delete_by_id(uuid=response.objects[1].uuid)

object = article.query.fetch_object_by_id(response.objects[1].uuid)
object