# Vector databases

<!-- testing: ignore -->

## Learning outcomes

By the end of this session, you should be able to:

* Define vector databases and semantic vectors in your own words.
* List common applications for vector databases.
* Describe how to use semantic vector representations to solve a business problem.
* List the limitations of vector databases and semantic vectors.

## Vector databases introduction

Vector databases are specialized databases designed to efficiently handle data in vector form. Vectors are arrays of float values. That collection of values encode other complex data, such as text. Vector representations can improve performance across a wide range of applications, including machine learning and information retrieval. To scale these data-driven applications, vector databases support the rapid retrieval of large sets of relevant vectors. Traditional relational databases are not optimized for vector-based queries and can be extremely slow when performing those operations. Vector databases support ML and AI at scale and in production environments by making vector data more accessible. 

## Semantic vectors introduction

As stated above, vectors are arrays of float point numbers. These vectors serve as compact, dense representations of complex data, such as text, images, sound, or video. These vectors are fixed-length. No matter the size of the object they are encoding, the resulting vector is always the same size. Vectors are easier for machines to understand and process than those other complex data types. The vectors typically used in vector databases encode semantic (i.e., meaning) information about the entities. 

One result of using semantic vectors is the distance between two vectors measures the relatedness of the underlying entities. Small distances suggest high relatedness, and large distances suggest low relatedness. For example, the vectors for "dog" and "cat" should be closer to each other than the vectors for "dog" and "penguin". Semantic vectors are important modern artificial intelligence (AI) applications because all ML algorithms require data to be numerically encoded. Semantic are numerical encoded data where distance is meaningful. If data can be encoded as a semantic vector, it can be used in ML. 

Semantic vectors are also called vector embeddings or simply embeddings.

## Vector database operations and applications

Vector databases can perform the standard database operations of insertion, update, deletion, and retrieval of specific records. Since vector databases store semantic vectors they support additional operations. The most common vector operation is finding the most similar vectors to a given query vector, aka nearest neighbor search. Since each vector is high-dimensional (hundreds to thousands of floats) and there are often many vectors (thousands, millions, or billions), this can be computationally expensive. Nearest neighbor search can support many applications:

* **Recommendations** Given a query entity, find nearby related entities.

* **Search** Given a query entity, return results ranked by relevance.

* **Clustering** Group entities by similarity.

* **Anomaly Detection**: Identify dissimilar entities to a set of reference entities.

* **Classification**: Assign a label to an entity.

## Demo of segmenting businesses with vector embeddings

One common use case for semantic vectors and vector databases is clustering text data. Semantic vectors are currently the best way to encode text data to then be used as input features for machine learning clustering. Clustering is grouping items together based on similarity. In the case of semantic vectors, similarity minimizes the distance between words.

In the following demo, we are going to segment businesses. We'll cluster service stations from the United Kingdom together based on vector representations of their respective text profiles.

Here is a preview of the steps:

1. Get embeddings for each service station profile.
1. Cluster similar service station profiles together based on those embeddings.
1. Evaluate the clusters, and try to improve them.

We'll use OpenAI embeddings. OpenAI is more well-known for ChatGPT. The same organization also provides high quality text embeddings through its API. Similar large language models (LLM) power both ChatGPT and text embeddings.

In [1]:
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

Call the OpenAI API to get an embedding for a piece of text.

In [2]:
# Text to embed
text_string = "Hello, world!"

# Embedding model
model_id = "text-embedding-ada-002"

# Get the embedding of the text
embedding = openai.Embedding.create(input=text_string, engine=model_id)['data'][0]['embedding']
embedding[:2]

[0.0014531221240758896, 0.0028919042088091373]

In [3]:
print(f"Each embedding is composed of {len(embedding):,} floats.")

Each embedding is composed of 1,536 floats.


### What happens if you have a lot of vectors?

If you had a separate embedding for 10 millions customers profiles, how large would that data in bytes?


1. A float is typically 8 bytes in size. <br>  
2. Each vector has 1,536 floats.  <br>
3. There are 10 million such vectors. <br>   
    
The size of each vector in bytes would be 1,536 floats X 8 bytes/float = 12,288 bytes.

The size of 10 million such vectors would be 10,000,000 vectors X 12,288 bytes/vector = 122,880,000,000 bytes, 122.88 GB.  

Thus, a vector databases would be useful.


As mentioned earlier, we can compare measure distance between two vector embeddings. The closer the distance, the more similar the items. Let's write a helper function that passes a string to the OpenAI API and casts the returned Python list as a NumPy vector.

In [4]:
import numpy as np

def get_embedding(string, model_id="text-embedding-ada-002"):
    embedding_as_list = openai.Embedding.create(input=string, engine=model_id)['data'][0]['embedding']
    embedding_as_array = np.array(embedding_as_list)
    return embedding_as_array

Let's get the vector representations for a couple of animals.

In [5]:
dog_vector = get_embedding("dog")
cat_vector = get_embedding("cat")
penguin_vector = get_embedding("penguin")

We can now compare Euclidean distance, aka "as the crow flies", between those two of those embeddings.

In [6]:
distance_dog_to_cat = np.linalg.norm(dog_vector - cat_vector)
distance_dog_to_cat

0.5225002841541313

In [7]:
distance_dog_to_penguin = np.linalg.norm(dog_vector - penguin_vector)
distance_dog_to_penguin

0.617644981045929

The distance is smaller between dog and cat than dog and penguin which means dog is semantically closer to a cat than a penguin. 

We can use that same logic to compare the text from longer documents. Here are profiles of fictional service stations in the United Kingdom.

In [8]:
profiles = [
    "Green Fuel Oasis, Surrey: Nestled in the picturesque countryside of Surrey, Green Fuel Oasis is a sustainable service station that offers a unique blend of eco-conscious services. Not only can you refuel your vehicle with clean, renewable energy options like battery recharging, bioethanol, and  biodiesel.",
    "Coastal Retreat Rest Stop, Cornwall: Located along the rugged coastline of Cornwall, this service station is a haven for travelers seeking relaxation. With breathtaking ocean views, it's the perfect spot to take a break, refuel, and savor locally sourced seafood at the seafood grill. Don't forget to explore the coastal walking trails nearby.",
    "TechHub Express, Manchester: For the tech-savvy traveler, TechHub Express in Manchester is a cutting-edge service station. Offering fast Wi-Fi, stations for electric vehicles, a variety of biofuels, and a state-of-the-art VR gaming lounge, it's a place where you can recharge both your car and your devices while having a blast.",
    "Highland Haven, Scotland: Situated amidst the stunning Scottish Highlands, Highland Haven is a serene service station catering to adventurers exploring the rugged landscapes. Sample traditional Scottish fare at the cozy café, all while gazing at panoramic mountain vistas.",
    "Countryside Classic, Cotswolds: Experience the charm of the Cotswolds at Countryside Classic, a service station that embodies the region's quintessential beauty. This idyllic spot and serves up classic British tea and scones in a quaint, cottage-style café surrounded by rolling hills and historic villages."
]

Looking at the profiles, which ones seem similar to each other?

In [9]:
for profile in profiles:
    print(profile)
    print("#"*20)

Green Fuel Oasis, Surrey: Nestled in the picturesque countryside of Surrey, Green Fuel Oasis is a sustainable service station that offers a unique blend of eco-conscious services. Not only can you refuel your vehicle with clean, renewable energy options like battery recharging, bioethanol, and  biodiesel.
####################
Coastal Retreat Rest Stop, Cornwall: Located along the rugged coastline of Cornwall, this service station is a haven for travelers seeking relaxation. With breathtaking ocean views, it's the perfect spot to take a break, refuel, and savor locally sourced seafood at the seafood grill. Don't forget to explore the coastal walking trails nearby.
####################
TechHub Express, Manchester: For the tech-savvy traveler, TechHub Express in Manchester is a cutting-edge service station. Offering fast Wi-Fi, stations for electric vehicles, a variety of biofuels, and a state-of-the-art VR gaming lounge, it's a place where you can recharge both your car and your devices 

Let's call the OpenAI API to get a embedding for each profile. 

In [10]:
profiles_vectorized  = []

for profile in profiles:
    embedding = openai.Embedding.create(input=profile, engine=model_id)['data'][0]['embedding']
    profiles_vectorized.append(embedding)

Each text profile has been encoded as a single vector.

In [11]:
for profile in profiles_vectorized:
    print(profile[:3])

[0.012512183748185635, -0.006786952260881662, -0.012209794484078884]
[0.0034489489626139402, -0.013403340242803097, 0.01063619926571846]
[-0.0031966629903763533, -0.0052318209782242775, -0.007783764973282814]
[0.0053925723768770695, -0.01785260997712612, 0.012269635684788227]
[-0.0010894560255110264, 0.0055679637007415295, 0.0067053623497486115]


Since the data has been encoded as numerical values, it is amenable to machine learning. We are going to perform k-means clustering. K-means clustering is a popular unsupervised machine learning algorithm used for partitioning a dataset into a set of distinct groups or clusters. K-means clustering does this by iteratively assigning data points to the cluster whose centroid (center) is nearest and updating the centroids based on the mean of the assigned data points, ultimately seeking to minimize the sum of squared distances between data points and their respective centroids. We'll use Python's scikit-learn library, the most popular machine learning library for traditional machine learning.

In [13]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, # Try n_clusters=2 and n_clusters=3
                n_init='auto',
                random_state=40,
               )

kmeans.fit(profiles_vectorized);

Let's look at the results to see which text profile ended up in which group.

In [14]:
import pandas as pd

df = pd.DataFrame({
    "cluster": kmeans.labels_,
    "text": profiles,
}).sort_values(by=['cluster']).reset_index(drop=True)
df

Unnamed: 0,cluster,text
0,0,"Coastal Retreat Rest Stop, Cornwall: Located a..."
1,0,"Highland Haven, Scotland: Situated amidst the ..."
2,0,"Countryside Classic, Cotswolds: Experience the..."
3,1,"Green Fuel Oasis, Surrey: Nestled in the pictu..."
4,1,"TechHub Express, Manchester: For the tech-savv..."


Looking at the two clusters, one cluster is focused on the "food" topic, and the other cluster is focused on the "tech" topic.

Looking at the text of the two tech groups, there are no meaningful words common in both. Instead of matching on common words, the embeddings pick up on the semantic similarities. The results are matches profiles on the underlying meaning behind the words.

In [15]:
print(profiles[0])
print("#"*20)
print(profiles[2])

Green Fuel Oasis, Surrey: Nestled in the picturesque countryside of Surrey, Green Fuel Oasis is a sustainable service station that offers a unique blend of eco-conscious services. Not only can you refuel your vehicle with clean, renewable energy options like battery recharging, bioethanol, and  biodiesel.
####################
TechHub Express, Manchester: For the tech-savvy traveler, TechHub Express in Manchester is a cutting-edge service station. Offering fast Wi-Fi, stations for electric vehicles, a variety of biofuels, and a state-of-the-art VR gaming lounge, it's a place where you can recharge both your car and your devices while having a blast.


### Extending this example with vector databases

In this example, we only had a couple of examples. It is more common to have thousands or millions of examples. ML algorithms become very computationally intensive at that scale, especially algorithms like k-means clustering. Vector databases are designed to manage vectors and vectors-based operations at that scale. Vector databases will have algorithms optimized for vector computation. Since vector databases do not support general-purpose computing, they can be much faster at specific vector tasks. 

One primary reason to use vector databases is to persist the vectors for later. Right now, the vectors are only in memory. If the computer restarts, the vector information will be lost. Additionally, vector databases can cache, aka store, the embeddings so we do not need to call the OpenAI API again for the same item. This is important because every call to OpenAI API costs money. Additionally, vector databases should be faster and more reliable than an API call.  

## Common vector database solutions

The following is a selection of the most common vector databases:

* **pgvector**: A Postgres extension to store vector embeddings and perform vector similarity search. The advantages of pgvector are that it is open source and works with Postgres, a popular relational database management system (RDBMS).

* **Elasticsearch**: A search system that includes a vector database. The advantages of Elasticsearch are that is a popular search system and is open source.

* **Pinecone**: A fully-managed vector database designed for machine learning applications. Advantages of Pinecone is that it is highly performant and scalable.

* **Milvus**: An open-source vector database. The advantages of Milvus are that is free to use and supports the most common vector database operations.

## The advantages and limitations vector databases

As discussed previously, the primary advantages of vector databases are speed, scalability, and providing fundamental data operations, such as ACID (Atomicity, Consistency, Isolation, and Durability) for vector data types.

There are also limitations to vector databases. Vector databases are currently in the Gartner Hype Cycle's "Peak of Inflated Expectations". Vector databases provide value, but the current hype might be inflating the value. This is similar to the position that MongoDB was in several years ago. MongoDB has been proven to be useful as a document database but was over-used for a period of time. Another limitation is the specific business use case might not be clear yet. Before adopting a vector database, make sure there is a clear ROI for the project and organization. Probably the biggest limitation is that a vector database is yet another separate tool to maintain. Businesses might want vector operations but not need a separate system just to support those operations. A vector extension for existing databases, such as pgvector for Postgres, might provide enough functionality with minimal additional complexity.

## Other vector systems

There might not be a need for a vector database solution, but it might be useful to have vector capabilities. The following are solutions that provide vector-based capabilities. 

* **Faiss (Facebook AI similarity search)**: Faiss is a library for efficient similarity search and clustering of  vectors. Written in C++ with complete wrappers for Python and NumPy. 

* **Annoy (Approximate Nearest Neighbors Oh Yeah)**:  Annoy is another library for approximate nearest neighbor search. It is optimized for high-dimensional vectors and provides Python bindings. Developed by Spotify, Spotify recommendations are often powered by vector embeddings of song and related information.

## Limitations of semantic vectors

All semantic vectors only summarize the training data. Whatever social biases, stereotypes, and negative sentiments towards certain groups that are in the training data will be reflected in the vector representations. It is a best practice to evaluate a set of vector embeddings for your specific use case. It might make sense to train custom vector embeddings or add additional control logic to minimize bias in end-user applications.

## Conclusion

* Vector databases store vectors and have specialized operations for those vectors.

* The vectors are fixed-length arrays of floating point numbers that represent semantic information about complex objects. 

* Finding the nearest neighbors of vectors that can solve a wide variety of business problems.

*Copyright &copy; 2023 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*