# Vector Database Basics

Vector databases help us store, manage, and query the embeddings we created for generative AI, recommenders, and search engines.

Across many of the common use cases, users often find that they need to manage more than just vectors.
To make it easier for practitioners, vector databases should store and manage all of the data they need:
- embedding vectors
- categorical metadata
- numerical metadata
- timeseries metadata
- text / pdf / images / video / point clouds

And support a wide range of query workloads:
- Vector search (may require ANN-index)
- Keyword search (requires full text search index)
- SQL (for filtering)

For this exercise we'll use LanceDB since it's open source and easy to setup

In [2]:
# pip install -U --quiet lancedb pandas pydantic [This has been pre-installed for you]

## Creating tables and adding data

Let's create a LanceDB table called `cats_and_dogs` under the local database directory `~/.lancedb`.
This table should have 4 fields:
- the embedding vector
- a string field indicating the species (either "cat" or "dog")
- the breed
- average weight in pounds

We're going to use pydantic to make this easier. First let's create a pydantic model with those fields

In [3]:
from lancedb.pydantic import vector, LanceModel

class CatsAndDogs(LanceModel):
    vector: vector(2)
    species: str
    breed: str
    weight: float

In [4]:
CatsAndDogs

__main__.CatsAndDogs

Now connect to a local db at ~/.lancedb and create an empty LanceDB table called "cats_and_dogs"

In [5]:
import lancedb

db = lancedb.connect('~/.lancedb')
table_name = "cats_and_dogs"
db.drop_table(table_name, ignore_missing=True)
table = db.create_table(table_name, schema=CatsAndDogs)
table

LanceTable(connection=LanceDBConnection(/Users/shlba/.lancedb), name="cats_and_dogs")

In [6]:
!ls

'Exercise 3 - Vector Database Basics.ipynb'


Let's add some data

First some cats

In [7]:
data = [
    CatsAndDogs(
        vector=[1., 0.],
        species="cat",
        breed="shorthair",
        weight=12.,
    ),
    CatsAndDogs(
        vector=[-1., 0.],
        species="cat",
        breed="himalayan",
        weight=9.5,
    ),
]

Now call the `LanceTable.add` API to insert these two records into the table

In [8]:
table.add([dict(d) for d in data])

Let's preview the data

In [9]:
table.head().to_pandas()

Unnamed: 0,vector,species,breed,weight
0,"[1.0, 0.0]",cat,shorthair,12.0
1,"[-1.0, 0.0]",cat,himalayan,9.5


Now let's add some dogs

In [10]:
data = [
    CatsAndDogs(
        vector=[0., 10.],
        species="dog",
        breed="samoyed",
        weight=47.5,
    ),
    CatsAndDogs(
        vector=[0, -1.],
        species="dog",
        breed="corgi",
        weight=26.,
    )
]

In [11]:
table.add([dict(d) for d in data])

In [12]:
table.head().to_pandas()

Unnamed: 0,vector,species,breed,weight
0,"[1.0, 0.0]",cat,shorthair,12.0
1,"[-1.0, 0.0]",cat,himalayan,9.5
2,"[0.0, 10.0]",dog,samoyed,47.5
3,"[0.0, -1.0]",dog,corgi,26.0


## Querying tables

Vector databases allow us to retrieve data for generative AI applications. Let's see how that's done.

Let's say we have a new animal that has embedding [10.5, 10.], what would you expect the most similar animal will be?
Can you use the table we created above to answer the question?

**HINT** you'll need to use the `search` API for LanceTable and `limit` / `to_df` APIs. For examples you can refer to [LanceDB documentation](https://lancedb.github.io/lancedb/basic/#how-to-search-for-approximate-nearest-neighbors).

In [13]:
table.search([10.5, 10.]).limit(1).to_pandas()

Unnamed: 0,vector,species,breed,weight,_distance
0,"[0.0, 10.0]",dog,samoyed,47.5,110.25


Now what if we use cosine distance instead? Would you expect that we get the same answer? Why or why not?

**HINT** you can add a call to `metric` in the call chain

In [14]:
table.search([10.5, 10.]).metric("cosine").limit(1).to_pandas()

Unnamed: 0,vector,species,breed,weight,_distance
0,"[1.0, 0.0]",cat,shorthair,12.0,0.275862


## Filtering tables

In practice, we often need to specify more than just a search vector for good quality retrieval. Oftentimes we need to filter the metadata as well.

Please write code to retrieve two most similar examples to the embedding [10.5, 10.] but only show the results that is a cat.

In [15]:
table.search([10.5, 10.]).limit(2).where("species='cat'").to_df()

  table.search([10.5, 10.]).limit(2).where("species='cat'").to_df()


Unnamed: 0,vector,species,breed,weight,_distance
0,"[1.0, 0.0]",cat,shorthair,12.0,190.25


## Creating ANN indices

For larger tables (e.g., >1M rows), searching through all of the vectors becomes quite slow. Here is where the Approximate Nearest Neighbor (ANN) index comes into play. While there are many different ANN indexing algorithms, they all have the same purpose - to drastically limit the search space as much as possible while losing as little accuracy as possible

For this problem we will create an ANN index on a LanceDB table and see how that impacts performance

### First let's create some data

Given the constraints of the classroom workspace, we'll complete this exercise by creating 100,000 vectors with 16D in a new table. Here the embedding values don't matter, so we simply generate random embeddings as a 2D numpy array. We then use the vec_to_table function to convert that in to an Arrow table, which can then be added to the table.

In [16]:
from lance.vector import vec_to_table
import numpy as np

mat = np.random.randn(1_000_000, 16)
table_name = "exercise3_ann"
db.drop_table(table_name, ignore_missing=True)
table = db.create_table(table_name, vec_to_table(mat))

### Let's establish a baseline without an index

Before we create the index, let's make sure know what we need to compare against.

We'll generate a random query vector and record it's value in the `query` variable so we can use the same query vector with and without the ANN index.

In [17]:
query = np.random.randn(16)
table.search(query).limit(10).to_df()

  table.search(query).limit(10).to_df()


Unnamed: 0,vector,_distance
0,"[-0.4611668, 1.3445679, -1.149227, 0.122381434...",2.857921
1,"[-0.5036552, 0.47419232, -0.7161889, 0.2983751...",3.085279
2,"[0.1495944, 0.9422459, -1.1396422, 0.29438668,...",3.088433
3,"[-0.080440834, 0.59383655, -0.66264814, 0.1655...",3.119587
4,"[0.0722229, 0.81390333, -0.08871777, 0.2153509...",3.140486
5,"[-0.49463022, 1.0360136, -0.5930591, 0.1494499...",3.857721
6,"[-0.7658261, 0.73778623, -0.7359389, 0.0382106...",3.859715
7,"[-0.60303307, 0.6784003, 0.014111206, 0.767907...",3.966603
8,"[-0.41171795, 0.6027729, -0.746797, 0.23270445...",4.026614
9,"[-0.57377183, -0.06744377, -0.04169583, 0.7258...",4.034904


Please write code to compute the average latency of this query

In [18]:
%timeit table.search(np.random.randn(16)).limit(10).to_arrow()

27.7 ms ± 637 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Now let's create an index

There are many possible index types ranging from hash based to tree based to partition based to graph based.
For this task, we'll create an IVFPQ index (partition-based index with product quantization compression) using LanceDB.

Please create an IVFPQ index on the LanceDB table such that each partition is 4000 rows and each PQ subvector is 8D.

**HINT** 
1. Total vectors / number of partitions = number of vectors in each partition
2. Total dimensions / number of subvectors = number of dimensions in each subvector
3. This step can take about 7-10 minutes to process and execute in the classroom workspace.

In [19]:
table.create_index(num_partitions=512, num_sub_vectors=16)

Now let's search through the data again. Notice how the answers now appear different.
This is because an ANN index is always a tradeoff between latency and accuracy.

In [20]:
table.search(query).limit(10).to_df()

  table.search(query).limit(10).to_df()


Unnamed: 0,vector,_distance
0,"[-0.4611668, 1.3445679, -1.149227, 0.122381434...",2.856782
1,"[-0.5036552, 0.47419232, -0.7161889, 0.2983751...",3.082088
2,"[0.1495944, 0.9422459, -1.1396422, 0.29438668,...",3.087055
3,"[-0.080440834, 0.59383655, -0.66264814, 0.1655...",3.120408
4,"[0.0722229, 0.81390333, -0.08871777, 0.2153509...",3.202635
5,"[-0.7658261, 0.73778623, -0.7359389, 0.0382106...",3.842851
6,"[-0.49463022, 1.0360136, -0.5930591, 0.1494499...",3.849876
7,"[-0.60303307, 0.6784003, 0.014111206, 0.767907...",3.986332
8,"[-0.41171795, 0.6027729, -0.746797, 0.23270445...",4.039896
9,"[-0.57377183, -0.06744377, -0.04169583, 0.7258...",4.099825


Now write code to compute the average latency for querying the same table using the ANN index.

**SOLUTION** The index is implementation detail, so it should just be running the same code as above. You should see almost an order of magnitude speed-up. On larger datasets, this performance difference should be even more pronounced.

In [23]:
%timeit table.search(np.random.randn(16)).limit(10).to_arrow();

837 μs ± 7.11 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Deleting rows

Like with other kinds of databases, you should be able to remove rows from the table.
Let's go back to our tables of cats and dogs

In [24]:
table = db["cats_and_dogs"]

In [25]:
len(table)

4

Can you use the `delete` API to remove all of the cats from the table?

**HINT** use a SQL like filter string to specify which rows to delete from the table

In [27]:
table.delete("species='cat'")

In [28]:
len(table)

2

## What if I messed up?

Errors is a common occurrence in AI. What's hard about errors in vector search is that oftentimes a bad vector doesn't cause a crash but just creates non-sensical answers. So to be able to rollback the state of the database is very important for debugging and reproducibility

So far we've accumulated 4 actions on the table:
1. creation of the table
2. added cats
3. added dogs
4. deleted cats

What if you realized that you should have deleted the dogs instead of the cats?

Here we can see the 4 versions that correspond to the 4 actions we've done

In [29]:
table.list_versions()

[{'version': 1,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 34, 40, 142068),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 34, 43, 742035),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 34, 50, 606746),
  'metadata': {}},
 {'version': 4,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 39, 17, 493338),
  'metadata': {}}]

Please write code to restore the version still containing the whole dataset

In [30]:
table = db["cats_and_dogs"]

In [31]:
len(table)

2

In [32]:
# restore to version 3

table.restore(3)

In [33]:
# delete the dogs instead

table.delete("species='dog'")

In [34]:
table.list_versions()

[{'version': 1,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 34, 40, 142068),
  'metadata': {}},
 {'version': 2,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 34, 43, 742035),
  'metadata': {}},
 {'version': 3,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 34, 50, 606746),
  'metadata': {}},
 {'version': 4,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 39, 17, 493338),
  'metadata': {}},
 {'version': 5,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 43, 2, 317954),
  'metadata': {}},
 {'version': 6,
  'timestamp': datetime.datetime(2024, 8, 5, 12, 43, 18, 434969),
  'metadata': {}}]

In [35]:
table.to_pandas()

Unnamed: 0,vector,species,breed,weight
0,"[1.0, 0.0]",cat,shorthair,12.0
1,"[-1.0, 0.0]",cat,himalayan,9.5


## Dropping a table

You can also choose to drop a table, which also completely removes the data.
Note that this operation is not reversible.

In [36]:
"cats_and_dogs" in db

True

Write code to irrevocably remove the table "cats_and_dogs" from the database

In [37]:
db.drop_table("cats_and_dogs")

How would you verify that the table has indeed been deleted?

In [39]:
table.name

'cats_and_dogs'

In [40]:
table.name in db

False

## Summary

Congrats, in this exercise you've learned the basic operations of vector databases from creating tables, to adding data, and to querying the data. You've learned how to create indices and you saw first hand how it changes the performance and the accuracy. Lastly, you've learned how to debug and rollback when errors happen.