# Timescale-vector

> Python library for storing vector data in Postgres

This file will become your README and also the index of your documentation.

## Install

```sh
pip install timescale_vector
```

## Basic Usage

Load up your postgres credentials. Safest way is with a .env file:

In [None]:
from dotenv import load_dotenv, find_dotenv
import os

In [None]:
_ = load_dotenv(find_dotenv()) 
service_url  = os.environ['TIMESCALE_SERVICE_URL'] 

Next, create the client. 

This takes three arguments: 

* A connection string
* The name of the collection
* Number of dimensions

  In this tutorial, we will use the async client. But we have a sync client as well (with an almost identical interface)

In [None]:
#| hide
import asyncpg

In [None]:
#| hide
con = await asyncpg.connect(service_url)
await con.execute("DROP TABLE IF EXISTS my_data;")
await con.execute("DROP TABLE IF EXISTS my_data_with_time_partition;")
await con.close()

In [None]:
from timescale_vector import client 

In [None]:
vec  = client.Async(service_url, "my_data", 2)

Next, create the tables for the collection:

In [None]:
await vec.create_tables()

Next, insert some data. The data record contains:

* A uuid to uniquely identify the emedding
* A json blob of metadata about the embedding
* The text the embedding represents
* The embedding itself

Because this data already includes uuids we only allow upserts

In [None]:
import uuid

In [None]:
await vec.upsert([\
    (uuid.uuid4(), '''{"animal":"fox"}''', "the brown fox", [1.0,1.3]),\
    (uuid.uuid4(), '''{"animal":"fox", "action":"jump"}''', "jumped over the", [1.0,10.8]),\
])

Now you can query for similar items:

In [None]:
await vec.search([1.0, 9.0])

[<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
 <Record id=UUID('2cdb8cbd-5dd7-4555-926a-5efafb4b1cf0') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]

You can specify the number of records to return.

In [None]:
await vec.search([1.0, 9.0], limit=1)

[<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]

You can also specify a filter on the metadata as a simple dictionary

In [None]:
await vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})

[<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>]

You can also specify a list of filter dictionaries, where an item is returned if it matches any dict

In [None]:
await vec.search([1.0, 9.0], limit=2, filter=[{"action": "jump"}, {"animal": "fox"}])

[<Record id=UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864') metadata={'action': 'jump', 'animal': 'fox'} contents='jumped over the' embedding=array([ 1. , 10.8], dtype=float32) distance=0.00016793422934946456>,
 <Record id=UUID('2cdb8cbd-5dd7-4555-926a-5efafb4b1cf0') metadata={'animal': 'fox'} contents='the brown fox' embedding=array([1. , 1.3], dtype=float32) distance=0.14489260377438218>]

You can access the fields as follows

In [None]:
records = await vec.search([1.0, 9.0], limit=1, filter={"action": "jump"})
records[0][client.SEARCH_RESULT_ID_IDX]

UUID('e5dbaa7c-081b-4131-be18-c81ce47fc864')

In [None]:
records[0][client.SEARCH_RESULT_METADATA_IDX]

{'action': 'jump', 'animal': 'fox'}

In [None]:
records[0][client.SEARCH_RESULT_CONTENTS_IDX]

'jumped over the'

In [None]:
records[0][client.SEARCH_RESULT_EMBEDDING_IDX]

array([ 1. , 10.8], dtype=float32)

In [None]:
records[0][client.SEARCH_RESULT_DISTANCE_IDX]

0.00016793422934946456

You can delete by ID:

In [None]:
await vec.delete_by_ids([records[0][client.SEARCH_RESULT_ID_IDX]])

[]

Or you can delete by metadata filters:

In [None]:
await vec.delete_by_metadata({"action": "jump"})

[]

To delete all records use: 

In [None]:
await vec.delete_all()

## Advanced Usage

### Indexing

Indexing speeds up queries over your data. 

By default, we setup indexes to query your data by the uuid and the metadata.

If you have many rows, you also need to setup an index on the embedding. You can create a timescale-vector index on the table with.

In [None]:
await vec.create_embedding_index(client.TimescaleVectorIndex())

You can drop the index with:

In [None]:
await vec.drop_embedding_index()

While we recommend the timescale-vector index type, we also have 2 more index types availabe:

* The pgvector ivfflat index
* The pgvector hnsw index

Usage examples below:

In [None]:
await vec.create_embedding_index(client.IvfflatIndex())
await vec.drop_embedding_index()
await vec.create_embedding_index(client.HNSWIndex())
await vec.drop_embedding_index()

Please note it is very important create the ivfflat index only after you have data in the table. 

Please note the community is actively working on new indexing methods for embeddings. As they become available, we will add them to our client as well.

### Time-partitioning

In many use-cases where you have many embeddings time is an important component associated with the embeddings. For example, when embedding news stories you often search by time as well as similarity (e.g. stories related to bitcoin in the past week, or stories about Clinton in November 2016). 

Yet, traditionally, searching by two components "similarity" and "time" is challenging approximate nearest neigbor (ANN) indexes and makes the similariy-search index less effective.

One approach to solving this is partitioning the data by time and creating ANN indexes on each partition individually. Then, during search you can:

 * Step 1: filter our partitions that don't match the time predicate
 * Step 2: perform the similarity search on all matching partitions
 * Step 3: combine all the results from each partition in step 2, rerank, and filter out results by time.

Step 1 makes the search a lot more effecient by filtering out whole swaths of data in one go.

Timescale-vector supports time partitioning using TimescaleDB's hypertables. To use this feature, simply indicate the length in time for each partition when creating the client:

In [None]:
from datetime import timedelta
from datetime import datetime

In [None]:
vec = client.Async(service_url, "my_data_with_time_partition", 2, time_partition_interval=timedelta(hours=6))
await vec.create_tables()

Then insert data where the ids use uuid's v1 and the time component of the uuid specifies the time of the embedding.
For example, to create an embedding for the current time simply do: 

In [None]:
id = uuid.uuid1()
await vec.upsert([(id, {"key": "val"}, "the brown fox", [1.0, 1.2])])

To insert data for a specific time in the past, create the uuid using our `uuid_from_time` function

In [None]:
specific_datetime = datetime(2018, 8, 10, 15, 30, 0)
await vec.upsert([(client.uuid_from_time(specific_datetime), {"key": "val"}, "the brown fox", [1.0, 1.2])])

You can then query the data by specifing a `uuid_time_filter` in the search call:

In [None]:
rec = await vec.search([1.0, 2.0], limit=4, uuid_time_filter=client.UUIDTimeRange(specific_datetime-timedelta(days=7), specific_datetime+timedelta(days=7)))

## Development

Please note that this project is developed with [nbdev](https://nbdev.fast.ai/). Please see that website for the development process.