This repo contains a collection of datasets, inspired by ann-benchmarks for searching for similar vectors with additional filtering conditions.
More and more applications are now using vector similarity search in their products. The task of approximate nearest neighbor (ANN) search has gone beyond the scope of academic research and the narrow circle of huge IT corporations.
In this regard, the issue of supplementing vector search with application business logic is becoming more and more relevant.
It is no longer enough to simply search for similar dishes by photo, you only need to search for them in those restaurants that are in the delivery area.
It is not enough to search for all items similar by description, you also need to consider price ranges, stock availability, etc.
It's not enough to find candidates for a job position based on similar skills, you also have to consider location, level of spoken language, and seniority.
You name it.
Classical approaches to ANN, and their implementations in many libraries, were usually customized for benchmarks, where the search speed among all vectors is the only comparison criterion.
Because of this, they had to sacrifice many functions that are useful in other situations: the ability to quickly delete, insert and modify stored values, as well as saving and filtering based on metadata.
description | Num vectors | dim | distance | filters | link |
---|---|---|---|---|---|
all-MiniLM-L6-v2 ArXiv titles | 2 138 591 | 384 | Cosine | match keyword / range | link |
Efficientnet encoded H&M Clothes | 105 100 | 2048 | Cosine | match keyword | link |
LAION Sample encoded with CLIP | 100 000 | 512 | Cosine | range | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | match keyword | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | match int | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | range | link |
Random vectors \ random payload | 1 000 000 | 100 | Cosine | geo-radius | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | match keyword | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | match int | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | range | link |
Random vectors \ random payload | 100 000 | 2048 | Cosine | geo-radius | link |
Each dataset contains of following files:
vectors.npy
- Numpy matrix of vectors. Shapenum_vectors x dim
payloads.jsonl
- payload values, associated with vectors. Number of lines equal tonum_vectors
tests.jsonl
- collection of queries with filtering conditions and expected results. Contains fields:query
- vector to be used for similarity searchconditions
- filtering conditions of 3 possible types:match
,range
, andgeo
closest_ids
- IDs of records, expected to be found with given queryclosest_scores
- similarity scores of associated IDs
{
"query": [-0.034, -0.185, -0.21, ...],
"conditions": {
"and": [
{
"department_name": {
"match": {
"value": "Divided Shoes"
}
}
}
]
},
"closest_ids": [565, 15631, 100747, ....],
"closest_scores": [0.734, 0.698, 0.697, 0.689, ...]
}