# Create a vector database from embeddings

_Note_: In the actual implementation to find similar cyclists we stuck with the collaborative learner model object that we stored on AWS S3. Creating a vector database (e.g. with FAISS or Pinecone) consists of a more advanced alternative. Once the algorithm finds the optimized embeddings, you could indeed put those into a vector database for easier management and similarity search. You would add the code for vector database creation and updating to `scripts/train.py`. This brief notebook is a primer to such a solution.

## Imports

In [1]:
import faiss
import numpy as np
import pandas as pd
from fastai.collab import load_learner

## Create vector db

In [2]:
learn = load_learner("../data/learner.pkl")

In [3]:
vectors = learn.model.u_weight.weight.detach().numpy()  # cyclist embeddings

In [4]:
faiss.normalize_L2(vectors)
index = faiss.IndexFlatL2(vectors.shape[1])
index.add(vectors)

In [5]:
# faiss.write_index(index, "../api/faiss_cyclists.index")
# index = faiss.read_index("../api/faiss_cyclists.index")

In [6]:
cyclist = "VAN AERT Wout"

idx = [i for i, r in enumerate(learn.model.classes["rider"]) if r == cyclist][0]
search_vector = vectors[idx, :]

_vector = np.array([search_vector])
faiss.normalize_L2(_vector)

distances, ann = index.search(_vector, k=index.ntotal)

results = pd.DataFrame({"distances": distances[0],
                        "ann": ann[0],
                        "cyclist": learn.dls.classes["rider"][ann[0]]})
results

Unnamed: 0,distances,ann,cyclist
0,0.000000,1767,VAN AERT Wout
1,0.492887,1303,PEDERSEN Mads
2,0.625946,1344,PHILIPSEN Jasper
3,0.795972,978,LAPORTE Christophe
4,0.836278,646,GIRMAY Biniam
...,...,...,...
1992,3.273132,1944,YATES Adam
1993,3.374993,1371,POGAČAR Tadej
1994,3.446501,1945,YATES Simon
1995,3.497690,758,HINDLEY Jai
