## Create a vector database from embeddings

_Note_: In the actual implementation to find similar cyclists we stuck with the collaborative learner model object that we stored on AWS S3. Creating a vector database (e.g. with FAISS or Pinecone) consists of a more advanced alternative. Once the algorithm finds the optimized embeddings, you could indeed put those into a vector database for easier management and similarity search. You would add the code for vector database creation and updating to `scripts/train.py`. This brief notebook is a primer to such a solution.

## Imports

In [1]:
import faiss
import numpy as np
import pandas as pd
from fastai.collab import load_learner

## Create vector db

In [2]:
learn = load_learner("../api/learner.pkl")

In [3]:
vectors = learn.model.u_weight.weight.detach().numpy()  # cyclist embeddings

In [4]:
faiss.normalize_L2(vectors)
index = faiss.IndexFlatL2(vectors.shape[1])
index.add(vectors)

In [5]:
# faiss.write_index(index, "../api/faiss_cyclists.index")
# index = faiss.read_index("../api/faiss_cyclists.index")

In [6]:
search_vector = vectors[2628, :]  # Wout van Aert

_vector = np.array([search_vector])
faiss.normalize_L2(_vector)

distances, ann = index.search(_vector, k=index.ntotal)

results = pd.DataFrame({"distances": distances[0],
                        "ann": ann[0],
                        "cyclist": learn.dls.classes["rider"][ann[0]]})
results

Unnamed: 0,distances,ann,cyclist
0,0.000000,2628,VAN AERT Wout
1,0.002423,2006,PHILIPSEN Jasper
2,0.003056,1382,KRISTOFF Alexander
3,0.003733,1213,JAKOBSEN Fabio
4,0.003798,572,DAINESE Alberto
...,...,...,...
2926,0.134902,255,BILBAO Pello
2927,0.141526,2204,ROGLIČ Primož
2928,0.143936,415,CARAPAZ Richard
2929,0.170000,787,EVENEPOEL Remco
