# Creating and searching against vector databases with TM-Vec

To form protein databases that are easily stored using vector embeddings, we will:
1. Generate a DB of protein vectors
2. Convert our output to a FAISS DB (for search)
3. Search against our DB and plot the results

In [1]:
# import necessary functions
import skbio
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import faiss

## Building a vector database

We can directly feed our FASTA file into the tmvec build_db __CLI__ function, which will output our 
vectors as a .npz file in the specified directory.

This function takes in as an input:
1. --input-fasta: A FASTA file containing your sequences.
2. --output: the file location to output to.

After this, we will feed the resulting .npz file into a numpy array, and convert it to a FAISS index.

In [1]:
!tmvec build-db --input-fasta bagel.fa --output test_db/bagel_fasta
# add flag functionality

[2024-06-14 05:34:58] cli.build_db INFO: Removed 0 sequences longer than 1024 residues.
[2024-06-14 05:35:43] cli.build_db INFO: Please, do not move or rename input FASTA file, TM-Vec model and config files and ProtT5 model. They will be used to for sequence search. It is a design feature to ensure the consistency of the models used. If the location has chaged you can manually modify the database.


In [5]:
embedded_data = np.load('test_db/bagel_fasta.npz')
# to_ord
vectors = embedded_data['embeddings']


## Plot the ordination results

Now, with our FAISS DB in hand, we can use the search __CLI__ function to search for proteins against our database, and return the k-nearest neighbors results.  

Finally, we can utilize the embed_vec_to_ordination function to create ordination objects from our search and plot them.

In [2]:
!tmvec search --input-fasta bagel.fa --database test_db/bagel_fasta.npz --output test_db/bagel_search_results

[2024-06-14 05:36:34] cli.search INFO: Removed 0 sequences longer than 1024 residues.
