# Scalable nearest genome search with Bacformer

This tutorial outlines how one can use Bacformer embeddings and the associated corpus of over 1.3M genomes to search for similar genomes.

We provide a precomputed dataset of genome embeddings from Bacformer for each genome from a variety of sources with associated metadata.

The user can input a query genome, embed it with Bacformer and perform scalable, fast search for $k$ closest genomes (L2 distance) from the dataset of over 1.3M genomes. The nearest genomes can be then used to annotate the query genome and investigate similar ones.

Before you start, make sure you have bacformer installed the [datasets>=2.21](https://pypi.org/project/datasets/) and [faiss-cpu](https://pypi.org/project/faiss-cpu/) packages. This can be done with a command: 
```python
pip install -U "datasets[faiss]>=2.21"
```

In [25]:
import pandas as pd
import numpy as np

from datasets import load_dataset

## Step 1: Download the dataset with Bacformer genome embeddings

Download the precomputed dataset with Bacformer genome embeddings and the associated metadata.

In [2]:
ds = load_dataset("macwiatrak/bacformer-genome-embeddings-corpus", split="test")

## Step 2: Compute the FAISS index for scalable, fast vector search

Use the [faiss](https://github.com/facebookresearch/faiss) package to compute an index which allows for efficient similarity search.

In [3]:
ds.add_faiss_index(column="genome_embedding") 

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1364/1364 [00:07<00:00, 187.80it/s]


Dataset({
    features: ['genome_name', 'source', 'n_seqs', 'n_contigs', 'genome_length', 'n50', 'completeness', 'contamination', 'derived_from_sample', 'env', 'phylum', 'class', 'order', 'family', 'genus', 'species', 'env_clean', 'taxid', 'genome_embedding', '__index_level_0__'],
    num_rows: 1363276
})

## Step 3: Get the $k$ most similar genomes

Fetch a query vector from the existing dataset. Using the query vector and the FAISS index computed in step 2. Search for similar genomes in our dataset and view the metadata.

In [27]:
query_vector = np.array(ds[0]['genome_embedding'])
# set k (i.e. number of most similar genomes to retrieve)
k = 10
# get k most similar genomes
dists, examples = ds.get_nearest_examples(
        "genome_embedding",
        query_vector,
        k=k
    )
# convert the examples to a DF, exclude the first item as it's the genome embedding itself
examples = pd.DataFrame(examples)[1:].drop(columns=['genome_embedding'])
# add L2 distances
examples['l2_distance'] = dists[1:]
examples.head()

Unnamed: 0,genome_name,source,n_seqs,n_contigs,genome_length,n50,completeness,contamination,derived_from_sample,env,phylum,class,order,family,genus,species,env_clean,taxid,__index_level_0__,l2_distance
1,spire_mag_00829374,SPIRE,1965.0,25.0,2353457.0,124487.0,99.83,5.29,SAMN07491241,host-associated:animal host:mammalian host:hum...,Actinobacteriota,Actinomycetia,Actinomycetales,Bifidobacteriaceae,Bifidobacterium,Bifidobacterium longum,Human host,,736918,0.054596
2,spire_mag_01587903,SPIRE,1803.0,27.0,2135822.0,105354.0,91.63,0.03,SAMN08993539,host-associated:animal host:mammalian host:hum...,Actinobacteriota,Actinomycetia,Actinomycetales,Bifidobacteriaceae,Bifidobacterium,Bifidobacterium longum,Human host,,751325,0.054921
3,spire_mag_02980720,SPIRE,1832.0,27.0,2211833.0,133051.0,99.95,0.02,SAMN17719258,host-associated:animal host:mammalian host:hum...,Actinobacteriota,Actinomycetia,Actinomycetales,Bifidobacteriaceae,Bifidobacterium,Bifidobacterium longum,Human host,,385705,0.061194
4,spire_mag_02997990,SPIRE,1822.0,27.0,2203378.0,104325.0,99.92,0.03,SAMEA14101315,host-associated:animal host:mammalian host:hum...,Actinobacteriota,Actinomycetia,Actinomycetales,Bifidobacteriaceae,Bifidobacterium,Bifidobacterium longum,Human host,,397971,0.064117
5,spire_mag_01286748,SPIRE,1856.0,31.0,2243106.0,111647.0,99.95,0.02,SAMN09259925,host-associated:animal host:mammalian host:hum...,Actinobacteriota,Actinomycetia,Actinomycetales,Bifidobacteriaceae,Bifidobacterium,Bifidobacterium longum,Human host,,637493,0.065715


----------------------

#### Voilà, you made it 👏! 

In case of any issues or questions raise an issue on github - https://github.com/macwiatrak/Bacformer/issues.