<a href="https://colab.research.google.com/github/tymor22/tm-vec/blob/master/Search_use_TM_Vec_search_to_search_for_relaetd_sequences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notes:
1. In order to use TM-Vec and DeepBlast, you need to install TM-Vec, DeepBlast, and the huggingface transformers library. 
2. You will also need to download the ProtT5-XL-UniRef50 encoder (large language model that TM-Vec and DeepBlast uses), the trained TM-Vec model, and the trained DeepBlast model. As the ProtT5-XL-UniRef50 encoder is very large (~11.3GB), unless you have the necessary RAM on your GPU (at least more than the model), you may have to use a CPU runtime on Google Colab.
3. This notebook demonstrates how TM-Vec can be used to search for related proteins contained within large protein databases to queries proteins.


<h3>Searching for related protein sequences using a trained TM-Vec model, and then aligning the related sequences using DeepBlast</h3>

**1. Install the relevant libraries including tm-vec, deepblast, and the huggingface transformers library**

In [None]:
!pip install tm-vec

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install deepblast

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
!pip install -q SentencePiece transformers

In [None]:
#If on a CPU- install faiss - cpu
!pip install faiss-cpu
#If on a GPU: !pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-cpu
  Downloading faiss_cpu-1.7.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.0 MB)
[K     |████████████████████████████████| 17.0 MB 465 kB/s 
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.7.3


<b>2. Load the relevant libraries<b>

In [None]:
import torch
from transformers import T5EncoderModel, T5Tokenizer
import re
import gc

import numpy as np
import pandas as pd

import torch
from torch.utils.data import Dataset

from tm_vec.embed_structure_model import trans_basic_block, trans_basic_block_Config
from tm_vec.tm_vec_utils import featurize_prottrans, embed_tm_vec

import faiss

<b>3. Load the ProtT5-XL-UniRef50 tokenizer and model<b>

In [None]:
tokenizer = T5Tokenizer.from_pretrained("Rostlab/prot_t5_xl_uniref50", do_lower_case=False )

In [None]:
model = T5EncoderModel.from_pretrained("Rostlab/prot_t5_xl_uniref50")

Downloading:   0%|          | 0.00/11.3G [00:00<?, ?B/s]

Some weights of the model checkpoint at Rostlab/prot_t5_xl_uniref50 were not used when initializing T5EncoderModel: ['decoder.block.6.layer.0.SelfAttention.v.weight', 'decoder.block.1.layer.2.DenseReluDense.wo.weight', 'decoder.block.18.layer.2.DenseReluDense.wo.weight', 'decoder.block.4.layer.1.layer_norm.weight', 'decoder.block.11.layer.0.layer_norm.weight', 'decoder.block.19.layer.0.SelfAttention.o.weight', 'decoder.block.13.layer.1.EncDecAttention.v.weight', 'decoder.block.14.layer.2.layer_norm.weight', 'decoder.block.10.layer.0.SelfAttention.v.weight', 'decoder.block.19.layer.0.SelfAttention.q.weight', 'decoder.block.7.layer.0.SelfAttention.k.weight', 'decoder.block.19.layer.0.SelfAttention.k.weight', 'decoder.block.3.layer.0.SelfAttention.o.weight', 'decoder.block.14.layer.1.EncDecAttention.k.weight', 'decoder.block.0.layer.1.EncDecAttention.q.weight', 'decoder.block.22.layer.0.SelfAttention.q.weight', 'decoder.block.9.layer.1.EncDecAttention.q.weight', 'decoder.block.2.layer.1.E

In [None]:
gc.collect()

75

<b>3. Put the model onto your GPU if it is avilabile, switching the model to inference mode<b>

In [None]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

In [None]:
print(device)

cpu


In [None]:
model = model.to(device)
model = model.eval()

<b>4. Download a trained TM-Vec model, its configuration file, and a trained DeepBlast model<b>

In [None]:
!wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt

--2022-12-21 17:10:34--  https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model.ckpt
Resolving users.flatironinstitute.org (users.flatironinstitute.org)... 144.121.86.9
Connecting to users.flatironinstitute.org (users.flatironinstitute.org)|144.121.86.9|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://users.flatironinstitute.org/~thamamsy/tm_vec_cath_model.ckpt [following]
--2022-12-21 17:10:34--  https://users.flatironinstitute.org/~thamamsy/tm_vec_cath_model.ckpt
Reusing existing connection to users.flatironinstitute.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 207922348 (198M) [application/octet-stream]
Saving to: ‘tm_vec_cath_model.ckpt’


2022-12-21 17:10:36 (101 MB/s) - ‘tm_vec_cath_model.ckpt’ saved [207922348/207922348]



In [None]:
!wget https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json

--2022-12-21 17:10:36--  https://users.flatironinstitute.org/thamamsy/public_www/tm_vec_cath_model_params.json
Resolving users.flatironinstitute.org (users.flatironinstitute.org)... 144.121.86.9
Connecting to users.flatironinstitute.org (users.flatironinstitute.org)|144.121.86.9|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://users.flatironinstitute.org/~thamamsy/tm_vec_cath_model_params.json [following]
--2022-12-21 17:10:36--  https://users.flatironinstitute.org/~thamamsy/tm_vec_cath_model_params.json
Reusing existing connection to users.flatironinstitute.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 181 [application/json]
Saving to: ‘tm_vec_cath_model_params.json’


2022-12-21 17:10:36 (51.1 MB/s) - ‘tm_vec_cath_model_params.json’ saved [181/181]



In [None]:
!wget https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt

--2022-12-21 17:10:37--  https://users.flatironinstitute.org/jmorton/public_www/deepblast-public-data/checkpoints/deepblast-lstm4x.pt
Resolving users.flatironinstitute.org (users.flatironinstitute.org)... 144.121.86.9
Connecting to users.flatironinstitute.org (users.flatironinstitute.org)|144.121.86.9|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://users.flatironinstitute.org/~jmorton/deepblast-public-data/checkpoints/deepblast-lstm4x.pt [following]
--2022-12-21 17:10:37--  https://users.flatironinstitute.org/~jmorton/deepblast-public-data/checkpoints/deepblast-lstm4x.pt
Reusing existing connection to users.flatironinstitute.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 618153154 (590M) [application/octet-stream]
Saving to: ‘deepblast-lstm4x.pt’


2022-12-21 17:11:02 (24.0 MB/s) - ‘deepblast-lstm4x.pt’ saved [618153154/618153154]



<b> 5. Load the trained TM-Vec model<b>

In [None]:
#TM-Vec model paths
tm_vec_model_cpnt = "tm_vec_cath_model.ckpt"
tm_vec_model_config = "tm_vec_cath_model_params.json"

#Load the TM-Vec model
tm_vec_model_config = trans_basic_block_Config.from_json(tm_vec_model_config)
model_deep = trans_basic_block.load_from_checkpoint(tm_vec_model_cpnt, config=tm_vec_model_config)
model_deep = model_deep.to(device)
model_deep = model_deep.eval()

<b> 7. Load one of our TM-Vec embedding databases and the associated metadata, or use one of your own (i.e. embed your own collection of protein sequences)<b>

In [None]:
!wget https://users.flatironinstitute.org/thamamsy/public_www/embeddings_cath_s100_final.npy
!wget https://users.flatironinstitute.org/thamamsy/public_www/embeddings_cath_s100_w_metadata.tsv

--2022-12-21 17:43:28--  https://users.flatironinstitute.org/thamamsy/public_www/embeddings_cath_s100_final.npy
Resolving users.flatironinstitute.org (users.flatironinstitute.org)... 144.121.86.9
Connecting to users.flatironinstitute.org (users.flatironinstitute.org)|144.121.86.9|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://users.flatironinstitute.org/~thamamsy/embeddings_cath_s100_final.npy [following]
--2022-12-21 17:43:28--  https://users.flatironinstitute.org/~thamamsy/embeddings_cath_s100_final.npy
Reusing existing connection to users.flatironinstitute.org:443.
HTTP request sent, awaiting response... 200 OK
Length: 251345024 (240M) [application/octet-stream]
Saving to: ‘embeddings_cath_s100_final.npy’


2022-12-21 17:43:52 (9.95 MB/s) - ‘embeddings_cath_s100_final.npy’ saved [251345024/251345024]

--2022-12-21 17:43:53--  https://users.flatironinstitute.org/thamamsy/public_www/embeddings_cath_s100_w_metadata.tsv
Resolving users

<b> 8. Load or paste some sequences that you would like to query the database with <b>

In [None]:
sequences = ["AETCZAO","SKTZP"]

<b> 9. Embed your query sequences using the same TM-Vec model used to make the embeddings database <b> 



In [None]:
#Embed query sequences
i = 0
embed_all_sequences=[]
while i < len(sequences): 
    protrans_sequence = featurize_prottrans(sequences[i:i+1], model, tokenizer, device)
    embedded_sequence = embed_tm_vec(protrans_sequence, model_deep, device)
    embed_all_sequences.append(embedded_sequence)
    i = i + 1
  
#convert query embeddings into a numpy array
queries = np.concatenate(embed_all_sequences, axis=0)

<b>10. Load and index the lookup database<b>

In [None]:
#Load the database that we will query
#Make sure that the query database was encoded using the same model that's being applied to the query (i.e. CATH and CATH database)
lookup_database = np.load("embeddings_cath_s100_final.npy")
metadata_for_lookup_database = pd.read_csv("embeddings_cath_s100_w_metadata.tsv", sep="\t")

In [None]:
#Normalize queries 
faiss.normalize_L2(queries)

#Build an indexed database
d = lookup_database.shape[1] 
index = faiss.IndexFlatIP(d)
faiss.normalize_L2(lookup_database)
index.add(lookup_database)             

<b>11. Return the k nearest neighbors to query sequences <b>

In [None]:
k = 10 #Here we use k equal to 10, but this is a user-defined parameter
D, I = index.search(queries, k)

In [None]:
print("TM scores for the nearest neighbors")
D

TM scores for the nearest neighbors


array([[0.95880437, 0.95266336, 0.9511788 , 0.95073986, 0.95073986,
        0.94380045, 0.9383105 , 0.936745  , 0.9239427 , 0.9090662 ],
       [0.9884029 , 0.9884029 , 0.9806186 , 0.9792333 , 0.9774759 ,
        0.97331107, 0.97142786, 0.9710368 , 0.9658628 , 0.9590106 ]],
      dtype=float32)

In [None]:
#Get metadata for the top neighbor
near_meta = []
for i in range(I.shape[0]):
    meta = metadata_for_lookup_database.iloc[I[i, 0]]
    near_meta.append(meta)

In [None]:
#1st queries nearest neighbors meta data
near_meta[0]

Cath_ID                                  1q16A02
CATH_full               cath|4_3_0|1q16A02/28-40
Cath_Domain                              1q16A02
Class                                          6
Architecture                                  10
Topology                                     250
Homology                                     300
S35_cluster                                    1
S60_cluster                                    1
S95_cluster                                    1
S100_cluster                                   1
S100_count                                     1
Domain_length                                 13
Structure_resolution                         1.9
Name: 14679, dtype: object