# Generate Protein Embeddings

As an example, here's how to proccess the frog reference proteome with ESM1b.

You can find proteome fastas on ENSEMBL: https://uswest.ensembl.org/info/data/ftp/index.html

We use a pretrained Transformer model from https://github.com/facebookresearch/esm. These models were trained on hundreds of millions of protein sequences from across the tree of life.

**NOTE:** These protein embedding scripts require an older version of the ESM Repo: you should checkout commit:
[`839c5b82c6cd9e18baa7a88dcbed3bd4b6d48e47`](https://github.com/facebookresearch/esm/commit/839c5b82c6cd9e18baa7a88dcbed3bd4b6d48e47)

**Clone the ESM repo.**

## Step 1: Download reference proteome

In [2]:
!mkdir data

In [34]:
import os
NAME = "Xenopus_tropicalis.UCB_Xtro_10.0.pep.all" # CHANGE THIS TO THE NAME OF THE REFERENCE PROTEOME YOU WANT
DATA_PATH = os.path.abspath(os.getcwd()) + "/data" # PATH TO DATA DIRECTORY (YOU CAN USE THE ONE IN THIS DIRECTORY)
ESM_PATH = "/lfs/local/0/yanay/esm/" # MAKE SURE TO CHANGE THIS TO THE PATH YOU CLONED THE ESM REPO TO ESM PATH
TORCH_HOME = "/dfs/project/cross-species/yanay/torch_home" # MAKE SURE TO CHANGE THIS TO YOUR DESIRED DIRECTORY,
DEVICE=6

In [39]:
!wget -r 'https://ftp.ensembl.org/pub/release-108/fasta/xenopus_tropicalis/pep/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.fa.gz' \
        -O data/{NAME}.fa.gz

will be placed in the single file you specified.

--2022-11-14 15:30:40--  https://ftp.ensembl.org/pub/release-108/fasta/xenopus_tropicalis/pep/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.fa.gz
Resolving ftp.ensembl.org (ftp.ensembl.org)... 193.62.193.139
Connecting to ftp.ensembl.org (ftp.ensembl.org)|193.62.193.139|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11316909 (11M) [application/x-gzip]
Saving to: ‘data/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.fa.gz’


2022-11-14 15:31:06 (429 KB/s) - ‘data/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.fa.gz’ saved [11316909/11316909]

FINISHED --2022-11-14 15:31:06--
Total wall clock time: 26s
Downloaded: 1 files, 11M in 26s (429 KB/s)


In [40]:
!gunzip data/{NAME}.fa.gz

gzip: data/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.fa already exists; do you wish to overwrite (y or n)? ^C


## Step 2: Clean Fasta

In [45]:
!python clean_fasta.py \
--data_path=./data/{NAME}.fa \
--save_path=./data/{NAME}.clean.fa

Number of original sequences = 49,792
100%|█████████████████████████████████| 49792/49792 [00:00<00:00, 146933.93it/s]
Number of cleaned sequences = 49,792


## Step 3: Run ESM

In [None]:
# THE MODELS ARE VERY LARGE AND TAKE A WHILE TO RUN

!export TORCH_HOME={TORCH_HOME}; cd {ESM_PATH}/scripts/; \
CUDA_VISIBLE_DEVICES={DEVICE} python extract.py esm1b_t33_650M_UR50S \
{DATA_PATH}/{NAME}.clean.fa \
{DATA_PATH}/{NAME}.clean.fa_esm1b \
--include mean  --truncate

Transferred model to GPU
Read /dfs/project/cross-species/yanay/code/SPEAR/protein_embeddings/data/Xenopus_tropicalis.UCB_Xtro_10.0.pep.all.clean.fa with 49792 sequences
Processing 1 of 8762 batches (73 sequences)
Processing 2 of 8762 batches (61 sequences)
Processing 3 of 8762 batches (57 sequences)
Processing 4 of 8762 batches (53 sequences)
Processing 5 of 8762 batches (51 sequences)
Processing 6 of 8762 batches (50 sequences)
Processing 7 of 8762 batches (48 sequences)
Processing 8 of 8762 batches (47 sequences)
Processing 9 of 8762 batches (46 sequences)
Processing 10 of 8762 batches (45 sequences)
Processing 11 of 8762 batches (44 sequences)
Processing 12 of 8762 batches (43 sequences)
Processing 13 of 8762 batches (42 sequences)
Processing 14 of 8762 batches (41 sequences)
Processing 15 of 8762 batches (41 sequences)
Processing 16 of 8762 batches (40 sequences)
Processing 17 of 8762 batches (40 sequences)
Processing 18 of 8762 batches (39 sequences)
Processing 19 of 8762 batches 

Processing 178 of 8762 batches (21 sequences)
Processing 179 of 8762 batches (21 sequences)
Processing 180 of 8762 batches (21 sequences)
Processing 181 of 8762 batches (21 sequences)
Processing 182 of 8762 batches (21 sequences)
Processing 183 of 8762 batches (21 sequences)
Processing 184 of 8762 batches (21 sequences)
Processing 185 of 8762 batches (21 sequences)
Processing 186 of 8762 batches (21 sequences)
Processing 187 of 8762 batches (21 sequences)
Processing 188 of 8762 batches (21 sequences)
Processing 189 of 8762 batches (21 sequences)
Processing 190 of 8762 batches (20 sequences)
Processing 191 of 8762 batches (20 sequences)
Processing 192 of 8762 batches (20 sequences)
Processing 193 of 8762 batches (20 sequences)
Processing 194 of 8762 batches (20 sequences)
Processing 195 of 8762 batches (20 sequences)
Processing 196 of 8762 batches (20 sequences)
Processing 197 of 8762 batches (20 sequences)
Processing 198 of 8762 batches (20 sequences)
Processing 199 of 8762 batches (20

Processing 357 of 8762 batches (16 sequences)
Processing 358 of 8762 batches (16 sequences)
Processing 359 of 8762 batches (16 sequences)
Processing 360 of 8762 batches (16 sequences)
Processing 361 of 8762 batches (16 sequences)
Processing 362 of 8762 batches (16 sequences)
Processing 363 of 8762 batches (16 sequences)
Processing 364 of 8762 batches (16 sequences)
Processing 365 of 8762 batches (16 sequences)
Processing 366 of 8762 batches (16 sequences)
Processing 367 of 8762 batches (16 sequences)
Processing 368 of 8762 batches (16 sequences)
Processing 369 of 8762 batches (16 sequences)
Processing 370 of 8762 batches (16 sequences)
Processing 371 of 8762 batches (16 sequences)
Processing 372 of 8762 batches (16 sequences)
Processing 373 of 8762 batches (16 sequences)
Processing 374 of 8762 batches (16 sequences)
Processing 375 of 8762 batches (16 sequences)
Processing 376 of 8762 batches (16 sequences)
Processing 377 of 8762 batches (16 sequences)
Processing 378 of 8762 batches (16

## Step 4: Convert to Embeddings File

In [None]:
!python sequence_model/map_gene_symbol_to_protein_ids.py \
    --fasta_path ./data/{NAME}.fa \
    --save_path ./data/{NAME}.gene_symbol_to_protein_ID.json


!python sequence_model/convert_protein_embeddings_to_gene_embeddings.py \
    --embedding_dir ./data/{NAME}.clean.fa_esm1b \
    --gene_symbol_to_protein_ids_path ./data/{NAME}.gene_symbol_to_protein_ID.json \
    --embedding_model ESM1b \
    --save_path ./data/{NAME}.gene_symbol_to_embedding_ESM1b.pt


## STEP 5: Running SPEAR

In [None]:
# Your final embeddings will be located at: 
os.path.abspath(f"./data/{NAME}.gene_symbol_to_embedding_ESM1b.pt")