# AMP - Antimicrobial Peptides

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [script](https://github.com/facebookresearch/esm/blob/main/scripts/extract.py)  that efficiently extracts embeddings in bulk.

The script `scripts/extract.py` stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, 


In [2]:
# Import dependencies
import os

Import file utilities

In [3]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

#import file utilities as fu
import file_utilities as fu

Create a path for the script `extract.py`.

In [4]:
# Path for extract.py
esm_scripts_path = '/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts'
extract = os.path.join(esm_scripts_path, 'extract.py')
extract

'/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts/extract.py'

Initiaize arguments

In [5]:
# Define arguments for the file_paths function
# First 4 are constant in this notebook
ptmodel = 'esm'
task = 'amp'
file_base = 'all_data'
pool = 'mean'  
# Last 2 arguments we might be changing through the notebook
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33

<br>

## all_data Dataset

### ESM ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [6]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/esm/all_data/amp_all_esm1v_mean


<br>

Run the embeding script for: `esm - amp - all_data - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [7]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 



Transferred model to GPU
Read ../data/amp/all_data.fa with 4042 sequences
Processing 1 of 36 batches (292 sequences)
Processing 2 of 36 batches (250 sequences)
Processing 3 of 36 batches (227 sequences)
Processing 4 of 36 batches (204 sequences)
Processing 5 of 36 batches (195 sequences)
Processing 6 of 36 batches (178 sequences)
Processing 7 of 36 batches (170 sequences)
Processing 8 of 36 batches (159 sequences)
Processing 9 of 36 batches (151 sequences)
Processing 10 of 36 batches (146 sequences)
Processing 11 of 36 batches (136 sequences)
Processing 12 of 36 batches (132 sequences)
Processing 13 of 36 batches (124 sequences)
Processing 14 of 36 batches (120 sequences)
Processing 15 of 36 batches (117 sequences)
Processing 16 of 36 batches (110 sequences)
Processing 17 of 36 batches (107 sequences)
Processing 18 of 36 batches (104 sequences)
Processing 19 of 36 batches (99 sequences)
Processing 20 of 36 batches (95 sequences)
Processing 21 of 36 batches (91 sequences)
Processing 22 

<br>

### ESM ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [8]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/esm/all_data/amp_all_esm1b_mean


<br>

Run the embeding script for: `esm - amp - all_data - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [9]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../data/amp/all_data.fa with 4042 sequences
Processing 1 of 36 batches (292 sequences)
Processing 2 of 36 batches (250 sequences)
Processing 3 of 36 batches (227 sequences)
Processing 4 of 36 batches (204 sequences)
Processing 5 of 36 batches (195 sequences)
Processing 6 of 36 batches (178 sequences)
Processing 7 of 36 batches (170 sequences)
Processing 8 of 36 batches (159 sequences)
Processing 9 of 36 batches (151 sequences)
Processing 10 of 36 batches (146 sequences)
Processing 11 of 36 batches (136 sequences)
Processing 12 of 36 batches (132 sequences)
Processing 13 of 36 batches (124 sequences)
Processing 14 of 36 batches (120 sequences)
Processing 15 of 36 batches (117 sequences)
Processing 16 of 36 batches (110 sequences)
Processing 17 of 36 batches (107 sequences)
Processing 18 of 36 batches (104 sequences)
Processing 19 of 36 batches (99 sequences)
Processing 20 of 36 batches (95 sequences)
Processing 21 of 36 batches (91 sequences)
Processing 22 

<br>

**Check the folders**

In [12]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/amp/esm/all_data
├── [4.0K Sep  6 17:55]  amp_all_esm1b_mean
└── [4.0K Sep  6 17:46]  amp_all_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [11]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

amp_all_esm1b_mean consumes: 22.86MB in 4042 files
amp_all_esm1v_mean consumes: 22.86MB in 4042 files
