# ACP - Anticancer Peptides

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [script](https://github.com/facebookresearch/esm/blob/main/scripts/extract.py)  that efficiently extracts embeddings in bulk.

The script `scripts/extract.py` stores embeddings in PyTorch `.pt` files (generated by `torch.save`) - one file per fasta sequence.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

# import file utilities as fu
import file_utilities as fu

Create a path for the script `extract.py`.

In [3]:
# Path for extract.py
esm_scripts_path = '/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts'
extract = os.path.join(esm_scripts_path, 'extract.py')
extract

'/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts/extract.py'

Initialize arguments

In [4]:
# Define arguments for the file_paths function
# First 3 are constant in this notebook
ptmodel = 'esm'
task = 'acp'
pool = 'mean'  
# Last 3 arguments we might be changing through the notebook
file_base = 'train'
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33

<br>

## Train Dataset

### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [6]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/esm/train/acp_train_esm1v_mean


<br>

Run the embedding script for: `esm - acp - train - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [7]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 



Transferred model to GPU
Read ../data/acp/train_data.fa with 1378 sequences
Processing 1 of 10 batches (292 sequences)
Processing 2 of 10 batches (215 sequences)
Processing 3 of 10 batches (178 sequences)
Processing 4 of 10 batches (157 sequences)
Processing 5 of 10 batches (132 sequences)
Processing 6 of 10 batches (117 sequences)
Processing 7 of 10 batches (105 sequences)
Processing 8 of 10 batches (91 sequences)
Processing 9 of 10 batches (80 sequences)
Processing 10 of 10 batches (11 sequences)
CPU times: user 32.4 s, sys: 14.4 s, total: 46.8 s
Wall time: 51.3 s


<br>

### ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [9]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/esm/train/acp_train_esm1b_mean


<br>

Run the embedding script for: `esm - acp - train - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [10]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Downloading: "https://dl.fbaipublicfiles.com/fair-esm/models/esm1b_t33_650M_UR50S.pt" to /home/damir/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S.pt
Downloading: "https://dl.fbaipublicfiles.com/fair-esm/regression/esm1b_t33_650M_UR50S-contact-regression.pt" to /home/damir/.cache/torch/hub/checkpoints/esm1b_t33_650M_UR50S-contact-regression.pt


Transferred model to GPU
Read ../data/acp/train_data.fa with 1378 sequences
Processing 1 of 10 batches (292 sequences)
Processing 2 of 10 batches (215 sequences)
Processing 3 of 10 batches (178 sequences)
Processing 4 of 10 batches (157 sequences)
Processing 5 of 10 batches (132 sequences)
Processing 6 of 10 batches (117 sequences)
Processing 7 of 10 batches (105 sequences)
Processing 8 of 10 batches (91 sequences)
Processing 9 of 10 batches (80 sequences)
Processing 10 of 10 batches (11 sequences)
CPU times: user 1min 1s, sys: 59.4 s, total: 2min 1s
Wall time: 5min 10s


<br>

**Check the folders**

In [7]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/acp/esm/train
├── [4.0K Sep  6 16:40]  acp_train_esm1b_mean
└── [4.0K Sep  6 16:01]  acp_train_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [12]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_train_esm1b_mean consumes: 7.79MB in 1378 files
acp_train_esm1v_mean consumes: 7.79MB in 1378 files


<br>

## Test Dataset

### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [5]:
# Update arguments
model = 'esm1v_t33_650M_UR90S_1'
file_base = 'test'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/esm/test/acp_test_esm1v_mean


<br>

Run the embedding script for: `esm - acp - test - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [7]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../data/acp/test_data.fa with 344 sequences
Processing 1 of 3 batches (175 sequences)
Processing 2 of 3 batches (113 sequences)
Processing 3 of 3 batches (56 sequences)
CPU times: user 15 s, sys: 7.91 s, total: 22.9 s
Wall time: 23 s


<br>

### ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [8]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/esm/test/acp_test_esm1b_mean


<br>

Run the embedding script for: `esm - acp - test - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [9]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../data/acp/test_data.fa with 344 sequences
Processing 1 of 3 batches (175 sequences)
Processing 2 of 3 batches (113 sequences)
Processing 3 of 3 batches (56 sequences)
CPU times: user 15.2 s, sys: 4.98 s, total: 20.2 s
Wall time: 22.5 s


<br>

**Check the folders**

In [10]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/acp/esm/test
├── [4.0K Sep 10 20:50]  acp_test_esm1b_mean
└── [4.0K Sep 10 20:47]  acp_test_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [11]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_test_esm1b_mean consumes: 1.95MB in 344 files
acp_test_esm1v_mean consumes: 1.95MB in 344 files
