# ACP - Anticancer Peptides

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [script](https://github.com/facebookresearch/esm/blob/main/scripts/extract.py)  that efficiently extracts embeddings in bulk.

The script `scripts/extract.py` stores embeddings in PyTorch `.pt` files (generated by `torch.save`) - one file per fasta sequence.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

# import file utilities as fu
import file_utilities as fu

Create a path for the script `extract.py`.

In [3]:
# Path for extract.py
esm_scripts_path = '/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts'
extract = os.path.join(esm_scripts_path, 'extract.py')
extract

'/home/damir/.cache/torch/hub/facebookresearch_esm_main/scripts/extract.py'

Initialize arguments

In [4]:
# Define arguments for the file_paths function
# First 3 are constant in this notebook
ptmodel = 'esm'
task = 'acp'
pool = 'mean'  
# Last 3 arguments we might be changing through the notebook
file_base = 'train'
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33

<br>

## Train Dataset

### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Run the script `file_paths` to prepare paths. The default root data folder is *../../data*.

In [5]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/esm/train/acp_train_esm1v_mean


<br>

Run the embedding script for: `esm - acp - train - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [6]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 



Transferred model to GPU
Read ../../data/acp/train_data.fa with 1378 sequences
Processing 1 of 10 batches (292 sequences)
Processing 2 of 10 batches (215 sequences)
Processing 3 of 10 batches (178 sequences)
Processing 4 of 10 batches (157 sequences)
Processing 5 of 10 batches (132 sequences)
Processing 6 of 10 batches (117 sequences)
Processing 7 of 10 batches (105 sequences)
Processing 8 of 10 batches (91 sequences)
Processing 9 of 10 batches (80 sequences)
Processing 10 of 10 batches (11 sequences)
CPU times: user 31.1 s, sys: 11.7 s, total: 42.8 s
Wall time: 52.6 s


<br>

### ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [7]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/esm/train/acp_train_esm1b_mean


<br>

Run the embedding script for: `esm - acp - train - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [8]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../../data/acp/train_data.fa with 1378 sequences
Processing 1 of 10 batches (292 sequences)
Processing 2 of 10 batches (215 sequences)
Processing 3 of 10 batches (178 sequences)
Processing 4 of 10 batches (157 sequences)
Processing 5 of 10 batches (132 sequences)
Processing 6 of 10 batches (117 sequences)
Processing 7 of 10 batches (105 sequences)
Processing 8 of 10 batches (91 sequences)
Processing 9 of 10 batches (80 sequences)
Processing 10 of 10 batches (11 sequences)
CPU times: user 29.4 s, sys: 14.6 s, total: 43.9 s
Wall time: 49.3 s


<br>

**Check the folders**

In [9]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/acp/esm/train
├── [4.0K Oct  2 19:32]  acp_train_esm1b_mean
└── [4.0K Oct  2 19:30]  acp_train_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [10]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_train_esm1b_mean consumes: 7.79MB in 1378 files
acp_train_esm1v_mean consumes: 7.79MB in 1378 files


<br>

## Test Dataset

### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [11]:
# Update arguments
model = 'esm1v_t33_650M_UR90S_1'
file_base = 'test'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/esm/test/acp_test_esm1v_mean


<br>

Run the embedding script for: `esm - acp - test - esm1v - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [12]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../../data/acp/test_data.fa with 344 sequences
Processing 1 of 3 batches (175 sequences)
Processing 2 of 3 batches (113 sequences)
Processing 3 of 3 batches (56 sequences)
CPU times: user 14.6 s, sys: 12.7 s, total: 27.3 s
Wall time: 28 s


<br>

### ESM-1b model - esm1b_t33_650M_UR50S

- **Pooling Operation:  `mean`**

Update arguments and prepare paths

In [13]:
# Update arguments
model = 'esm1b_t33_650M_UR50S'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/esm/test/acp_test_esm1b_mean


<br>

Run the embedding script for: `esm - acp - test - esm1b - mean`.  
The script reads the fasta file and creates `.pt` files with embeddings, one for each fasta sequence.

In [14]:
%%time
# Run embedding script
%run "{extract}" "{model}" "{path_fa}" "{path_pt}" --repr_layers "{emb_layer}" --include "{pool}" 

Transferred model to GPU
Read ../../data/acp/test_data.fa with 344 sequences
Processing 1 of 3 batches (175 sequences)
Processing 2 of 3 batches (113 sequences)
Processing 3 of 3 batches (56 sequences)
CPU times: user 13.8 s, sys: 10.8 s, total: 24.6 s
Wall time: 25.1 s


<br>

**Check the folders**

In [15]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/acp/esm/test
├── [4.0K Oct  2 19:35]  acp_test_esm1b_mean
└── [4.0K Oct  2 19:34]  acp_test_esm1v_mean

2 directories, 0 files


Print the total size and number of pt files in each embedding folder

In [16]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_test_esm1b_mean consumes: 1.95MB in 344 files
acp_test_esm1v_mean consumes: 1.95MB in 344 files
