# ACP - Anticancer Peptides

##### After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [embedding script](https://github.com/tbepler/prose/blob/main/embed_sequences.py) developed by [Tristan Bepler](mailto:tbepler@gmail.com).  
The script writes embeddings out as an HDF5 file using the sequence headers as keys and embedding vectors as values.

In this project we are using pretrained models from [Facebook's Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) and [Protein Sequence Embeddings (ProSE)](https://github.com/tbepler/prose). Both of them accept fasta formated files as input but they write embeddings in different formats. ESM stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, while ProSE writes embeddings out as an HDF5 file (`.h5`).

To store embeddings using an identical data format, we are going to follow the data format used by ESM.   
Hence, as the last step, we will convert HDF5 file, generated by the ProSE embedding script, to `.pt` files.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

# import file utilities as fu
import file_utilities as fu

Initialize arguments

In [3]:
# Define arguments for the file_paths function
# First two are constant in this notebook
ptmodel = 'prose'
task = 'acp'
# Last 3 arguments we will be changing through the notebook
file_base = 'train'
model = 'prose_dlm'
pool = 'avg'  


<br>

## Train Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/prose/train/acp_train_dlm_avg.h5 
 ../../data/acp/prose/train/acp_train_dlm_avg


<br>

Run the embedding script for: `prose - acp - train - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [5]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/acp/prose/train/acp_train_dlm_avg.h5
# embedding with pool=avg
# 1372 sequences processed...

CPU times: user 15.6 s, sys: 2.53 s, total: 18.1 s
Wall time: 20.5 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [7]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 52.1 s, sys: 973 ms, total: 53.1 s
Wall time: 58.7 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [8]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/prose/train/acp_train_dlm_max.h5 
 ../../data/acp/prose/train/acp_train_dlm_max


<br>

Run the embedding script for: `prose - acp - train - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [9]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/acp/prose/train/acp_train_dlm_max.h5
# embedding with pool=max
# 1361 sequences processed...

CPU times: user 14.2 s, sys: 1.76 s, total: 15.9 s
Wall time: 16.3 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [10]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 49.2 s, sys: 452 ms, total: 49.7 s
Wall time: 55.2 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [11]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/prose/train/acp_train_dlm_sum.h5 
 ../../data/acp/prose/train/acp_train_dlm_sum


<br>

Run the embedding script for: `prose - acp - train - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [12]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/acp/prose/train/acp_train_dlm_sum.h5
# embedding with pool=sum
                                                                                

CPU times: user 13.7 s, sys: 2.26 s, total: 16 s
Wall time: 16.4 s


Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [13]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 48.4 s, sys: 605 ms, total: 49 s
Wall time: 54.3 s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [14]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/prose/train/acp_train_mt_avg.h5 
 ../../data/acp/prose/train/acp_train_mt_avg


<br>

Run the embedding script for: `prose - acp - train - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [15]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/acp/prose/train/acp_train_mt_avg.h5
# embedding with pool=avg
                                                                                

CPU times: user 14.6 s, sys: 1.91 s, total: 16.5 s
Wall time: 17.3 s


Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [16]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 49.6 s, sys: 807 ms, total: 50.4 s
Wall time: 56 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [17]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/prose/train/acp_train_mt_max.h5 
 ../../data/acp/prose/train/acp_train_mt_max


<br>

Run the embedding script for: `prose - acp - train - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [18]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/acp/prose/train/acp_train_mt_max.h5
# embedding with pool=max
# 1374 sequences processed...

CPU times: user 14.3 s, sys: 2.11 s, total: 16.4 s
Wall time: 16.8 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [19]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 48.4 s, sys: 600 ms, total: 49 s
Wall time: 54.6 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [20]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/train_data.fa 
 ../../data/acp/prose/train/acp_train_mt_sum.h5 
 ../../data/acp/prose/train/acp_train_mt_sum


<br>

Run the embedding script for: `prose - acp - train - mt - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [21]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/acp/prose/train/acp_train_mt_sum.h5
# embedding with pool=sum
# 1372 sequences processed...

CPU times: user 14.5 s, sys: 1.82 s, total: 16.3 s
Wall time: 16.7 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [22]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 48.2 s, sys: 557 ms, total: 48.8 s
Wall time: 54.4 s


<br>

**Check the folders**

In [23]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/acp/prose/train
├── [4.0K Oct  2 19:42]  acp_train_dlm_avg
├── [4.0K Oct  2 19:55]  acp_train_dlm_max
├── [4.0K Oct  2 19:56]  acp_train_dlm_sum
├── [4.0K Oct  2 19:57]  acp_train_mt_avg
├── [4.0K Oct  2 19:59]  acp_train_mt_max
├── [4.0K Oct  2 20:00]  acp_train_mt_sum
├── [ 33M Oct  2 19:41]  acp_train_dlm_avg.h5
├── [ 33M Oct  2 19:54]  acp_train_dlm_max.h5
├── [ 33M Oct  2 19:55]  acp_train_dlm_sum.h5
├── [ 33M Oct  2 19:56]  acp_train_mt_avg.h5
├── [ 33M Oct  2 19:58]  acp_train_mt_max.h5
└── [ 33M Oct  2 19:59]  acp_train_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [24]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_train_dlm_avg consumes: 33.45MB in 1378 files
acp_train_dlm_max consumes: 33.45MB in 1378 files
acp_train_dlm_sum consumes: 33.45MB in 1378 files
acp_train_mt_avg consumes: 33.45MB in 1378 files
acp_train_mt_max consumes: 33.45MB in 1378 files
acp_train_mt_sum consumes: 33.45MB in 1378 files


<br>

## Test Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [25]:
# Update arguments
file_base = 'test'
model = 'prose_dlm'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/prose/test/acp_test_dlm_avg.h5 
 ../../data/acp/prose/test/acp_test_dlm_avg


<br>

Run the embedding script for: `prose - acp - test - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [26]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/acp/prose/test/acp_test_dlm_avg.h5
# embedding with pool=avg
# 334 sequences processed...

CPU times: user 4.27 s, sys: 326 ms, total: 4.59 s
Wall time: 4.7 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [27]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 11.7 s, sys: 246 ms, total: 11.9 s
Wall time: 13.3 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [28]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/prose/test/acp_test_dlm_max.h5 
 ../../data/acp/prose/test/acp_test_dlm_max


<br>

Run the embedding script for: `prose - acp - test - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [29]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/acp/prose/test/acp_test_dlm_max.h5
# embedding with pool=max
# 332 sequences processed...

CPU times: user 4.06 s, sys: 530 ms, total: 4.59 s
Wall time: 4.7 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [30]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 13 s, sys: 136 ms, total: 13.2 s
Wall time: 14.6 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [31]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/prose/test/acp_test_dlm_sum.h5 
 ../../data/acp/prose/test/acp_test_dlm_sum


<br>

Run the embedding script for: `prose - acp - test - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [32]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/acp/prose/test/acp_test_dlm_sum.h5
# embedding with pool=sum
# 341 sequences processed...

CPU times: user 4.08 s, sys: 608 ms, total: 4.69 s
Wall time: 4.77 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [33]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 12.5 s, sys: 107 ms, total: 12.6 s
Wall time: 14.1 s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [34]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/prose/test/acp_test_mt_avg.h5 
 ../../data/acp/prose/test/acp_test_mt_avg


<br>

Run the embedding script for: `prose - acp - test - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [35]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/acp/prose/test/acp_test_mt_avg.h5
# embedding with pool=avg
# 331 sequences processed...

CPU times: user 4.46 s, sys: 579 ms, total: 5.04 s
Wall time: 5.13 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [36]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 13.2 s, sys: 84 ms, total: 13.3 s
Wall time: 14.7 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [37]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/prose/test/acp_test_mt_max.h5 
 ../../data/acp/prose/test/acp_test_mt_max


<br>

Run the embedding script for: `prose - acp - test - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [38]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/acp/prose/test/acp_test_mt_max.h5
# embedding with pool=max
# 333 sequences processed...

CPU times: user 4.53 s, sys: 704 ms, total: 5.24 s
Wall time: 5.34 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [39]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 12.6 s, sys: 117 ms, total: 12.7 s
Wall time: 14.2 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [40]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/acp/test_data.fa 
 ../../data/acp/prose/test/acp_test_mt_sum.h5 
 ../../data/acp/prose/test/acp_test_mt_sum


<br>

Run the embedding script for: `prose - acp - test - mt - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [41]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/acp/prose/test/acp_test_mt_sum.h5
# embedding with pool=sum
# 335 sequences processed...

CPU times: user 4.79 s, sys: 692 ms, total: 5.49 s
Wall time: 5.47 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [42]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 13.4 s, sys: 337 ms, total: 13.7 s
Wall time: 15.3 s


<br>

**Check the folders**

In [43]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/acp/prose/test
├── [4.0K Oct  2 20:00]  acp_test_dlm_avg
├── [4.0K Oct  2 20:01]  acp_test_dlm_max
├── [4.0K Oct  2 20:01]  acp_test_dlm_sum
├── [4.0K Oct  2 20:01]  acp_test_mt_avg
├── [4.0K Oct  2 20:02]  acp_test_mt_max
├── [4.0K Oct  2 20:02]  acp_test_mt_sum
├── [8.2M Oct  2 20:00]  acp_test_dlm_avg.h5
├── [8.2M Oct  2 20:00]  acp_test_dlm_max.h5
├── [8.2M Oct  2 20:01]  acp_test_dlm_sum.h5
├── [8.2M Oct  2 20:01]  acp_test_mt_avg.h5
├── [8.2M Oct  2 20:01]  acp_test_mt_max.h5
└── [8.2M Oct  2 20:02]  acp_test_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [44]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_test_dlm_avg consumes: 8.35MB in 344 files
acp_test_dlm_max consumes: 8.35MB in 344 files
acp_test_dlm_sum consumes: 8.35MB in 344 files
acp_test_mt_avg consumes: 8.35MB in 344 files
acp_test_mt_max consumes: 8.35MB in 344 files
acp_test_mt_sum consumes: 8.35MB in 344 files
