# ACP - Anticancer Peptides

##### After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [embedding script](https://github.com/tbepler/prose/blob/main/embed_sequences.py) developed by [Tristan Bepler](tbepler@gmail.com).  
The script writes embeddings out as an HDF5 file using the sequence headers as keys and embedding vectors as values.

In this project we are using pretrained models from [Facebook's Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) and [Protein Sequence Embeddings (ProSE)](V). Both of them accept fasta formated files as input but they write embeddings in different formats. ESM stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, while ProSE writes embeddings out as an HDF5 file (`.h5`).

To store embeddings using an identical data format, we are going to follow the data format used by ESM.   
Hence, as the last step, we will convert HDF5 file, generated by the ProSE embedding script, to `.pt` files.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

# import file utilities as fu
import file_utilities as fu

Initialize arguments

In [3]:
# Define arguments for the file_paths function
# First two are constant in this notebook
ptmodel = 'prose'
task = 'acp'
# Last 3 arguments we will be changing through the notebook
file_base = 'train'
model = 'prose_dlm'
pool = 'avg'  


<br>

## Train Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_dlm_avg.h5 
 ../data/acp/prose/train/acp_train_dlm_avg


<br>

Run the embedding script for: `prose - acp - train - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [10]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/acp/prose/train/acp_train_dlm_avg.h5
# embedding with pool=avg
# 1361 sequences processed...

CPU times: user 17.6 s, sys: 3.23 s, total: 20.8 s
Wall time: 21.6 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [11]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 48.3 s, sys: 448 ms, total: 48.8 s
Wall time: 53.8 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [12]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_dlm_max.h5 
 ../data/acp/prose/train/acp_train_dlm_max


<br>

Run the embedding script for: `prose - acp - train - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [13]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/acp/prose/train/acp_train_dlm_max.h5
# embedding with pool=max
# 1368 sequences processed...

CPU times: user 13.9 s, sys: 1.63 s, total: 15.5 s
Wall time: 15.8 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [14]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 47.6 s, sys: 391 ms, total: 48 s
Wall time: 53.4 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [15]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_dlm_sum.h5 
 ../data/acp/prose/train/acp_train_dlm_sum


<br>

Run the embedding script for: `prose - acp - train - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [16]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/acp/prose/train/acp_train_dlm_sum.h5
# embedding with pool=sum
# 1375 sequences processed...

CPU times: user 14 s, sys: 1.83 s, total: 15.9 s
Wall time: 16.3 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [17]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 48.7 s, sys: 513 ms, total: 49.2 s
Wall time: 54.5 s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [18]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_mt_avg.h5 
 ../data/acp/prose/train/acp_train_mt_avg


<br>

Run the embedding script for: `prose - acp - train - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [19]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/acp/prose/train/acp_train_mt_avg.h5
# embedding with pool=avg
# 1372 sequences processed...

CPU times: user 14.4 s, sys: 1.68 s, total: 16 s
Wall time: 16.4 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [20]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 47.4 s, sys: 414 ms, total: 47.8 s
Wall time: 52.9 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [21]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_mt_max.h5 
 ../data/acp/prose/train/acp_train_mt_max


<br>

Run the embedding script for: `prose - acp - train - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [22]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/acp/prose/train/acp_train_mt_max.h5
# embedding with pool=max
# 1374 sequences processed...

CPU times: user 14.5 s, sys: 1.69 s, total: 16.2 s
Wall time: 16.4 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [23]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 47.4 s, sys: 459 ms, total: 47.9 s
Wall time: 53.3 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [24]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_mt_sum.h5 
 ../data/acp/prose/train/acp_train_mt_sum


<br>

Run the embedding script for: `prose - acp - train - mt - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [25]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/acp/prose/train/acp_train_mt_sum.h5
# embedding with pool=sum
# 1368 sequences processed...

CPU times: user 14.6 s, sys: 1.64 s, total: 16.2 s
Wall time: 16.5 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [26]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 48.3 s, sys: 533 ms, total: 48.8 s
Wall time: 54.1 s


<br>

**Check the folders**

In [5]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/acp/prose/train
├── [4.0K Sep  5 15:10]  acp_train_dlm_avg
├── [4.0K Sep  5 15:18]  acp_train_dlm_max
├── [4.0K Sep  5 15:28]  acp_train_dlm_sum
├── [4.0K Sep  5 15:44]  acp_train_mt_avg
├── [4.0K Sep  5 15:48]  acp_train_mt_max
├── [4.0K Sep  5 15:50]  acp_train_mt_sum
├── [ 33M Sep  5 15:09]  acp_train_dlm_avg.h5
├── [ 33M Sep  5 15:17]  acp_train_dlm_max.h5
├── [ 33M Sep  5 15:26]  acp_train_dlm_sum.h5
├── [ 33M Sep  5 15:43]  acp_train_mt_avg.h5
├── [ 33M Sep  5 15:47]  acp_train_mt_max.h5
└── [ 33M Sep  5 15:49]  acp_train_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [28]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_train_dlm_avg consumes: 33.45MB in 1378 files
acp_train_dlm_max consumes: 33.45MB in 1378 files
acp_train_dlm_sum consumes: 33.45MB in 1378 files
acp_train_mt_avg consumes: 33.45MB in 1378 files
acp_train_mt_max consumes: 33.45MB in 1378 files
acp_train_mt_sum consumes: 33.45MB in 1378 files


<br>

## Test Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [4]:
# Update arguments
file_base = 'test'
model = 'prose_dlm'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_dlm_avg.h5 
 ../data/acp/prose/test/acp_test_dlm_avg


<br>

Run the embedding script for: `prose - acp - test - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [5]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/acp/prose/test/acp_test_dlm_avg.h5
# embedding with pool=avg
# 331 sequences processed...

CPU times: user 5.71 s, sys: 1.37 s, total: 7.08 s
Wall time: 9.34 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [7]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 11.6 s, sys: 114 ms, total: 11.7 s
Wall time: 13 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [8]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_dlm_max.h5 
 ../data/acp/prose/test/acp_test_dlm_max


<br>

Run the embedding script for: `prose - acp - test - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [9]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/acp/prose/test/acp_test_dlm_max.h5
# embedding with pool=max
# 333 sequences processed...

CPU times: user 4.11 s, sys: 519 ms, total: 4.63 s
Wall time: 4.83 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [10]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 11.6 s, sys: 102 ms, total: 11.7 s
Wall time: 13.1 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [11]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_dlm_sum.h5 
 ../data/acp/prose/test/acp_test_dlm_sum


<br>

Run the embedding script for: `prose - acp - test - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [12]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/acp/prose/test/acp_test_dlm_sum.h5
# embedding with pool=sum
# 341 sequences processed...

CPU times: user 3.9 s, sys: 622 ms, total: 4.52 s
Wall time: 4.71 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [13]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 11.7 s, sys: 132 ms, total: 11.8 s
Wall time: 13.2 s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [14]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_mt_avg.h5 
 ../data/acp/prose/test/acp_test_mt_avg


<br>

Run the embedding script for: `prose - acp - test - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [15]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/acp/prose/test/acp_test_mt_avg.h5
# embedding with pool=avg
# 338 sequences processed...

CPU times: user 4.46 s, sys: 701 ms, total: 5.16 s
Wall time: 5.7 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [16]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 12 s, sys: 62.4 ms, total: 12.1 s
Wall time: 13.4 s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [17]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_mt_max.h5 
 ../data/acp/prose/test/acp_test_mt_max


<br>

Run the embedding script for: `prose - acp - test - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [18]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/acp/prose/test/acp_test_mt_max.h5
# embedding with pool=max
# 329 sequences processed...

CPU times: user 4.36 s, sys: 824 ms, total: 5.19 s
Wall time: 5.39 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [19]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 13.3 s, sys: 152 ms, total: 13.5 s
Wall time: 14.9 s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [20]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_mt_sum.h5 
 ../data/acp/prose/test/acp_test_mt_sum


<br>

Run the embedding script for: `prose - acp - test - mt - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [21]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/acp/prose/test/acp_test_mt_sum.h5
# embedding with pool=sum
# 340 sequences processed...

CPU times: user 4.52 s, sys: 591 ms, total: 5.11 s
Wall time: 5.28 s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [22]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 11.9 s, sys: 147 ms, total: 12.1 s
Wall time: 13.5 s


<br>

**Check the folders**

In [23]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/acp/prose/test
├── [4.0K Sep 10 20:35]  acp_test_dlm_avg
├── [4.0K Sep 10 20:37]  acp_test_dlm_max
├── [4.0K Sep 10 20:37]  acp_test_dlm_sum
├── [4.0K Sep 10 20:37]  acp_test_mt_avg
├── [4.0K Sep 10 20:38]  acp_test_mt_max
├── [4.0K Sep 10 20:38]  acp_test_mt_sum
├── [8.2M Sep 10 20:33]  acp_test_dlm_avg.h5
├── [8.2M Sep 10 20:37]  acp_test_dlm_max.h5
├── [8.2M Sep 10 20:37]  acp_test_dlm_sum.h5
├── [8.2M Sep 10 20:37]  acp_test_mt_avg.h5
├── [8.2M Sep 10 20:38]  acp_test_mt_max.h5
└── [8.2M Sep 10 20:38]  acp_test_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [24]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

acp_test_dlm_avg consumes: 8.35MB in 344 files
acp_test_dlm_max consumes: 8.35MB in 344 files
acp_test_dlm_sum consumes: 8.35MB in 344 files
acp_test_mt_avg consumes: 8.35MB in 344 files
acp_test_mt_max consumes: 8.35MB in 344 files
acp_test_mt_sum consumes: 8.35MB in 344 files
