# AMP - Antimicrobial Peptides

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [embedding script](https://github.com/tbepler/prose/blob/main/embed_sequences.py) developed by [Tristan Bepler](mailto:tbepler@gmail.com).  
The script writes embeddings out as an HDF5 file using the sequence headers as keys and embedding vectors as values.

In this project we are using pretrained models from [Facebook's Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) and [Protein Sequence Embeddings (ProSE)](https://github.com/tbepler/prose). Both of them accept fasta formated files as input but they write embeddings in different formats. ESM stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, while ProSE writes embeddings out as an HDF5 file (`.h5`).

To store embeddings using an identical data format, we are going to follow the data format used by ESM.   
Hence, as the last step, we will convert HDF5 file, generated by the ProSE embedding script, to `.pt` files.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

# import file utilities as fu
import file_utilities as fu

Initialize arguments

In [3]:
# Define arguments for the file_paths function
# First three (amp only) are constant in this notebook
ptmodel = 'prose'
task = 'amp'
file_base = 'all_data'
# Last 2 arguments we will be changing through the notebook
model = 'prose_dlm'
pool = 'avg'  


<br>

## all_data Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/amp/all_data.fa 
 ../../data/amp/prose/all_data/amp_all_dlm_avg.h5 
 ../../data/amp/prose/all_data/amp_all_dlm_avg


<br>

Run the embedding script for: `prose - amp - all_data - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [5]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/amp/prose/all_data/amp_all_dlm_avg.h5
# embedding with pool=avg
                                                                                

CPU times: user 57.3 s, sys: 11.6 s, total: 1min 8s
Wall time: 1min 10s


Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [6]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 18s, sys: 1.49 s, total: 2min 20s
Wall time: 2min 36s


<br> 

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [7]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/amp/all_data.fa 
 ../../data/amp/prose/all_data/amp_all_dlm_max.h5 
 ../../data/amp/prose/all_data/amp_all_dlm_max


<br>

Run the embedding script for: `prose - amp - all_data - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [8]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/amp/prose/all_data/amp_all_dlm_max.h5
# embedding with pool=max
# 4038 sequences processed...

CPU times: user 54 s, sys: 6.73 s, total: 1min
Wall time: 1min 1s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [9]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 20s, sys: 1.38 s, total: 2min 21s
Wall time: 2min 37s


<br> 

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [10]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/amp/all_data.fa 
 ../../data/amp/prose/all_data/amp_all_dlm_sum.h5 
 ../../data/amp/prose/all_data/amp_all_dlm_sum


<br>

Run the embedding script for: `prose - amp - all_data - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [11]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/amp/prose/all_data/amp_all_dlm_sum.h5
# embedding with pool=sum
                                                                                

CPU times: user 54 s, sys: 5.95 s, total: 59.9 s
Wall time: 1min


Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [12]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 18s, sys: 1.12 s, total: 2min 19s
Wall time: 2min 35s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [13]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/amp/all_data.fa 
 ../../data/amp/prose/all_data/amp_all_mt_avg.h5 
 ../../data/amp/prose/all_data/amp_all_mt_avg


<br>

Run the embedding script for: `prose - amp - all_data - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [14]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/amp/prose/all_data/amp_all_mt_avg.h5
# embedding with pool=avg
                                                                                

CPU times: user 53.1 s, sys: 25.6 s, total: 1min 18s
Wall time: 1min 19s


Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [15]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 3min 10s, sys: 2.2 s, total: 3min 12s
Wall time: 3min 26s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [16]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/amp/all_data.fa 
 ../../data/amp/prose/all_data/amp_all_mt_max.h5 
 ../../data/amp/prose/all_data/amp_all_mt_max


<br>

Run the embedding script for: `prose - amp - all_data - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [17]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/amp/prose/all_data/amp_all_mt_max.h5
# embedding with pool=max
# 4038 sequences processed...

CPU times: user 55 s, sys: 8.85 s, total: 1min 3s
Wall time: 1min 4s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [18]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 31s, sys: 1.47 s, total: 2min 32s
Wall time: 2min 47s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [19]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/amp/all_data.fa 
 ../../data/amp/prose/all_data/amp_all_mt_sum.h5 
 ../../data/amp/prose/all_data/amp_all_mt_sum


<br>  

Run the embedding script for: `prose - amp - all_data - mt - sum`.   
The script reads the fasta file and creates the h5 file with embeddings.

In [20]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/amp/prose/all_data/amp_all_mt_sum.h5
# embedding with pool=sum
# 4040 sequences processed...

CPU times: user 54.3 s, sys: 6.28 s, total: 1min
Wall time: 1min 1s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [21]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 21s, sys: 1.56 s, total: 2min 22s
Wall time: 2min 38s


<br>

**Check the folders**

In [22]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/amp/prose/all_data
├── [4.0K Oct  2 22:12]  amp_all_dlm_avg
├── [4.0K Oct  2 22:16]  amp_all_dlm_max
├── [4.0K Oct  2 22:19]  amp_all_dlm_sum
├── [4.0K Oct  2 22:24]  amp_all_mt_avg
├── [4.0K Oct  2 22:28]  amp_all_mt_max
├── [4.0K Oct  2 22:32]  amp_all_mt_sum
├── [ 97M Oct  2 22:09]  amp_all_dlm_avg.h5
├── [ 97M Oct  2 22:13]  amp_all_dlm_max.h5
├── [ 97M Oct  2 22:17]  amp_all_dlm_sum.h5
├── [ 97M Oct  2 22:21]  amp_all_mt_avg.h5
├── [ 97M Oct  2 22:25]  amp_all_mt_max.h5
└── [ 97M Oct  2 22:29]  amp_all_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [23]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

amp_all_dlm_avg consumes: 98.11MB in 4042 files
amp_all_dlm_max consumes: 98.11MB in 4042 files
amp_all_dlm_sum consumes: 98.11MB in 4042 files
amp_all_mt_avg consumes: 98.11MB in 4042 files
amp_all_mt_max consumes: 98.11MB in 4042 files
amp_all_mt_sum consumes: 98.11MB in 4042 files
