# AMP - Antimicrobial Peptides

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [embedding script](https://github.com/tbepler/prose/blob/main/embed_sequences.py) developed by [Tristan Bepler](tbepler@gmail.com).  
The script writes embeddings out as an HDF5 file using the sequence headers as keys and embedding vectors as values.

In this project we are using pretrained models from [Facebook's Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) and [Protein Sequence Embeddings (ProSE)](V). Both of them accept fasta formated files as input but they write embeddings in differnt formats. ESM stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, while ProSE writes embeddings out as an HDF5 file (`.h5`).

To store embeddings using identical data format, we are going to follow data format used by ESM.   
Hence, as the last step, we will convert HDF5 file, generated by the ProSE embedding script, to `.pt`files.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

#import file utilities as fu
import file_utilities as fu

Initialize arguments

In [3]:
# Define arguments for the file_paths function
# First three (amp only) are constant in this notebook
ptmodel = 'prose'
task = 'amp'
file_base = 'all_data'
# Last 2 arguments we will be changing through the notebook
model = 'prose_dlm'
pool = 'avg'  


<br>

## all_data Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_avg.h5 
 ../data/amp/prose/all_data/amp_all_dlm_avg


<br>

Run the embeding script for: `prose - amp - all_data - dlm - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [6]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/amp/prose/all_data/amp_all_dlm_avg.h5
# embedding with pool=avg
                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [11]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 3min 1s, sys: 1.5 s, total: 3min 2s
Wall time: 3min 21s


<br> 

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [6]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_max.h5 
 ../data/amp/prose/all_data/amp_all_dlm_max


<br>

Run the embeding script for: `prose - amp - all_data - dlm - max`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [20]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/amp/prose/all_data/amp_all_dlm_max.h5
# embedding with pool=max
# 4039 sequences processed...

CPU times: user 54.4 s, sys: 6.03 s, total: 1min
Wall time: 1min 1s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [7]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 58s, sys: 2.13 s, total: 3min
Wall time: 3min 16s


<br> 

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [8]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_sum.h5 
 ../data/amp/prose/all_data/amp_all_dlm_sum


<br>

Run the embeding script for: `prose - amp - all_data - dlm - sum`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [9]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/amp/prose/all_data/amp_all_dlm_sum.h5
# embedding with pool=sum
# 4041 sequences processed...

CPU times: user 57.8 s, sys: 8.06 s, total: 1min 5s
Wall time: 1min 7s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [10]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 13s, sys: 1.78 s, total: 2min 15s
Wall time: 2min 30s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [4]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_mt_avg.h5 
 ../data/amp/prose/all_data/amp_all_mt_avg


<br>

Run the embeding script for: `prose - amp - all_data - mt - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [5]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/amp/prose/all_data/amp_all_mt_avg.h5
# embedding with pool=avg
# 4038 sequences processed...

CPU times: user 56.5 s, sys: 9.3 s, total: 1min 5s
Wall time: 1min 7s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [6]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 21s, sys: 1.62 s, total: 2min 23s
Wall time: 2min 38s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [7]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_mt_max.h5 
 ../data/amp/prose/all_data/amp_all_mt_max


<br>

Run the embeding script for: `prose - amp - all_data - mt - max`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [8]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/amp/prose/all_data/amp_all_mt_max.h5
# embedding with pool=max
                                                                                

CPU times: user 54.3 s, sys: 7.49 s, total: 1min 1s
Wall time: 1min 2s


Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [9]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 15s, sys: 2.14 s, total: 2min 17s
Wall time: 2min 32s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [10]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_mt_sum.h5 
 ../data/amp/prose/all_data/amp_all_mt_sum


<br>  

Run the embeding script for: `prose - amp - all_data - mt - sum`.   
The script reads the fasta file and creates the h5 file wilh embeddings.

In [11]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/amp/prose/all_data/amp_all_mt_sum.h5
# embedding with pool=sum
# 4038 sequences processed...

CPU times: user 54.1 s, sys: 6.08 s, total: 1min
Wall time: 1min 1s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [12]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 2min 12s, sys: 1.3 s, total: 2min 14s
Wall time: 2min 29s


<br>

**Check the folders**

In [5]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/amp/prose/all_data
├── [4.0K Sep  4 17:39]  amp_all_dlm_avg
├── [4.0K Sep  4 22:27]  amp_all_dlm_max
├── [4.0K Sep  4 22:38]  amp_all_dlm_sum
├── [4.0K Sep  4 22:56]  amp_all_mt_avg
├── [4.0K Sep  4 23:05]  amp_all_mt_max
├── [4.0K Sep  4 23:16]  amp_all_mt_sum
├── [ 97M Sep  4 17:21]  amp_all_dlm_avg.h5
├── [ 97M Sep  4 22:04]  amp_all_dlm_max.h5
├── [ 97M Sep  4 22:35]  amp_all_dlm_sum.h5
├── [ 97M Sep  4 22:53]  amp_all_mt_avg.h5
├── [ 97M Sep  4 23:01]  amp_all_mt_max.h5
└── [ 97M Sep  4 23:11]  amp_all_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [6]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

amp_all_dlm_avg consumes: 98.11MB in 4042 files
amp_all_dlm_max consumes: 98.11MB in 4042 files
amp_all_dlm_sum consumes: 98.11MB in 4042 files
amp_all_mt_avg consumes: 98.11MB in 4042 files
amp_all_mt_max consumes: 98.11MB in 4042 files
amp_all_mt_sum consumes: 98.11MB in 4042 files
