## AMP - Antimicrobial Peptides

In [1]:
# Import dependencies
import torch
import h5py
# import esm
import os
import numpy as np

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

import prepare_paths as pp

In [3]:
# Define arguments for the file_paths function
# First three (amp only) are constant in this notebook
ptmodel = 'prose'
task = 'amp'
file_base = 'all_data'
# Last 2 arguments we will be changing through the notebook
model = 'prose_dlm'
pool = 'avg'  


### all_data Dataset

### ProSE DLM model - prose_dlm

**Pooling Operation:  avg**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = pp.file_paths(ptmodel, task, file_base, model, pool)

In [5]:
print(path_fa, '\n', path_h5, '\n', path_pt)

../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_avg.h5 
 ../data/amp/prose/all_data/amp_all_dlm_avg


Run the embeding script for: `prose - amp - all_data - dlm - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [6]:
# Run embedding script
%%time
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/amp/prose/all_data/amp_all_dlm_avg.h5
# embedding with pool=avg
                                                                                

The helper function `convert_h5_to_pt`, converts h5 file to set of pt files, one for each sequence embedding.

In [10]:
# Converts h5 file to ot files, one per each sequence embedding.
def convert_h5_to_pt(path_h5, path_pt, pool):
    os.makedirs(path_pt, exist_ok=True)
    with h5py.File(path_h5, 'r') as hf:
        dd = {}
        for key in hf.keys():
            
            dd['label'] = key
            t = torch.tensor(hf.get(key))
            dd[f'{pool}_representations'] = {'layer': t}
            torch.save(dd, f'{os.path.join(path_pt, key)}.pt')

Read the h5 file and create one pt file for every embedding ih the path_pt folder.

In [11]:
%%time
# Convert h5 file to pt files
convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 3min 1s, sys: 1.5 s, total: 3min 2s
Wall time: 3min 21s


**Pooling Operation:  max**

In [None]:
# Prepare paths
pool = 'max'
path_pt, path_h5, path_fa = pp.file_paths(ptmodel, task, file_base, model, pool)

Run the embeding script for: `prose - amp - all_data - dlm - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [None]:
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"