# DBP - DNA-Binding Proteins

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [embedding script](https://github.com/tbepler/prose/blob/main/embed_sequences.py) developed by [Tristan Bepler](tbepler@gmail.com).  
The script writes embeddings out as an HDF5 file using the sequence headers as keys and embedding vectors as values.

In this project we are using pretrained models from [Facebook's Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) and [Protein Sequence Embeddings (ProSE)](V). Both of them accept fasta formated files as input but they write embeddings in differnt formats. ESM stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, while ProSE writes embeddings out as an HDF5 file (`.h5`).

To store embeddings using identical data format, we are going to follow data format used by ESM.   
Hence, as the last step, we will convert HDF5 file, generated by the ProSE embedding script, to `.pt`files.


In [1]:
# Import dependencies
import torch
import h5py
import os
import numpy as np

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

#import file utilities as fu
import file_utilities as fu

Initiaize arguments

In [3]:
# Define arguments for the file_paths function
# First two are constant in this notebook
ptmodel = 'prose'
task = 'dbp'
# Last 3 arguments we will be changing through the notebook
file_base = 'train'
model = 'prose_dlm'
pool = 'avg'  


<br>

## Train Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_dlm_avg.h5 
 ../data/dna_binding/prose/train/dbp_train_dlm_avg


<br>

Run the embeding script for: `prose - dbp - train - dlm - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [5]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/dna_binding/prose/train/dbp_train_dlm_avg.h5
# embedding with pool=avg
# 14015 sequences processed...

CPU times: user 27min 36s, sys: 10min 9s, total: 37min 45s
Wall time: 39min 35s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [6]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 54s, sys: 6.32 s, total: 9min
Wall time: 9min 54s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [7]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_dlm_max.h5 
 ../data/dna_binding/prose/train/dbp_train_dlm_max


<br>

Run the embeding script for: `prose - dbp - train - dlm - max`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [8]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/dna_binding/prose/train/dbp_train_dlm_max.h5
# embedding with pool=max
# 14015 sequences processed...

CPU times: user 28min 19s, sys: 10min 10s, total: 38min 30s
Wall time: 40min 17s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [9]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 52s, sys: 6.88 s, total: 8min 58s
Wall time: 9min 53s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [10]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_dlm_sum.h5 
 ../data/dna_binding/prose/train/dbp_train_dlm_sum


<br>

Run the embeding script for: `prose - dbp - train - dlm - sum`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [11]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/dna_binding/prose/train/dbp_train_dlm_sum.h5
# embedding with pool=sum
# 14015 sequences processed...

CPU times: user 27min 10s, sys: 10min 32s, total: 37min 43s
Wall time: 39min 29s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [12]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 9min 28s, sys: 7.07 s, total: 9min 36s
Wall time: 10min 27s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [13]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_mt_avg.h5 
 ../data/dna_binding/prose/train/dbp_train_mt_avg


<br>

Run the embeding script for: `prose - dbp - train - mt - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [14]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/dna_binding/prose/train/dbp_train_mt_avg.h5
# embedding with pool=avg
# 14015 sequences processed...

CPU times: user 26min 57s, sys: 11min 9s, total: 38min 6s
Wall time: 40min 6s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [15]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 9min 15s, sys: 7.35 s, total: 9min 23s
Wall time: 10min 13s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [16]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_mt_max.h5 
 ../data/dna_binding/prose/train/dbp_train_mt_max


<br>

Run the embeding script for: `prose - dbp - train - mt - max`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [17]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/dna_binding/prose/train/dbp_train_mt_max.h5
# embedding with pool=max
# 14015 sequences processed...

CPU times: user 27min 28s, sys: 9min 45s, total: 37min 14s
Wall time: 39min 23s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [18]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 36s, sys: 7.39 s, total: 8min 43s
Wall time: 9min 39s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [19]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_mt_sum.h5 
 ../data/dna_binding/prose/train/dbp_train_mt_sum


<br>

Run the embeding script for: `prose - dbp - train - mt - sum`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [20]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/dna_binding/prose/train/dbp_train_mt_sum.h5
# embedding with pool=sum
                                                                                

CPU times: user 27min 37s, sys: 9min 56s, total: 37min 33s
Wall time: 39min 20s


Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [21]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 56s, sys: 7.92 s, total: 9min 4s
Wall time: 10min 3s


<br>

**Check the folders**

In [5]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/dna_binding/prose/train
├── [4.0K Sep  5 17:52]  dbp_train_dlm_avg
├── [4.0K Sep  5 19:49]  dbp_train_dlm_max
├── [4.0K Sep  5 20:53]  dbp_train_dlm_sum
├── [4.0K Sep  5 21:52]  dbp_train_mt_avg
├── [4.0K Sep  5 22:49]  dbp_train_mt_max
├── [4.0K Sep  6 05:28]  dbp_train_mt_sum
├── [335M Sep  5 17:40]  dbp_train_dlm_avg.h5
├── [335M Sep  5 19:16]  dbp_train_dlm_max.h5
├── [335M Sep  5 20:36]  dbp_train_dlm_sum.h5
├── [335M Sep  5 21:37]  dbp_train_mt_avg.h5
├── [335M Sep  5 22:36]  dbp_train_mt_max.h5
└── [335M Sep  5 23:39]  dbp_train_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [23]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

dbp_train_dlm_avg consumes: 340.2MB in 14016 files
dbp_train_dlm_max consumes: 340.2MB in 14016 files
dbp_train_dlm_sum consumes: 340.2MB in 14016 files
dbp_train_mt_avg consumes: 340.2MB in 14016 files
dbp_train_mt_max consumes: 340.2MB in 14016 files
dbp_train_mt_sum consumes: 340.2MB in 14016 files


<br>

## Test Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [6]:
# Update arguments
file_base = 'test'
model = 'prose_dlm'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_dlm_avg.h5 
 ../data/dna_binding/prose/test/dbp_test_dlm_avg


<br>

Run the embeding script for: `prose - dbp - test - dlm - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [25]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/dna_binding/prose/test/dbp_test_dlm_avg.h5
# embedding with pool=avg
# 2271 sequences processed...

CPU times: user 4min 41s, sys: 1min 50s, total: 6min 32s
Wall time: 6min 51s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [26]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 24s, sys: 963 ms, total: 1min 25s
Wall time: 1min 34s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [27]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_dlm_max.h5 
 ../data/dna_binding/prose/test/dbp_test_dlm_max


<br>

Run the embeding script for: `prose - dbp - test - dlm - max`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [28]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/dna_binding/prose/test/dbp_test_dlm_max.h5
# embedding with pool=max
# 2271 sequences processed...

CPU times: user 4min 39s, sys: 1min 50s, total: 6min 29s
Wall time: 6min 45s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [29]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 26s, sys: 1.22 s, total: 1min 27s
Wall time: 1min 36s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [30]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_dlm_sum.h5 
 ../data/dna_binding/prose/test/dbp_test_dlm_sum


<br>

Run the embeding script for: `prose - dbp - test - dlm - sum`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [31]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../data/dna_binding/prose/test/dbp_test_dlm_sum.h5
# embedding with pool=sum
# 2271 sequences processed...

CPU times: user 4min 40s, sys: 1min 53s, total: 6min 33s
Wall time: 6min 49s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [32]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 25s, sys: 1 s, total: 1min 26s
Wall time: 1min 35s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [33]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_mt_avg.h5 
 ../data/dna_binding/prose/test/dbp_test_mt_avg


<br>

Run the embeding script for: `prose - dbp - test - mt - avg`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [34]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/dna_binding/prose/test/dbp_test_mt_avg.h5
# embedding with pool=avg
# 2271 sequences processed...

CPU times: user 4min 40s, sys: 1min 48s, total: 6min 29s
Wall time: 6min 46s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [35]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 31s, sys: 851 ms, total: 1min 32s
Wall time: 1min 41s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [36]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_mt_max.h5 
 ../data/dna_binding/prose/test/dbp_test_mt_max


<br>

Run the embeding script for: `prose - dbp - test - mt - max`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [37]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/dna_binding/prose/test/dbp_test_mt_max.h5
# embedding with pool=max
# 2271 sequences processed...

CPU times: user 4min 37s, sys: 1min 53s, total: 6min 31s
Wall time: 6min 50s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [38]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 28s, sys: 1.01 s, total: 1min 29s
Wall time: 1min 38s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [39]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_mt_sum.h5 
 ../data/dna_binding/prose/test/dbp_test_mt_sum


<br>

Run the embeding script for: `prose - dbp - test - mt - sum`.  
The script reads the fasta file and creates the h5 file wilh embeddings.

In [40]:
%%time
# Run embedding script
%run ../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../data/dna_binding/prose/test/dbp_test_mt_sum.h5
# embedding with pool=sum
# 2271 sequences processed...

CPU times: user 4min 38s, sys: 1min 57s, total: 6min 36s
Wall time: 6min 47s


                                                                                

Use the utility funcion `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding ih the `path_pt` folder.

In [41]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 36s, sys: 1.17 s, total: 1min 37s
Wall time: 1min 47s


<br>

**Check the folders**

In [7]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../data/dna_binding/prose/test
├── [4.0K Sep  6 10:41]  dbp_test_dlm_avg
├── [4.0K Sep  6 10:55]  dbp_test_dlm_max
├── [4.0K Sep  6 11:10]  dbp_test_dlm_sum
├── [4.0K Sep  6 11:33]  dbp_test_mt_avg
├── [4.0K Sep  6 13:32]  dbp_test_mt_max
├── [4.0K Sep  6 13:59]  dbp_test_mt_sum
├── [ 54M Sep  6 10:39]  dbp_test_dlm_avg.h5
├── [ 54M Sep  6 10:52]  dbp_test_dlm_max.h5
├── [ 54M Sep  6 11:04]  dbp_test_dlm_sum.h5
├── [ 54M Sep  6 11:30]  dbp_test_mt_avg.h5
├── [ 54M Sep  6 11:56]  dbp_test_mt_max.h5
└── [ 54M Sep  6 13:51]  dbp_test_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [43]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

dbp_test_dlm_avg consumes: 55.15MB in 2272 files
dbp_test_dlm_max consumes: 55.15MB in 2272 files
dbp_test_dlm_sum consumes: 55.15MB in 2272 files
dbp_test_mt_avg consumes: 55.15MB in 2272 files
dbp_test_mt_max consumes: 55.15MB in 2272 files
dbp_test_mt_sum consumes: 55.15MB in 2272 files
