# DBP - DNA-Binding Proteins

After cleaning our original csv files, we converted them to fasta format.

These fasta files are input to the [embedding script](https://github.com/tbepler/prose/blob/main/embed_sequences.py) developed by [Tristan Bepler](mailto:tbepler@gmail.com).  
The script writes embeddings out as an HDF5 file using the sequence headers as keys and embedding vectors as values.

In this project we are using pretrained models from [Facebook's Evolutionary Scale Modeling (ESM)](https://github.com/facebookresearch/esm) and [Protein Sequence Embeddings (ProSE)](https://github.com/tbepler/prose). Both of them accept fasta formated files as input but they write embeddings in different formats. ESM stores embeddings in PyTorch `.pt` files (generated by `torch.save`), one file per fasta sequence, while ProSE writes embeddings out as an HDF5 file (`.h5`).

To store embeddings using an identical data format, we are going to follow the data format used by ESM.   
Hence, as the last step, we will convert HDF5 file, generated by the ProSE embedding script, to `.pt` files.


In [1]:
# Import dependencies
import os

Import file utilities

In [2]:
# Import the script from different folder
import sys  
sys.path.append('../../scripts')

# import file utilities as fu
import file_utilities as fu

Initialize arguments

In [3]:
# Define arguments for the file_paths function
# First two are constant in this notebook
ptmodel = 'prose'
task = 'dbp'
# Last 3 arguments we will be changing through the notebook
file_base = 'train'
model = 'prose_dlm'
pool = 'avg'  


<br>

## Train Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../../data*.

In [4]:
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/train_prose.fa 
 ../../data/dna_binding/prose/train/dbp_train_dlm_avg.h5 
 ../../data/dna_binding/prose/train/dbp_train_dlm_avg


<br>

Run the embedding script for: `prose - dbp - train - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [5]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/dna_binding/prose/train/dbp_train_dlm_avg.h5
# embedding with pool=avg
# 14015 sequences processed...

CPU times: user 26min 44s, sys: 12min, total: 38min 44s
Wall time: 40min 21s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [6]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 58s, sys: 7.83 s, total: 9min 6s
Wall time: 9min 54s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [7]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/train_prose.fa 
 ../../data/dna_binding/prose/train/dbp_train_dlm_max.h5 
 ../../data/dna_binding/prose/train/dbp_train_dlm_max


<br>

Run the embedding script for: `prose - dbp - train - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [8]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/dna_binding/prose/train/dbp_train_dlm_max.h5
# embedding with pool=max
# 14015 sequences processed...

CPU times: user 27min 27s, sys: 10min 28s, total: 37min 55s
Wall time: 39min 3s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [9]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 35s, sys: 7.23 s, total: 8min 42s
Wall time: 9min 33s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [10]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/train_prose.fa 
 ../../data/dna_binding/prose/train/dbp_train_dlm_sum.h5 
 ../../data/dna_binding/prose/train/dbp_train_dlm_sum


<br>

Run the embedding script for: `prose - dbp - train - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [11]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/dna_binding/prose/train/dbp_train_dlm_sum.h5
# embedding with pool=sum
# 14015 sequences processed...

CPU times: user 26min 50s, sys: 11min 46s, total: 38min 37s
Wall time: 39min 47s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [12]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 9min, sys: 7 s, total: 9min 7s
Wall time: 9min 56s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [13]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/train_prose.fa 
 ../../data/dna_binding/prose/train/dbp_train_mt_avg.h5 
 ../../data/dna_binding/prose/train/dbp_train_mt_avg


<br>

Run the embedding script for: `prose - dbp - train - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [14]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/dna_binding/prose/train/dbp_train_mt_avg.h5
# embedding with pool=avg
# 14015 sequences processed...

CPU times: user 27min 20s, sys: 10min 35s, total: 37min 56s
Wall time: 39min 11s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [15]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 35s, sys: 7.55 s, total: 8min 42s
Wall time: 9min 33s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [16]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/train_prose.fa 
 ../../data/dna_binding/prose/train/dbp_train_mt_max.h5 
 ../../data/dna_binding/prose/train/dbp_train_mt_max


<br>

Run the embedding script for: `prose - dbp - train - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [17]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/dna_binding/prose/train/dbp_train_mt_max.h5
# embedding with pool=max
# 14015 sequences processed...

CPU times: user 26min 56s, sys: 10min 54s, total: 37min 51s
Wall time: 39min 31s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [18]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 26s, sys: 6.84 s, total: 8min 33s
Wall time: 9min 25s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [19]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/train_prose.fa 
 ../../data/dna_binding/prose/train/dbp_train_mt_sum.h5 
 ../../data/dna_binding/prose/train/dbp_train_mt_sum


<br>

Run the embedding script for: `prose - dbp - train - mt - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [20]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/dna_binding/prose/train/dbp_train_mt_sum.h5
# embedding with pool=sum
# 14015 sequences processed...

CPU times: user 26min 45s, sys: 10min 57s, total: 37min 43s
Wall time: 39min 30s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [21]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 8min 26s, sys: 6.85 s, total: 8min 33s
Wall time: 9min 25s


<br>

**Check the folders**

In [22]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/dna_binding/prose/train
├── [4.0K Oct  3 00:14]  dbp_train_dlm_avg
├── [4.0K Oct  3 01:02]  dbp_train_dlm_max
├── [4.0K Oct  3 01:52]  dbp_train_dlm_sum
├── [4.0K Oct  3 02:41]  dbp_train_mt_avg
├── [4.0K Oct  3 03:30]  dbp_train_mt_max
├── [4.0K Oct  3 04:19]  dbp_train_mt_sum
├── [335M Oct  3 00:04]  dbp_train_dlm_avg.h5
├── [335M Oct  3 00:53]  dbp_train_dlm_max.h5
├── [335M Oct  3 01:42]  dbp_train_dlm_sum.h5
├── [335M Oct  3 02:31]  dbp_train_mt_avg.h5
├── [335M Oct  3 03:20]  dbp_train_mt_max.h5
└── [335M Oct  3 04:09]  dbp_train_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [23]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

dbp_train_dlm_avg consumes: 340.2MB in 14016 files
dbp_train_dlm_max consumes: 340.2MB in 14016 files
dbp_train_dlm_sum consumes: 340.2MB in 14016 files
dbp_train_mt_avg consumes: 340.2MB in 14016 files
dbp_train_mt_max consumes: 340.2MB in 14016 files
dbp_train_mt_sum consumes: 340.2MB in 14016 files


<br>

## Test Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [24]:
# Update arguments
file_base = 'test'
model = 'prose_dlm'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/test_prose.fa 
 ../../data/dna_binding/prose/test/dbp_test_dlm_avg.h5 
 ../../data/dna_binding/prose/test/dbp_test_dlm_avg


<br>

Run the embedding script for: `prose - dbp - test - dlm - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [25]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/dna_binding/prose/test/dbp_test_dlm_avg.h5
# embedding with pool=avg
# 2271 sequences processed...

CPU times: user 4min 37s, sys: 1min 56s, total: 6min 34s
Wall time: 6min 48s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [26]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 30s, sys: 1.38 s, total: 1min 31s
Wall time: 1min 39s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [27]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/test_prose.fa 
 ../../data/dna_binding/prose/test/dbp_test_dlm_max.h5 
 ../../data/dna_binding/prose/test/dbp_test_dlm_max


<br>

Run the embedding script for: `prose - dbp - test - dlm - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [28]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/dna_binding/prose/test/dbp_test_dlm_max.h5
# embedding with pool=max
# 2271 sequences processed...

CPU times: user 4min 40s, sys: 1min 53s, total: 6min 34s
Wall time: 6min 51s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [29]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 19s, sys: 973 ms, total: 1min 20s
Wall time: 1min 29s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [30]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/test_prose.fa 
 ../../data/dna_binding/prose/test/dbp_test_dlm_sum.h5 
 ../../data/dna_binding/prose/test/dbp_test_dlm_sum


<br>

Run the embedding script for: `prose - dbp - test - dlm - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [31]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE DLM model
# writing: ../../data/dna_binding/prose/test/dbp_test_dlm_sum.h5
# embedding with pool=sum
# 2271 sequences processed...

CPU times: user 4min 36s, sys: 1min 59s, total: 6min 36s
Wall time: 6min 52s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [32]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 28s, sys: 1.3 s, total: 1min 29s
Wall time: 1min 37s


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [33]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/test_prose.fa 
 ../../data/dna_binding/prose/test/dbp_test_mt_avg.h5 
 ../../data/dna_binding/prose/test/dbp_test_mt_avg


<br>

Run the embedding script for: `prose - dbp - test - mt - avg`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [34]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/dna_binding/prose/test/dbp_test_mt_avg.h5
# embedding with pool=avg
# 2271 sequences processed...

CPU times: user 4min 33s, sys: 2min 1s, total: 6min 34s
Wall time: 6min 53s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [35]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 35s, sys: 1.15 s, total: 1min 36s
Wall time: 1min 45s


<br>

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [36]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/test_prose.fa 
 ../../data/dna_binding/prose/test/dbp_test_mt_max.h5 
 ../../data/dna_binding/prose/test/dbp_test_mt_max


<br>

Run the embedding script for: `prose - dbp - test - mt - max`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [37]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/dna_binding/prose/test/dbp_test_mt_max.h5
# embedding with pool=max
# 2271 sequences processed...

CPU times: user 4min 20s, sys: 1min 49s, total: 6min 10s
Wall time: 7min 13s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [38]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 40s, sys: 1.14 s, total: 1min 41s
Wall time: 1min 49s


<br>

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [39]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, path_h5, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_h5, '\n', path_pt)

 ../../data/dna_binding/test_prose.fa 
 ../../data/dna_binding/prose/test/dbp_test_mt_sum.h5 
 ../../data/dna_binding/prose/test/dbp_test_mt_sum


<br>

Run the embedding script for: `prose - dbp - test - mt - sum`.  
The script reads the fasta file and creates the h5 file with embeddings.

In [40]:
%%time
# Run embedding script
%run ../../prose/embed_sequences --model "{model}" --pool "{pool}" -o "{path_h5}" "{path_fa}"

# loading the pre-trained ProSE MT model
# writing: ../../data/dna_binding/prose/test/dbp_test_mt_sum.h5
# embedding with pool=sum
# 2271 sequences processed...

CPU times: user 4min 38s, sys: 1min 52s, total: 6min 30s
Wall time: 6min 50s


                                                                                

Use the utility function `convert_h5_to_pt` to read the h5 file and create one pt file for every embedding in the `path_pt` folder.

In [41]:
%%time
# Convert h5 file to pt files
fu.convert_h5_to_pt(path_h5, path_pt, pool)

CPU times: user 1min 32s, sys: 1 s, total: 1min 33s
Wall time: 1min 41s


<br>

**Check the folders**

In [42]:
base = os.path.split(path_pt)[0]
!tree -nDhL 1 "{base}" --dirsfirst

../../data/dna_binding/prose/test
├── [4.0K Oct  3 04:31]  dbp_test_dlm_avg
├── [4.0K Oct  3 04:39]  dbp_test_dlm_max
├── [4.0K Oct  3 04:47]  dbp_test_dlm_sum
├── [4.0K Oct  3 04:56]  dbp_test_mt_avg
├── [4.0K Oct  3 05:05]  dbp_test_mt_max
├── [4.0K Oct  3 05:14]  dbp_test_mt_sum
├── [ 54M Oct  3 04:29]  dbp_test_dlm_avg.h5
├── [ 54M Oct  3 04:37]  dbp_test_dlm_max.h5
├── [ 54M Oct  3 04:46]  dbp_test_dlm_sum.h5
├── [ 54M Oct  3 04:54]  dbp_test_mt_avg.h5
├── [ 54M Oct  3 05:03]  dbp_test_mt_max.h5
└── [ 54M Oct  3 05:12]  dbp_test_mt_sum.h5

6 directories, 6 files


Print the total size and number of pt files in each embedding folder

In [43]:
# Print the total size and number of pt files in each embedding folder
fu.emb_files_stats(path_pt)

dbp_test_dlm_avg consumes: 55.15MB in 2272 files
dbp_test_dlm_max consumes: 55.15MB in 2272 files
dbp_test_dlm_sum consumes: 55.15MB in 2272 files
dbp_test_mt_avg consumes: 55.15MB in 2272 files
dbp_test_mt_max consumes: 55.15MB in 2272 files
dbp_test_mt_sum consumes: 55.15MB in 2272 files
