# DBP - Reading Embeddings

In this notebook we are going to use previously created `.pt` files to extract embeddings together with corresponding sequence ids and our target variables (label).
For every combination of pretrained models, pooling operations, our task datasets this wil be the workflow:

1. set the arguments
2. run `file_paths` function
   - it vill set all needed paths according to the arguments
3. run `read_embeddings` function
   - reads fasta headers from a fasta file
   - extracts sequence_id and our target variable ys from the header
   - names of `.pt` files are `header[1:]`and the scripts loads the corresponding file
   - extracts the embedding vector a an array Xs
   - Xs, ys, sequence_id can be now used in machine learning applications
4. run `check_with_df` function to check id all is OK
   - it uses sequence_id, Xs, ys to create Pandas DataFrame
   - displays first 2 and last 2 rows

First we read `prose` embeddings and then `esm` embeddings.

Import file utilities

In [1]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

import file_utilities as fu

## ProSE embeddings


Initiaize arguments

In [2]:
# Define arguments for the file_paths function
task = 'dbp'
ptmodel = 'prose'
file_base = 'train'
model = 'prose_dlm'
emb_layer = 'layer'
pool = 'avg'  


<br>

## Train Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [3]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_dlm_avg


Extract embeddings with `read_embedings` function.

In [4]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(14016, 6165)
Length of target label list:	14016
Length of sequential ids list:	14016


Check our data with `check_with_df` function.

In [5]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Q6A8L0,0.113636,0.064394,0.037879,0.094697,0.003788,0.026515,0.094697,0.075758,0.007576,...,-0.128591,0.081230,0.080329,0.048674,-0.035709,0.092301,-0.037153,-0.038137,0.004536,1
1,Q7V7T9,0.093333,0.075556,0.008889,0.057778,0.017778,0.075556,0.080000,0.102222,0.022222,...,-0.247285,0.128575,0.095270,-0.024129,0.075140,0.057433,-0.124096,0.079725,0.031797,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14014,Q16AE8,0.085714,0.100000,0.042857,0.035714,0.003571,0.035714,0.035714,0.132143,0.025000,...,-0.010057,0.207347,0.158648,-0.002683,-0.148059,0.128500,-0.098671,0.082399,0.272792,0
14015,Q486Z0,0.099010,0.054455,0.049505,0.064356,0.012376,0.032178,0.074257,0.076733,0.032178,...,-0.172984,0.002041,-0.008729,-0.056369,-0.041159,0.142222,-0.169299,0.286869,-0.067152,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [6]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_dlm_max


Extract embeddings with `read_embedings` function.

In [7]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(14016, 6165)
Length of target label list:	14016
Length of sequential ids list:	14016


Check our data with `check_with_df` function.

In [8]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Q6A8L0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.027604,0.705131,0.718681,0.907449,0.416926,0.783536,0.141743,0.023427,0.258694,1
1,Q7V7T9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.007948,0.929712,0.917170,0.851409,0.997472,0.724912,0.044931,0.509727,0.646402,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14014,Q16AE8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.112960,0.939670,0.760756,0.584100,0.219241,0.818944,-0.001795,0.418227,0.949619,0
14015,Q486Z0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.383856,0.928906,0.741534,0.894473,0.982955,0.982904,0.005548,0.907785,0.550146,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [9]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_dlm_sum


Extract embeddings with `read_embedings` function.

In [10]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(14016, 6165)
Length of target label list:	14016
Length of sequential ids list:	14016


Check our data with `check_with_df` function.

In [11]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Q6A8L0,30.0,17.0,10.0,25.0,1.0,7.0,25.0,20.0,2.0,...,-33.948021,21.444836,21.206905,12.849918,-9.427154,24.367416,-9.808510,-10.068150,1.197400,1
1,Q7V7T9,21.0,17.0,2.0,13.0,4.0,17.0,18.0,23.0,5.0,...,-55.639015,28.929262,21.435713,-5.428989,16.906427,12.922314,-27.921585,17.938097,7.154354,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14014,Q16AE8,24.0,28.0,12.0,10.0,1.0,10.0,10.0,37.0,7.0,...,-2.815915,58.057266,44.421452,-0.751250,-41.456539,35.980129,-27.627777,23.071705,76.381744,0
14015,Q486Z0,40.0,22.0,20.0,26.0,5.0,13.0,30.0,31.0,13.0,...,-69.885620,0.824720,-3.526418,-22.773172,-16.628233,57.457756,-68.396904,115.894958,-27.129374,0


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [12]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_mt_avg


Extract embeddings with `read_embedings` function.

In [13]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(14016, 6165)
Length of target label list:	14016
Length of sequential ids list:	14016


Check our data with `check_with_df` function.

In [14]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Q6A8L0,0.113636,0.064394,0.037879,0.094697,0.003788,0.026515,0.094697,0.075758,0.007576,...,-0.176603,0.095661,0.036870,0.059194,-0.037333,0.149248,-0.051019,0.038219,-0.041694,1
1,Q7V7T9,0.093333,0.075556,0.008889,0.057778,0.017778,0.075556,0.080000,0.102222,0.022222,...,-0.379992,0.147494,0.075313,-0.018626,0.090156,0.111229,-0.194673,0.055714,0.022726,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14014,Q16AE8,0.085714,0.100000,0.042857,0.035714,0.003571,0.035714,0.035714,0.132143,0.025000,...,-0.016353,0.255953,0.160380,-0.014287,-0.114906,0.139985,-0.097824,0.085277,0.372806,0
14015,Q486Z0,0.099010,0.054455,0.049505,0.064356,0.012376,0.032178,0.074257,0.076733,0.032178,...,-0.158786,-0.056541,0.011900,-0.057320,-0.065921,0.156501,-0.131794,0.367778,-0.079480,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [15]:
# Update arguments
model = 'prose_mt'
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_mt_max


Extract embeddings with `read_embedings` function.

In [16]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(14016, 6165)
Length of target label list:	14016
Length of sequential ids list:	14016


Check our data with `check_with_df` function.

In [17]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Q6A8L0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.031456,0.868353,0.564631,0.868619,0.543947,0.995774,0.086551,0.185510,0.422129,1
1,Q7V7T9,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.007239,0.990007,0.933836,0.752532,0.985601,0.990459,0.052465,0.450199,0.633835,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14014,Q16AE8,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.157115,0.982936,0.908918,0.618211,0.044996,0.931821,-0.000162,0.249665,0.975980,0
14015,Q486Z0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.437450,0.934017,0.826290,0.800415,0.822493,0.994496,-0.000129,0.967463,0.499478,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [18]:
# Update arguments
model = 'prose_mt'
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_prose.fa 
 ../data/dna_binding/prose/train/dbp_train_mt_sum


Extract embeddings with `read_embedings` function.

In [19]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(14016, 6165)
Length of target label list:	14016
Length of sequential ids list:	14016


Check our data with `check_with_df` function.

In [20]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Q6A8L0,30.0,17.0,10.0,25.0,1.0,7.0,25.0,20.0,2.0,...,-46.623241,25.254379,9.733679,15.627245,-9.855886,39.401455,-13.469040,10.089860,-11.007279,1
1,Q7V7T9,21.0,17.0,2.0,13.0,4.0,17.0,18.0,23.0,5.0,...,-85.498138,33.186237,16.945395,-4.190884,20.285025,25.026552,-43.801403,12.535753,5.113260,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14014,Q16AE8,24.0,28.0,12.0,10.0,1.0,10.0,10.0,37.0,7.0,...,-4.578764,71.666748,44.906357,-4.000475,-32.173798,39.195778,-27.390800,23.877583,104.385681,0
14015,Q486Z0,40.0,22.0,20.0,26.0,5.0,13.0,30.0,31.0,13.0,...,-64.149536,-22.842548,4.807764,-23.157238,-26.632029,63.226501,-53.244881,148.582184,-32.109756,0


<br>

## Test Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [21]:
# Update arguments
file_base = 'test'
model = 'prose_dlm'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_dlm_avg


Extract embeddings with `read_embedings` function.

In [22]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2272, 6165)
Length of target label list:	2272
Length of sequential ids list:	2272


Check our data with `check_with_df` function.

In [23]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,P27204|1,0.138614,0.267327,0.029703,0.000000,0.000000,0.000000,0.000000,0.019802,0.029703,...,-0.126832,0.010812,0.051377,-0.000109,-0.102151,0.111184,-0.080012,-0.138695,0.082615,1
1,P53528|1,0.151261,0.088235,0.019608,0.057423,0.012605,0.046218,0.063025,0.063025,0.018207,...,-0.210425,-0.009928,0.025247,0.011334,-0.234925,0.017712,-0.142075,0.116288,0.048743,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270,P80484|2,0.210811,0.018919,0.032432,0.064865,0.002703,0.032432,0.027027,0.064865,0.013514,...,0.232649,-0.009447,0.084011,0.017176,-0.100683,0.123371,0.019268,-0.014178,-0.005585,0
2271,Q57837|2,0.038348,0.026549,0.041298,0.035398,0.002950,0.005900,0.058997,0.082596,0.014749,...,-0.130443,0.037440,-0.005931,0.045478,-0.048512,0.077510,-0.125126,0.125926,-0.023890,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [24]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_dlm_max


Extract embeddings with `read_embedings` function.

In [25]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2272, 6165)
Length of target label list:	2272
Length of sequential ids list:	2272


Check our data with `check_with_df` function.

In [26]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,P27204|1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,-0.001267,0.097833,0.190686,0.047367,0.000588,0.494923,0.202199,-0.015823,0.281614,1
1,P53528|1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.017937,0.516417,0.817019,0.591379,0.696387,0.585353,0.010985,0.442446,0.918994,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270,P80484|2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.532443,0.071653,0.321972,0.253031,0.466342,0.534482,0.335638,0.040889,0.048345,0
2271,Q57837|2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.002863,0.672572,0.515865,0.878341,0.796702,0.919690,-0.000859,0.423690,0.214172,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [27]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_dlm_sum


Extract embeddings with `read_embedings` function.

In [28]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2272, 6165)
Length of target label list:	2272
Length of sequential ids list:	2272


Check our data with `check_with_df` function.

In [29]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,P27204|1,14.0,27.0,3.0,0.0,0.0,0.0,0.0,2.0,3.0,...,-12.810034,1.092043,5.189063,-0.010999,-10.317245,11.229548,-8.081186,-14.008194,8.344089,1
1,P53528|1,108.0,63.0,14.0,41.0,9.0,33.0,45.0,45.0,13.0,...,-150.243561,-7.088599,18.026421,8.092163,-167.736679,12.646237,-101.441269,83.029282,34.802650,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270,P80484|2,78.0,7.0,12.0,24.0,1.0,12.0,10.0,24.0,5.0,...,86.080215,-3.495316,31.084242,6.355218,-37.252892,45.647121,7.129293,-5.245878,-2.066378,0
2271,Q57837|2,13.0,9.0,14.0,12.0,1.0,2.0,20.0,28.0,5.0,...,-44.220062,12.692288,-2.010568,15.417030,-16.445723,26.275995,-42.417809,42.689011,-8.098829,0


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [30]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_mt_avg


Extract embeddings with `read_embedings` function.

In [31]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2272, 6165)
Length of target label list:	2272
Length of sequential ids list:	2272


Check our data with `check_with_df` function.

In [32]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,P27204|1,0.138614,0.267327,0.029703,0.000000,0.000000,0.000000,0.000000,0.019802,0.029703,...,-0.108032,-0.006805,0.025998,-0.000365,-0.071782,0.033783,-0.063707,-0.076529,0.051280,1
1,P53528|1,0.151261,0.088235,0.019608,0.057423,0.012605,0.046218,0.063025,0.063025,0.018207,...,-0.216519,0.009888,0.024019,0.013675,-0.240709,0.016526,-0.168821,0.141335,0.049230,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270,P80484|2,0.210811,0.018919,0.032432,0.064865,0.002703,0.032432,0.027027,0.064865,0.013514,...,0.026451,0.018157,0.049419,0.024843,-0.185805,0.105743,-0.001065,0.092161,-0.024942,0
2271,Q57837|2,0.038348,0.026549,0.041298,0.035398,0.002950,0.005900,0.058997,0.082596,0.014749,...,-0.113229,0.029627,-0.005488,0.049189,-0.048718,0.071532,-0.106684,0.148737,-0.021660,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [33]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_mt_max


Extract embeddings with `read_embedings` function.

In [34]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2272, 6165)
Length of target label list:	2272
Length of sequential ids list:	2272


Check our data with `check_with_df` function.

In [35]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,P27204|1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,...,-0.005655,0.066985,0.144704,0.034658,0.173915,0.220022,0.036432,0.020925,0.163018,1
1,P53528|1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.001896,0.838819,0.522188,0.687059,0.367041,0.430013,0.009612,0.499845,0.908976,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270,P80484|2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,0.200721,0.175597,0.225573,0.249912,0.248831,0.482014,0.195493,0.207856,0.053781,0
2271,Q57837|2,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.000725,0.575338,0.532389,0.882318,0.591743,0.912876,-0.004104,0.600571,0.382251,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [36]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_prose.fa 
 ../data/dna_binding/prose/test/dbp_test_mt_sum


Extract embeddings with `read_embedings` function.

In [37]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2272, 6165)
Length of target label list:	2272
Length of sequential ids list:	2272


Check our data with `check_with_df` function.

In [38]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,P27204|1,14.0,27.0,3.0,0.0,0.0,0.0,0.0,2.0,3.0,...,-10.911218,-0.687335,2.625784,-0.036835,-7.249979,3.412109,-6.434397,-7.729387,5.179250,1
1,P53528|1,108.0,63.0,14.0,41.0,9.0,33.0,45.0,45.0,13.0,...,-154.594452,7.059755,17.149433,9.763611,-171.866165,11.799815,-120.538528,100.913208,35.149967,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2270,P80484|2,78.0,7.0,12.0,24.0,1.0,12.0,10.0,24.0,5.0,...,9.786963,6.718013,18.285032,9.191998,-68.748024,39.124813,-0.394081,34.099724,-9.228519,0
2271,Q57837|2,13.0,9.0,14.0,12.0,1.0,2.0,20.0,28.0,5.0,...,-38.384468,10.043718,-1.860400,16.674957,-16.515369,24.249302,-36.165977,50.421692,-7.342779,0


## ESM embeddings


## Train Dataset

### ESM ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [39]:
# Update arguments
ptmodel = 'esm'
file_base = 'train'
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33
pool = 'mean'  
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_esm.fa 
 ../data/dna_binding/esm/train/dbp_train_esm1v_mean


Extract embeddings with `read_embedings` function.

In [40]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(13108, 1280)
Length of target label list:	13108
Length of sequential ids list:	13108


Check our data with `check_with_df` function.

In [41]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,Q6A8L0,-0.236084,-0.080925,-0.112605,0.069161,0.160846,-0.079905,-0.074510,-0.134663,-0.008643,...,-0.090658,-0.155936,-0.094594,-0.004625,-0.134892,0.098090,-0.081409,-0.107882,-0.135179,1
1,Q7V7T9,-0.055984,0.031985,0.022922,-0.232816,-0.098906,0.037312,0.058930,-0.305301,0.236868,...,0.043834,0.015146,-0.230926,0.003251,-0.429846,0.046164,-0.067858,-0.084438,0.038270,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13106,Q16AE8,0.019666,-0.046728,-0.145186,-0.060228,0.222324,0.058785,0.259896,-0.034740,-0.013585,...,-0.389559,0.229647,0.045373,-0.075256,0.073156,0.224126,0.069066,-0.215409,0.207082,0
13107,Q486Z0,-0.309475,0.157435,-0.090308,-0.015247,-0.292232,0.121620,0.231596,-0.217210,0.125528,...,-0.189036,0.064136,-0.156838,0.128314,0.065698,0.212815,-0.189209,-0.154284,0.045231,0


<br>

### ESM ESM-1b model - esm1b_t33_650M_UR50S

Update arguments and prepare paths

In [42]:
# Update arguments
model = 'esm1b_t33_650M_UR50S' 
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/train_esm.fa 
 ../data/dna_binding/esm/train/dbp_train_esm1b_mean


Extract embeddings with `read_embedings` function.

In [43]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(13108, 1280)
Length of target label list:	13108
Length of sequential ids list:	13108


Check our data with `check_with_df` function.

In [44]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,Q6A8L0,0.024735,0.121467,0.027258,0.053919,-0.044988,-0.049945,-0.059293,-0.069261,-0.151866,...,-0.048141,-0.056756,0.087139,-0.359019,-0.109614,-0.050734,-0.065588,-0.001964,0.149597,1
1,Q7V7T9,-0.017985,0.125560,0.137783,0.151576,0.192121,-0.133194,-0.079588,-0.032318,-0.195545,...,0.086358,-0.023590,0.001365,-0.429306,-0.043662,-0.033201,-0.015863,-0.048463,0.089530,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13106,Q16AE8,0.066102,0.224571,0.061397,0.305505,-0.083444,-0.043334,-0.096683,0.019443,-0.197743,...,-0.121072,-0.120786,-0.114808,0.070889,-0.047742,0.063377,-0.176054,-0.116016,0.058687,0
13107,Q486Z0,0.088875,0.188258,-0.013611,-0.009012,-0.118099,-0.345594,-0.334829,0.208053,-0.214457,...,-0.065021,-0.296802,0.123161,0.009821,-0.066100,0.087617,-0.160601,-0.007373,-0.084994,0


<br>

## Test Dataset

### ESM ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [45]:
# Update arguments
model = 'esm1v_t33_650M_UR90S_1'
file_base = 'test'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_esm.fa 
 ../data/dna_binding/esm/test/dbp_test_esm1v_mean


In [46]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2081, 1280)
Length of target label list:	2081
Length of sequential ids list:	2081


Check our data with `check_with_df` function.

In [47]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,P27204|1,-0.514593,-0.123215,-0.403862,-0.257147,-0.400282,0.334636,0.280655,-0.102672,-0.237462,...,0.070027,0.314713,0.100227,-0.433217,0.162725,0.089294,-0.208665,0.169917,0.079544,1
1,P53528|1,-0.597343,0.108853,0.030260,0.131975,0.152198,0.021401,-0.096014,-0.111102,0.040586,...,-0.046748,-0.152161,-0.323566,-0.042212,-0.029988,0.195072,-0.165746,0.098884,0.140598,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2079,P80484|2,-0.189874,0.021987,-0.073087,-0.159768,-0.224471,0.024664,0.418944,-0.160513,0.047089,...,-0.133678,-0.233500,0.214999,-0.088417,-0.044613,0.117578,-0.100098,0.147895,-0.203335,0
2080,Q57837|2,-0.326968,-0.164970,-0.318906,0.064471,-0.559337,-0.019767,0.170227,-0.229546,-0.078261,...,-0.364000,-0.172490,0.014528,-0.141528,-0.249329,0.111519,-0.011608,0.197438,0.138820,0


<br>

### ESM ESM-1b model - esm1b_t33_650M_UR50S

Update arguments and prepare paths

In [48]:
# Update arguments
model = 'esm1b_t33_650M_UR50S' 
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/dna_binding/test_esm.fa 
 ../data/dna_binding/esm/test/dbp_test_esm1b_mean


Extract embeddings with `read_embedings` function.

In [49]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(2081, 1280)
Length of target label list:	2081
Length of sequential ids list:	2081


Check our data with `check_with_df` function.

In [50]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,P27204|1,-0.011398,-0.006860,0.128372,0.106459,-0.043602,-0.045671,-0.068560,0.131720,-0.151244,...,0.020886,-0.019969,-0.037539,-0.831627,0.038463,0.073157,0.015005,-0.057251,0.154478,1
1,P53528|1,-0.052130,0.202240,0.161669,-0.053947,0.117859,-0.262483,-0.075677,0.120203,-0.188200,...,-0.009060,-0.118130,-0.073568,0.046297,-0.012672,0.103821,-0.086478,-0.041500,-0.025464,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2079,P80484|2,0.052295,0.076697,-0.053086,0.043356,-0.033600,-0.125369,-0.006812,0.030662,-0.068741,...,0.068226,-0.017867,0.183126,-0.778656,0.018837,-0.117384,0.041702,0.065865,0.271809,0
2080,Q57837|2,0.067418,0.224127,-0.010041,0.057362,0.088127,-0.150553,-0.062061,0.013359,-0.026879,...,0.049471,-0.159092,0.032727,0.076621,0.118966,0.072636,0.024107,0.101532,0.061851,0
