# ACP - Reading Embeddings

In this notebook we are going to use previously created `.pt` files to extract embeddings together with corresponding sequence ids and our target variables (label).
For every combination of pretrained models, pooling operations, our task datasets this wil be the workflow:

1. set the arguments
2. run `file_paths` function
   - it vill set all needed paths according to the arguments
3. run `read_embeddings` function
   - reads fasta headers from a fasta file
   - extracts sequence_id and our target variable ys from the header
   - names of `.pt` files are `header[1:]`and the scripts loads the corresponding file
   - extracts the embedding vector a an array Xs
   - Xs, ys, sequence_id can be now used in machine learning applications
4. run `check_with_df` function to check id all is OK
   - it uses sequence_id, Xs, ys to create Pandas DataFrame
   - displays first 2 and last 2 rows

First we read `prose` embeddings and then `esm` embeddings.

Import file utilities

In [1]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

import file_utilities as fu

## ProSE embeddings


Initiaize arguments

In [2]:
# Define arguments for the file_paths function
task = 'acp'
ptmodel = 'prose'
file_base = 'train'
model = 'prose_dlm'
emb_layer = 'layer'
pool = 'avg'  


<br>

## Train Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [3]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_dlm_avg


Extract embeddings with `read_embedings` function.

In [4]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 6165)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [5]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.000000,0.600000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.049427,-0.011677,0.035984,-0.001416,-0.109538,0.000317,0.083599,-0.048907,-0.457161,0
1,Protein_seq_0002,0.076923,0.038462,0.0,0.038462,0.000000,0.0,0.0,0.192308,0.038462,...,0.004739,-0.007626,0.041614,0.027862,-0.099025,0.005362,-0.076701,-0.053982,0.004785,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,0.153846,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.076923,0.000000,...,0.072595,-0.014353,0.021571,0.004904,-0.086838,0.006352,-0.084948,-0.045534,0.023370,1
1377,Protein_seq_1378,0.000000,0.333333,0.0,0.000000,0.083333,0.0,0.0,0.000000,0.000000,...,-0.055110,-0.023708,0.021072,-0.004619,-0.132936,0.014448,-0.118819,-0.037327,-0.033501,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [6]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_dlm_max


Extract embeddings with `read_embedings` function.

In [7]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 6165)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [8]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.098071,-0.004856,0.205549,0.016927,-0.032756,0.004301,0.154252,-0.027492,-0.159978,0
1,Protein_seq_0002,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,...,0.105207,0.014237,0.149053,0.089912,0.072833,0.051643,0.004200,-0.014472,0.036598,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.119992,-0.005735,0.077855,0.024257,0.014907,0.014248,-0.009560,-0.012431,0.043095,1
1377,Protein_seq_1378,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.021431,0.009526,0.109263,0.001237,0.081146,0.025377,-0.019791,-0.020703,0.062248,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [9]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_dlm_sum


Extract embeddings with `read_embedings` function.

In [10]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 6165)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [11]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.494269,-0.116769,0.359840,-0.014159,-1.095376,0.003168,0.835994,-0.489074,-4.571615,0
1,Protein_seq_0002,2.0,1.0,0.0,1.0,0.0,0.0,0.0,5.0,1.0,...,0.123219,-0.198279,1.081959,0.724408,-2.574637,0.139403,-1.994219,-1.403535,0.124409,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.943738,-0.186590,0.280417,0.063755,-1.128891,0.082574,-1.104320,-0.591940,0.303815,1
1377,Protein_seq_1378,0.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.661316,-0.284495,0.252859,-0.055431,-1.595233,0.173380,-1.425824,-0.447922,-0.402017,0


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [12]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_mt_avg


Extract embeddings with `read_embedings` function.

In [13]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 6165)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [14]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.000000,0.600000,0.0,0.000000,0.000000,0.0,0.0,0.000000,0.000000,...,0.088738,-0.001149,0.027113,0.001807,-0.056439,0.008605,0.155694,-0.168206,-0.189653,0
1,Protein_seq_0002,0.076923,0.038462,0.0,0.038462,0.000000,0.0,0.0,0.192308,0.038462,...,0.348926,-0.003707,0.154556,0.133863,-0.135781,-0.054469,-0.060656,0.043199,0.001528,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,0.153846,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.076923,0.000000,...,0.105621,-0.003129,0.086821,0.048828,-0.023954,-0.029000,-0.053837,-0.025096,0.023295,1
1377,Protein_seq_1378,0.000000,0.333333,0.0,0.000000,0.083333,0.0,0.0,0.000000,0.000000,...,-0.039725,-0.015628,0.008452,0.008466,0.049375,0.047414,-0.055997,-0.076221,-0.055401,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [15]:
# Update arguments
model = 'prose_mt'
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_mt_max


Extract embeddings with `read_embedings` function.

In [16]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 6165)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [17]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.173222,0.000546,0.143507,0.035961,-0.006350,0.064840,0.254236,-0.056681,0.024833,0
1,Protein_seq_0002,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,...,0.747496,0.004670,0.712976,0.431933,0.220262,0.334480,-0.004101,0.113751,0.039629,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.265788,0.009886,0.386292,0.137479,0.102106,0.057675,-0.018014,0.014908,0.165961,1
1377,Protein_seq_1378,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.024972,0.039072,0.057057,0.052728,0.337448,0.108013,0.043138,-0.048651,-0.005079,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [18]:
# Update arguments
model = 'prose_mt'
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/prose/train/acp_train_mt_sum


Extract embeddings with `read_embedings` function.

In [19]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 6165)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [20]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.887380,-0.011494,0.271127,0.018067,-0.564391,0.086048,1.556936,-1.682055,-1.896534,0
1,Protein_seq_0002,2.0,1.0,0.0,1.0,0.0,0.0,0.0,5.0,1.0,...,9.072073,-0.096385,4.018457,3.480438,-3.530303,-1.416185,-1.577045,1.123185,0.039732,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,1.373076,-0.040675,1.128679,0.634764,-0.311402,-0.376998,-0.699875,-0.326243,0.302838,1
1377,Protein_seq_1378,0.0,4.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,-0.476698,-0.187542,0.101420,0.101587,0.592505,0.568967,-0.671963,-0.914656,-0.664809,0


<br>

## Test Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [21]:
# Update arguments
file_base = 'test'
model = 'prose_dlm'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_dlm_avg


Extract embeddings with `read_embedings` function.

In [22]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 6165)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [23]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.050000,0.000000,0.000000,0.000000,0.100000,0.0,0.000000,0.000000,0.000000,...,0.022126,-0.018916,0.013000,-0.000939,-0.154109,0.006128,0.074266,-0.067285,0.009709,0
1,Protein_seq_0002,0.045455,0.045455,0.113636,0.022727,0.136364,0.0,0.045455,0.113636,0.022727,...,-0.158891,-0.029196,0.099248,0.018526,-0.146355,0.057987,-0.036259,-0.045937,-0.048397,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.000000,0.029412,0.088235,0.058824,0.176471,0.0,0.029412,0.029412,0.000000,...,-0.074295,-0.027915,0.047587,0.009566,-0.164005,0.045475,-0.086575,0.001548,-0.209905,0
343,Protein_seq_0344,0.066667,0.100000,0.066667,0.033333,0.000000,0.0,0.000000,0.166667,0.033333,...,0.278144,-0.001120,0.041746,0.036685,-0.121415,0.016682,-0.090837,-0.068485,0.018224,1


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [24]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_dlm_max


Extract embeddings with `read_embedings` function.

In [25]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 6165)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [26]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.052047,0.002589,0.089642,0.002420,0.016091,0.020113,0.188227,0.008648,0.070660,0
1,Protein_seq_0002,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,-0.040933,0.070645,0.368851,0.175326,0.021163,0.419425,0.041212,-0.007430,0.071112,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,-0.017585,0.041106,0.150218,0.128818,-0.040178,0.112985,-0.016964,0.064841,0.026078,0
343,Protein_seq_0344,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,...,0.415007,0.048386,0.104834,0.157649,0.043800,0.064324,-0.020729,-0.006121,0.074632,1


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [27]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_dlm_sum


Extract embeddings with `read_embedings` function.

In [28]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 6165)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [29]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.442528,-0.378323,0.260004,-0.018781,-3.082188,0.122568,1.485326,-1.345693,0.194190,0
1,Protein_seq_0002,2.0,2.0,5.0,1.0,6.0,0.0,2.0,5.0,1.0,...,-6.991189,-1.284623,4.366925,0.815132,-6.439634,2.551432,-1.595399,-2.021245,-2.129475,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.0,1.0,3.0,2.0,6.0,0.0,1.0,1.0,0.0,...,-2.526030,-0.949109,1.617966,0.325234,-5.576172,1.546156,-2.943535,0.052647,-7.136753,0
343,Protein_seq_0344,2.0,3.0,2.0,1.0,0.0,0.0,0.0,5.0,1.0,...,8.344315,-0.033606,1.252369,1.100552,-3.642461,0.500466,-2.725115,-2.054560,0.546709,1


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [30]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_mt_avg


Extract embeddings with `read_embedings` function.

In [31]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 6165)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [32]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,0.050000,0.000000,0.000000,0.000000,0.100000,0.0,0.000000,0.000000,0.000000,...,0.008637,-0.018315,-0.000528,0.011046,0.043528,0.095396,0.122163,-0.279304,-0.041722,0
1,Protein_seq_0002,0.045455,0.045455,0.113636,0.022727,0.136364,0.0,0.045455,0.113636,0.022727,...,-0.050221,0.219214,0.284175,0.040288,-0.111377,0.255813,0.094091,-0.050291,0.624662,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.000000,0.029412,0.088235,0.058824,0.176471,0.0,0.029412,0.029412,0.000000,...,-0.007637,0.231036,0.262570,-0.060898,0.028585,0.204784,0.103657,0.060234,-0.554332,0
343,Protein_seq_0344,0.066667,0.100000,0.066667,0.033333,0.000000,0.0,0.000000,0.166667,0.033333,...,0.433051,0.012130,0.150403,0.192943,-0.092854,0.164331,-0.017945,-0.012524,-0.037857,1


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [33]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_mt_max


Extract embeddings with `read_embedings` function.

In [34]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 6165)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [35]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.110451,0.049564,0.088130,0.053263,0.444589,0.316590,0.345659,-0.037727,0.191679,0
1,Protein_seq_0002,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,0.360737,0.878113,0.999674,0.739617,0.690238,0.998892,0.640663,-0.002014,0.999404,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,...,0.474195,0.916303,0.975597,0.788543,0.689950,0.921763,0.774353,0.271772,0.159036,0
343,Protein_seq_0344,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,...,0.704793,0.217215,0.449781,0.718158,0.324645,0.849167,0.006671,0.002739,0.033046,1


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [36]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/prose/test/acp_test_mt_sum


Extract embeddings with `read_embedings` function.

In [37]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 6165)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [38]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,Protein_seq_0001,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,...,0.172748,-0.366306,-0.010552,0.220912,0.870551,1.907911,2.443265,-5.586086,-0.834441,0
1,Protein_seq_0002,2.0,2.0,5.0,1.0,6.0,0.0,2.0,5.0,1.0,...,-2.209705,9.645418,12.503687,1.772668,-4.900606,11.255764,4.140015,-2.212802,27.485130,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.0,1.0,3.0,2.0,6.0,0.0,1.0,1.0,0.0,...,-0.259670,7.855216,8.927393,-2.070531,0.971876,6.962669,3.524337,2.047955,-18.847296,0
343,Protein_seq_0344,2.0,3.0,2.0,1.0,0.0,0.0,0.0,5.0,1.0,...,12.991516,0.363910,4.512089,5.788303,-2.785632,4.929940,-0.538359,-0.375729,-1.135704,1


## ESM embeddings


## Train Dataset

### ESM ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [41]:
# Update arguments
ptmodel = 'esm'
file_base = 'train'
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33
pool = 'mean'  
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/esm/train/acp_train_esm1v_mean


Extract embeddings with `read_embedings` function.

In [42]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 1280)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [43]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,Protein_seq_0001,-0.138126,-0.055445,-0.347316,-0.112959,-0.480035,0.491782,0.036463,-0.483421,0.331314,...,0.254845,0.156322,-0.107048,-0.020280,0.090806,-0.086318,-0.391517,0.093338,0.216391,0
1,Protein_seq_0002,0.075502,-0.189331,-0.183814,-0.048120,-0.571827,0.303221,0.303297,-0.116944,0.021927,...,-0.073445,0.039713,-0.011428,-0.048105,-0.040856,0.033303,-0.210972,0.296037,0.002197,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,-0.001074,-0.035816,-0.038935,-0.082159,-0.919749,0.248961,0.149405,-0.195072,0.078797,...,0.111134,-0.016999,-0.357851,-0.063481,-0.047024,-0.197977,-0.235286,0.000254,0.105499,1
1377,Protein_seq_1378,-0.272645,0.294362,-0.213407,0.048606,-0.403573,0.172892,-0.038970,-0.254383,0.015924,...,0.311809,0.027571,-0.398911,-0.062097,0.012328,-0.061342,-0.361377,-0.014396,0.162779,0


<br>

### ESM ESM-1b model - esm1b_t33_650M_UR50S

Update arguments and prepare paths

In [44]:
# Update arguments
model = 'esm1b_t33_650M_UR50S' 
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/train_data.fa 
 ../data/acp/esm/train/acp_train_esm1b_mean


Extract embeddings with `read_embedings` function.

In [45]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(1378, 1280)
Length of target label list:	1378
Length of sequential ids list:	1378


Check our data with `check_with_df` function.

In [46]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,Protein_seq_0001,0.013383,0.263489,0.112976,0.121213,0.068783,0.349227,0.082764,-0.380642,0.089891,...,-0.440336,-0.045218,0.059868,-2.970160,-0.157830,-0.845686,-0.023482,0.066518,-0.478375,0
1,Protein_seq_0002,-0.069985,0.024390,0.022845,0.080284,-0.141245,-0.104143,0.039870,-0.117321,-0.080534,...,-0.126686,0.091434,-0.130513,-0.763611,-0.005721,0.087701,-0.019189,0.102544,-0.125280,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1376,Protein_seq_1377,0.025062,0.139695,0.169002,0.067507,-0.099769,-0.167287,-0.002163,-0.182491,-0.075108,...,-0.150237,-0.064731,-0.098473,-0.682649,0.122551,0.227881,-0.090101,0.066067,0.020852,1
1377,Protein_seq_1378,0.100933,-0.011118,0.240505,0.112940,-0.006494,-0.005311,-0.147712,0.164667,-0.015644,...,0.031900,-0.240745,0.021270,-1.207849,-0.080719,0.181342,0.011719,0.281298,-0.020788,0


<br>

## Test Dataset

### ESM ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [47]:
# Update arguments
model = 'esm1v_t33_650M_UR90S_1'
file_base = 'test'
emb_layer = 33
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/esm/test/acp_test_esm1v_mean


In [48]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 1280)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [49]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,Protein_seq_0001,-0.182235,0.361860,-0.017193,-0.152551,-1.438788,0.026033,0.178496,-0.079849,0.313568,...,0.338911,0.087524,-0.548240,-0.161019,-0.023369,-0.506091,-0.418309,-0.018172,0.284923,0
1,Protein_seq_0002,0.051948,-0.015856,-0.014406,-0.075514,0.156486,-0.287665,-0.086575,0.124207,-0.089903,...,-0.180648,0.164926,-0.133322,-0.102071,0.020742,-0.003653,-0.029525,0.034266,-0.137092,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,0.088324,-0.096162,-0.082784,-0.182584,0.000360,-0.294891,0.199458,0.270405,-0.170535,...,-0.235207,0.198515,-0.039839,-0.114767,-0.057207,-0.056086,-0.135266,0.119055,-0.215575,0
343,Protein_seq_0344,-0.024125,-0.107894,-0.113403,0.020654,-0.271500,0.226581,0.192231,-0.190837,-0.030578,...,-0.062776,0.051986,-0.111523,0.044257,-0.051940,0.033404,-0.161278,0.236552,-0.062306,1


<br>

### ESM ESM-1b model - esm1b_t33_650M_UR50S

Update arguments and prepare paths

In [50]:
# Update arguments
model = 'esm1b_t33_650M_UR50S' 
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/acp/test_data.fa 
 ../data/acp/esm/test/acp_test_esm1b_mean


Extract embeddings with `read_embedings` function.

In [51]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(344, 1280)
Length of target label list:	344
Length of sequential ids list:	344


Check our data with `check_with_df` function.

In [52]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,Protein_seq_0001,0.088257,0.108168,0.066006,0.129144,0.107783,-0.078416,0.010681,-0.019907,-0.035259,...,-0.105909,-0.174954,0.119478,-1.237831,0.052709,0.350217,-0.031515,0.203626,0.100336,0
1,Protein_seq_0002,0.045376,0.277047,-0.057178,-0.032632,-0.105120,-0.078605,-0.132627,0.012729,-0.118805,...,-0.107482,-0.013155,-0.041099,-0.685873,0.156594,-0.003488,0.155243,0.107222,0.136139,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
342,Protein_seq_0343,-0.036402,0.355699,-0.066467,0.041765,0.052619,-0.048664,-0.036612,-0.116422,-0.068359,...,-0.094321,-0.228979,-0.034547,-0.634425,0.142032,0.091536,0.046962,0.011019,0.115468,0
343,Protein_seq_0344,0.038379,0.068516,0.053050,0.119118,-0.184480,-0.086476,0.054324,-0.155769,-0.123679,...,-0.064922,-0.010431,-0.051974,-0.933202,-0.095085,0.154882,0.045132,0.169930,0.018835,1
