# AMP - Reading Embeddings

In this notebook we are going to use previously created `.pt` files to extract embeddings together with corresponding sequence ids and our target variables (label).
For every combination of pretrained models, pooling operations, and task datasets this will be the workflow:

1. set the arguments
2. run `file_paths` function
   - it will set all needed paths according to the arguments
3. run `read_embeddings` function
   - reads fasta headers from a fasta file
   - extracts sequence_id and our target variable ys from the header
   - names of `.pt` files are `header[1:]`and the scripts loads the corresponding file
   - extracts the embedding vector to an array Xs
   - Xs, ys, sequence_id can be now used in machine learning applications
4. run `check_with_df` function to do a sanity check on the embeddings
   - it uses sequence_id, Xs, ys to create a Pandas DataFrame
   - displays first 2 and last 2 rows

First we read `prose` embeddings and then `esm` embeddings.

Import file utilities

In [1]:
# Import the script from different folder
import sys  
sys.path.append('../scripts')

import file_utilities as fu

## ProSE embeddings


Initialize arguments

In [7]:
# Define arguments for the file_paths function
task = 'amp'
ptmodel = 'prose'
file_base = 'all_data'
model = 'prose_dlm'
emb_layer = 'layer'
pool = 'avg'  


<br>

## all_data Dataset

### ProSE DLM model - prose_dlm

- **Pooling Operation:  `avg`**

Run the script `file_paths` to prepare paths. The default root data folder is *../data*.

In [8]:
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_avg


Extract embeddings with `read_embeddings` function.

In [9]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 6165)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [10]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,AP02484,0.190476,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.190476,0.000000,...,0.125029,0.000287,0.031710,0.045591,-0.068033,0.031241,-0.075914,-0.014122,0.013585,1
1,AP02630,0.000000,0.023256,0.046512,0.023256,0.162791,0.046512,0.000000,0.093023,0.046512,...,-0.139132,-0.051193,0.108947,-0.014495,-0.107359,0.016725,0.050440,-0.053858,-0.121933,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,0.000000,0.000000,0.000000,0.066667,0.000000,0.066667,0.066667,0.000000,0.133333,...,-0.064352,-0.036118,0.038847,-0.005325,-0.162470,-0.013375,0.012879,0.061577,0.170036,0
4041,UniRef50_Q50L39,0.000000,0.000000,0.040000,0.040000,0.000000,0.080000,0.080000,0.000000,0.080000,...,0.165536,-0.026004,0.047379,-0.002521,-0.150257,0.028818,-0.092182,0.051275,0.055207,0


<br> 

- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [19]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_max


Extract embeddings with `read_embeddings` function.

In [20]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 6165)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [21]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,AP02484,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.242918,0.013340,0.090424,0.114423,0.050711,0.068568,0.031580,0.012137,0.053268,1
1,AP02630,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,-0.019129,0.008187,0.199729,-0.001016,-0.019748,0.076221,0.104859,-0.019677,0.061204,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,-0.037907,0.014221,0.204487,0.017825,0.011032,0.018597,0.041912,0.149302,0.349932,0
4041,UniRef50_Q50L39,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,0.259932,0.004266,0.139227,0.094924,0.018846,0.065851,-0.008283,0.091745,0.124798,0


<br> 

- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [16]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_dlm_sum


Extract embeddings with `read_embeddings` function.

In [17]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 6165)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [18]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,AP02484,4.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,...,2.625609,0.006037,0.665917,0.957419,-1.428702,0.656070,-1.594204,-0.296553,0.285287,1
1,AP02630,0.0,1.0,2.0,1.0,7.0,2.0,0.0,4.0,2.0,...,-5.982681,-2.201300,4.684705,-0.623292,-4.616421,0.719193,2.168899,-2.315881,-5.243106,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,2.0,...,-0.965285,-0.541764,0.582708,-0.079879,-2.437053,-0.200628,0.193190,0.923662,2.550547,0
4041,UniRef50_Q50L39,0.0,0.0,1.0,1.0,0.0,2.0,2.0,0.0,2.0,...,4.138393,-0.650102,1.184463,-0.063016,-3.756428,0.720450,-2.304556,1.281865,1.380174,0


<br>

### ProSE MT model - prose_mt

- **Pooling Operation:  `avg`**

Update arguments and prepare paths

In [22]:
# Update arguments
model = 'prose_mt'
pool = 'avg'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_mt_avg


Extract embeddings with `read_embeddings` function.

In [23]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 6165)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [24]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,AP02484,0.190476,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.190476,0.000000,...,0.182407,0.006507,0.025610,0.087097,-0.059270,0.188868,-0.076926,0.024240,-0.030708,1
1,AP02630,0.000000,0.023256,0.046512,0.023256,0.162791,0.046512,0.000000,0.093023,0.046512,...,-0.166235,-0.034336,0.100167,-0.004404,-0.064510,0.046750,-0.035658,-0.019140,-0.454555,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,0.000000,0.000000,0.000000,0.066667,0.000000,0.066667,0.066667,0.000000,0.133333,...,-0.109032,-0.063507,0.035962,-0.021260,-0.209445,-0.016908,0.084930,0.083419,0.250840,0
4041,UniRef50_Q50L39,0.000000,0.000000,0.040000,0.040000,0.000000,0.080000,0.080000,0.000000,0.080000,...,0.203517,-0.072818,0.054570,0.004134,-0.096326,0.040333,-0.099720,0.076378,0.122510,0


- **Pooling Operation:  `max`**

Update arguments and prepare paths

In [25]:
# Update arguments
pool = 'max'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_mt_max


Extract embeddings with `read_embeddings` function.

In [26]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 6165)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [27]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,AP02484,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.396872,0.109051,0.278176,0.410893,0.123763,0.656348,0.031152,0.061782,0.015913,1
1,AP02630,0.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,...,-0.009511,0.202685,0.726645,0.072782,0.035308,0.227542,0.012614,0.052842,-0.027416,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,...,-0.046611,0.179802,0.161669,0.026548,0.063300,0.068668,0.165022,0.219370,0.546242,0
4041,UniRef50_Q50L39,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,...,0.419781,0.025045,0.210932,0.118358,0.336438,0.154400,-0.013796,0.157182,0.396947,0


- **Pooling Operation:  `sum`**

Update arguments and prepare paths

In [28]:
# Update arguments
pool = 'sum'
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/prose/all_data/amp_all_mt_sum


Extract embeddings with `read_embeddings` function.

In [29]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 6165)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [30]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,6156,6157,6158,6159,6160,6161,6162,6163,6164,label
0,AP02484,4.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,...,3.830548,0.136643,0.537802,1.829039,-1.244662,3.966237,-1.615452,0.509042,-0.644860,1
1,AP02630,0.0,1.0,2.0,1.0,7.0,2.0,0.0,4.0,2.0,...,-7.148090,-1.476431,4.307183,-0.189360,-2.773945,2.010263,-1.533286,-0.823000,-19.545868,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,2.0,...,-1.635473,-0.952612,0.539435,-0.318907,-3.141680,-0.253616,1.273956,1.251285,3.762596,0
4041,UniRef50_Q50L39,0.0,0.0,1.0,1.0,0.0,2.0,2.0,0.0,2.0,...,5.087928,-1.820445,1.364251,0.103362,-2.408159,1.008336,-2.492997,1.909452,3.062747,0


## ESM embeddings


### ESM-1v model - esm1v_t33_650M_UR90S_1

#### Pooling Operation:  `mean`

Update arguments and prepare paths

In [34]:
# Update arguments
ptmodel = 'esm'
model = 'esm1v_t33_650M_UR90S_1'
emb_layer = 33
pool = 'mean'  
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/esm/all_data/amp_all_esm1v_mean


Extract embeddings with `read_embeddings` function.

In [35]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 1280)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [36]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,AP02484,0.084339,-0.085008,-0.160018,-0.012769,-0.580508,0.268068,0.306808,-0.176000,0.059967,...,-0.183808,0.071751,-0.113516,-0.024507,-0.019659,0.000653,-0.217731,0.292975,-0.068536,1
1,AP02630,-0.210203,-0.046939,-0.005811,-0.270891,-0.189393,-0.238038,0.074845,0.102332,-0.082589,...,-0.124119,0.170866,0.100142,-0.311002,-0.018382,0.080332,-0.088743,0.118312,-0.096608,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,-0.194847,0.107679,-0.136121,-0.266417,-1.054611,0.052423,-0.163465,-0.056415,0.215633,...,0.346126,0.259780,-0.328390,-0.254156,0.014044,-0.250382,-0.235456,0.126883,0.005950,0
4041,UniRef50_Q50L39,-0.111752,0.028765,-0.154828,-0.061214,-0.533987,0.003171,0.089367,-0.115539,-0.045581,...,0.367201,0.056130,-0.168089,-0.124638,0.049767,-0.107238,0.013631,0.112440,0.005164,0


<br>

### ESM-1b model - esm1b_t33_650M_UR50S

Update arguments and prepare paths

In [37]:
# Update arguments
model = 'esm1b_t33_650M_UR50S' 
# Prepare paths
path_pt, _, path_fa = fu.file_paths(ptmodel, task, file_base, model, pool)
print('', path_fa, '\n', path_pt)

 ../data/amp/all_data.fa 
 ../data/amp/esm/all_data/amp_all_esm1b_mean


Extract embeddings with `read_embeddings` function.

In [38]:
Xs, ys, sequence_id = fu.read_embeddings(path_fa, path_pt, pool, emb_layer)

Shape of embeddings: 		(4042, 1280)
Length of target label list:	4042
Length of sequential ids list:	4042


Check our data with `check_with_df` function.

In [39]:
fu.check_with_df(Xs, ys, sequence_id)

Unnamed: 0,sequence,0,1,2,3,4,5,6,7,8,...,1271,1272,1273,1274,1275,1276,1277,1278,1279,label
0,AP02484,-0.082861,0.065592,0.070587,0.110821,-0.141855,-0.203780,0.052344,-0.161724,-0.092577,...,-0.084342,0.049048,-0.102408,-0.849814,0.030440,0.134961,0.004792,0.063991,0.029604,1
1,AP02630,-0.060575,0.146167,0.096243,0.054418,-0.027555,-0.068356,0.060887,0.159950,-0.125833,...,0.054181,-0.002241,-0.018633,-0.634573,-0.125386,0.059756,0.172722,0.095655,0.135376,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4040,UniRef50_Q54H44,-0.049084,0.136789,0.004329,-0.079083,0.215748,0.154155,0.046304,-0.014131,0.209034,...,-0.196847,-0.264772,-0.088322,-1.158075,-0.042665,0.077199,0.010362,-0.058461,-0.126261,0
4041,UniRef50_Q50L39,0.071070,0.176068,-0.007601,0.141909,-0.021336,-0.170452,-0.089157,0.042029,-0.075111,...,0.088693,-0.130390,0.119437,-1.621088,0.175110,0.131705,-0.089250,0.065559,0.045723,0
