# Embedding

> Classes and functions to create different types of embeddings from one-hot to pre-trained.

[Basics of Protein Structure](https://proteopedia.org/wiki/index.php/Basics_of_Protein_Structure)
- [Amino Acids](https://proteopedia.org/wiki/index.php/Amino_Acids)
- [Peptide](https://proteopedia.org/wiki/index.php/Peptide)
    - [Stephen Mills/Peptide tutorial 1](https://proteopedia.org/wiki/index.php/User:Stephen_Mills/Peptide_tutorial_1)
    - [Stephen Mills/Peptide tutorial 2](https://proteopedia.org/wiki/index.php/User:Stephen_Mills/Peptide_tutorial_2)

**Encoding**
- [What is multi-hot encoding](https://stats.stackexchange.com/questions/467633/what-exactly-is-multi-hot-encoding-and-how-is-it-different-from-one-hot)
- [Muti-hot encoding vs Label-Encoding](https://datascience.stackexchange.com/questions/37234/muti-hot-encoding-vs-label-encoding)
    - Also talks about Mean-Encoding (aka target encoding or likelihood encoding)
- [Auto-encoder vs embedding layer vs multi-hot encoding](https://datascience.stackexchange.com/questions/36564/auto-encoder-to-condense-pre-process-large-one-hot-input-vectors/37218#37218)

In [None]:
#export
import numpy as np
import pandas as pd

from peptide.preprocessing.data import *
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder

**List of amino acids based on the literature above**

In [None]:

import string


amino_acids = list(string.ascii_uppercase)

Load data

In [None]:
merged_train_df, merged_test_df = get_all_data(merge=True)
acp_train_df, acp_test_df, amp_train_df, amp_test_df, dna_bind_train_df, dna_bind_test_df = get_all_data()

In [None]:
acp_train_df.head(5)

Unnamed: 0,sequence,label_acp
0,RRWWRRWRRW,0
1,GWKSVFRKAKKVGKTVGGLALDHYLG,0
2,ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ,1
3,GLFDVIKKVAAVIGGL,1
4,VAKLLAKLAKKVL,1


## Experiments

### Small Dataset
Get a small dataset for experiments

In [None]:
small_df = acp_train_df[:10].copy()

In [None]:
small_df

Unnamed: 0,sequence,label_acp
0,RRWWRRWRRW,0
1,GWKSVFRKAKKVGKTVGGLALDHYLG,0
2,ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ,1
3,GLFDVIKKVAAVIGGL,1
4,VAKLLAKLAKKVL,1
5,IIGHLIKTALGFLGL,0
6,FLPLLASLFSRLL,1
7,WFKKIPKFLHLAKKF,1
8,ATCDLLSKWNWNHTACAGHCIAKGFKGGYCNDKAVCVCRN,1
9,NIPQLTPTP,0


In [None]:
small_df, small_features, small_labels = extract_features_labels(small_df)

In [None]:
small_df

Unnamed: 0,sequence,label_acp,lenghts
0,"[R, R, W, W, R, R, W, R, R, W]",0,10
1,"[G, W, K, S, V, F, R, K, A, K, K, V, G, K, T, ...",0,26
2,"[A, L, W, K, T, M, L, K, K, L, G, T, M, A, L, ...",1,34
3,"[G, L, F, D, V, I, K, K, V, A, A, V, I, G, G, L]",1,16
4,"[V, A, K, L, L, A, K, L, A, K, K, V, L]",1,13
5,"[I, I, G, H, L, I, K, T, A, L, G, F, L, G, L]",0,15
6,"[F, L, P, L, L, A, S, L, F, S, R, L, L]",1,13
7,"[W, F, K, K, I, P, K, F, L, H, L, A, K, K, F]",1,15
8,"[A, T, C, D, L, L, S, K, W, N, W, N, H, T, A, ...",1,40
9,"[N, I, P, Q, L, T, P, T, P]",0,9


In [None]:
small_features

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,30,31,32,33,34,35,36,37,38,39
0,R,R,W,W,R,R,W,R,R,W,...,,,,,,,,,,
1,G,W,K,S,V,F,R,K,A,K,...,,,,,,,,,,
2,A,L,W,K,T,M,L,K,K,L,...,Q,G,T,Q,,,,,,
3,G,L,F,D,V,I,K,K,V,A,...,,,,,,,,,,
4,V,A,K,L,L,A,K,L,A,K,...,,,,,,,,,,
5,I,I,G,H,L,I,K,T,A,L,...,,,,,,,,,,
6,F,L,P,L,L,A,S,L,F,S,...,,,,,,,,,,
7,W,F,K,K,I,P,K,F,L,H,...,,,,,,,,,,
8,A,T,C,D,L,L,S,K,W,N,...,N,D,K,A,V,C,V,C,R,N
9,N,I,P,Q,L,T,P,T,P,,...,,,,,,,,,,


In [None]:
small_labels

Unnamed: 0,label_acp
0,0
1,0
2,1
3,1
4,1
5,0
6,1
7,1
8,1
9,0


### One-hot Encoding

Amino acids by simply listing the letters in the alpabet

In [None]:
ohe_example = OneHotEncoder()

In [None]:
ohe_example.fit(np.array(amino_acids).reshape(-1, 1))

OneHotEncoder()

In [None]:
ohe_example.categories_

[array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
        'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
       dtype='<U1')]

One-hot encoding with sklearn example
- https://datascience.stackexchange.com/a/71831

Example using `pd.get_dummies()` 
- https://datascience.stackexchange.com/questions/71804/how-to-perform-one-hot-encoding-on-multiple-categorical-columns
- easier to use than sklearn's `OneHotEncoder` in this case as discussed in the article
- but then what about transforming test data with the same encoder


Not sure if padding is needed, yet, **TODO** - update, `pd.get_dummies()` just ignores the `None` values

In [None]:
# features_df.fillna('pad')

#### Using sklearn

Make `sparse=True` in actual code

In [None]:
ohe = OneHotEncoder(sparse=False)
transformed_data = ohe.fit_transform(small_features)
transformed_df = pd.DataFrame(transformed_data, index=small_features.index)

In [None]:
transformed_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,167,168,169,170,171,172,173,174,175,176
0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
4,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
6,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0
8,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
9,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0


In [None]:
transformed_data

array([[0., 0., 0., ..., 1., 0., 1.],
       [0., 0., 1., ..., 1., 0., 1.],
       [1., 0., 0., ..., 1., 0., 1.],
       ...,
       [0., 0., 0., ..., 1., 0., 1.],
       [1., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 1., 0., 1.]])

#### Using Pandas `pd.get_dummies()`

In [None]:
features_ohe = pd.get_dummies(small_features, sparse=True)
features_ohe

Unnamed: 0,0_A,0_F,0_G,0_I,0_N,0_R,0_V,0_W,1_A,1_F,...,32_K,32_T,33_A,33_Q,34_V,35_C,36_V,37_C,38_R,39_N
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,1,0,1,0,0,0,0,0,0
3,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
8,1,0,0,0,0,0,0,0,0,0,...,1,0,1,0,1,1,1,1,1,1
9,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Using `sparse=True` reduced memory usage

In [None]:
features_ohe.memory_usage()

Index    128
0_A       10
0_F        5
0_G       10
0_I        5
        ... 
35_C       5
36_V       5
37_C       5
38_R       5
39_N       5
Length: 147, dtype: int64

In [None]:
pd.set_option('display.max_columns', 150)

In [None]:
small_df.loc[0:2]

Unnamed: 0,sequence,label_acp,lenghts
0,"[R, R, W, W, R, R, W, R, R, W]",0,10
1,"[G, W, K, S, V, F, R, K, A, K, K, V, G, K, T, ...",0,26
2,"[A, L, W, K, T, M, L, K, K, L, G, T, M, A, L, ...",1,34


In [None]:
features_ohe.loc[0:2]

Unnamed: 0,0_A,0_F,0_G,0_I,0_N,0_R,0_V,0_W,1_A,1_F,1_I,1_L,1_R,1_T,1_W,2_C,2_F,2_G,2_K,2_P,2_W,3_D,3_H,3_K,3_L,3_Q,3_S,3_W,4_I,4_L,4_R,4_T,4_V,5_A,5_F,5_I,5_L,5_M,5_P,5_R,5_T,6_K,6_L,6_P,6_R,6_S,6_W,7_F,7_K,7_L,7_R,7_T,8_A,8_F,8_K,8_L,8_P,8_R,8_V,8_W,9_A,9_H,9_K,9_L,9_N,9_S,9_W,10_A,10_G,10_K,10_L,10_R,10_W,11_A,11_F,11_L,11_N,11_T,11_V,12_G,12_H,12_I,12_K,12_L,12_M,13_A,13_G,13_K,13_T,14_A,14_F,14_G,14_L,14_T,15_C,15_H,15_L,15_V,16_A,16_G,17_G,18_H,18_K,18_L,19_A,19_C,20_A,20_I,20_L,21_A,21_D,21_L,22_G,22_H,22_K,23_A,23_G,23_Y,24_A,24_F,24_L,25_A,25_G,25_K,26_D,26_G,27_G,27_T,28_I,28_Y,29_C,29_S,30_N,30_Q,31_D,31_G,32_K,32_T,33_A,33_Q,34_V,35_C,36_V,37_C,38_R,39_N
0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,1,0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,1,0,1,0,1,0,0,0,0,1,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,1,0,1,0,1,0,1,0,1,0,0,0,0,0,0


In [None]:
pd.reset_option('display.max_columns')

### Multi-hot Encoding

Based on the following ideas
- [How to one-hot-encode from a pandas column containing a list?](https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list)
- [DataFrame with multiple values in each column. How to one-hot encode them under the main heading?](https://stackoverflow.com/questions/67108935/dataframe-with-multiple-values-in-each-column-how-to-one-hot-encode-them-under)
    - This turns into the one-hot encoding in the previous section

Comparing the learned list of amino acids (learned from the entire *merged* dataset) to simple alphabetically defined one above

In [None]:
list(merged_train_df.sequence)[:5]

['RRWWRRWRRW',
 'GWKSVFRKAKKVGKTVGGLALDHYLG',
 'ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ',
 'GLFDVIKKVAAVIGGL',
 'VAKLLAKLAKKVL']

In [None]:
mlb = MultiLabelBinarizer()
mlb.fit(list(merged_train_df.sequence))

MultiLabelBinarizer()

In [None]:
mlb.classes_

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L', 'M', 'N',
       'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y'],
      dtype=object)

In [None]:
print(amino_acids)

['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']


`J` and `Z` missing in the learned list

The following, based on this example
- [How to one-hot-encode from a pandas column containing a list?](https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list)

In [None]:
mlb = MultiLabelBinarizer(sparse_output = True)

In [None]:
features_mlb = pd.DataFrame.sparse.from_spmatrix(
    mlb.fit_transform(small_df['sequence']),
    index = small_df.index,
    columns = mlb.classes_
)

In [None]:
features_mlb

Unnamed: 0,A,C,D,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
1,1,0,1,1,1,1,0,1,1,0,0,0,0,1,1,1,1,1,1
2,1,0,1,0,1,1,1,1,1,1,0,0,1,0,1,1,0,1,0
3,1,0,1,1,1,0,1,1,1,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0
5,1,0,0,1,1,1,1,1,1,0,0,0,0,0,0,1,0,0,0
6,1,0,0,1,0,0,0,0,1,0,0,1,0,1,1,0,0,0,0
7,1,0,0,1,0,1,1,1,1,0,0,1,0,0,0,0,0,1,0
8,1,1,1,1,1,1,1,1,1,0,1,0,0,1,1,1,1,1,1
9,0,0,0,0,0,0,1,0,1,0,1,1,1,0,0,1,0,0,0


In [None]:
small_df.sequence

0                       [R, R, W, W, R, R, W, R, R, W]
1    [G, W, K, S, V, F, R, K, A, K, K, V, G, K, T, ...
2    [A, L, W, K, T, M, L, K, K, L, G, T, M, A, L, ...
3     [G, L, F, D, V, I, K, K, V, A, A, V, I, G, G, L]
4              [V, A, K, L, L, A, K, L, A, K, K, V, L]
5        [I, I, G, H, L, I, K, T, A, L, G, F, L, G, L]
6              [F, L, P, L, L, A, S, L, F, S, R, L, L]
7        [W, F, K, K, I, P, K, F, L, H, L, A, K, K, F]
8    [A, T, C, D, L, L, S, K, W, N, W, N, H, T, A, ...
9                          [N, I, P, Q, L, T, P, T, P]
Name: sequence, dtype: object

**Clearly multi-hot encoding is not an option as we lose amino acid sequence information**

## One-hot Encode

In [None]:
#export

def one_hot_encode(df, sparse=True):
    '''Create and return one-hot encoded features'''

    df = df.copy()
    df['seq_list'] = df['sequence'].apply(lambda x: list(x))
    df['lenghts'] = df['sequence'].apply(lambda x: len(x))
    features_df = pd.DataFrame(df['seq_list'].to_list())

    ohe = OneHotEncoder(sparse=sparse)
    transformed_data = ohe.fit_transform(features_df)
    transformed_df = pd.DataFrame(transformed_data, index=features_df.index)

    # features_ohe = pd.get_dummies(features_df, sparse=True)
    return ohe, transformed_df, features_df, df    

In [None]:
ohe, transformed_df, features_df, df = one_hot_encode(acp_train_df, sparse=False)

In [None]:
ohe.transform(acp_test_df)

OneHotEncoder(sparse=False)

#### Old Pandas effort

In [None]:
features_ohe, features_df, df  = one_hot_encode(merged_train_df)

In [None]:
features_ohe

Unnamed: 0,0_A,0_C,0_D,0_E,0_F,0_G,0_H,0_I,0_K,0_L,...,4901_G,4902_A,4903_V,4904_N,4905_C,4906_R,4907_K,4908_W,4909_M,4910_N
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18796,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18797,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18798,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18799,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
features_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4901,4902,4903,4904,4905,4906,4907,4908,4909,4910
0,R,R,W,W,R,R,W,R,R,W,...,,,,,,,,,,
1,G,W,K,S,V,F,R,K,A,K,...,,,,,,,,,,
2,A,L,W,K,T,M,L,K,K,L,...,,,,,,,,,,
3,G,L,F,D,V,I,K,K,V,A,...,,,,,,,,,,
4,V,A,K,L,L,A,K,L,A,K,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18796,M,E,E,K,K,E,K,E,I,L,...,,,,,,,,,,
18797,M,S,T,I,A,D,P,R,D,I,...,,,,,,,,,,
18798,M,S,L,E,S,F,D,K,D,I,...,,,,,,,,,,
18799,M,A,L,K,S,Y,K,P,T,T,...,,,,,,,,,,


In [None]:
df

Unnamed: 0,sequence,label_acp,label_amp,label_dna_bind,seq_list,lenghts
0,RRWWRRWRRW,0.0,0.0,0.0,"[R, R, W, W, R, R, W, R, R, W]",10
1,GWKSVFRKAKKVGKTVGGLALDHYLG,0.0,0.0,0.0,"[G, W, K, S, V, F, R, K, A, K, K, V, G, K, T, ...",26
2,ALWKTMLKKLGTMALHAGKAALGAAADTISQGTQ,1.0,0.0,0.0,"[A, L, W, K, T, M, L, K, K, L, G, T, M, A, L, ...",34
3,GLFDVIKKVAAVIGGL,1.0,0.0,0.0,"[G, L, F, D, V, I, K, K, V, A, A, V, I, G, G, L]",16
4,VAKLLAKLAKKVL,1.0,0.0,0.0,"[V, A, K, L, L, A, K, L, A, K, K, V, L]",13
...,...,...,...,...,...,...
18796,MEEKKEKEILDVSALTGKQKAAILLVSIGSEISSKVFKYLSQEEIE...,0.0,0.0,0.0,"[M, E, E, K, K, E, K, E, I, L, D, V, S, A, L, ...",344
18797,MSTIADPRDILLAPVISEKSYGLIEEGTYTFLVHPDSNKTQIKIAV...,0.0,0.0,0.0,"[M, S, T, I, A, D, P, R, D, I, L, L, A, P, V, ...",101
18798,MSLESFDKDIYSLVNKELERQCDHLEMIASENFTYPDVMEVMGSVL...,0.0,0.0,0.0,"[M, S, L, E, S, F, D, K, D, I, Y, S, L, V, N, ...",414
18799,MALKSYKPTTPGQRGLVLIDRSELWKGRPVKALTEGLSKHGGRNNT...,0.0,0.0,0.0,"[M, A, L, K, S, Y, K, P, T, T, P, G, Q, R, G, ...",280
