# Amino Acid Decode Prep

This notebook documents steps taken to create a set of keys for converting amino acid one letter codes to a larger set of information to be used by the CNN. 

Data references:

Nelson, David L.; Cox, Michael M. (2000). Lehninger Principles of Biochemistry (3rd ed.). Worth Publishers. ISBN 978-1-57259-153-0.

Kyte J, Doolittle RF (May 1982). "A simple method for displaying the hydropathic character of a protein". Journal of Molecular Biology. 157 (1): 105–32. CiteSeerX 10.1.1.458.454. doi:10.1016/0022-2836(82)90515-0. PMID 7108955.

Meierhenrich, Uwe J. (2008). Amino acids and the asymmetry of life (1st ed.). Springer. ISBN 978-3-540-76885-2.

Biochemistry, Harpers (2015). Harpers Illustrated Biochemistry (30st ed.). Lange. ISBN 978-0-07-182534-4.

In [64]:
import pandas as pd
import numpy as np

In [70]:
data_r = pd.read_csv('AA_info.csv')
data_r.head()

Unnamed: 0,Name,One Letter,Three Letter,mass,PI,pka,pkb,side chain,hydrophobic,pka.1,polar,ph,small,tiny,aromatic or aliphatic,van der waal volume,Hydrophobicity
0,Alanine,A,Ala,89.09404,6.01,2.35,9.87,-CH3,Yes,30.0,No,,Yes,Yes,Aliphatic,67,0.33
1,Arginine,R,Arg,174.20274,10.76,1.82,8.99,-(CH2)3NH-C(NH)NH2,No,12.3,Yes,strongly basic,No,No,-,148,1.0
2,Asparagine,N,Asn,132.11904,5.41,2.14,8.72,-CH2CONH2,No,30.0,Yes,,Yes,No,-,96,0.43
3,Aspartic acid,D,Asp,133.10384,2.85,1.99,9.9,-CH2COOH,No,3.67,Yes,acidic,Yes,No,-,91,2.66
4,Cysteine,C,Cys,121.15404,5.05,1.92,10.7,-CH2SH,Yes,8.55,No,acidic,Yes,Yes,-,86,0.22


In [28]:
data_r.columns

Index(['Name', 'One Letter', 'Three Letter', 'mass', 'PI', 'pka', 'pkb',
       'side chain', 'hydrophobic', 'pka.1', 'polar', 'ph', 'small', 'tiny',
       'aromatic or aliphatic', 'van der waal volume', 'Hydrophobicity'],
      dtype='object')

In [71]:
data_r.index = data_r['One Letter']
data_r.drop(['O', 'U'], inplace=True)

Numeric columns taken as is and scaled with standard scaler. Note that for pka.1, some amino acids do not have reported vlues. They are given a value of 30, which is consistent with their side chains expected to have significantly higher pka's than the ones with reported values.

In [54]:
data = data_r[['mass', 'PI', 'pka', 'pkb','pka.1','van der waal volume', 'Hydrophobicity']]
data.index = data_r['One Letter']

In [55]:
data

Unnamed: 0_level_0,mass,PI,pka,pkb,pka.1,van der waal volume,Hydrophobicity
One Letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,89.09404,6.01,2.35,9.87,30.0,67,0.33
R,174.20274,10.76,1.82,8.99,12.3,148,1.0
N,132.11904,5.41,2.14,8.72,30.0,96,0.43
D,133.10384,2.85,1.99,9.9,3.67,91,2.66
C,121.15404,5.05,1.92,10.7,8.55,86,0.22
Q,146.14594,5.65,2.17,9.13,30.0,114,0.19
E,147.13074,3.15,2.1,9.47,4.25,109,1.67
G,75.06714,6.06,2.35,9.78,30.0,48,1.14
H,155.15634,7.6,1.8,9.33,6.54,118,1.34
I,131.17464,6.05,2.32,9.76,30.0,124,-0.81


In [30]:
from sklearn.preprocessing import StandardScaler

In [56]:
ss = StandardScaler()
data = pd.DataFrame(ss.fit_transform(data), columns=data.columns, index=data.index)

In [40]:
data

Unnamed: 0_level_0,mass,PI,pka,pkb,pka.1,van der waal volume,Hydrophobicity
One Letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,-1.5892,-0.009948,1.08575,0.709184,0.723605,-1.490636,-0.068747
R,1.239969,2.76956,-1.835303,-1.061261,-0.934901,1.370537,0.662374
N,-0.15897,-0.361043,-0.071648,-1.604466,0.723605,-0.466265,0.040375
D,-0.126234,-1.859052,-0.898361,0.76954,-1.74354,-0.642881,2.473808
C,-0.523467,-0.571701,-1.284161,2.379036,-1.286279,-0.819496,-0.188782
Q,0.30731,-0.220605,0.093694,-0.7796,0.723605,0.169551,-0.221519
E,0.340046,-1.683504,-0.292105,-0.095564,-1.689193,-0.007065,1.393495
G,-2.05548,0.01931,1.08575,0.528116,0.723605,-2.161775,0.815145
H,0.606832,0.920456,-1.945531,-0.377226,-1.474618,0.310843,1.03339
I,-0.190364,0.013459,0.920407,0.487878,0.723605,0.522782,-1.312744


Boolean type columns are converted to 1,0 columns

In [57]:
# Hydrophobic is simple yes or no
for col in ['hydrophobic', 'polar']:
    data[col] = pd.get_dummies(data_r[col])['Yes']

In [59]:
all(data['hydrophobic'] != data['polar'])

True

In [60]:
# dropping polar as it is an inverse of hydrophobic
data.drop('polar', axis=1, inplace=True)

In [61]:
# Aromatic/Aliphatic are each yes or no, '-' is a 0 for both
data = pd.concat((data, pd.get_dummies(data_r['aromatic or aliphatic'])[['Aliphatic', 'Aromatic']]), axis=1)

In [62]:
data

Unnamed: 0_level_0,mass,PI,pka,pkb,pka.1,van der waal volume,Hydrophobicity,hydrophobic,Aliphatic,Aromatic
One Letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
A,-1.5892,-0.009948,1.08575,0.709184,0.723605,-1.490636,-0.068747,1,1,0
R,1.239969,2.76956,-1.835303,-1.061261,-0.934901,1.370537,0.662374,0,0,0
N,-0.15897,-0.361043,-0.071648,-1.604466,0.723605,-0.466265,0.040375,0,0,0
D,-0.126234,-1.859052,-0.898361,0.76954,-1.74354,-0.642881,2.473808,0,0,0
C,-0.523467,-0.571701,-1.284161,2.379036,-1.286279,-0.819496,-0.188782,1,0,0
Q,0.30731,-0.220605,0.093694,-0.7796,0.723605,0.169551,-0.221519,0,0,0
E,0.340046,-1.683504,-0.292105,-0.095564,-1.689193,-0.007065,1.393495,0,0,0
G,-2.05548,0.01931,1.08575,0.528116,0.723605,-2.161775,0.815145,1,0,0
H,0.606832,0.920456,-1.945531,-0.377226,-1.474618,0.310843,1.03339,0,0,1
I,-0.190364,0.013459,0.920407,0.487878,0.723605,0.522782,-1.312744,1,1,0


The ph column can be converted to an ordinal as 'basic' and 'acidic' represent two ends of a scale.

In [76]:
ordinal = []
for i in data_r['ph']:
    if i == 'acidic':
        ordinal.append(-2)
    elif i == 'weak acidic':
        ordinal.append(-1)
    elif i == 'weak basic':
        ordinal.append(1)
    elif i == 'basic':
        ordinal.append(2)
    elif i == 'strongly basic':
        ordinal.append(3)
    else:
        ordinal.append(0)
data['ph'] = ordinal

In [77]:
pd.concat((data['ph'], data_r['ph']), axis=1)

Unnamed: 0_level_0,ph,ph
One Letter,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,
R,3,strongly basic
N,0,
D,-2,acidic
C,-2,acidic
Q,0,
E,-2,acidic
G,0,
H,1,weak basic
I,0,


Finally, the amino acids contain much more information not captured in the above columns. Their one letter codes are encoded in dummy variables for the CNN to have the opportunity to 'learn' more information from these.

In [89]:
letters = pd.DataFrame(data=pd.get_dummies(data_r.index))
letters.index = data_r.index
letters

Unnamed: 0_level_0,A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
One Letter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
A,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
R,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
N,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
D,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
C,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Q,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
E,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
G,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
H,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
I,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


In [93]:
for c in letters.columns:
    if letters[c][letters[c] == 1].name != c:
        print(c)