# Capstone 2 Pre-Processing

Having completed some preliminary EDA, I'm ready to pre-process the data in preparation for fitting some models.  This will entail examining and removing some very sparse categories, scaling the quantitative variables, and encoding categorical variables.

In [30]:
# some easily anticipated imports
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import LabelBinarizer


#import warnings
#warnings.filterwarnings('ignore')

In [31]:
# these could prove useful later:
codon_list = ['UUU', 'UUC', 'UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG',
       'AUU', 'AUC', 'AUA', 'AUG', 'GUU', 'GUC', 'GUA', 'GUG', 'GCU', 'GCC',
       'GCA', 'GCG', 'CCU', 'CCC', 'CCA', 'CCG', 'UGG', 'GGU', 'GGC', 'GGA',
       'GGG', 'UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC', 'ACU', 'ACC', 'ACA',
       'ACG', 'UAU', 'UAC', 'CAA', 'CAG', 'AAU', 'AAC', 'UGU', 'UGC', 'CAU',
       'CAC', 'AAA', 'AAG', 'CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG', 'GAU',
       'GAC', 'GAA', 'GAG', 'UAA', 'UAG', 'UGA']

amino_list = ['alanine', 'arginine',
       'asparagine', 'aspartic acid', 'cysteine', 'glutamine', 'glutamic acid',
       'glycine', 'histidine', 'isoleucine', 'leucine', 'lysine', 'methionine',
       'phenylalanine', 'proline', 'serine', 'threonine', 'tryptophan',
       'tyrosine', 'valine', 'start', 'stop']

# might strip 'start' and 'stop' from the above list but let's leave it for now

# this dictionary, from the preceding notebook, could also prove quite useful
amino_codons = {'alanine': ['GCU', 'GCC', 'GCA', 'GCG'], 
                'arginine': ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
                'asparagine': ['AAU', 'AAC'],
                'aspartic acid': ['GAU', 'GAC'],
                'cysteine': ['UGU', 'UGC'], 
                'glutamine': ['CAA', 'CAG'],
                'glutamic acid': ['GAA', 'GAG'],
                'glycine': ['GGU', 'GGC', 'GGA', 'GGG'],
                'histidine': ['CAU', 'CAC'], 
                'isoleucine': ['AUU', 'AUC', 'AUA'], 
                'leucine': ['CUU', 'CUC', 'CUA', 'CUG', 'UUA', 'UUG'],
                'lysine': ['AAA', 'AAG'], 
                'methionine': ['AUG'], 
                'phenylalanine': ['UUU', 'UUC'], 
                'proline': ['CCU', 'CCC', 'CCA', 'CCG'],
                'serine': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'], 
                'threonine': ['ACU', 'ACC', 'ACA', 'ACG'], 
                'tryptophan': ['UGG'],
                'tyrosine': ['UAU', 'UAC'], 
                'valine': ['GUU', 'GUC', 'GUA', 'GUG'], 
                'start': ['AUG'], 
                'stop': ['UAA', 'UAG', 'UGA']}

In [32]:
# the file prepared in the wrangling notebook
codon_usage = pd.read_csv('codon_usage2.csv')
codon_usage.shape

(13028, 92)

The variables that I'd like to model as outcomes are 'Kingdom' and 'DNAtype'; I'd like to take another look at their value counts.

In [33]:
codon_usage['Kingdom'].value_counts()

bacteria         2920
virus            2832
plant            2523
vertebrate       2077
invertebrate     1345
mammal            572
bacteriophage     220
rodent            215
primate           180
archaea           126
plasmid            18
Name: Kingdom, dtype: int64

In [34]:
codon_usage['DNAtype'].value_counts()

genomic                   9267
mitochondrial             2899
chloroplast                816
plastid                     31
kinetoplast                  5
cyanelle                     2
nucleomorph                  2
apicoplast                   2
secondary_endosymbiont       1
chromoplast                  1
Name: DNAtype, dtype: int64

We can see that among the 'DNAtype' values, almost all are 'genomic', 'mitochondrial', or 'chloroplast'.  With 'Kingdom', there's a reasonable number of all cases except for 'plasmid'.  All of these observations should be dropped, so that we can produce models not unduly leveraged by them.

In [35]:
# tried and failed to do this with df.drop() but this seems to work:
codon_usage = codon_usage[codon_usage['Kingdom'] != 'plasmid']
codon_usage.shape

(13010, 92)

In [36]:
# I'm absolutely positive there's a better way to do this.
drop_types = ['plastid', 'kinetoplast', 'nucleomorph', 'cyanelle', 'apicoplast', 'secondary_endosymbiont', 'chromoplast']
for drop in drop_types:
    codon_usage = codon_usage[codon_usage['DNAtype'] != drop]

codon_usage.shape


(12966, 92)

This removed 44 rows from the dataframe, which is what was expected.

In [37]:
codon_usage[['Kingdom', 'DNAtype']].value_counts()

Kingdom        DNAtype      
bacteria       genomic          2918
virus          genomic          2832
vertebrate     mitochondrial    1613
plant          genomic          1523
invertebrate   genomic           922
plant          chloroplast       815
mammal         mitochondrial     470
vertebrate     genomic           464
invertebrate   mitochondrial     411
bacteriophage  genomic           220
rodent         mitochondrial     156
plant          mitochondrial     152
archaea        genomic           126
mammal         genomic           102
primate        mitochondrial      97
               genomic            83
rodent         genomic            59
invertebrate   chloroplast         1
dtype: int64

This didn't produce quite the results intended.  But one thing stands out here as unusual:  there's a single observation of an organism with 'Kingdom' being 'invertebrate', but 'DNAtype' being 'chloroplast' - to the best of my knowledge, chloroplasts are exclusive to photosynthetic life, and an invertebrate shouldn't have any.  What's going on here?

In [38]:
codon_usage.loc[(codon_usage['Kingdom'] == 'invertebrate') & (codon_usage['DNAtype'] == 'chloroplast')]

Unnamed: 0.1,Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,...,methionine,phenylalanine,proline,serine,threonine,tryptophan,tyrosine,valine,start,stop
9400,9400,invertebrate,chloroplast,5811,6158,chloroplast Toxoplasma gondii,0.08412,0.00438,0.11725,0.00244,...,0.01234,0.0885,0.01851,0.05927,0.04319,0.0026,0.06528,0.02517,0.01234,0.0086


A bit of searching on the subject of 'chloroplast Toxoplasma gondii' suggests that what's meant by this is the apicoplast present in T. gondii which is believed to be a vestige of a chloroplast in the evolutionary history of several organisms.  (See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166169/)  At any rate, it is at least highly questionable that this is really chloroplast DNA, and the observation should simply be dropped.

In [39]:
codon_usage.drop(9400, inplace=True)
codon_usage.shape

(12965, 92)

The above results also show that there's still a column that duplicates the index.  This could be trimmed as well.

In [40]:
codon_usage.drop('Unnamed: 0', axis=1, inplace=True)
codon_usage.head()

Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,methionine,phenylalanine,proline,serine,threonine,tryptophan,tyrosine,valine,start,stop
0,virus,genomic,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,...,0.02506,0.02857,0.07268,0.06115,0.04561,0.01003,0.02757,0.09423,0.02506,0.00301
1,virus,genomic,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,...,0.03324,0.04071,0.0502,0.06581,0.05768,0.01425,0.03392,0.08955,0.03324,0.00339
2,virus,genomic,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,...,0.0218,0.04154,0.06232,0.08289,0.05491,0.01728,0.03887,0.0578,0.0218,0.00535
3,virus,genomic,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,...,0.02924,0.0402,0.0376,0.09191,0.06215,0.01201,0.03029,0.07885,0.02924,0.00418
4,virus,genomic,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,...,0.02773,0.04187,0.02798,0.06893,0.06745,0.01205,0.03456,0.07196,0.02773,0.00175


In [41]:
codon_usage['Kingdom'].value_counts()

bacteria         2918
virus            2832
plant            2492
vertebrate       2077
invertebrate     1333
mammal            572
bacteriophage     220
rodent            215
primate           180
archaea           126
Name: Kingdom, dtype: int64

In [42]:
codon_usage['DNAtype'].value_counts()

genomic          9249
mitochondrial    2899
chloroplast       815
Name: DNAtype, dtype: int64

It looks like we've successfully trimmed the data of these sparse categories.  We should also check for missing values.

In [43]:
codon_usage.isna().sum()

Kingdom        0
DNAtype        2
SpeciesID      0
Ncodons        0
SpeciesName    0
              ..
tryptophan     0
tyrosine       0
valine         0
start          0
stop           0
Length: 91, dtype: int64

It looks like we still have two missing values for 'DNAtype' - let's drop those rows, too.

In [44]:
codon_usage.dropna(inplace=True)
codon_usage.shape

(12963, 91)

The next major task is to standardize the values.  In general, the variances among the codon and amino acid frequencies shouldn't be so dissimilar as to matter much, but it's good practice, and of course the 'Ncodons' variable has a completely different scale.

**EDIT:** I'm going to postpone using StandardScaler() until after train_test_split in the modeling notebook to avoid data leakage.

In [45]:
ss = StandardScaler()

Perhaps this is bad practice but I'm making a copy of the df as it exists now.

In [46]:
cu = codon_usage
#transform_cols = codon_list + amino_list

# almost forgot this one
#transform_cols.append('Ncodons')

#cu[transform_cols] = ss.fit_transform(cu[transform_cols])

In [47]:
cu[amino_list].mean()


alanine          0.074558
arginine         0.046398
asparagine       0.044485
aspartic acid    0.045349
cysteine         0.014894
glutamine        0.034859
glutamic acid    0.049967
glycine          0.065279
histidine        0.023725
isoleucine       0.071653
leucine          0.108339
lysine           0.049951
methionine       0.021138
phenylalanine    0.048164
proline          0.049957
serine           0.071232
threonine        0.062768
tryptophan       0.011608
tyrosine         0.034324
valine           0.062914
start            0.021138
stop             0.008439
dtype: float64

In [48]:
cu[amino_list].std()

alanine          0.024743
arginine         0.019469
asparagine       0.014036
aspartic acid    0.017574
cysteine         0.010467
glutamine        0.014234
glutamic acid    0.019885
glycine          0.017717
histidine        0.007686
isoleucine       0.032334
leucine          0.036654
lysine           0.021793
methionine       0.008160
phenylalanine    0.017163
proline          0.014653
serine           0.015628
threonine        0.019275
tryptophan       0.006574
tyrosine         0.010434
valine           0.015595
start            0.008160
stop             0.011069
dtype: float64

All of the means are within rounding error of zero, and the standard deviations very close to one, as we would expect after converting to standard scores.  Next, we need to replace the categorical variables 'Kingdom' and 'DNAtype' with dummy variables.  There are a variety of ways to do this; I'm going to try pd.get_dummies()

In [49]:
# should I be using drop_first=True?
kingdoms = pd.get_dummies(cu['Kingdom'], prefix='K')
kingdoms.head()

Unnamed: 0,K_archaea,K_bacteria,K_bacteriophage,K_invertebrate,K_mammal,K_plant,K_primate,K_rodent,K_vertebrate,K_virus
0,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,1


In [50]:
dnas = pd.get_dummies(cu['DNAtype'], prefix='D')
dnas.head()

Unnamed: 0,D_chloroplast,D_genomic,D_mitochondrial
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0


Now these encoded variables need to be added to the dataframe, and the original categorical variables dropped.

**Revision: It proved necessary to retain these.**

In [51]:
cu = pd.concat([cu, kingdoms, dnas], axis=1)
#cu.drop(['Kingdom', 'DNAtype'], axis=1, inplace=True)


**Another revision:  I'm going to try LabelEncoder too.**

In [52]:
le = LabelEncoder()

cu['KingLabel'] = le.fit_transform(cu['Kingdom'])
cu['DNALabel'] = le.fit_transform(cu['DNAtype'])

In [53]:
cu['KingLabel'].value_counts()

1    2918
9    2832
5    2490
8    2077
3    1333
4     572
2     220
7     215
6     180
0     126
Name: KingLabel, dtype: int64

In [54]:
cu['DNALabel'].value_counts()

1    9249
2    2899
0     815
Name: DNALabel, dtype: int64

At this point pre-processing is completed; the data are ready for modeling.  All that remains is to save the new dataframe.

In [55]:
cu.to_csv('codon_usage3.csv')