# Capstone 2 Data Wrangling

Having obtained data from http://archive.ics.uci.edu/ml/datasets/Codon+usage, I'll begin by importing some essential libraries and opening the csv.

In [1]:
import pandas as pd
codon_usage = pd.read_csv('codon_usage.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
print(codon_usage.head())
print(codon_usage.size)
print(codon_usage.shape)

  Kingdom  DNAtype  SpeciesID  Ncodons  \
0     vrl        0     100217     1995   
1     vrl        0     100220     1474   
2     vrl        0     100755     4862   
3     vrl        0     100880     1915   
4     vrl        0     100887    22831   

                               SpeciesName      UUU      UUC      UUA  \
0  Epizootic haematopoietic necrosis virus  0.01654  0.01203  0.00050   
1                         Bohle iridovirus  0.02714  0.01357  0.00068   
2             Sweet potato leaf curl virus  0.01974   0.0218  0.01357   
3             Northern cereal mosaic virus  0.01775  0.02245  0.01619   
4           Soil-borne cereal mosaic virus  0.02816  0.01371  0.00767   

       UUG      CUU  ...      CGG      AGA      AGG      GAU      GAC  \
0  0.00351  0.01203  ...  0.00451  0.01303  0.03559  0.01003  0.04612   
1  0.00678  0.00407  ...  0.00136  0.01696  0.03596  0.01221  0.04545   
2  0.01543  0.00782  ...  0.00596  0.01974  0.02489  0.03126  0.02036   
3  0.00992  0.01

This is more or less what was expected:  13028 observations of 69 variables.  With 4 distinct nucleotides and three nucleotides per codon, this gives us 4 ^ 3 = 64 variables representing the frequency of each codon.  The remaining five variables are descriptive:  kingdom, DNA type, species ID, number of codons, and species name.  

It will be important to discover what is meant by "DNA type" - check to see if there's a note or a data dictionary associated with this file.

It's also important to verify that the data types are what would be expected:  int for the species ID, DNA type, and number of codons; strings for the kingdom and species name; floating point numbers for the codon frequencies.

In [3]:
codon_usage.dtypes

Kingdom         object
DNAtype          int64
SpeciesID        int64
Ncodons          int64
SpeciesName     object
                ...   
GAA            float64
GAG            float64
UAA            float64
UAG            float64
UGA            float64
Length: 69, dtype: object

It's a bit of a surprise that the 'Kingdom' and 'SpeciesName' variables are objects instead of strings.  This might cause complications that need to be remedied.  As we're interested in exploring relationships between codon usage frequencies and taxonomic levels, such as kingdoms (perhaps it would be fruitful to merge with another data set to expand the taxonomic variables if possible) - let's see the unique values for 'Kingdom'.

In [4]:
codon_usage['Kingdom'].unique()

array(['vrl', 'arc', 'bct', 'phg', 'plm', 'pln', 'inv', 'vrt', 'mam',
       'rod', 'pri'], dtype=object)

Again, this is something of a surprise.  At a cursory glance, I might guess that 'vrl' refers to viruses - which seems supported by the .head() results.  I could speculate that 'arc' refers to archaea, 'bct' to bacteria, 'inv' to invertebrates and 'vrt' to vertebrates - but this needs to be explored further.  And it should be noted in passing that, if correct, some of these categories don't, in fact, refer to kingdoms, but indeed to other levels of the taxonomy of life.  However, looking back at the data source (in the link provided above) reveals what's meant by all of these values:

The 'Kingdom' is a 3-letter code corresponding to `xxx' in the CUTG database name: 'arc'(archaea), 'bct'(bacteria), 'phg'(bacteriophage), 'plm' (plasmid), 'pln' (plant), 'inv' (invertebrate), 'vrt' (vertebrate), 'mam' (mammal), 'rod' (rodent), 'pri' (primate), and 'vrl'(virus) sequence entries.   Of course, this presents the problem that bacteriophages are a subset of viruses, and that primates and rodents are subsets of mammals, which in turn are a subset of vertebrates.  Is it perhaps reasonable to assume that we should consider these set relations under difference?  That is, to assume 'viruses' in this context refers to 'viruses that are not bacteriophages', and similarly that 'vertebrates' means 'vertebrates that are not mammals', and 'mammals' refers to 'mammals that are not rodents or primates'?  I don't know if it would be reasonable to bother the researcher who compiled and curated these data over this matter.

The description page also clarifies what's meant by DNA type:

The 'DNAtype' is denoted as an integer for the genomic composition in the species: 0-genomic, 1-mitochondrial, 2-chloroplast, 3-cyanelle, 4-plastid, 5-nucleomorph, 6-secondary_endosymbiont, 7-chromoplast, 8-leucoplast, 9-NA, 10-proplastid, 11-apicoplast, and 12-kinetoplast.

At this point, my underlying knowledge of the subject matter fails me.  I understand what's meant by genomic, mitochondrial, and chloroplast DNA (the last of which should be found only in plants, I assume) but the other categories are unknown to me.    I'm going to have to read up a little just so that I have a basic understanding of what they are.  It's also a little alarming that they have an 'NA' value for this variable; what exactly this means is mysterious.  

It's a bit odd that plasmid DNA is represented in the 'Kingdom' and not the 'DNAtype' variable. 

However, it also emerges as an interesting possibility to see if there are any significant associations between DNA type and codon frequency.

A bit of internet searching reveals the following basic facts, heretofore unknown to me:
1.  A **cyanelle** is an organelle analogous to the chloroplast, found in glaucophytes, a grouping of freshwater algae.  
2.  **Plastids** are a type of organelle found primarily in plants.  They include **chloroplasts**, **chromoplasts**, and **leucoplasts**, which exist as other values of the DNA type variable, which is somewhat confusing - to what DNA sources, exactly, do these records refer?
3.  A **nucleomorph** is a vestigial nucleus found between the membrane pairs in some plastids.  There are only two groups of organisms in which they are known to exist.
4.  **Secondary endosymbiosis** occurs when an eukaryotic cell engulfs and absorbs another eukaryotic cell which has already developed endosymbiosis with a prokaryotic cell.  The inner prokaryotic cell is considered the **secondary endosymbiont**.
5.  **Chromoplasts** are a type of plastid used to synthesize and store carotenoid pigments, usually in flowers and fruit.
6.  **Leucoplasts** are another type of plastid, used for synthesis and storage - usually of macronutrients.  Distinguished from chromoplasts and chromoplasts by the absence of photosynthesis and pigment, as well as a more 'amoeboid' morphology.
7.  A **proplastid** is an undifferentiated plastid, formed generally in the meristem tissue of plants; all specialized plastids are derived from them.
8.  **Apicoplasts** are plastids found in most, but not all, protozoan parasites.  Probably a case of secondary endosymbiosis, they are believed to derive from photosynthetic plastids, and have several plant-like properties that make them good targets for drugs.  This is important, as the species carrying apicoplastids can be serious disease agents - the parasite causing malaria, for example.
9.  A **kinetoplast** is a circular complex of DNA found inside a large mitochondrion, carrying multiple copies of the mitochondrial genome.  They are known to exist only in a single phylum of flagellate protists. 

Having cleared up what the data labels mean - except for the troubling 'NA' in 'DNAtype' - it would be informative to take a look at the distribution of these values.

In [5]:
codon_usage['Kingdom'].value_counts()

bct    2920
vrl    2832
pln    2523
vrt    2077
inv    1345
mam     572
phg     220
rod     215
pri     180
arc     126
plm      18
Name: Kingdom, dtype: int64

It looks like we have a pretty substantial number of cases for each 'Kingdom' other than plasmids.  (Plasmids are DNA sequences found in bacteria that are not part of the bacterial chromosome and replicate independently; bacteria can exchange plasmids with each other in a process called conjugation.)  Considering that plasmid DNA is unique in comparison to the other categories for this variable, and there are a mere 18 observations, perhaps it would be appropriate to omit these cases?


In [6]:
codon_usage['DNAtype'].value_counts()

0     9267
1     2899
2      816
4       31
12       5
9        2
3        2
11       2
5        2
6        1
7        1
Name: DNAtype, dtype: int64

Here we see that almost all observations fall into categories 0, 1, and 2 - genomic, mitochondrial, and chloroplast DNA.  Considering that we have only a very small number of observations for the other DNA types, as well as some unanswered questions about how these categories relate to each other (some being subsets of others, etc.) - perhaps it would be reasonable to restrict our analysis to only these three categories?

Of course, another important step is to check to see how many missing values we have.

In [7]:
codon_usage.isna().values.sum()

0

Astonishingly, there are **zero** missing values in this csv.  These data are obviously well curated.  However, it will be helpful to relabel some of the categorical variables, particularly 'DNAtype'.

In [8]:
taxons_dict = {'arc': 'archaea', 'bct': 'bacteria', 'phg': 'bacteriophage', 'plm': 'plasmid', 'pln': 'plant', 'inv': 'invertebrate', 'vrt': 'vertebrate', 'mam': 'mammal', 'rod': 'rodent', 'pri': 'primate', 'vrl': 'virus'}
DNA_dict = {0: 'genomic', 1: 'mitochondrial', 2: 'chloroplast', 3: 'cyanelle', 4: 'plastid', 5: 'nucleomorph', 6: 'secondary_endosymbiont', 7: 'chromoplast', 8: 'leucoplast', 9: 'NA', 10: 'proplastid', 11: 'apicoplast', 12: 'kinetoplast'}

codon_usage['Kingdom'].replace(taxons_dict, inplace=True)
codon_usage['DNAtype'].replace(DNA_dict, inplace=True)
print(codon_usage[['Kingdom', 'DNAtype']].head())
print("Kingdoms:", codon_usage['Kingdom'].unique(), "\nDNA types:", codon_usage['DNAtype'].unique())

  Kingdom  DNAtype
0   virus  genomic
1   virus  genomic
2   virus  genomic
3   virus  genomic
4   virus  genomic
Kingdoms: ['virus' 'archaea' 'bacteria' 'bacteriophage' 'plasmid' 'plant'
 'invertebrate' 'vertebrate' 'mammal' 'rodent' 'primate'] 
DNA types: ['genomic' 'secondary_endosymbiont' 'plastid' 'chloroplast'
 'mitochondrial' 'cyanelle' 'chromoplast' 'NA' 'nucleomorph' 'apicoplast'
 'kinetoplast']


It looks like the abbreviations and integers have been replaced for these two variables throughout the dataframe; this may help avoid careless errors later.  It would also make me more comfortable to convert the columns with dtype 'object' to strings.

In [9]:
codon_usage['Kingdom'] = codon_usage['Kingdom'].astype('str')
codon_usage['DNAtype'] = codon_usage['DNAtype'].astype('str')
codon_usage['SpeciesName'] = codon_usage['SpeciesName'].astype('str')
codon_usage.dtypes

Kingdom         object
DNAtype         object
SpeciesID        int64
Ncodons          int64
SpeciesName     object
                ...   
GAA            float64
GAG            float64
UAA            float64
UAG            float64
UGA            float64
Length: 69, dtype: object

This somehow failed to convert the columns to strings as desired.  Searching the internet for solutions, I came across something akin to this:

In [35]:
'''codon_usage['Kingdom'] = codon_usage['Kingdom'].astype('|S')
codon_usage['DNAtype'] = codon_usage['DNAtype'].astype('|S')
codon_usage['SpeciesName'] = codon_usage['SpeciesName'].astype('|S')
codon_usage.dtypes'''

Kingdom           |S13
DNAtype           |S22
SpeciesID        int64
Ncodons          int64
SpeciesName       |S67
                ...   
GAA            float64
GAG            float64
UAA            float64
UAG            float64
UGA            float64
Length: 69, dtype: object

This is something new to me.  I'm unfamiliar with the "|Sn" datatype, but looking at the values n, my first guess is that these represent string datatypes in which the n refers to the maximum string length.  I need to find some resources to read more about this.  However, I'll take it for granted that these are now string datatypes.  It would also be prudent to inspect some values:


In [36]:
'''print(codon_usage['SpeciesName'][5150:5155])
print(codon_usage['DNAtype'][5150:5155])
print(codon_usage['Kingdom'][5150:5155])
'''


5150          b'Thermofilum pendens Hrk 5'
5151    b'Desulfitobacterium dehalogenans'
5152                b'Brucella melitensis'
5153         b'Thiobacillus denitrificans'
5154             b'Ideonella dechloratans'
Name: SpeciesName, dtype: bytes536
5150    b'genomic'
5151    b'genomic'
5152    b'genomic'
5153    b'genomic'
5154    b'genomic'
Name: DNAtype, dtype: bytes176
5150     b'archaea'
5151    b'bacteria'
5152    b'bacteria'
5153    b'bacteria'
5154    b'bacteria'
Name: Kingdom, dtype: bytes104


Oddly, all of the values are now preceded by a lower case 'b', and then surrounded by single quotes.  This, too, should be cleaned up - or some other solution to the type conversion problem needs to be found.  (Update:  some other solution needs to be found.)

In [46]:
print(codon_usage['SpeciesName'][5150:5155])
print(codon_usage['DNAtype'][5150:5155])
print(codon_usage['Kingdom'][5150:5155])

5150          Thermofilum pendens Hrk 5
5151    Desulfitobacterium dehalogenans
5152                Brucella melitensis
5153         Thiobacillus denitrificans
5154             Ideonella dechloratans
Name: SpeciesName, dtype: object
5150    genomic
5151    genomic
5152    genomic
5153    genomic
5154    genomic
Name: DNAtype, dtype: object
5150     archaea
5151    bacteria
5152    bacteria
5153    bacteria
5154    bacteria
Name: Kingdom, dtype: object


Having commented out the "|S" conversion, the values seem to be reverted to normal.  Perhaps the fact that these data are "object" type isn't terribly important.

In [53]:
codon_usage["Kingdom"].dtype

dtype('O')

In [54]:
type(codon_usage["Kingdom"][512])

str

It appears that while the series themselves are objects, the individual values are strings.  Still, for those columns with many repeated values (that is, 'Kingdom' and 'DNAtype') it would be prudent to convert them to categories.

In [10]:
codon_usage['Kingdom'] = codon_usage['Kingdom'].astype('category')
codon_usage['DNAtype'] = codon_usage['DNAtype'].astype('category')
codon_usage.dtypes

Kingdom        category
DNAtype        category
SpeciesID         int64
Ncodons           int64
SpeciesName      object
                 ...   
GAA             float64
GAG             float64
UAA             float64
UAG             float64
UGA             float64
Length: 69, dtype: object

That seems to have worked.  It will also be helpful, for some of the things that might be analyzed, to connect the codons to the amino acids they encode.  I'm going to attempt this by means of a dictionary.

In [11]:
# Because one of the things I'd like to explore in this dataset is codon bias, I'm going to need to get the totals for
# each amino acid. Here's a dictionary to link amino acids to codons, as taken from
# https://teaching.healthtech.dtu.dk/22110/index.php/Codon_list

# T's switched to U's, to conform with the original dataset

amino_codons = {'alanine': ['GCU', 'GCC', 'GCA', 'GCG'], 
                'arginine': ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'],
                'asparagine': ['AAU', 'AAC'],
                'aspartic acid': ['GAU', 'GAC'],
                'cysteine': ['UGU', 'UGC'], 
                'glutamine': ['CAA', 'CAG'],
                'glutamic acid': ['GAA', 'GAG'],
                'glycine': ['GGU', 'GGC', 'GGA', 'GGG'],
                'histidine': ['CAU', 'CAC'], 
                'isoleucine': ['AUU', 'AUC', 'AUA'], 
                'leucine': ['CUU', 'CUC', 'CUA', 'CUG', 'UUA', 'UUG'],
                'lysine': ['AAA', 'AAG'], 
                'methionine': ['AUG'], 
                'phenylalanine': ['UUU', 'UUC'], 
                'proline': ['CCU', 'CCC', 'CCA', 'CCG'],
                'serine': ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'], 
                'threonine': ['ACU', 'ACC', 'ACA', 'ACG'], 
                'tryptophan': ['UGG'],
                'tyrosine': ['UAU', 'UAC'], 
                'valine': ['GUU', 'GUC', 'GUA', 'GUG'], 
                'start': ['AUG'], 
                'stop': ['UAA', 'UAG', 'UGA']}

amino_codons.items()

dict_items([('alanine', ['GCU', 'GCC', 'GCA', 'GCG']), ('arginine', ['CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG']), ('asparagine', ['AAU', 'AAC']), ('aspartic acid', ['GAU', 'GAC']), ('cysteine', ['UGU', 'UGC']), ('glutamine', ['CAA', 'CAG']), ('glutamic acid', ['GAA', 'GAG']), ('glycine', ['GGU', 'GGC', 'GGA', 'GGG']), ('histidine', ['CAU', 'CAC']), ('isoleucine', ['AUU', 'AUC', 'AUA']), ('leucine', ['CUU', 'CUC', 'CUA', 'CUG', 'UUA', 'UUG']), ('lysine', ['AAA', 'AAG']), ('methionine', ['AUG']), ('phenylalanine', ['UUU', 'UUC']), ('proline', ['CCU', 'CCC', 'CCA', 'CCG']), ('serine', ['UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC']), ('threonine', ['ACU', 'ACC', 'ACA', 'ACG']), ('tryptophan', ['UGG']), ('tyrosine', ['UAU', 'UAC']), ('valine', ['GUU', 'GUC', 'GUA', 'GUG']), ('start', ['AUG']), ('stop', ['UAA', 'UAG', 'UGA'])])

In [12]:
# Now for the tricky part:  creating a new column for the frequency of each amino acid, taking the sum of all of the 
# frequencies of the codons that correspond to that amino acid

for amino, codons in amino_codons.items():
    codon_usage[amino] = 0
    for codon in codons:
        codon_usage[amino] += codon_usage[codon]


TypeError: unsupported operand type(s) for +: 'int' and 'str'

This is unexpected and unfortunate.  I sandboxed this method and it worked beautifully.  What's the issue?  Why would any of these values present as strings?

In [13]:


for amino, codons in amino_codons.items():
    for codon in codons:
        print(codon+':', codon_usage[codon].dtype)


GCU: float64
GCC: float64
GCA: float64
GCG: float64
CGU: float64
CGC: float64
CGA: float64
CGG: float64
AGA: float64
AGG: float64
AAU: float64
AAC: float64
GAU: float64
GAC: float64
UGU: float64
UGC: float64
CAA: float64
CAG: float64
GAA: float64
GAG: float64
GGU: float64
GGC: float64
GGA: float64
GGG: float64
CAU: float64
CAC: float64
AUU: float64
AUC: float64
AUA: float64
CUU: float64
CUC: float64
CUA: float64
CUG: float64
UUA: float64
UUG: float64
AAA: float64
AAG: float64
AUG: float64
UUU: object
UUC: object
CCU: float64
CCC: float64
CCA: float64
CCG: float64
UCU: float64
UCC: float64
UCA: float64
UCG: float64
AGU: float64
AGC: float64
ACU: float64
ACC: float64
ACA: float64
ACG: float64
UGG: float64
UAU: float64
UAC: float64
GUU: float64
GUC: float64
GUA: float64
GUG: float64
AUG: float64
UAA: float64
UAG: float64
UGA: float64


Ok, so now codons 'UUU' and 'UUC' are appearing as objects?  Interestingly, these are the two codons that correspond to phenylalanine.

In [14]:
codon_usage[['UUU', 'UUC']].head()


Unnamed: 0,UUU,UUC
0,0.01654,0.01203
1,0.02714,0.01357
2,0.01974,0.0218
3,0.01775,0.02245
4,0.02816,0.01371


This looks normal enough!  Perhaps a simple type conversion will fix this.

In [15]:
codon_usage['UUU'] = codon_usage['UUU'].astype('float64')
codon_usage['UUC'] = codon_usage['UUC'].astype('float64')

ValueError: could not convert string to float: 'non-B hepatitis virus'

This is alarming!  Perhaps these data are not as clean as I had thought?  Presumably, this is occurring only in these two columns - otherwise, how could they be showing a float64 type?  I need to find a way to check for values that can't possibly be numeric.

After several failed efforts, it looks like I can replace botched values with NaN's>

In [16]:
codon_usage['UUU'] = pd.to_numeric(codon_usage['UUU'], errors='coerce')
codon_usage['UUC'] = pd.to_numeric(codon_usage['UUC'], errors='coerce')

Previously, we found a total of zero NA values in the dataframe.  How many have we uncovered?

In [17]:
print("UUU:", codon_usage['UUU'].isna().values.sum(), "UUC:", codon_usage['UUC'].isna().values.sum())

UUU: 2 UUC: 1


That's a total of three missing values out of 13028.  If there were more, it might be worthwhile to apply an imputation - in fact, it could even be done exactly, as the relative frequencies of codons across each row should sum to 1.  But for three values I'm not convinced that's worth the time.  I'm going to fill them with zeroes.

In [18]:
codon_usage['UUU'] = codon_usage['UUU'].fillna(0)
codon_usage['UUC'] = codon_usage['UUC'].fillna(0)

# and check:

print("UUU:", codon_usage['UUU'].isna().values.sum(), "UUC:", codon_usage['UUC'].isna().values.sum())

UUU: 0 UUC: 0


Now it should be simple to retype these variables, and apply the summation I attempted to so many cells above.


In [19]:
codon_usage['UUU'] = codon_usage['UUU'].astype('float64')
codon_usage['UUC'] = codon_usage['UUC'].astype('float64')

In [20]:
for amino, codons in amino_codons.items():
    codon_usage[amino] = 0
    for codon in codons:
        codon_usage[amino] += codon_usage[codon]

Let's check to make sure that worked as intended.  There should be an additional 22 (20 amino acids, start, and stop) columns in the dataframe.

In [21]:
codon_usage.shape

(13028, 91)

Well, 22 columns added to the 69 that already existed does indeed give us 91.  But some further examination is needed.

In [22]:
aminos = ['alanine', 'arginine', 'asparagine', 'aspartic acid', 'cysteine', 'glutamine', 'glutamic acid', 'glycine', 'histidine',
          'isoleucine', 'leucine', 'lysine', 'methionine', 'phenylalanine', 'proline', 'serine', 'threonine', 'tryptophan',
          'tyrosine', 'valine', 'start', 'stop']
codon_usage[aminos].head()

Unnamed: 0,alanine,arginine,asparagine,aspartic acid,cysteine,glutamine,glutamic acid,glycine,histidine,isoleucine,...,methionine,phenylalanine,proline,serine,threonine,tryptophan,tyrosine,valine,start,stop
0,0.08673,0.06817,0.03008,0.05615,0.01404,0.03409,0.05564,0.08722,0.02857,0.03308,...,0.02506,0.02857,0.07268,0.06115,0.04561,0.01003,0.02757,0.09423,0.02506,0.00301
1,0.08548,0.06378,0.03799,0.05766,0.01899,0.02781,0.0597,0.06716,0.02171,0.04545,...,0.03324,0.04071,0.0502,0.06581,0.05768,0.01425,0.03392,0.08955,0.03324,0.00339
2,0.05594,0.07445,0.04751,0.05162,0.02859,0.03887,0.0471,0.05779,0.03291,0.05203,...,0.0218,0.04154,0.06232,0.08289,0.05491,0.01728,0.03887,0.0578,0.0218,0.00535
3,0.05587,0.04596,0.03656,0.05692,0.0188,0.0282,0.06475,0.06423,0.01462,0.07154,...,0.02924,0.0402,0.0376,0.09191,0.06215,0.01201,0.03029,0.07885,0.02924,0.00418
4,0.06601,0.05789,0.04534,0.06631,0.01787,0.03311,0.07038,0.05658,0.02032,0.04893,...,0.02773,0.04187,0.02798,0.06893,0.06745,0.01205,0.03456,0.07196,0.02773,0.00175


In theory, these columns should sum to one.  (In practice, this isn't going to be so, because the 'start' codon also represents one of the amino acids.)

In [23]:
codon_usage['aminosum'] = 0
for amino in aminos:
    codon_usage['aminosum'] += codon_usage[amino]
codon_usage['aminosum'] -= codon_usage['start']    
print(codon_usage['aminosum'].min(), codon_usage['aminosum'].max())

0.99758 1.00017


These are close enough to 1 to rule out any gross errors; the discrepancies are probably attributable to rounding.  Dropping the 'aminosum' column leaves us with a dataframe ready for subsequent analysis.

In [24]:
codon_usage.shape


(13028, 92)

In [25]:
codon_usage.drop('aminosum', axis=1, inplace=True)
codon_usage.shape

(13028, 91)

In [27]:
# save file for subsequent EDA
codon_usage.to_csv('codon_usage2.csv')