# Capstone 2 Data Wrangling

Having obtained data from http://archive.ics.uci.edu/ml/datasets/Codon+usage, I'll begin by importing some essential libraries and opening the csv.

In [1]:
import pandas as pd
codon_usage = pd.read_csv('codon_usage.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print(codon_usage.head())
print(codon_usage.size)
print(codon_usage.shape)

  Kingdom  DNAtype  SpeciesID  Ncodons  \
0     vrl        0     100217     1995   
1     vrl        0     100220     1474   
2     vrl        0     100755     4862   
3     vrl        0     100880     1915   
4     vrl        0     100887    22831   

                               SpeciesName      UUU      UUC      UUA  \
0  Epizootic haematopoietic necrosis virus  0.01654  0.01203  0.00050   
1                         Bohle iridovirus  0.02714  0.01357  0.00068   
2             Sweet potato leaf curl virus  0.01974   0.0218  0.01357   
3             Northern cereal mosaic virus  0.01775  0.02245  0.01619   
4           Soil-borne cereal mosaic virus  0.02816  0.01371  0.00767   

       UUG      CUU  ...      CGG      AGA      AGG      GAU      GAC  \
0  0.00351  0.01203  ...  0.00451  0.01303  0.03559  0.01003  0.04612   
1  0.00678  0.00407  ...  0.00136  0.01696  0.03596  0.01221  0.04545   
2  0.01543  0.00782  ...  0.00596  0.01974  0.02489  0.03126  0.02036   
3  0.00992  0.01

This is more or less what was expected:  13028 observations of 69 variables.  With 4 distinct nucleotides and three nucleotides per codon, this gives us 4 ^ 3 = 64 variables representing the frequency of each codon.  The remaining five variables are descriptive:  kingdom, DNA type, species ID, number of codons, and species name.  

It will be important to discover what is meant by "DNA type" - check to see if there's a note or a data dictionary associated with this file.

It's also important to verify that the data types are what would be expected:  int for the species ID, DNA type, and number of codons; strings for the kingdom and species name; floating point numbers for the codon frequencies.

In [4]:
codon_usage.dtypes

Kingdom         object
DNAtype          int64
SpeciesID        int64
Ncodons          int64
SpeciesName     object
                ...   
GAA            float64
GAG            float64
UAA            float64
UAG            float64
UGA            float64
Length: 69, dtype: object

It's a bit of a surprise that the 'Kingdom' and 'SpeciesName' variables are objects instead of strings.  This might cause complications that need to be remedied.  As we're interested in exploring relationships between codon usage frequencies and taxonomic levels, such as kingdoms (perhaps it would be fruitful to merge with another data set to expand the taxonomic variables if possible) - let's see the unique values for 'Kingdom'.

In [5]:
codon_usage['Kingdom'].unique()

array(['vrl', 'arc', 'bct', 'phg', 'plm', 'pln', 'inv', 'vrt', 'mam',
       'rod', 'pri'], dtype=object)

Again, this is something of a surprise.  At a cursory glance, I might guess that 'vrl' refers to viruses - which seems supported by the .head() results.  I could speculate that 'arc' refers to archaea, 'bct' to bacteria, 'inv' to invertebrates and 'vrt' to vertebrates - but this needs to be explored further.  And it should be noted in passing that, if correct, some of these categories don't, in fact, refer to kingdoms, but indeed to other levels of the taxonomy of life.  However, looking back at the data source (in the link provided above) reveals what's meant by all of these values:

The 'Kingdom' is a 3-letter code corresponding to `xxx' in the CUTG database name: 'arc'(archaea), 'bct'(bacteria), 'phg'(bacteriophage), 'plm' (plasmid), 'pln' (plant), 'inv' (invertebrate), 'vrt' (vertebrate), 'mam' (mammal), 'rod' (rodent), 'pri' (primate), and 'vrl'(virus) sequence entries.   Of course, this presents the problem that bacteriophages are a subset of viruses, and that primates and rodents are subsets of mammals, which in turn are a subset of vertebrates.  Is it perhaps reasonable to assume that we should consider these set relations under difference?  That is, to assume 'viruses' in this context refers to 'viruses that are not bacteriophages', and similarly that 'vertebrates' means 'vertebrates that are not mammals', and 'mammals' refers to 'mammals that are not rodents or primates'?  I don't know if it would be reasonable to bother the researcher who compiled and curated these data over this matter.

The description page also clarifies what's meant by DNA type:

The 'DNAtype' is denoted as an integer for the genomic composition in the species: 0-genomic, 1-mitochondrial, 2-chloroplast, 3-cyanelle, 4-plastid, 5-nucleomorph, 6-secondary_endosymbiont, 7-chromoplast, 8-leucoplast, 9-NA, 10-proplastid, 11-apicoplast, and 12-kinetoplast.

At this point, my underlying knowledge of the subject matter fails me.  I understand what's meant by genomic, mitochondrial, and chloroplast DNA (the last of which should be found only in plants, I assume) but the other categories are unknown to me.    I'm going to have to read up a little just so that I have a basic understanding of what they are.  It's also a little alarming that they have an 'NA' value for this variable; what exactly this means is mysterious.  

It's a bit odd that plasmid DNA is represented in the 'Kingdom' and not the 'DNAtype' variable. 

However, it also emerges as an interesting possibility to see if there are any significant associations between DNA type and codon frequency.

A bit of internet searching reveals the following basic facts, heretofore unknown to me:
1.  A **cyanelle** is an organelle analogous to the chloroplast, found in glaucophytes, a grouping of freshwater algae.  
2.  **Plastids** are a type of organelle found primarily in plants.  They include **chloroplasts**, **chromoplasts**, and **leucoplasts**, which exist as other values of the DNA type variable, which is somewhat confusing - to what DNA sources, exactly, do these records refer?
3.  A **nucleomorph** is a vestigial nucleus found between the membrane pairs in some plastids.  There are only two groups of organisms in which they are known to exist.
4.  **Secondary endosymbiosis** occurs when an eukaryotic cell engulfs and absorbs another eukaryotic cell which has already developed endosymbiosis with a prokaryotic cell.  The inner prokaryotic cell is considered the **secondary endosymbiont**.
5.  **Chromoplasts** are a type of plastid used to synthesize and store carotenoid pigments, usually in flowers and fruit.
6.  **Leucoplasts** are another type of plastid, used for synthesis and storage - usually of macronutrients.  Distinguished from chromoplasts and chromoplasts by the absence of photosynthesis and pigment, as well as a more 'amoeboid' morphology.
7.  A **proplastid** is an undifferentiated plastid, formed generally in the meristem tissue of plants; all specialized plastids are derived from them.
8.  **Apicoplasts** are plastids found in most, but not all, protozoan parasites.  Probably a case of secondary endosymbiosis, they are believed to derive from photosynthetic plastids, and have several plant-like properties that make them good targets for drugs.  This is important, as the species carrying apicoplastids can be serious disease agents - the parasite causing malaria, for example.
9.  A **kinetoplast** is a circular complex of DNA found inside a large mitochondrion, carrying multiple copies of the mitochondrial genome.  They are known to exist only in a single phylum of flagellate protists. 

Having cleared up what the data labels mean - except for the troubling 'NA' in 'DNAtype' - it would be informative to take a look at the distribution of these values.

In [6]:
codon_usage['Kingdom'].value_counts()

bct    2920
vrl    2832
pln    2523
vrt    2077
inv    1345
mam     572
phg     220
rod     215
pri     180
arc     126
plm      18
Name: Kingdom, dtype: int64

It looks like we have a pretty substantial number of cases for each 'Kingdom' other than plasmids.  (Plasmids are DNA sequences found in bacteria that are not part of the bacterial chromosome and replicate independently; bacteria can exchange plasmids with each other in a process called conjugation.)  Considering that plasmid DNA is unique in comparison to the other categories for this variable, and there are a mere 18 observations, perhaps it would be appropriate to omit these cases?


In [7]:
codon_usage['DNAtype'].value_counts()

0     9267
1     2899
2      816
4       31
12       5
9        2
3        2
11       2
5        2
6        1
7        1
Name: DNAtype, dtype: int64

Here we see that almost all observations fall into categories 0, 1, and 2 - genomic, mitochondrial, and chloroplast DNA.  Considering that we have only a very small number of observations for the other DNA types, as well as some unanswered questions about how these categories relate to each other (some being subsets of others, etc.) - perhaps it would be reasonable to restrict our analysis to only these three categories?

Of course, another important step is to check to see how many missing values we have.

In [11]:
codon_usage.isna().values.sum()

0

Astonishingly, there are **zero** missing values in this csv.  These data are obviously well curated.  However, it will be helpful to relabel some of the categorical variables, particularly 'DNAtype'.

In [21]:
taxons_dict = {'arc': 'archaea', 'bct': 'bacteria', 'phg': 'bacteriophage', 'plm': 'plasmid', 'pln': 'plant', 'inv': 'invertebrate', 'vrt': 'vertebrate', 'mam': 'mammal', 'rod': 'rodent', 'pri': 'primate', 'vrl': 'virus'}
DNA_dict = {0: 'genomic', 1: 'mitochondrial', 2: 'chloroplast', 3: 'cyanelle', 4: 'plastid', 5: 'nucleomorph', 6: 'secondary_endosymbiont', 7: 'chromoplast', 8: 'leucoplast', 9: 'NA', 10: 'proplastid', 11: 'apicoplast', 12: 'kinetoplast'}

codon_usage['Kingdom'].replace(taxons_dict, inplace=True)
codon_usage['DNAtype'].replace(DNA_dict, inplace=True)
print(codon_usage[['Kingdom', 'DNAtype']].head())
print("Kingdoms:", codon_usage['Kingdom'].unique(), "\nDNA types:", codon_usage['DNAtype'].unique())

  Kingdom  DNAtype
0   virus  genomic
1   virus  genomic
2   virus  genomic
3   virus  genomic
4   virus  genomic
Kingdoms: ['virus' 'archaea' 'bacteria' 'bacteriophage' 'plasmid' 'plant'
 'invertebrate' 'vertebrate' 'mammal' 'rodent' 'primate'] 
DNA types: ['genomic' 'secondary_endosymbiont' 'plastid' 'chloroplast'
 'mitochondrial' 'cyanelle' 'chromoplast' 'NA' 'nucleomorph' 'apicoplast'
 'kinetoplast']


It looks like the abbreviations and integers have been replaced for these two variables throughout the dataframe; this may help avoid careless errors later.  It would also make me more comfortable to convert the columns with dtype 'object' to strings.

In [28]:
codon_usage['Kingdom'] = codon_usage['Kingdom'].astype('str')
codon_usage['DNAtype'] = codon_usage['DNAtype'].astype('str')
codon_usage['SpeciesName'] = codon_usage['SpeciesName'].astype('str')
codon_usage.dtypes

Kingdom         object
DNAtype         object
SpeciesID        int64
Ncodons          int64
SpeciesName     object
                ...   
GAA            float64
GAG            float64
UAA            float64
UAG            float64
UGA            float64
Length: 69, dtype: object

This somehow failed to convert the columns to strings as desired.