# Project: Predicting Kingdom and RNA Type Using Codon Frequency

##### by: Jasmine Marzouk

## Notebook 1: Data Cleaning

---

# Data dictionary:

- Kingdom: string: the kingdom classification of the species

    - pri: primate
    - rod: rodent
    - mam: mammalian
    - vrt: vertebrate
    - inv: invertebrate
    - pln: plant
    - bct: bacteria
    - vrl: virus
    - phg :bacteriophage
    - arc: archaea
    - plm: plasmid

    --

- DNAtype: int

    - 0: nuclear
    - 1: mitochondrion
    - 2: chloroplast
    - 3: cyanelle
    - 4: plastid
    - 5: nucleomorph
    - 6: secondary endosymbiont
    - 7: chromoplast
    - 8: leukoplast
    - 9: NA
    - 10: proplastid
    - 11: apicoplast
    - 12: kinetoplast
    
    --
    
- Species ID: int: 
    the species ID on the CUTG codon database.
- Ncodons: int 
    the number of codonds in the sequence.
- SpeciesName: string
    Species name
- Codons: float
    the 64 codons, as frequencies (per thousand codons)


The dataset was found on the UCI repository for machine learning: https://archive.ics.uci.edu/ml/datasets/Codon+usage

---

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats


In [3]:
dfcodon = pd.read_csv('../data/codon_usage.csv')


  dfcodon = pd.read_csv('../data/codon_usage.csv')


In [4]:
# Looking at the shape of the dataframe
dfcodon.shape


(13028, 69)

In [5]:
dfcodon.head(10)


Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
0,vrl,0,100217,1995,Epizootic haematopoietic necrosis virus,0.01654,0.01203,0.0005,0.00351,0.01203,...,0.00451,0.01303,0.03559,0.01003,0.04612,0.01203,0.04361,0.00251,0.0005,0.0
1,vrl,0,100220,1474,Bohle iridovirus,0.02714,0.01357,0.00068,0.00678,0.00407,...,0.00136,0.01696,0.03596,0.01221,0.04545,0.0156,0.0441,0.00271,0.00068,0.0
2,vrl,0,100755,4862,Sweet potato leaf curl virus,0.01974,0.0218,0.01357,0.01543,0.00782,...,0.00596,0.01974,0.02489,0.03126,0.02036,0.02242,0.02468,0.00391,0.0,0.00144
3,vrl,0,100880,1915,Northern cereal mosaic virus,0.01775,0.02245,0.01619,0.00992,0.01567,...,0.00366,0.0141,0.01671,0.0376,0.01932,0.03029,0.03446,0.00261,0.00157,0.0
4,vrl,0,100887,22831,Soil-borne cereal mosaic virus,0.02816,0.01371,0.00767,0.03679,0.0138,...,0.00604,0.01494,0.01734,0.04148,0.02483,0.03359,0.03679,0.0,0.00044,0.00131
5,vrl,0,101029,5274,Human adenovirus type 7d,0.02579,0.02218,0.01479,0.01024,0.02294,...,0.00303,0.01593,0.00171,0.02427,0.02503,0.02825,0.0127,0.00133,0.00038,0.00209
6,vrl,0,101688,3042,Apple latent spherical virus,0.04635,0.01545,0.02005,0.024,0.02761,...,0.00329,0.01315,0.00822,0.04011,0.01183,0.02663,0.02663,0.00033,0.00033,0.0
7,vrl,0,101764,2801,Aconitum latent virus,0.02285,0.02678,0.01214,0.02321,0.01714,...,0.00678,0.0125,0.01107,0.03534,0.01571,0.03642,0.02785,0.00107,0.00036,0.00071
8,vrl,0,101947,2897,Pseudorabies virus Ea,0.01105,0.02106,0.00035,0.00104,0.00035,...,0.02658,0.00207,0.00311,0.00414,0.04556,0.00449,0.04867,0.00138,0.00035,0.00138
9,vrl,0,10249,61247,Vaccinia virus Copenhagen,0.03411,0.0143,0.02771,0.01869,0.01148,...,0.00167,0.0223,0.00411,0.04866,0.01559,0.03695,0.01412,0.0025,0.00077,0.00103


---

### Checking data types

In [6]:
dfcodon.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13028 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13028 non-null  object 
 1   DNAtype      13028 non-null  int64  
 2   SpeciesID    13028 non-null  int64  
 3   Ncodons      13028 non-null  int64  
 4   SpeciesName  13028 non-null  object 
 5   UUU          13028 non-null  object 
 6   UUC          13028 non-null  object 
 7   UUA          13028 non-null  float64
 8   UUG          13028 non-null  float64
 9   CUU          13028 non-null  float64
 10  CUC          13028 non-null  float64
 11  CUA          13028 non-null  float64
 12  CUG          13028 non-null  float64
 13  AUU          13028 non-null  float64
 14  AUC          13028 non-null  float64
 15  AUA          13028 non-null  float64
 16  AUG          13028 non-null  float64
 17  GUU          13028 non-null  float64
 18  GUC          13028 non-null  float64
 19  GUA 

Looking at the data types there appears to be two inconsistancies in columns `UUU` and `UUC`.
Since they are the codon columns and the contents of the cells in these columns are floats, the data type should reflect this. I will change the datatypes for these columns using the `astype()` function.

In [7]:
for col in dfcodon.columns[5:]:
    try:
        dfcodon[col] = dfcodon[col].astype(float)
    except Exception as e:
        print(e)


could not convert string to float: 'non-B hepatitis virus'
could not convert string to float: '-'


According to the error message there is a string type in a couple of the column rows, this makes sense now as to why the dtype was object instead of a float type.

In [8]:
# creating a separate dataframe to further investigate the issue of the object dtype
dfjustcodon1 = dfcodon.drop(
    ['Kingdom', 'DNAtype', 'SpeciesID', 'Ncodons', 'SpeciesName'], axis=1).copy()


In [9]:
def char_finder(data_frame, series_name):
    '''
    Function taken from the following link: https://towardsdatascience.com/data-cleaning-automatically-removing-bad-data-c4274c21e299

    '''
    cnt = 0
    print(series_name)
    for row in data_frame[series_name]:
        try:
            float(row)  # changed to float to not flag NaNs or decimals.
            pass
        except ValueError:
            print(data_frame.loc[cnt, series_name], "-> at row:"+str(cnt))
        cnt += 1


In [10]:
for col in dfjustcodon1:
    char_finder(dfjustcodon1, col)


UUU
non-B hepatitis virus -> at row:486
12;I -> at row:5063
UUC
- -> at row:5063
UUA
UUG
CUU
CUC
CUA
CUG
AUU
AUC
AUA
AUG
GUU
GUC
GUA
GUG
GCU
GCC
GCA
GCG
CCU
CCC
CCA
CCG
UGG
GGU
GGC
GGA
GGG
UCU
UCC
UCA
UCG
AGU
AGC
ACU
ACC
ACA
ACG
UAU
UAC
CAA
CAG
AAU
AAC
UGU
UGC
CAU
CAC
AAA
AAG
CGU
CGC
CGA
CGG
AGA
AGG
GAU
GAC
GAA
GAG
UAA
UAG
UGA


The `char_finder` function has located the rows that contain the strings, I will further look at these columns and see what is going on.

In [11]:
# row 486
dfcodon.loc[486].head(10)


Kingdom                          vrl
DNAtype                            0
SpeciesID                      12440
Ncodons                         1238
SpeciesName                    Non-A
UUU            non-B hepatitis virus
UUC                          0.04362
UUA                            0.021
UUG                          0.01292
CUU                          0.01292
Name: 486, dtype: object

In [12]:
# row 5063
dfcodon.loc[5063].head(10)


Kingdom                                                  bct
DNAtype                                                    0
SpeciesID                                             353569
Ncodons                                                 1698
SpeciesName    Salmonella enterica subsp. enterica serovar 4
UUU                                                     12;I
UUC                                                        -
UUA                                                   0.0212
UUG                                                  0.02356
CUU                                                  0.01178
Name: 5063, dtype: object

So now I can see the issue here lies in the csv file, as it is comma seperated this has caused the name 'Non-A, non-B hepatitis virus' as it was recorded in the CUTG repository to offset the entire row.
this issue will have to be dealt with in the CSV file by addding quotation marks around the name. The CSV file will be reuploaded into a new dataframe.

This should then resolve the issue with the codon columns that were object type.

In [13]:
# reloading the fixed dataframe
dfcodon1 = pd.read_csv('../data/codon_usage_fixed.csv')


In [14]:
dfcodon1.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13028 entries, 0 to 13027
Data columns (total 69 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Kingdom      13028 non-null  object 
 1   DNAtype      13028 non-null  int64  
 2   SpeciesID    13028 non-null  int64  
 3   Ncodons      13028 non-null  int64  
 4   SpeciesName  13028 non-null  object 
 5   UUU          13028 non-null  float64
 6   UUC          13028 non-null  float64
 7   UUA          13028 non-null  float64
 8   UUG          13028 non-null  float64
 9   CUU          13028 non-null  float64
 10  CUC          13028 non-null  float64
 11  CUA          13028 non-null  float64
 12  CUG          13028 non-null  float64
 13  AUU          13028 non-null  float64
 14  AUC          13028 non-null  float64
 15  AUA          13028 non-null  float64
 16  AUG          13028 non-null  float64
 17  GUU          13028 non-null  float64
 18  GUC          13028 non-null  float64
 19  GUA 

`UUU` and `UUC` have now been fixed and are of dtype float.

---

### Checking for duplicates

In [15]:
dfcodon1.duplicated().sum()


0

There are no duplicated values in the dataframe.

---

### Checking for null values

In [16]:
pd.set_option('display.max_rows', 1000)


In [17]:
dfcodon1.isnull().sum().sort_values(ascending=False)


UGA            2
UAG            1
ACG            0
AAC            0
AAU            0
CAG            0
CAA            0
UAC            0
UAU            0
ACA            0
UGC            0
ACC            0
ACU            0
AGC            0
AGU            0
UCG            0
UGU            0
CAU            0
UCC            0
CAC            0
AAA            0
AAG            0
CGU            0
CGC            0
CGA            0
CGG            0
AGA            0
AGG            0
GAU            0
GAC            0
GAA            0
GAG            0
UAA            0
UCA            0
Kingdom        0
DNAtype        0
CUU            0
AUA            0
AUC            0
AUU            0
CUG            0
CUA            0
CUC            0
UUG            0
GGG            0
UUA            0
UUC            0
UUU            0
SpeciesName    0
Ncodons        0
SpeciesID      0
AUG            0
GUU            0
GUC            0
GUA            0
GUG            0
GCU            0
GCC            0
GCA           

There are a few missing values, located in columns `UGA` and `UAG`.

In [18]:
dfcodon1[dfcodon1['UAG'].isna()]


Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
5063,bct,0,353569,1698,Salmonella enterica subsp. enterica serovar 4 ...,0.0212,0.02356,0.01178,0.01296,0.0106,...,0.00707,0.00118,0.0,0.02945,0.02356,0.04476,0.02473,0.00118,,


In [19]:
dfcodon1[dfcodon1['UGA'].isna()]


Unnamed: 0,Kingdom,DNAtype,SpeciesID,Ncodons,SpeciesName,UUU,UUC,UUA,UUG,CUU,...,CGG,AGA,AGG,GAU,GAC,GAA,GAG,UAA,UAG,UGA
486,vrl,0,12440,1238,Non-A non-B hepatitis virus,0.04362,0.021,0.01292,0.01292,0.03554,...,0.00323,0.00242,0.00162,0.04443,0.01696,0.02423,0.02262,0.00162,0.0,
5063,bct,0,353569,1698,Salmonella enterica subsp. enterica serovar 4 ...,0.0212,0.02356,0.01178,0.01296,0.0106,...,0.00707,0.00118,0.0,0.02945,0.02356,0.04476,0.02473,0.00118,,


It is easy to replace these values. I will take them from the CUTG database using the species name to find the appropriate value.

- For `UGA` missing value in row 486, corresponding to species name `Non-A non-B hepatitis virus` the value is 0.0024
- For `UAG` missing value in row 5063, corresponding to species name `Salmonella enterica subsp. enterica serovar 4,12;I,- [gbbct]: 2` the value is 0.0
- For `UGA` missing value in row 5063, corresponding to species name `Salmonella enterica subsp. enterica serovar 4,12;I,- [gbbct]: 2` the value is 0.0

I will fill in the missing data with these values.


In [20]:
# filling in the missing values with the values I found from CUTG:

dfcodon1.loc[486, 'UGA'] = 0.0024
dfcodon1.loc[5063, 'UAG'] = 0.0
dfcodon1.loc[5063, 'UGA'] = 0.0


In [21]:
# Checking the missing values have been replaced:
dfcodon1.isna().sum()


Kingdom        0
DNAtype        0
SpeciesID      0
Ncodons        0
SpeciesName    0
UUU            0
UUC            0
UUA            0
UUG            0
CUU            0
CUC            0
CUA            0
CUG            0
AUU            0
AUC            0
AUA            0
AUG            0
GUU            0
GUC            0
GUA            0
GUG            0
GCU            0
GCC            0
GCA            0
GCG            0
CCU            0
CCC            0
CCA            0
CCG            0
UGG            0
GGU            0
GGC            0
GGA            0
GGG            0
UCU            0
UCC            0
UCA            0
UCG            0
AGU            0
AGC            0
ACU            0
ACC            0
ACA            0
ACG            0
UAU            0
UAC            0
CAA            0
CAG            0
AAU            0
AAC            0
UGU            0
UGC            0
CAU            0
CAC            0
AAA            0
AAG            0
CGU            0
CGC            0
CGA           

---

### Renaming the Kingdom Classes:

In [22]:
dfcodon1['Kingdom'].replace({'vrl': 'virus',
                             'arc': 'archaea',
                             'bct': 'bacteria',
                             'phg': 'phage',
                             'plm': 'plasmid',
                             'pln': 'plant',
                             'inv': 'invertebrate',
                             'vrt': 'vertebrate',
                             'mam': 'mammal',
                             'rod': 'rodent',
                             'pri': 'primate'
                             }, inplace=True)

dfcodon1['Kingdom']


0          virus
1          virus
2          virus
3          virus
4          virus
          ...   
13023    primate
13024    primate
13025    primate
13026    primate
13027    primate
Name: Kingdom, Length: 13028, dtype: object

In [24]:
dfcodon1['DNAtype'].replace({0: 'nuclear',
                             1: 'mitochondrial',
                             2: 'chloroplast',
                             }, inplace=True)

dfcodon1['DNAtype']


0              nuclear
1              nuclear
2              nuclear
3              nuclear
4              nuclear
             ...      
13023          nuclear
13024    mitochondrial
13025    mitochondrial
13026          nuclear
13027    mitochondrial
Name: DNAtype, Length: 13028, dtype: object

# Saving Work

The Dataframe has been cleaned and missing values replaced. I will save this to a new dataframe to use in the following EDA section.

In [25]:
import joblib
# Save data as pickle file to data folder
joblib.dump(dfcodon1, '../data/dfcodon1.pkl')


['../data/dfcodon1.pkl']