<div style="color:white;
           display:fill;
           border-radius:0px;
           background-color:#004c6d;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:1px">

<p style="padding: 16px;
              color:white;overflow:hidden;margin:0;font-size:100%;text-align:center"> 
    <span style="font-size:30px;">
    <b>
Principal Component Bacterial Genetic Analysis
    </b>
        </span>
</p>
</div>

# <span style="color:#004c6d"> Summary </span>
Our task is to classify 10 different bacteria species using oligomer sampling methods outlined in Wood et al (2020). This method of genomic analysis extracts 10-mer snippets of DNA to find the frequency of nucleotide base counts in the sequence (= an oligomer, eg. AxTxGxCx) with machine learning to rapidly identify bacteria species.  

"Bacterial antibiotic resistance is becoming a significant health threat, and rapid
identification of antibiotic-resistant bacteria is essential to save lives and reduce the
spread of antibiotic resistance. This paper analyzes the ability of machine learning
algorithms (MLAs) to process data from a novel spectroscopic diagnostic device to
identify antibiotic-resistant genes and bacterial species by comparison to available
bacterial DNA sequences. Simulation results show that the algorithms attain from 92%
accuracy (for genes) up to 99% accuracy (for species). This novel approach identifies
genes and species by optically reading the percentage of A, C, G, T bases in 1000s
of short 10-base DNA oligomers instead of relying on conventional DNA sequencing
in which the sequence of bases in long oligomers provides genetic information.
The identification algorithms are robust in the presence of simulated random genetic
mutations and simulated random experimental errors. Thus, these algorithms can be
used to identify bacterial species, to reveal antibiotic resistance genes, and to perform
other genomic analyses. Some MLAs evaluated here are shown to be better than
others at accurate gene identification and avoidance of false negative identification of
antibiotic resistance." (Wood, 2020)

The speed and accuracy of new genomic sequencing techniques provides cost effective and scalable becterial testing capabilities for scientists and has a broad range of use in society. The use of machine learning in research supports the advancement of science and the highlights the scope of knowledge that can be aquired from less information by using technology.

**Key players** 

* Bacteria Species Samples
* Oligomer Genetic Sequence Frequency
<div style="color:white;
           display:fill;
           border-radius:0px;
           background-color:#004c6d;
           font-size:110%;
           font-family:Verdana;
           letter-spacing:1px">

# <span style="color:#004c6d"> Stage 1:  Ask </span>
#### "The scientist is not a person who gives the right answers, he is one who asks the right questions."- Claude Lévi-Strauss

### <span style="color:#004c6d"> 1.1 Analysis Task </span>  

Task: Machine learning tasks are usually split into three categories; supervised, unsupervised and reinforcement. For this competition, our task is supervised learning.

### <span style="color:#004c6d"> 1.2 Dataset used: </span>  

We are given a training and testing data set in csv format.

Training: the training data consists of 199,999 rows of bacterial samples, with 285 columns describing oligomer presence. We are have been provided the target bacteria species for each row.

Testing: The test data set consists of 100,000 bacteria samples with the same attributes as the training data. However, the test data has does not have listed bacteria species, and will rely on the machine learning models to predict bacteria type based on oligomer frequency. We will be running and tuning multiple mlm models to create predictions for the Kaggle competition.

### <span style="color:#004c6d"> 1.3 Data Organization and verification: </span>

The data source used for this project is the tabular-playground-series-feb-2022 dataset on Kaggle. This dataset contains data for 399,000 bacteria samples. Each row represents one sample, the columns describe different oligomer permutations (ex. AxTxGxCx).

The two CSV documents- the training set (train.csv) and test set (test.csv) provide a similar distribution of bacterial samples, with the training set available to train machine learning models, and the test set to test model accuracy. 
The data is considered wide since each row is one subject. Every bacterial sample has unique oligomer frequencies as columnarly listed attributes. 

Due to the size of the training set, I sorted and filtered tables creating Pivot Tables in Excel. 
I was able to verify attributes and observations.

| Variable | Description |
| --- | --- |   
| row_id | Bacterial Sample Number |
| A0T0G0C10 | Oligomer Frequency in Genome, a Permutation of AxTxGxCx |
| ...x286 | ... |
| A10T0G0C0 | Oligomer Frequency in Genome, a Permutation of AxTxGxCx |
| target | Bacteria Species |

**Task 1:** Choose data- For this analysis objective, I am going to work with the test and train datasets to conduct descriptive statistics and run machine learning models.

**Task 2:** Know your audience- I am performing analysis for the Kaggle competition, and the Kaggle community. Kaggler's have a working understanding of data analytics and science. I need to tailor my results to be useful to them.

**Task 3:** Aims to address objective- In the aims of answering this analytics objective, I decided analysis should identify genetic sequences with a strong classification effect on bacterial species and conduct descriptive statistics, as well as fit and tune multiple machine learning models. The oligomer sequences with the strongest effect were: A5T4G0C1, A5T5G0C0, A0T0G5C5, A0T0G7C3, A5T4G0C1, A5T5G0C0, A0T0G5C5, A0T0G7C3, A1T2G6C1, A1T2G7C0, A5T4G0C1, and A5T5G0C0.


# <span style="color:#004c6d"> Stage 2:  Prepare </span>     

#### “It is a capital mistake to theorize before one has data.” -Sherlock Holmes

**Task 1: Data source validation-** Verifying the metadata of the dataset confirmed it as open-source. The owner has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. We can copy, modify, and distribute the work without asking permission. 

With information about only 399,000 bacterial samples, with no sampling methodology, there are limitations that could potentially be encountered due to sampling bias. This dataset is not representative of an entire population, and our model could be considered an operational approach to bacterial species identification. 

**Task 2: Data exploration-** I needed to prepare my workspace and load the packages that I would be using in my analysis. I uploaded the files to the Jupyter notebook hub, and made sure they were in a usable format. I used the describe.data function on each dataset to find general trends and irregularities in the data. In terms of Python syntax, I did not have a problem with the original column names or data distribution, so I kept them the same.

I used Python and Jupyter notebooks for the competition. Due to the amount of data, needing to visualize and model data, and sharing my results through a markdown, I am using Python to run my analysis.

In [None]:
# import data analysis libraries
import pandas as pd
import numpy as np
import random as rnd
from pandas import DataFrame
import datetime

# import data visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.colors as mpl
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import init_notebook_mode

# import statistics packages
from scipy.stats import skew, norm
from scipy.stats import boxcox
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
import statsmodels.api as sm 
from statsmodels.formula.api import ols

# import machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor, BaggingRegressor
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.kernel_ridge import KernelRidge
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.svm import SVR
!pip install mlxtend
from mlxtend.regressor import StackingCVRegressor
!pip install lightgbm
from lightgbm import LGBMRegressor
!pip install xgboost
from xgboost import XGBRegressor
from sklearn.pipeline import make_pipeline
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, KFold
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score, auc, classification_report, mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.cluster import KMeans
from math import factorial

# Stats
from scipy import stats
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax

pd.set_option('display.max_columns', None)

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")
pd.options.display.max_seq_items = 8000
pd.options.display.max_rows = 8000

# Load data
train = pd.read_csv('../input/tabular-playground-series-feb-2022/train.csv', index_col=0)
test = pd.read_csv('../input/tabular-playground-series-feb-2022/test.csv', index_col=0)
sub = pd.read_csv('../input/tabular-playground-series-feb-2022/sample_submission.csv')

# <span style="color:#004c6d"> Stage 3:  Process </span>

#### "No data is clean, but most is useful" - Dean Abbott

**Task 1:** Clean and process data- With a single bacterial sample serving as a unit of replication and the oligomer features already listed, I didn't need to further create or manipulate the metrics for this dataset. The dataset had no missing values. I removed duplicate values, identified the number of bacteria species present, and transformed the response varibles (oligomer frequency) to integer format. Later in the modeling process, I will set the dataset as factors and addressed asymmetry in the data source.

### <span style="color:#004c6d"> 3.1 Find number of missing values and duplicates </span>

In [None]:
# Load data success?
print("Train dataset has {} rows with {} variables each.".format(*train.shape))
print("Test dataset has {} rows with {} variables each.".format(*test.shape))

In [None]:
#Count the number of missing and duplicate values
print('Train dataset has {} missing values and {} duplicate values'\
      .format(train.isna().sum().sum(), train.duplicated().sum()))
print('Test dataset has {} missing values and {} duplicates values'\
      .format(test.isna().sum().sum(), test.duplicated().sum()))

#Drop duplicate values
train_dropped=train.drop_duplicates() 
print('Dropping Duplicate values:\nNew Train Shape: {}'.format(train_dropped.shape))

### <span style="color:#004c6d"> 3.2 Bacterial Species Data Exploration </span>

In [None]:
# identify bacterial species
train_dropped.groupby('target').describe()

In [None]:
# Bacteria Species Dataset Composition
bact=train_dropped.target.value_counts(normalize=True).reset_index()
bact.target=bact.target.mul(100).sort_values(ascending=False)
bact.describe

In the dataset, bacteria species are represented around 10%, at similar frequencies. There is no asymmetry from sampling bias that needs to be controlled.

### <span style="color:#004c6d"> 3.2 Converting Frequecy Values to Integer Format </span>
The frequency values for bacterial oligomer frequency are in the format of a float point number, described in the data source. To convert these values to integer values, the methods are described in Wood et al. (2020). The authors list a data treatment process where frequency integers were divided by 1,000,000 and a specific constant, called the bias, was subtracted. I algebraically transformed values back to their integer format.

In [None]:
train_dropped_integers = train_dropped.copy()
test_integer = test.copy()

elements = [e for e in train_dropped_integers.columns if e != 'row_id' and e != 'target']

# Convert the 10 bacteria names to the integers 0 .. 9
le = LabelEncoder()
train_dropped_integers['target_num'] = le.fit_transform(train_dropped_integers.target)

def bias(w, x, y, z):
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

def bias_of(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

train_integer = pd.DataFrame({col: ((train_dropped_integers[col] + bias_of(col)) * 1000000).round().astype(int)
                        for col in elements})
test_integer = pd.DataFrame({col: ((test[col] + bias_of(col)) * 1000000).round().astype(int)
                       for col in elements})
train_integer

# <span style="color:#004c6d"> Stage 4-5: Analyze and Figures   </span>
#### "If you torture the data long enough, it will confess." - Ronald Coase
### <span style="color:#004c6d"> 4.1 Exploratory correlation   </span> 

For this section of descriptive statistics, I am investigating the variables with the highest rough correlation with bacteria species to see if these variables have a statistically significant relationship.

In [None]:
cor=train_integer.corr()    
cor.style.background_gradient(cmap='plasma')

Figure 1. Correlation of bacterial oligomer frequecies in bacterial samples (n = 5.7e7). Oligomer genetic sequence is listed in correlation on x and y axes, with highly grouped values indicated by brighter colors.

### <span style="color:#004c6d"> 4.2 Oligomer Correlation </span> 
Below is a list of the highest 10 oligomer pairs correlated per species.

In [None]:
for i in train_dropped.target.unique():
    cor_df=train_dropped[train_dropped.target==i]
    cor=cor_df.corr()  
    c = cor.abs().unstack().drop_duplicates().reset_index()
    c = c.rename(columns={'level_0': 'Oligomer 1', 'level_1': 'Oligomer 2', 0: 'Correlation'})
    c = c.query('.75 <= Correlation < 1').sort_values(by = 'Correlation', ascending = False).reset_index(drop=True)
    display(c.iloc[:10,:].style.background_gradient(cmap='plasma').set_caption('Strongest oligomer correlation: {}'.format(i.replace('_', ' '))))

Based in the descriptive statistics, different oligomer types are more strongly correlated with different bacteral species. There are some repeating oligomer genotypes (ex. A0T0G5C5 and A0T0G7C3 in *Streptococcus pyogenes* and *Enterococcus hirae*). Bacterial species are subdivided to groups with a simmilar genetic structure.
This table summarize the results :

| Rank | Species | Oligomer 1 | Oligomer 2 | Correlation |
| --- | --- | --- | --- | --- |
| 1 | *Streptococcus pneumoniae* | A0T0G8C2 | A1T0G0C9 | 0.995343|
| 2 | *Campylobacter jejuni* | A0T0G4C6 | A0T0G6C4 | 0.994300 |
| 3 | *Staphylococcus aureus* | A0T0G3C7 | A0T1G1C8 | 0.991636 |
| 4 | *Streptococcus pyogenes* | A0T0G5C5 | A0T0G7C3 | 0.981113 |
| 5 | *Enterococcus hirae* | A0T0G5C5 | A0T0G7C3 | 0.980968 |
| 6 | *Klebsiella pneumoniae* | A5T4G0C1 | A5T5G0C0 | 0.956983 |
| 7 | *Escherichia fergusonii* | A5T4G0C1 | A5T5G0C0 | 0.916947 |
| 8 | *Salmonella enterica* | A5T4G0C1 | A5T5G0C0 | 0.916334 |
| 9 | *Escherichia coli* | A5T4G0C1 | A5T5G0C0 | 0.904267 |
| 10 | *Bacteroides fragilis* | A1T2G6C1 | A1T2G7C0 | 0.870281 |

### <span style="color:#004c6d"> 4.3 Overall Oligomer Correlation </span> 

In [None]:
#Strongest Overall Oligomer Correlation
c = cor.abs().unstack().drop_duplicates().reset_index()
c = c.rename(columns={'level_0': 'Oligomer 1', 'level_1': 'Oligomer 2', 0: 'Correlation'})
c = c.query('.85 <= Correlation < 1').sort_values(by = 'Correlation', ascending = False).reset_index(drop=True)
c.style.background_gradient(cmap='plasma')

### <span style="color:#004c6d"> 5.1 Figures: Species Type vs. Oligomer Frequency </span> 

In [None]:
# A0T0G5C5 bacterial species composition
sns.set_context("notebook", font_scale=1.5)
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 6))
sns.barplot(x = "A0T0G5C5", y = "target", palette= 'plasma', data = train_dropped)
plt.tick_params(axis='x', labelsize=15, rotation=-45)
plt.tick_params(axis='y', labelsize=15, rotation=0)
plt.xlabel("Oligomer Frequency")
plt.ylabel("Bacterial Species")

Figure 2. Frequency of oligomer 'A0T0G5C5' in bacterial samples (n = 5.7e7), ± 1 standard error. Length of bars indicates oligomer frequency of occurance in bacterial species, and color indicates bacterial species.

In [None]:
# A5T3G0C2 bacterial species composition
sns.set_context("notebook", font_scale=1.5)
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 6))
sns.barplot(x = "A5T3G0C2", y = "target", palette= 'plasma', data = train_dropped)
plt.tick_params(axis='x', labelsize=15, rotation=-45)
plt.tick_params(axis='y', labelsize=15, rotation=0)
plt.xlabel("Oligomer Frequency")
plt.ylabel("Bacterial Species")

Figure 3. Frequency of oligomer 'A5T3G0C2' in bacterial samples (n = 5.7e7), ± 1 standard error. Length of bars indicates oligomer frequency of occurance in bacterial species, and color indicates bacterial species.

In [None]:
# A3T4G3C0 bacterial species composition
sns.set_context("notebook", font_scale=1.5)
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 6))
sns.barplot(x = "A3T4G3C0", y = "target", palette= 'plasma', data = train_dropped)
plt.tick_params(axis='x', labelsize=15, rotation=-45)
plt.tick_params(axis='y', labelsize=15, rotation=0)
plt.xlabel("Oligomer Frequency")
plt.ylabel("Bacterial Species")

Figure 4. Frequency of oligomer 'A3T4G3C0' in bacterial samples (n = 5.7e7), ± 1 standard error. Length of bars indicates oligomer frequency of occurance in bacterial species, and color indicates bacterial species.

In [None]:
# A5T4G0C1 bacterial species composition
sns.set_context("notebook", font_scale=1.5)
sns.set_style("white")
sns.set_color_codes(palette='deep')
f, ax = plt.subplots(figsize=(8, 6))
sns.barplot(x = "A5T4G0C1", y = "target", palette= 'plasma', data = train_dropped)
plt.tick_params(axis='x', labelsize=15, rotation=-45)
plt.tick_params(axis='y', labelsize=15, rotation=0)
plt.xlabel("Oligomer Frequency")
plt.ylabel("Bacterial Species")

Figure 5. Frequency of oligomer 'A5T4G0C1' in bacterial samples (n = 5.7e7), ± 1 standard error. Length of bars indicates oligomer frequency of occurance in bacterial species, and color indicates bacterial species.

#  <span style="color:#004c6d">  Stage 6: Machine learning modeling  </span>
####  “All models are wrong, but some are useful” -George E. P. Box.
### <span style="color:#004c6d">  6.1 Set dataset features as factors  </span> 

In [None]:
# setting sale price as the index column
train_labels = train_dropped['target'].reset_index(drop=True)
# drop sales price variable; axis=1 drops column not row
train_features = train_dropped.drop(['target'], axis=1)
test_features = test

# combine train and test datasets
all_features = pd.concat([train_features, test_features]).reset_index(drop=True)
# describe
all_features.shape

# get numeric data
# set the numeric datatypes
numeric_dtypes = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
# save potential values (i) in numeric dictionary if dtype is numeric
numeric = []
for i in all_features.columns:
    if all_features[i].dtype in numeric_dtypes:
        numeric.append(i)
        
# check the skew of all numerical features with skew function
# lower skew is better. If we get high scores, we can transform data to make the data distribution more normal
skew_features = all_features[numeric].apply(lambda x: skew(x)).sort_values(ascending=False)

# set values with a high skew as being above > 0.5, create dictionary
high_skew = skew_features[skew_features > 0.5]
skew_index = high_skew.index

# count how many variables have high skew
print("There are {} variables with a skew > 0.5 :".format(high_skew.shape[0]))

### <span style="color:#004c6d">  6.2 Skewed Variables  </span> 

In [None]:
# set dataframe for values with high skew
skewness = pd.DataFrame({'Skew' :high_skew})
# show values
skew_features.head(30)

In [None]:
fig = plt.figure()
res = stats.probplot(all_features['A1T0G0C9'], plot=plt)
plt.show()

Figure 6. Ordered values compared with theoretical quantiles of oligomer 'A1T0G0C9'.

In [None]:
fig = plt.figure()
res = stats.probplot(all_features['A0T10G0C0'], plot=plt)
plt.show()

Figure 7. Ordered values compared with theoretical quantiles of oligomer 'A0T10G0C0'.

In [None]:
fig = plt.figure()
res = stats.probplot(all_features['A0T0G9C1'], plot=plt)
plt.show()

Figure 8. Ordered values compared with theoretical quantiles of oligomer 'A0T0G9C1'.

In [None]:
skew_cols = train_dropped.select_dtypes(exclude='object').skew().sort_values(ascending=False)
skew_cols = pd.DataFrame(skew_features.loc[skew_features > 0.75]).rename(columns={0:'Skew before'})

# Box-cox transformation
t=train_dropped.copy()
for i in skew_cols.index.tolist():
    t[i] = boxcox1p(t[i], boxcox_normmax(t[i] + 1))
    
skew_df=pd.concat([skew_cols, t[skew_cols.index].skew()], axis=1).rename(columns={0:'After'})
skew_df.head(30)

In [None]:
fig = plt.figure()
res = stats.probplot(t['A1T0G0C9'], plot=plt)
plt.show()

Figure 9. Post boxcox transformation ordered values compared with theoretical quantiles of oligomer 'A1T0G0C9'.

In [None]:
fig = plt.figure()
res = stats.probplot(t['A0T10G0C0'], plot=plt)
plt.show()

Figure 10. Post boxcox transformation ordered values compared with theoretical quantiles of oligomer 'A0T10G0C0'.

In [None]:
fig = plt.figure()
res = stats.probplot(t['A0T0G9C1'], plot=plt)
plt.show()

Figure 11. Post boxcox transformation ordered values compared with theoretical quantiles of oligomer 'A0T0G9C1'.

### <span style="color:#004c6d">  6.3 Standardize Data </span>

In [None]:
X=t.drop('target', axis=1)
X_scaled=pd.DataFrame(StandardScaler().fit_transform(X), columns=X.columns)
X_scaled.head()

### <span style="color:#004c6d">  6.4 Principal Components Analysis </span>

Now that my data has been converted into factors, I have processed data for missing and duplicate values, and standardized data asymmetry, I will be able to create a model for my study objective. To classify the 10 bacterial species from the test dataset, I will use principal components analysis (PCA). Principal component analysis (PCA) is the process of creating new varibles called Principal Components from a linear combination of metrics (oligomer sequences) that capture the greatest amount of variance within the dataset. Each Principal Component successively finds the maximum variance in the data projected along the orthogonal axis of the previous component, creating a new set of uncorrelated features that collectively explain the variability within the data. PCA allows us to summarize information by smaller “summary indices”, which are more easily visualized and analyzed.  

In [None]:
# set cumulative components
pca = PCA(n_components=286).fit(X_scaled)
# set multiplier
pca_sum=pd.Series(np.cumsum(pca.explained_variance_ratio_)).mul(100)
# set index and dataframe
pca_sum.index = np.arange(1, len(pca_sum)+1)
# set individual componenents
ind=pd.Series(pca.explained_variance_ratio_).mul(100)
# set index and dataframe
ind.index = np.arange(1, len(ind)+1)

# PCA graph
fig = go.Figure(
    layout=go.Layout(
        updatemenus=[dict(type="buttons", direction="left", x=0.15, y=1.2, showactive=False)],
        xaxis=dict(range=[1, 287],
                   autorange=False, tickwidth=2),
        yaxis=dict(range=[0, 100],
                   autorange=False)))

fig.add_trace(go.Scatter(x=ind.index[:1], y=ind[:1], line=dict(color='#5f007f', width=3), 
                         visible=True, fill='tozeroy', opacity=0.8,
                         hovertemplate = 'Variance Explained = %{y:.2f}%<br>Principal Component %{x:.0f}',
                         name='Individual'))
fig.add_trace(go.Scatter(x=pca_sum.index[:1], y=pca_sum[:1], line=dict(color='#ADD8E6', width=3), 
                         visible=True, fill='tonexty', opacity=0.7,
                         hovertemplate = 'Variance Explained = %{y:.1f}%<br>Number of Principal Components = %{x:.0f}',
                         name='Cumulative'))

fig.update(frames=[go.Frame(data=[
    go.Scatter(x=ind.index[:i], y=ind[:i]),
    go.Scatter(x=pca_sum.index[:i], y=pca_sum[:i])])
                   for i in range(1, 287)])

fig.update_yaxes(title = 'Variance Explained', showline=True, ticksuffix='%', range=[0,105])

fig.update_layout(title='Principal Components Explained Variance', 
                  xaxis_title="Number of Principal Components",
                  hovermode="x unified", width=700,
                  legend=dict(orientation="v", yanchor="bottom", y=1.08, xanchor="right", x=.99, title=""),
                  updatemenus=[dict(buttons=list(
                      [dict(label="Play", method="animate", 
                            args=[None, {"frame": {"duration":15, "redraw": False}},{"fromcurrent": True}]),
                       dict(label="Pause", method="animate", 
                            args=[{"frame": {"duration": 0, "redraw": False}},{"mode": "immediate"},
                                  {"transition": {"duration": 0}}])]))])
fig.show()

Figure 12. The variance captured from calculated Principal Components, both cumulatively and individually, indicated by the blue and purple lines. The cumulative PC score reaches an infection point early in the number of Principal Components.

In [None]:
# Variable significance: important oligomer sequences
df=pd.DataFrame(abs(pca.components_.T), columns=['PC'+str(i+1) for i in range(286)], index=X_scaled.columns)
pca_ind=pd.Series(pca.explained_variance_ratio_)
var_pca=[]
for i,j in zip(df.columns.tolist(), pca_ind):
    k=df[i].nlargest(1)*j
    var_pca.append(pd.DataFrame({'Principal Component':str(i[2:]),'Gene':k.index,'Var':k[0]}))
var_pca=pd.concat(var_pca).reset_index(drop=True)
plot_df=var_pca.iloc[:10,:]

pal = sns.color_palette("plasma", 14).as_hex()[1:11]
fig = px.bar(plot_df, x='Gene', y='Var', text='Var', color='Principal Component', 
             color_discrete_sequence=pal, opacity=0.7)
fig.update_traces(texttemplate='%{text:,.3f}', textposition='outside',
                  marker_line=dict(width=1, color='#28221D'))
fig.show()

Figure 13. Composition of oligomer sequency in Principal Components of bacterial data analysis. Height of bar indicates the relative makeup of the Principal Component by the genetic sequence, and color indicates different oligomer sequence.

In [None]:
# PC 1 and 2
pca = PCA(n_components=10).fit_transform(X_scaled)
pca_df=pd.DataFrame(data=pca, columns=['PC'+str(i+1) for i in range(0,10)]).reset_index(drop=True)
species=train_dropped.target.reset_index(drop=True).str.replace('_', ' ') 
pca_df=pd.concat([species, pca_df], axis=1)
pca_df['map'] = pca_df['target'].map(pca_df['target'].value_counts())
pca_df = pca_df.sort_values(by='map', ascending=False).drop('map', axis=1)

pal = sns.color_palette("plasma", 12).as_hex()[:10]
fig = px.scatter(pca_df, x='PC1', y='PC2', color='target', color_discrete_sequence=pal, opacity=0.4)
fig.update_traces(marker_size=7,
                  hovertemplate="Principal Component 1 = %{x}<br>Principal Component 2 = %{y}")
fig.update_layout(title='Bacteria Species Projected onto Components 1 and 2', legend_title='', 
                  xaxis_title='Component 1 (variance explained = 31.9%)', 
                  yaxis_title='Component 2 (variance explained = 20.4%)',
                  width=700, height=600)
fig.show()

Figure 14. Principal components 1 and 2 of oligomer bacterial samples. Bacterial species are indicated by color.

To interprete this figure, we can see the bacterial oligomer score vectors graphed on 2 Principal Component axes. The separation of species is greatest for species *Klebsiella pneumoniae* and *Bacteroides fragilis*, with varying separatation across species by commonly shared oligomer sequences. Variance in data is explained by 31.9% and 20.4% in Principal Components 1 and 2. Below are the projections onto the first three principal components for each species.

In [None]:
# PC 1-3
s=pca_df.target.unique()
rgb=[]
for i in pal:
    rgb.append('rgb' + str(mpl.to_rgb(i)))

fig = make_subplots(rows=5, cols=2,
                    specs=[[{'type': 'scatter3d'}, {'type': 'scatter3d'}],
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}], 
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}],
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}],
                           [{'type': 'scatter3d'}, {'type': 'scatter3d'}]],
                    horizontal_spacing = 0.1, vertical_spacing = 0.05,
                    subplot_titles=(s[0],s[1],s[2],s[3],s[4],
                                    s[5],s[6],s[7],s[8],s[9]))

p1=pca_df[pca_df.target=='Bacteroides fragilis']
fig.add_trace(go.Scatter3d(x=p1.PC1, y=p1.PC2, z=p1.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[0], opacity=0.4, line_width=1, line_color=rgb[0])),
              row=1, col=1)
p2=pca_df[pca_df.target=='Campylobacter jejuni']
fig.add_trace(go.Scatter3d(x=p2.PC1, y=p2.PC2, z=p2.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[1], opacity=0.3, line_width=1, line_color=rgb[1])),
              row=1, col=2)
p3=pca_df[pca_df.target=='Klebsiella pneumoniae']
fig.add_trace(go.Scatter3d(x=p3.PC1, y=p3.PC2, z=p3.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[2], opacity=0.3, line_width=1, line_color=rgb[2])),
              row=2, col=1)
p4=pca_df[pca_df.target=='Streptococcus pneumoniae']
fig.add_trace(go.Scatter3d(x=p4.PC1, y=p4.PC2, z=p4.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[3], opacity=0.3, line_width=1, line_color=rgb[3])),
              row=2, col=2)
p5=pca_df[pca_df.target=='Staphylococcus aureus']
fig.add_trace(go.Scatter3d(x=p5.PC1, y=p5.PC2, z=p5.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[4], opacity=0.3, line_width=1, line_color=rgb[4])),
              row=3, col=1)
p6=pca_df[pca_df.target=='Streptococcus pyogenes']
fig.add_trace(go.Scatter3d(x=p6.PC1, y=p6.PC2, z=p6.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[5], opacity=0.35, line_width=1, line_color=rgb[5])),
              row=3, col=2)
p7=pca_df[pca_df.target=='Salmonella enterica']
fig.add_trace(go.Scatter3d(x=p7.PC1, y=p7.PC2, z=p7.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[6], opacity=0.35, line_width=1, line_color=rgb[6])),
              row=4, col=1)
p8=pca_df[pca_df.target=='Enterococcus hirae']
fig.add_trace(go.Scatter3d(x=p8.PC1, y=p8.PC2, z=p8.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[7], opacity=0.35, line_width=1, line_color=rgb[7])),
              row=4, col=2)
p9=pca_df[pca_df.target=='Escherichia coli']
fig.add_trace(go.Scatter3d(x=p9.PC1, y=p9.PC2, z=p9.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[8], opacity=0.35, line_width=1, line_color=rgb[8])),
              row=5, col=1)
p10=pca_df[pca_df.target=='Escherichia fergusonii']
fig.add_trace(go.Scatter3d(x=p10.PC1, y=p10.PC2, z=p10.PC3, mode='markers', showlegend=False,
                           marker=dict(size=3, color=pal[9], opacity=0.35, line_width=1, line_color=rgb[9])),
              row=5, col=2)
fig.update_traces(hovertemplate='Component 1 = %{x}<br>Component 2 = %{y}<br>Component 3 = %{z}<extra></extra>')
fig.update_layout(title='Bacteria Species Projected onto the first 3 Principal Components',
                  scene1=dict(aspectmode='cube',xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene2=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene3=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene4=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene5=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene6=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene7=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene8=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  scene9=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'), 
                  scene10=dict(aspectmode='cube', xaxis_title='PC 1',yaxis_title='PC 2',zaxis_title='PC 3'),
                  height=2500, width=800)
fig.show()

Figure 15. Principal Components 1, 2, and 3 of oligomer frequency in bacterial species. Color indicated bacterial species, and spread of data indicates a simplified summary of oligomer characteristics per species.

### <span style="color:#004c6d">  6.5 Model Prediction </span>

To predict the species of bacteria, I will use two types of models, one with the full set of features in the dataset and one with a subset of features using the first 100 Principal Components. These components cumulatively explain over 85% of the variation in the data. 

In [None]:
s = StandardScaler()
enc=LabelEncoder()
y=train_dropped.target
y=enc.fit_transform(y)
train_dropped.drop('target', axis=1, inplace=True)

X_train, X_val, y_train, y_val = train_test_split(train_dropped, y, test_size=0.2, shuffle=True, 
                                                  stratify=y, random_state=21)

X_train_scaled = s.fit_transform(X_train)
X_val_scaled = s.transform(X_val)
X_test_scaled = s.transform(test)

pca = PCA(n_components=100)
X_train_pca=pca.fit_transform(X_train_scaled)
X_val_pca=pca.transform(X_val_scaled)
X_test_pca=pca.transform(X_test_scaled)

print("Train Shape: {} {}".format(X_train_pca.shape, y_train.shape))
print("Validation Shape: {} {}".format(X_val_pca.shape, y_val.shape))
print("Test Shape: {}\n".format(X_test_pca.shape))

et_pca=ExtraTreesClassifier(n_estimators=500,
                            class_weight='balanced',
                            random_state=92).fit(X_train_pca, y_train)
print(et_pca)

y_preds=et_pca.predict(X_val_pca)
y_probs=et_pca.predict_proba(X_val_pca)
val_acc=accuracy_score(y_true=y_val, y_pred=y_preds)
val_auc=roc_auc_score(y_true=y_val, y_score=y_probs, average='weighted', multi_class='ovr')
c=classification_report(y_val, y_preds, target_names=enc.classes_, output_dict=True)
c=pd.DataFrame(c).T.iloc[:10,][['f1-score', 'precision', 'recall', 'support']]
val_f1=c['f1-score'].mean()

print('\nModel Accuracy = {:.2f}%\nF1-Score = {:.2f}%\nArea Under the Curve = {:.3f}\n'\
      .format(val_acc*100, val_f1*100, val_auc))

c[['f1-score', 'precision', 'recall']]=c[['f1-score', 'precision', 'recall']].mul(100)
c.sort_values('f1-score', ascending=False).style\
.background_gradient(cmap='flare_r', subset=['f1-score'])\
.format({'f1-score':'{:,.1f}%', 'precision':'{:,.1f}%', 'recall':'{:,.1f}%', "support": "{:,.0f}"})

In [None]:
fpr = {}
tpr = {}
roc_auc = {}
thresh = {}

species=c.sort_values('f1-score', ascending=False).index.str.replace('_', ' ')
for i in range(len(species)):    
    fpr[i], tpr[i], thresh[i] = roc_curve(y_val, y_probs[:,i], pos_label=i)
    roc_auc[i] = auc(fpr[i], tpr[i])

fig = go.Figure()
for i,j in zip(enumerate(species), pal):
    fig.add_trace(go.Scatter(x=fpr[i[0]], y=tpr[i[0]], line=dict(color=j, width=3), opacity=0.7,
                             hovertemplate = 'True positive rate = %{y:.3f}, False positive rate = %{x:.3f}',
                             name='{} AUC = {:.3f}'.format(i[1],roc_auc[i[0]])))
fig.add_shape(type="line", xref="x", yref="y", x0=0, y0=0, x1=1, y1=1, 
              line=dict(color="Black", width=1, dash="dot"))
fig.update_layout(title="Multiclass ROC Curves<br>of Bacteria Species", hovermode="x unified", 
                  hoverlabel = dict(bgcolor="white",font_size=12), xaxis=dict(zeroline=False, hoverformat=".2f"),
                  xaxis_title='False Positive Rate (1 - Specificity)', yaxis_title='True Positive Rate (Sensitivity)',
                  legend=dict(y=.1, x=.98, xanchor="right",bordercolor="black", borderwidth=.5, font=dict(size=12)),
                  height=550, width=700)
fig.show()

Figure 16. ROC curve of PCA model of bacterial species analysis.

In [None]:
test_preds=et_pca.predict(X_test_pca)
target=enc.inverse_transform(test_preds)
sub_pca=pd.DataFrame({'row_id':[i for i in range(int(2e5),int(3e5))], 'target':target})
bact=sub_pca.target.value_counts(normalize=True).reset_index()
bact.target=bact.target.mul(100).sort_values(ascending=False)

In [None]:
sub_pca.to_csv("submission_pca.csv", index=False)

### <span style="color:#004c6d">  6.6 Model Application </span>

In [None]:
print("Train Shape: {} {}".format(X_train_scaled.shape, y_train.shape))
print("Validation Shape: {} {}".format(X_val_scaled.shape, y_val.shape))
print("Test Shape: {}\n".format(X_test_scaled.shape))

et_all=ExtraTreesClassifier(n_estimators=500,  
                            class_weight='balanced', 
                            random_state=21).fit(X_train_scaled, y_train)
print(et_all)

y_preds=et_all.predict(X_val_scaled)
y_probs=et_all.predict_proba(X_val_scaled)
val_acc=accuracy_score(y_true=y_val, y_pred=y_preds)
val_auc=roc_auc_score(y_true=y_val, y_score=y_probs, average='weighted', multi_class='ovr')
c=classification_report(y_val, y_preds, target_names=enc.classes_, output_dict=True)
c=pd.DataFrame(c).T.iloc[:10,][['f1-score', 'precision', 'recall', 'support']]
val_f1=c['f1-score'].mean()

print('Model Accuracy = {:.2f}%\nF1-Score = {:.2f}%\nArea Under the Curve = {:.3f}\n'\
      .format(val_acc*100, val_f1*100, val_auc))

c[['f1-score', 'precision', 'recall']]=c[['f1-score', 'precision', 'recall']].mul(100)
c.sort_values('f1-score', ascending=False).style\
.background_gradient(subset=['f1-score'])\
.format({'f1-score':'{:,.1f}%', 'precision':'{:,.1f}%', 'recall':'{:,.1f}%', "support": "{:,.0f}"})

Now submit to the competition!

In [None]:
test_preds=et_all.predict(X_test_scaled)
target=enc.inverse_transform(test_preds)
sub=pd.DataFrame({'row_id':[i for i in range(int(2e5),int(3e5))], 'target':target})
bact=sub.target.value_counts(normalize=True).reset_index()
bact.target=bact.target.mul(100).sort_values(ascending=False)

In [None]:
sub.to_csv("submission.csv", index=False)