# TPS-Feb22 | üìä EDA + üìà ExtraTrees

# <a id="Agenda">üìù Agenda</a>
>1. [üìã Context](#Context)
>2. [üìö Loading libraries and files](#Loading)
>3. [üîç Exploratory Data Analysis](#EDA)
>4. [‚úÖ Cross-validation Method](#Validation)
>5. [üèãÔ∏è Model Training & Inference](#TrainingInference)

___

# <a id="Context">üìã Context</a>

> For the [February 2022 Tabular Playground Series competition](https://www.kaggle.com/c/tabular-playground-series-feb-2022), your task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment $\text{ATATGGCCTT}$ becomes $\text{A}_{2} \text{T}_{4} \text{G}_{2} \text{C}_{2}$.

## üìê Technique used

> Block optical sequencing (BOS) method using surface-enhanced Raman spectroscopy (SERS).

| Drawbacks        | Advantages                                        |
|------------------|---------------------------------------------------|
| Data compression | Very short reads of DNA‚Äìsufficiently short length |
| Data loss        | Suitable to very fast analysis and identification |


### What is block optical sequencing?

> Sequencing is the process of determining the nucleic acid sequence ‚Äì the order of nucleotides in DNA. In this context, the block optical technique can identify relative $\text{A}$, $\text{T}$, $\text{G}$, and $\text{C}$ content in DNA ***k***-mers (length ***k*** subsequences).

For example, all the possible ***k***-mers of the $\text{GTAGAGCTGT}$ DNA sequence are shown below:

| *k*  | *k*-mers                                 |
|------|------------------------------------------|
| 1    | G, T, A, G, A, G, C, T, G, T             |
| 2    | GT, TA, AG, GA, AG, GC, CT, TG, GT       |
| 3    | GTA, TAG, AGA, GAG, AGC, GCT, CTG, TGT   |
| 4    | GTAG, TAGA, AGAG, GAGC, AGCT, GCTG, CTGT |
| 5    | GTAGA, TAGAG, AGAGC, GAGCT, AGCTG, GCTGT |
| 6    | GTAGAG, TAGAGC, AGAGCT, GAGCTG, AGCTGT   |
| 7    | GTAGAGC, TAGAGCT, AGAGCTG, GAGCTGT       |
| 8    | GTAGAGCT, TAGAGCTG, AGAGCTGT             |
| 9    | GTAGAGCTG, TAGAGCTGT                     |
| 10   | GTAGAGCTGT                               |
<br />

Since this [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full) is using 10-mer blocks, our $\text{ATATGGCCTT}$ DNA sequence remains unchanged. However, as the introductory sentence shows it, we will see later that it will change in some way.

### What is surface-enhanced Raman spectroscopy?

#### Raman spectroscopy

> Raman spectroscopy is a chemical analysis method used to determine the structure of molecules in a sample. It is a noninvasive method, meaning that it does not destroy the sample. Raman spectroscopy, like infrared, makes it possible to get access to the vibrational levels of molecules.
> 
> Raman spectroscopy provides information about:
> * Chemical structure and identity
> * Phase and polymorphism
> * Intrinsic stress/strain
> * Contamination and impurity

#### Surface-enhanced Raman spectroscopy

> Raman spectroscopy is a low sensitive vibrational spectroscopy that confines the analysis of chemical species to high concentrations. However, when molecules are placed near rough metal surfaces or nanostructures, it is possible to significantly enhance their Raman signature. This is known as surface-enhanced Raman scattering.

üìå According to the [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full):
> Because the Raman spectrum of each $\text{A}$, $\text{T}$, $\text{G}$, and $\text{C}$ base is known, the overall $\text{ATGC}$ content of a single ***k***-mer can be calculated by mathematical analysis of the ***k***-mer spectrum. Sequence information is lost, but the base content‚Äìcalled block optical content (BOC)‚Äìis preserved. For example, the 10 bp DNA segment $\text{ATATGGCCTT}$ would become a BOC datum of $\text{A}_{2} \text{T}_{4} \text{G}_{2} \text{C}_{2}$. 

## ü¶† Bacteria species (classes)

* [Streptococcus_pyogenes](https://en.wikipedia.org/wiki/Streptococcus_pyogenes)
* [Salmonella_enterica](https://ru.wikipedia.org/wiki/Salmonella_enterica)
* [Enterococcus_hirae](https://en.wikipedia.org/wiki/Enterococcus_hirae)
* [Escherichia_coli](https://en.wikipedia.org/wiki/Escherichia_coli)
* [Campylobacter_jejuni](https://en.wikipedia.org/wiki/Campylobacter_jejuni)
* [Streptococcus_pneumoniae](https://en.wikipedia.org/wiki/Streptococcus_pneumoniae)
* [Staphylococcus_aureus](https://en.wikipedia.org/wiki/Staphylococcus_aureus)
* [Escherichia_fergusonii](https://en.wikipedia.org/wiki/Escherichia_fergusonii)
* [Bacteroides_fragilis](https://en.wikipedia.org/wiki/Bacteroides_fragilis)
* [Klebsiella_pneumoniae](https://en.wikipedia.org/wiki/Klebsiella_pneumoniae)

## üéØ Goal 

Can you use this lossy information to accurately predict bacteria species?

‚¨ÜÔ∏è [Back to the top](#Agenda) ‚¨ÜÔ∏è
___
# <a id="Loading">üìö Loading libraries and files</a>

In [None]:
%%capture

# Intel¬Æ Extension for Scikit-learn installation:
!pip install scikit-learn-intelex

import os
import warnings

import numpy as np  # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import seaborn as sns
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.graph_objects as go

%matplotlib inline

from scipy.stats import mode
from tqdm import tqdm
from pathlib import Path

from sklearnex import patch_sklearn
patch_sklearn()

# Mute warnings
warnings.filterwarnings("ignore")

In [None]:
!tree ../input/

In [None]:
data_dir = Path('../input/tabular-playground-series-feb-2022')

df_train = pd.read_csv(data_dir / 'train.csv', index_col='row_id')
df_test  = pd.read_csv(data_dir / 'test.csv', index_col='row_id')

TARGET = df_train.columns.difference(df_test.columns)[0]

‚¨ÜÔ∏è [Back to the top](#Agenda) ‚¨ÜÔ∏è
___
# <a id="EDA">üîç Exploratory Data Analysis</a>

In [None]:
df_train.head(5)

In [None]:
print('Train set - dimensions: \t', df_train.shape)
print('Test set - dimensions: \t', df_test.shape)

### Missing values

In [None]:
print('Train set - missing values: \t', df_train.isnull().sum().sum())
print('Test set - missing values: \t', df_test.isnull().sum().sum())

### Target distribution

In [None]:
target_distrib = pd.DataFrame({
    'count': df_train[TARGET].value_counts(),
    'share': df_train[TARGET].value_counts() / df_train.shape[0] * 100
})

target_distrib.sort_index()

In [None]:
fig = go.Figure(data=[
    go.Pie(
        labels=target_distrib.index, 
        values=target_distrib['count'],
        hole=0.2
    )
])
fig.show()

### Features correlation

In [None]:
plt.figure(figsize=(18, 14))
correlation = df_train.corr()
sns.heatmap(
    correlation, 
    vmin=0
)
plt.show()

In [None]:
# High-correlated feature pairs w/o auto-correlations
threshold = 0.8
correlation = df_train.corr()

corr_pairs = (
    correlation[abs(correlation) > threshold][correlation != 1.0]
).unstack().dropna().to_dict()

unique_corr_pairs = pd.DataFrame(
    list(
        set([(tuple(sorted(key)), corr_pairs[key]) for key in corr_pairs])
    ), columns=['pair', 'corr']
)

unique_corr_pairs

### Numerical/Categorical features

In [None]:
data = pd.concat([df_train, df_test])

In [None]:
# Features with cardinality < 25
feature_nb_sub25 = data.nunique()[data.nunique() < 25][:-1]
feature_nb_sub25

In [None]:
# Categorical features
cat_features = feature_nb_sub25.index.tolist()

# Numerical features
num_features = data.columns.difference(cat_features)[:-1]

print('\033[1;31;43m Categorical features: \033[0;0m \n', cat_features)

In [None]:
fig = go.Figure(data=[
    go.Pie(
        labels=[
            'Numerical features', 
            'Categorical features (cardinality < 25)'
        ], 
        values=[len(num_features), len(cat_features)], 
        pull=[0.1, 0.1],
        hole=0.2, 
        rotation=95
    )
    
])
fig.show()

### Dropping duplicated rows
üìå This part has been updated and largely inspired by these two notebooks:
> * [TPS - Feb 2022](https://www.kaggle.com/sfktrkl/tps-feb-2022/notebook) by [≈ûafak T√ºrkeli](https://www.kaggle.com/sfktrkl)
> * [TPSFEB22-02 Postprocessing against the mutants üíÄ](https://www.kaggle.com/ambrosm/tpsfeb22-02-postprocessing-against-the-mutants) by [AmbrosM](https://www.kaggle.com/ambrosm)

In [None]:
print('Train data samples (w/ duplicates): \t', df_train.shape[0])

# Creating a new df without duplicates
df_train_dedup = pd.DataFrame(
    [list(tup) for tup in df_train.value_counts().index.values], 
    columns=df_train.columns
)

print('Train data samples (w/o duplicates): \t', df_train_dedup.shape[0])

As rightly [mentionned](https://www.kaggle.com/c/tabular-playground-series-feb-2022/discussion/305733#1678838) by [AmbrosM](https://www.kaggle.com/ambrosm), duplicates must not be dropped without compensation. A solution would be to use the `sample_weight` parameter during training to compensate the information loss. Thus, we can add a column to the new dataset that will further be used.

In [None]:
# Values to be used duraing the training phase
df_train_dedup['sample_weight'] = df_train.value_counts().values
df_train_dedup[['sample_weight']]

### Target distribution after dropping duplicated rows

In [None]:
target_distrib['count_w_drop'] = df_train_dedup.target.value_counts()
target_distrib['share_w_drop'] = target_distrib['count_w_drop'] / df_train_dedup.shape[0] * 100

target_distrib.sort_index()

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Bar( 
        x=target_distrib.index, 
        y=target_distrib['count'],
        name='Before drop',
        opacity=0.35
    )
)
fig.add_trace(
    go.Bar(
        x=target_distrib.index, 
        y=target_distrib['count_w_drop'],
        name='After drop'
    ) 
)

fig.update_layout(
    title_text='Classes distribution',
    xaxis_title_text='Classes',
    yaxis_title_text='Count',
    barmode='overlay',
    legend=dict(
        xanchor="right",
        yanchor="top",
        x=0.99,
        y=1.25
    )
)

‚¨ÜÔ∏è [Back to the top](#Agenda) ‚¨ÜÔ∏è
___
# <a id="Validation">‚úÖ Cross-validation method</a>

In [None]:
# Function modified from:
# https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html
# Inspired by https://www.kaggle.com/tomwarrens/timeseriessplit-how-to-use-it/notebook

from matplotlib.patches import Patch

def plot_cv_indices(cv, X, y, n_splits, date_col=None):
    """Create a sample plot for indices of a cross-validation object."""
    
    fig, ax = plt.subplots(1, 1, figsize = (16, 10))

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(
            range(len(indices)),
            [ii + 0.5] * len(indices),
            c=indices,
            marker="_",
            lw=10,
            cmap=cmap_cv,
            vmin=-0.2,
            vmax=1.2,
            zorder=2
        )

    # Formatting
    yticklabels = list(range(n_splits))
    
    if date_col is not None:
        tick_locations  = ax.get_xticks()
        tick_dates = [" "] + date_col.iloc[list(tick_locations[1:-1])].astype(str).tolist() + [" "]

        tick_locations_str = [str(int(i)) for i in tick_locations]
        new_labels = ['\n\n'.join(x) for x in zip(list(tick_locations_str), tick_dates)]
        
        ax.set_xticks(tick_locations)
        ax.set_xticklabels(new_labels)
    
    # Custom visualization
    ax.set_facecolor('#fcfcfc')
    ax.grid(alpha=0.7, linewidth=1, zorder=0)
    
    ax.set_yticks(np.arange(n_splits) + .5)
    ax.set_yticklabels(yticklabels)
    ax.set_ylabel('CV iteration', fontsize=15, labelpad=10)
    ax.set_ylim([n_splits+0.2, -.2])
    ax.yaxis.set_tick_params(labelsize=12, pad=10, length=0)
    
    ax.set_xlabel('Sample index', fontsize=15, labelpad=10)
    ax.xaxis.set_tick_params(labelsize=12, pad=10, length=0)
    
    ax.legend(
        [
            Patch(color=cmap_cv(.8)), 
            Patch(color=cmap_cv(.02))
        ],
        [
            'Testing set', 
            'Training set'
        ],
        fontsize=12,
        loc=(1.02, .8)
    )
    
    ax.set_title(
        '{}'.format(type(cv).__name__),
        loc="left", 
        color="#000", 
        fontsize=20, 
        pad=5, 
        y=1, 
        zorder=3
    )
    
    return ax

In [None]:
# Features without target and sample weights
features = df_train.columns[df_train.columns != TARGET]

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encoding target
le = LabelEncoder()

X = df_train_dedup[features]
y = pd.DataFrame(le.fit_transform(df_train_dedup[TARGET]), columns=[TARGET])
sample_weight = df_train_dedup['sample_weight'] # sample weights

In [None]:
from sklearn.model_selection import StratifiedKFold

SEED = 42
N_SPLITS = 10

folds = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=SEED)

# Visualization
cmap_cv = plt.cm.winter
plot_cv_indices(folds, X, y, folds.n_splits);

‚¨ÜÔ∏è [Back to the top](#Agenda) ‚¨ÜÔ∏è
___
# <a id="TrainingInference">üèãÔ∏è Model Training & Inference</a>

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import accuracy_score

y_pred, y_prob, scores = [], [], []

for fold, (train_id, valid_id) in enumerate(tqdm(folds.split(X, y), total=N_SPLITS)):
    
    # Splitting (w/ sample weights)
    X_train, y_train, sample_weight_train = X.iloc[train_id], y.iloc[train_id], sample_weight.iloc[train_id]
    X_valid, y_valid, sample_weight_valid = X.iloc[valid_id], y.iloc[valid_id], sample_weight.iloc[valid_id]
    
    # Model with params
    params = {
        'n_estimators': 300,
    }
    
    model = ExtraTreesClassifier(
        **params,
        n_jobs=-1,
        random_state=SEED
    )

    # Training (w/ sample weights)
    model.fit(X_train, y_train, sample_weight_train)
    
    # Evaluation
    valid_pred = model.predict(X_valid)
    valid_score = accuracy_score(y_valid, valid_pred, sample_weight=sample_weight_valid)
    
    print(f'### \033[1;31;43m Fold: {fold} \033[0;0m')
    print(f'Accuracy score: {valid_score:6f} \n')
    
    scores.append(valid_score)
    
    # Prediction for submission
    y_pred.append(model.predict(df_test))
    y_prob.append(model.predict_proba(df_test))

In [None]:
score = np.array(scores).mean()
print(f'Mean accuracy score: {score:6f}')

### Feature importance

In [None]:
df_feature_imp = pd.DataFrame({
    'feature': X.columns, 
    'importance': model.feature_importances_
})

feature_imp_25 = df_feature_imp.sort_values(
    by='importance', ascending=False
).iloc[:25].reset_index(drop=True)

fig = go.Figure(
    go.Bar(
        x=feature_imp_25.importance,
        y=feature_imp_25.feature,
        orientation='h',
        marker=dict(color=feature_imp_25.importance)
    )
)

fig.update_layout(
    title_text='Feature importance',
    xaxis_title_text='Importance',
    yaxis_title_text='Features',
    height=1000,
    yaxis=dict(autorange='reversed')
)
fig.show()

### Ensembling

In [None]:
# Majority vote 
y_pred = mode(y_pred).mode[0]
y_pred = le.inverse_transform(y_pred)

### Post-processing

üìå This part has been updated and largely inspired by these two notebooks:
> * [TPSFEB22-02 Postprocessing against the mutants üíÄ](https://www.kaggle.com/ambrosm/tpsfeb22-02-postprocessing-against-the-mutants) by [AmbrosM](https://www.kaggle.com/ambrosm)
> * [TPS - Feb 2022](https://www.kaggle.com/sfktrkl/tps-feb-2022/notebook) by [≈ûafak T√ºrkeli](https://www.kaggle.com/sfktrkl)

Classes distribution:

In [None]:
target_distrib['pred_count'] = pd.Series(y_pred, index=df_test.index).value_counts()
target_distrib['pred_share'] = target_distrib['pred_count'] / len(df_test) * 100

target_distrib.iloc[:, -2:].sort_index()

In [None]:
def get_diff(bias):
    y_pred_tuned = np.argmax(y_prob + bias, axis=1)
    share_train = target_distrib['share_w_drop'].sort_index().values
    share_pred = pd.Series(y_pred_tuned).value_counts().sort_index() / len(df_test) * 100
    diff = share_train - share_pred
    
    return diff

def custom_bias(diff, bias):
    while abs(diff).max() > 0.1:
        for i in range(len(diff)):
            if diff[i] > 0.1:
                bias[i] += 0.001
                break
            if diff[i] < -0.1:
                bias[i] -= 0.001
                break

        diff = get_diff(bias)
    
    return bias

In [None]:
y_prob = sum(y_prob) / len(y_prob)
bias = np.zeros(df_train[TARGET].nunique())

diff = get_diff(bias)
print(f'\033[1;31;43m Difference: \033[0;0m \n{diff}')

In [None]:
bias = custom_bias(diff, bias)
print(f'\033[1;31;43m Bias to add: \033[0;0m \n{bias}')

In [None]:
y_prob += bias
y_pred_tuned = le.inverse_transform(np.argmax(y_prob, axis=1))

In [None]:
target_distrib['tuned_pred_count'] = pd.Series(y_pred_tuned, index=df_test.index).value_counts()
target_distrib['tuned_pred_share'] = target_distrib['tuned_pred_count'] / len(df_test) * 100

target_distrib.iloc[:, -4:].sort_index()

In [None]:
fig = go.Figure()

fig.add_trace(
    go.Bar( 
        x=target_distrib.index, 
        y=target_distrib['pred_count'],
        name='Predictions',
        opacity=0.35
    )
)
fig.add_trace(
    go.Bar(
        x=target_distrib.index, 
        y=target_distrib['tuned_pred_count'],
        name='Predictions (tuned)'
    ) 
)

fig.update_layout(
    title_text='Classes distribution',
    xaxis_title_text='Classes',
    yaxis_title_text='Count',
    barmode='group',
    bargap=0.2, 
    bargroupgap=0.1,
    legend=dict(
        xanchor="right",
        yanchor="top",
        x=0.99,
        y=1.25
    )
)

In [None]:
target_distrib.sort_index()

### Submission

In [None]:
submission = pd.read_csv(data_dir / 'sample_submission.csv')

submission[TARGET] = y_pred_tuned

display(submission.head(10), submission.tail(10))

In [None]:
submission.to_csv('submission.csv', index=False)