# **TPS Feb 2022 - Bacteria Species (keras_tuner-Beginner)**

---
# **Table of Contents / 目次**
<a id="toc"></a>
- [1. Introduction / 序章](#1)
- [2. Loading Libraries and Files / ライブラリとファイルの読込](#2)
- [3. Quick Look of Data / データの確認](#3)
   - [3.1 Exploring Train Data / Train Dataの探索](#3.1)
   - [3.2 Exploring Test Data / Test Dataの探索](#3.2)
   - [3.3 Checking Submission File / Submission Fileの確認](#3.3)
   - [3.4 Basic Preparation / 基本準備](#3.4)
- [4. Exploratory Data Analysis / 探索的データ分析](#4)
   - [4.1 Target Distribution / 目的変数の分布](#4.1)
   - [4.2 Feature Colleration / 特徴量の相関](#4.2)
   - [4.3 Continuos and Categorical Data Distribution / 連続データとカテゴリーデータの分布](#4.3)
   - [4.4 Checking Duplicated Rows / 重複している行の確認](#4.4)
- [5. Feature Engineering / 特徴量エンジニアリング](#5)
- [6. Modelling / モデリング](#6)
   - [6.1 Modeling with Keras / Kerasでのモデリング](#6.1)
   - [6.2 Finding the Best Tune / 最適解の探索](#6.2)
   - [6.3 Checking the model / モデルの検証](#6.3)
- [7. Submission / 提出](#7)
- [8. Reference / 参考](#8)

---
<a id="1"></a>
# **1. Introduction / 序章**
> For the [February 2022 Tabular Playground Series competition](https://www.kaggle.com/c/tabular-playground-series-feb-2022), your task is to classify 10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment $\text{ATATGGCCTT}$ becomes $\text{A}_{2} \text{T}_{4} \text{G}_{2} \text{C}_{2}$. The idea for this competition came from the following [paper](https://www.frontiersin.org/articles/10.3389/fmicb.2020.00257/full).

>2022年2月のTabular Playgroundシリーズのコンペティションでは、データ圧縮とデータ損失があるゲノム解析技術のデータを使って、10種類の細菌を分類することが課題です。この技術では、DNAの10merの断片をサンプリングして解析し、塩基数のヒストグラムを得ます。つまり、 $\text{ATATGGCCTT}$というDNAセグメントは、$\text{A}_{2} \text{T}_{4} \text{G}_{2} \text{C}_{2}$になります。このコンペティションのアイデアは、論文から生まれました。


In this notebook, I'm using  keras_tuner BayesianOptimization for finding the best tune (See the section 6.2). Please note that the run time of this notebook is more than 5h in GPU. I appreciate for any comments to redue this run time.  
このノートでは、keras_tuner BayesianOptimizationを使って、最適なチューンを探索しています(6.2参照)。このノートの実行時間はGPUで5時間以上です。この時間を短縮する方法等、何かコメントいただけますと幸いです。

---
<a id="2"></a>
# **2. Loading Libraries and Files / ライブラリとファイルの読込**


In [None]:
%%capture

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import time
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from tensorflow import keras
from tensorflow.keras import layers
from keras_tuner.tuners import BayesianOptimization

pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('float_format', '{:f}'.format)

In [None]:
!tree ../input/

In [None]:
train = pd.read_csv("../input/tabular-playground-series-feb-2022/train.csv")
test = pd.read_csv("../input/tabular-playground-series-feb-2022/test.csv")
submission = pd.read_csv("../input/tabular-playground-series-feb-2022/sample_submission.csv")

---
<a id="3"></a>
# **3. Quick look of data / データの確認**
<a id="3.1"></a>
## 3.1 Exploring Train Data / Train Dataの探索

In [None]:
train.head()

In [None]:
print(f'Number of rows and columns in train data: {train.shape}')
print(f'Number of values in train data: {train.count().sum()}')
print(f'Number missing values in train data: {sum(train.isna().sum())}')

In [None]:
train.info()

In [None]:
train.describe()

<a id="3.2"></a>
## 3.2 Exploring Test Data / Test Dataの探索

In [None]:
test.head()

In [None]:
print(f'Number of rows and columns in test data: {test.shape}')
print(f'Number of values in test data: {test.count().sum()}')
print(f'Number missing values in test data: {sum(test.isna().sum())}')

In [None]:
train.info()

In [None]:
test.describe()

<a id="3.3"></a>
## 3.3 Checking Submission File / Submission Fileの確認

In [None]:
submission.head()

<a id="3.4"></a>
## 3.4 Basic Preparation / 基本準備

Prepare "row_id" for submission.  
提出用にrow_idを準備します。

In [None]:
row_id = test['row_id']

Convert the 10 bacteria names to the integers 0 to 9  
10種類のバクテリア名を0から9の数字に変換します。

In [None]:
le = LabelEncoder()
train['target_num'] = le.fit_transform(train.target)

Integrate the data.  
データを統合し中身を確認。

In [None]:
df = pd.concat([train,test], ignore_index = True)

Extract only features  
特徴量のみを抜き出しておきます。

In [None]:
FEATURES = [col for col in df.columns if col not in ['row_id', 'target', 'target_num']]

---
<a id="4"></a>
# **4. Exploratory Data Analysis / 探索的データ分析**
<a id="4.1"></a>
## 4.1 Target Distribution

In [None]:
plt.figure(figsize=(18,10))
plt.title("Bar of target")
sns.countplot(y='target', data=df)
plt.legend()
plt.show()

species = df.groupby('target').size()
plt.figure(figsize=(18,10))
plt.title("Pie of target")
plt.pie(x=species,
       labels=species.index,
       counterclock=False, startangle=90,
       autopct='%1.1f%%', pctdistance=0.7)
plt.show()

<a id="4.2"></a>
## 4.2 Features correlation / 特徴量の相関

In [None]:
correlation_FEATURES = FEATURES.copy()

In [None]:
correlation = df[correlation_FEATURES][df['target_num'].notnull()].corr()
plt.figure(figsize=(18, 14))
sns.heatmap(correlation)
plt.show()

https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees  
Show high-correlated feature pairs  
相関の強い特徴量のペアを出力します。

In [None]:
threshold = 0.8
correlation = df[correlation_FEATURES][df['target_num'].notnull()].corr()

corr_pairs = (
    correlation[abs(correlation) > threshold][correlation != 1.0]
).unstack().dropna().to_dict()

unique_corr_pairs = pd.DataFrame(
    list(
        set([(tuple(sorted(key)), corr_pairs[key]) for key in corr_pairs])
    ), columns=['pair', 'corr']
)

unique_corr_pairs

<a id="4.3"></a>
## 4.3 Continuos and Categorical Data Distribution / 連続データとカテゴリーデータの分布

In [None]:
cate_features = [col for col in FEATURES if df[col].nunique() < 25]
cont_features = [col for col in FEATURES if df[col].nunique() >= 25]

In [None]:
print(f'Total number of features: {len(FEATURES)}')
print(f'Number of categorical (<25 Unique Values) features: {len(cate_features)}')
print(f'Number of continuos features: {len(cont_features)}')

<a id="4.4"></a>
## 4.4 Checking Duplicated Rows / 重複している行の確認

In [None]:
print('No. of train data samples (w/ duplicates): \t', df['target_num'].notnull().sum())
print('No. of Depulicates in train data samples (w/ duplicates): \t', df[FEATURES][df['target_num'].notnull()].duplicated().sum())

It shows that there are many duplicates but I keep them as they are this time.  
重複データが多く確認できますが、今回はこのままにしています。

---
<a id="5"></a>
# **5. Feature Engineering / 特徴量エンジニアリング**

In [None]:
# https://www.kaggle.com/odins0n/tps-feb-22-eda-modelling#Feature-Engineering
df["mean"] = df[FEATURES].mean(axis=1)
df["std"] = df[FEATURES].std(axis=1)
df["min"] = df[FEATURES].min(axis=1)
df["max"] = df[FEATURES].max(axis=1)

---
<a id="6"></a>
# **6. Modeling / モデリング**
<a id="6.1"></a>
## 6.1 Modeling with Keras / Kerasでのモデリング

In [None]:
train_new = df[df['target'].notnull()]
test_new  = df[df['target'].isnull()]

In [None]:
X = train_new.copy()
y = X.pop('target_num')
X.drop(columns=['row_id','target'], axis=1, inplace=True)

In [None]:
scale = StandardScaler()
def scaling(df):
    '''Scaling the Dataset'''
    df_scale = scale.fit_transform(df)
    df_scale = pd.DataFrame(df_scale, columns=df.columns)
    return df_scale

In [None]:
X = scaling(X)

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3,  random_state=0)

In [None]:
input_shape = [X_train.shape[1]]
print("Input shape: {}".format(input_shape))

<a id="6.2"></a>
## 6.2 Finding the Best Tune / 最適解の探索
 1. hp.Int('num_layers', 2, 10)
 2. hp.Int('units_' + str(i), min_value=32, max_value=512, step=32)
 3. hp.Choice('batchnorm_and_dropout', ['batch', 'dropout', 'both'])
 4. hp.Choice(name="optimizer",values=["rmsprop","adam"])

 1. 層の数 (2~10)
 2. ユニットの数 (32~512の32ごと値)
 3. BatchNormalizationかDropoutかどちらもか
 4. 最適化はrmspropとadamのどちらか

In [None]:
def build_model(hp):
    model = keras.Sequential()
    for i in range(hp.Int('num_layers', 2, 10)):
        model.add(layers.Dense(units=hp.Int('units_' + str(i),
                                            min_value=32,
                                            max_value=512,
                                            step=32),
                               activation='relu'))
        if hp.Choice('batchnorm_and_dropout', ['batch', 'dropout', 'both']) == 'batch':
            model.add(layers.BatchNormalization())
        elif hp.Choice('batchnorm_and_dropout', ['batch', 'dropout', 'both']) == 'dropout':
            model.add(layers.Dropout(0.2))
        else:
            model.add(layers.BatchNormalization())
            model.add(layers.Dropout(0.2))
    model.add(layers.Dense(10, activation="softmax"))
    
    optimizer = hp.Choice(name="optimizer",values=["rmsprop","adam"])
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"])
    return model

In [None]:
tuner = BayesianOptimization(
    build_model,
    objective="val_accuracy",
    max_trials=10,
    executions_per_trial=2,
    overwrite=True,
)

In [None]:
tuner.search(X_train, y_train, validation_data=(X_val, y_val), epochs=100)

In [None]:
best_hp = tuner.get_best_hyperparameters()[0]
model = build_model(best_hp)

In [None]:
history = model.fit(X_train,
                    y_train,
                    epochs=100,
                    batch_size=512,
                    validation_data=(X_val, y_val))

In [None]:
model.summary()

<a id="6.3"></a>
## 6.3 Checking the model / モデルの検証

In [None]:
history_dict = history.history
loss_values = history_dict["loss"]
val_loss_values = history_dict["val_loss"]
epochs = range(1,len(loss_values)+1)
plt.plot(epochs, loss_values, "bo",label="Training loss")
plt.plot(epochs, val_loss_values, "b",label="Varidation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()
plt.show()

In [None]:
plt.clf()
acc = history_dict["accuracy"]
val_acc = history_dict["val_accuracy"]
plt.plot(epochs, acc, "bo",label="Training acc")
plt.plot(epochs, val_acc, "b",label="Varidation acc")
plt.title("Training and validation accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.legend()
plt.show()

---
<a id="7"></a>
# **7. Submission / 提出**

In [None]:
X_test = test_new.copy()
X_test.drop(columns=['row_id',"target","target_num"], axis=1, inplace=True)

In [None]:
X_test = scaling(X_test)

In [None]:
predictions = model.predict(X_test)

In [None]:
max_predictions = [np.argmax(predictions[i]) for i in range(len(predictions))]

In [None]:
bacteria = le.inverse_transform(max_predictions)

In [None]:
submission = pd.DataFrame({"row_id": row_id, "target": bacteria})
submission.to_csv("submission.csv", index=False)
print("Your submission was successfully saved!")

---
<a id="8"></a>
# **8. Reference / 参考**

Great thanks to the following papars.  
https://www.kaggle.com/ambrosm/tpsfeb22-01-eda-which-makes-sense  
https://www.kaggle.com/odins0n/tps-feb-22-eda-modelling  
https://www.kaggle.com/maxencefzr/tps-feb22-eda-extratrees  
