# Tabular Playground Series - Feb 2022
Created: 2022-02-08

* Base EDA
* Baseline Model
* Define the hypotheses
* Test the hypotheses

#### LB
* 2022-02-09: Baseline Model(Not tunning lightgbm): 0.93223
* 2022-02-17: KNN Model: 0.97751


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import warnings
import lightgbm as lgb
from scipy import stats
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import preprocessing
from sklearn.decomposition import PCA
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.metrics import SparseCategoricalAccuracy
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN, KMeans
from sklearn.neighbors import KNeighborsClassifier


warnings.filterwarnings(action='ignore')
tf.debugging.set_log_device_placement(True)

## EDA for baseline submit.

In [None]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/test.csv")
sample_submission = pd.read_csv("/kaggle/input/tabular-playground-series-feb-2022/sample_submission.csv")

In [None]:
for df in [train, test, sample_submission]:
    print(df.shape)

In [None]:
train.head()

In [None]:
sample_submission.head()

#### How many features are there?(변수의 개수는?)
-> 286 (Except for row_id, target)

In [None]:
train.shape

#### What is the distribution of targets?(타겟 클래스의 분포는?)
-> 10% each for 10 classes.(10개 클래스가 각각 10%씩 균등하게 분포)

In [None]:
train["target"].value_counts()

In [None]:
train["target"].value_counts(normalize=True)

#### There are features with missing values?(결측치 있는 칼럼이 있나?) 
-> Nothing.

In [None]:
train.isnull().sum().values

#### What's the data type of features?(칼럼의 자료형은?)
-> It's all float(286개 칼럼 모두 float)

In [None]:
train.info()

In [None]:
train["row_id"].dtype

In [None]:
train["target"].dtype

#### Is the distribution of train and test data the same?
-> There are features that t-test p-value is under 0.05(139 of 286 features)

In [None]:
unequal_variance = []
for idx, col in enumerate(train.columns[1:-1]):
    p_value = stats.ttest_ind(
        train[train.columns[1:-1]][col],
        test[test.columns[1:]][col]
    ).pvalue
    if p_value <0.05:
        print(f"[{idx+1}]{col}\'s p-value: {p_value:.3f}")
        unequal_variance.append(col)

print(len(unequal_variance))

## Baseline Model: Not tunning lightgbm

In [None]:
%%time

# K-Fold Cross validation
kfold = KFold(n_splits=5, random_state=0, shuffle=True)


# Baseline Model
def get_model(model, x, y, idx_df):   
    train_idx, test_idx = idx_df
    train_x, test_x = x.iloc[train_idx], x.iloc[test_idx]
    train_y, test_y = y.iloc[train_idx], y.iloc[test_idx]

    print("train start")
    model.fit(
        train_x,
        train_y,
        eval_set=[(test_x,test_y)],
        eval_metric="multi_logloss",
        early_stopping_rounds=30,
        verbose=True
    )
    
    pred = model.predict(test_x)
    print(f"accuracy: {accuracy_score(test_y, pred):.2f}")
    return model


# Plot validation logloss
def plot_eval(model, model_name=""):
    pd.DataFrame(model._evals_result["valid_0"].values()).T.plot(
        title=f"{model_name} logloss line plot",
        xlabel="Rounds",
        ylabel="Logloss",
        grid=True,
        legend=False
    )

clf = lgb.LGBMClassifier(
    objective="multiclass",
    learning_rate=0.05,
    n_estimators=1000,
)

In [None]:
model_1 = get_model(clf, train[train.columns[1:-1]], train["target"], list(kfold.split(train[train.columns[1:-1]]))[0])
# model_2 = get_model(clf, train[train.columns[1:-1]], train["target"], list(kfold.split(train[train.columns[1:-1]]))[1])
# model_3 = get_model(clf, train[train.columns[1:-1]], train["target"], list(kfold.split(train[train.columns[1:-1]]))[2])
# model_4 = get_model(clf, train[train.columns[1:-1]], train["target"], list(kfold.split(train[train.columns[1:-1]]))[3])
# model_5 = get_model(clf, train[train.columns[1:-1]], train["target"], list(kfold.split(train[train.columns[1:-1]]))[4])

In [None]:
plot_eval(model_1)

#### Features importance
Confirmed that each feature has a different effect on the model.(변수마다 모델에 주는 영향이 다른 것을 확인)

In [None]:
lgb.plot_importance(model_1);

## Hypotheses & Test
* ~~If scaling, classification is better.(스케일링을하면, 분류를 더 잘한다.~~ -> Reject

* ~~If I handle outliers, classification is better.(이상치를 처리해주면, 분류를 더 잘한다.)~~ -> Reject
* ~~If I apply pca, classification is better.(pca를 하면, 분류를 더 잘한다.)~~ -> Reject
* ~~DNN is better at classifying. (DNN이 분류를 더 잘한다.)~~ -> Reject
* ~~If I apply ensemble, classification is better.(앙상블을 하면, 분류가 더 잘된다.)~~ -> Reject
* If I use KNN, classification is better. [(Reference)](https://www.kaggle.com/leehomhuang/simple-k-neighbors)(KNN을 쓰면, 분류를 더 잘한다.) -> Accept
* If I use Random Forest, classification is better.(랜덤포레스트를 쓰면, 분류를 더 잘한다.)

### H1: If scaling, classification is better.(스케일링을 하면, 분류를 더 잘한다.) -> Reject
baseline accuracy: 0.93  
standard scaling accuracy: 0.93  
min max scaling accuracy: 0.92

In [None]:
standard_scaler = preprocessing.StandardScaler()
standard_x = standard_scaler.fit_transform(train[train.columns[1:-1]])

get_model(
    clf,
    pd.DataFrame(standard_x),
    train["target"],
    list(kfold.split(standard_x))[0]
)

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()
min_max_x = min_max_scaler.fit_transform(train[train.columns[1:-1]])

get_model(
    clf,
    pd.DataFrame(min_max_x),
    train["target"],
    list(kfold.split(min_max_x)[0]
)

### H1: If I handle outliers, classification is better.(이상치를 처리하면, 분류가 더 잘된다.) -> Reject
baseline accuracy: 0.93  
Adjust outlies accuracy: 0.92


Confirmed that there are features with outliers.(이상치가 있는 속성이 존재하는 것을 확인)

In [None]:
train[train.columns[1:-1]].plot(kind="box", figsize=(30,10))

Adjust outlies to IQR range.(이상치를 사분위수 범위로 조정해준다.)

In [None]:
def replace_outlier_iqr_range(df):
    for col in df.columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1

        df[col].loc[df[col] > q3+1.5*iqr] = q3+1.5*iqr
        df[col].loc[df[col] < q1-1.5*iqr] = q1-1.5*iqr
    return df

df_iqr_range = replace_outlier_iqr_range(train[train.columns[1:-1]])

df_iqr_range.plot(kind="box", figsize=(30,10))

In [None]:
get_model(
    clf,
    df_iqr_range,
    train["target"],
    list(kfold.split(df_iqr_range))[0]
)

### H1: If I apply pca, classification is better.(pca를 하면, 분류를 더 잘한다.) -> Reject
baseline accuracy: 0.93  
PCA accuracy: 0.88

In [None]:
standard_scaler = preprocessing.StandardScaler()
standard_x = standard_scaler.fit_transform(train[train.columns[1:-1]])

n = 10
pca = PCA(n_components=n)
printcipal_components = pca.fit_transform(standard_x)
pca_df = pd.DataFrame(data=printcipal_components, columns = list(range(1,n+1)))

In [None]:
print(f"{sum(pca.explained_variance_ratio_)*100:.1f}% explain")
pca_df.head()

In [None]:
get_model(
    clf,
    pca_df,
    train["target"],
    list(kfold.split(pca_df))[0]
)

### H1: DNN is better at classifying. (DNN이 분류를 더 잘한다.) --> Reject
- Baseline LB Score: 0.93
- DNN LB Score: 0.87

In [None]:
# KFold
train_idx, test_idx = list(kfold.split(train[train.columns[1:-1]]))[0]
train_x, test_x = train[train.columns[1:-1]].iloc[train_idx], train[train.columns[1:-1]].iloc[test_idx]
train_y, test_y = pd.get_dummies(train["target"]).iloc[train_idx], pd.get_dummies(train["target"]).iloc[test_idx]

# DNN
dnn = Sequential()
dnn.add(Dense(128, input_dim=train_x.shape[1], activation="relu"))
dnn.add(Dense(64, activation="relu"))
dnn.add(Dense(10, activation="softmax"))

dnn.compile(
    loss="categorical_crossentropy",
    optimizer="adam",
#     optimizer="rmsprop",
    metrics=["accuracy"]
)

In [None]:
# with tf.device('/GPU:0'):
history = dnn.fit(
    train_x,
    train_y,
    epochs=30,
    batch_size=32,
    validation_data=(test_x, test_y)
)

In [None]:
loss = history.history["loss"]
val_loss = history.history["val_loss"]

epochs = range(1, len(loss) + 1)

plt.plot(epochs, loss, "bo", label="Training loss")
plt.plot(epochs, val_loss, "b", label="Validation loss")
plt.title("Training and validation loss")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.legend()

plt.show()

In [None]:
# For DNN submit
# pred_proba = dnn.predict(test[test.columns[1:]])
# test_pred = train_y.columns[pred_proba.argmax(1)]
# sub = sample_submission.copy()
# sub["target"] = test_pred
# sub.to_csv("submission.csv",index=None)

### H1: If I apply ensemble, classification is better. -> Reject
baseline LB Score: 0.93  
Ensemble LB Score: 0.91

In [None]:
baseline = get_model(clf, train[train.columns[1:-1]], train["target"], list(kfold.split(train[train.columns[1:-1]]))[1])

# Ensemble
pred_proba = (
    baseline.predict_proba(test[test.columns[1:]])
    + dnn.predict(test[test.columns[1:]])
)/2
test_pred = baseline.classes_[pred_proba.argmax(1)]

# sub = sample_submission.copy()
# sub["target"] = test_pred
# sub.to_csv("submission.csv",index=None)

### H1: If I use KNN, classification is better. [(Reference)](https://www.kaggle.com/leehomhuang/simple-k-neighbors)(KNN을 쓰면, 분류를 더 잘한다.) -> Accept

Baseline LB Score: 0.93  
KNN LB Score: 0.977

In [None]:
train_idx, test_idx = list(kfold.split(train[train.columns[1:-1]]))[0]
train_x, test_x = train[train.columns[1:-1]].iloc[train_idx], train[train.columns[1:-1]].iloc[test_idx]
train_y, test_y = train["target"].iloc[train_idx], train["target"].iloc[test_idx]

knn = KNeighborsClassifier(n_neighbors=1, p=2)
knn.fit(train_x, train_y)

pred = knn.predict(test_x)
print(f"accuracy: {accuracy_score(test_y, pred):.2f}")

In [None]:
# For KNN submit
test_pred = knn.predict(test[test.columns[1:]])
sub = sample_submission.copy()
sub["target"] = test_pred
sub.to_csv("submission.csv",index=None)

### H1: If I use Random Forest, classification is better.(랜덤포레스트를 쓰면, 분류를 더 잘한다.)

## Predict the test data to submission 


In [None]:
# Ensemble
# pred_proba = (
#     model_1.predict_proba(test[test.columns[1:]])
#     + model_2.predict_proba(test[test.columns[1:]])
#     + model_3.predict_proba(test[test.columns[1:]])
#     + model_4.predict_proba(test[test.columns[1:]])
#     + model_5.predict_proba(test[test.columns[1:]])
# )/5
# test_pred = model_1.classes_[pred_proba.argmax(1)]

# one model
test_pred = model_1.predict(test[test.columns[1:]])

sub = sample_submission.copy()
sub["target"] = test_pred
sub.to_csv("submission.csv",index=None)