This notebook was prepared by Adrian Hajdukiewicz.

# Kaggle ML Competition: Tabular Playground Series - May 2021<hr>

* Setup Imports and Variables
* Exploratory Data Analysis (EDA)
* Data cleaning
 * Second dataset - balanced sample
* Models
 * Linear regression
 * Overfitting / underfitting - KNN
 * KNN
 * KNN with auto tuning
 * XGBoost
 * Logistic Regression - automation
* Submissions/Final models
 * Logistic Regression - 4 models - unbalanced samples
 * Logistic Regression, balanced sample
 * Logistic Regression - whole dataset + logs of variables + normalized data
 * Logistic Regression - dummy variables for each feature 
* Next steps

<hr>

Models were compared using F1 (harmonic mean of recall and precision) and kaggle score (log-loss, calculated after submission by kaggle)

Types of models used in this notebook:<br>
1. Linear regression <font color="red">(bad model)</font><br>
2. Logistic regression<br>
3. KNN<br>
4. XGBoost<br>
<br>

# Setup Imports and Variables<hr>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn import preprocessing
from copy import deepcopy
import statistics as stat
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier


# own utility script
import roc_conf_matrix2 as conf_m

# set charts size
sns.set(rc={'figure.figsize':(9.36, 8.27)}) 

# iterations in Logistic regression - max
# this should be at least around 200-300
max_iterations = 800

# Paths
test_path = '../input/tabular-playground-series-may-2021/test.csv'
train_path = '../input/tabular-playground-series-may-2021/train.csv'
sample_submission_path = '../input/tabular-playground-series-may-2021/sample_submission.csv'

# Read data
test_data = pd.read_csv(test_path)
train_data = pd.read_csv(train_path)

# Exploratory Data Analysis (EDA)<hr>

In [None]:
train_data.drop(columns=['id']).describe().T.style.background_gradient(cmap='Greens', axis=0)

Based on the table above, I have to **normalize the data if** I assume they are **continuous variables.** (due to different means, min and max values)

In [None]:
train_data.info()

In [None]:
# list of unique values for each column
for column in train_data:
    print(column, train_data[column].unique())

No NaN values (or other problematic cases) so **no imputation required**.

In [None]:
train_data.skew()

**Due to positive skewness**, I will **create an additional dataset with logs of variables** (assuming they are continuous variables).

In [None]:
sns.countplot(train_data['target'], palette='winter')
plt.show()

In [None]:
train_data.groupby('target').id.count()

Imbalanced sample. That is why predictions are in favour of class 2 and class 3.

# Data cleaning<hr>

Assumptions of my basic plan:<br><br>
1. First I will treat **every variable as continuous**.<br>
2. Alternatively I will try to create **dummy variables for features with 5 or less unique values** (to treat them as categorical ones). <br>
3. Eventually I will create **a dataset with dummy variables for each feature**.

It turned out that the best solution was to treat the variables as categorical or continuous (getting to this point is not included in the notebook) and create dummy variables for each category.

In [None]:
x_temp = train_data.drop(columns=['target', 'id'])

# for dummy variables
x_temp_dum = pd.get_dummies(deepcopy(x_temp), columns=x_temp.columns, prefix=x_temp.columns)

scaler = preprocessing.StandardScaler().fit(x_temp)
x_temp = pd.DataFrame(scaler.transform(x_temp), columns = x_temp.columns)

#y = train_data['target'].replace(to_replace = 'Class_2', value = '')
y_temp = pd.to_numeric(train_data['target'].str.replace('Class_', ''))

x_train, x_test, y_train, y_test = train_test_split(x_temp, y_temp, train_size=0.8,
test_size=0.2)

# create dummy variables
x_train_dum, x_test_dum, y_train_dum, y_test_dum = train_test_split(x_temp_dum, y_temp, 
                                                                    train_size=0.8, test_size=0.2)

># Second dataset - balanced sample

In [None]:
# provide function with no of observations for each class
def get_unbalanced_data(no_observations):
    class1_data = train_data[train_data['target']=='Class_1']
    class2_data = train_data[train_data['target']=='Class_2']
    class3_data = train_data[train_data['target']=='Class_3']
    class4_data = train_data[train_data['target']=='Class_4']

    datasets_by_classes = [class1_data, class2_data, class3_data, class4_data]
    
    list_balanced_data = []
    for count, dataset in enumerate(datasets_by_classes):
        list_balanced_data.append(dataset.tail(no_observations[count]))

    balanced_dataset = pd.concat(list_balanced_data)
    x_balanced_temp = balanced_dataset.drop(columns=['id', 'target'])
    y_balanced_temp = pd.to_numeric(balanced_dataset['target'].str.replace('Class_', ''))
    return x_balanced_temp, y_balanced_temp

x_balanced, y_balanced = get_unbalanced_data([8000, 8000, 8000, 8000])

In [None]:
# Define the colormap
colors = sns.diverging_palette(150, 275, s=80, l=55, n=9, as_cmap=True)
 
# Create heatmap of correlation between features
sns.heatmap(train_data.corr(), center=0, cmap=colors, robust=True)
plt.show()

There should not be any problems with collinearity.

In [None]:
# Show correlation with target
y_classes = pd.get_dummies(y_temp, prefix="class")                                            
plot_corr_data = train_data.join(y_classes)

sns.heatmap(plot_corr_data.corr(), center=0, cmap=colors, robust=True)
plt.show()

The linear correlation between classes and features is low. Which is why I will try to get dummies for each feature to build a better model.

# Models <hr>

># Linear regression

In [None]:
lin_reg = LinearRegression()

lin_reg.fit(x_train_dum, y_train_dum)
predictions_lin_reg = lin_reg.predict(x_test_dum)

conf_m.conf_m_report_w_labels(y_test_dum, predictions_lin_reg, 'Linear Regression 1')

As seen in the model results, Linear regression is not a good model for classification because predicted probabilities may go beyond the limits of the set. A way better solution is to use probit/logit model if we want to have linear endogenous variables (but estimation is not linear in the latter case of course).<br><br>

I will not conduct tests for heteroskedasticity (White test), functional form (RESET test), ... because the model is bad at the beginning. 


># Overfitting / underfitting - KNN

In general, models have similar performance within test and train data (tests excluded from the notebook) so there is no need of tuning the models (based on test not included in the notebook).<br><br>
**But in case of KNN we need to select proper number of neighbors.**

In [None]:
KNN_test_F1 = {}
KNN_train_F1 = {}

for k in range(1, 14):
    KNN = KNeighborsClassifier(n_neighbors=k, algorithm='brute')
    KNN.fit(x_train_dum, y_train_dum)
    knn_y_train_pred = KNN.predict(x_train_dum)
    knn_y_test_pred = KNN.predict(x_test_dum)

    # training avg macro F1
    KNN_train_F1[k] = f1_score(y_train_dum, knn_y_train_pred, average='macro')
    
    # testing accuracy
    KNN_test_F1[k]= f1_score(y_test_dum, knn_y_test_pred, average='macro')

plt.plot(KNN_test_F1.keys(),KNN_test_F1.values(), label = 'Testing avg macro F1')
plt.plot(KNN_train_F1.keys(),KNN_train_F1.values(), label = 'Training avg macro F1')
plt.legend()
plt.title('F1 (avg macro) vs. K Value')
plt.xlabel('K')
plt.ylabel('F1 (avg macro)')

As seen above, K equal to 4 is the best option (minimizing the overfitting effect and maximizing F1 (I am trying to tune the model for better results for each class))

># KNN

In [None]:
KNN = KNeighborsClassifier(n_neighbors=4, algorithm='brute')
KNN.fit(x_train_dum, y_train_dum)
predictionsKNN1 = KNN.predict(x_test_dum)

conf_m.conf_m_report_w_labels(y_test_dum, predictionsKNN1, 'KNN; model 1')

># KNN with auto tuning

Using original dataset (without changes)

**Decrease train dataset size (especially since execution time of KNN does not grow in linear pace along with the size of the dataset ; )**

In [None]:
x_train_CV, x_test_CV, y_train_CV, y_test_CV = train_test_split(x_temp, y_temp, 
                                                                train_size=0.4, test_size=0.6)

In [None]:
kn = KNeighborsClassifier()
params = {
    'n_neighbors' : list(range(1, 30, 6)), # max val among features is 66
    'weights': ['uniform', 'distance'],
    'algorithm': ['ball_tree', 'kd_tree', 'brute']
}
grid_kn = GridSearchCV(estimator = kn,
                        param_grid = params,
                        scoring = 'f1_macro', 
                        cv = 3, 
                        verbose = 1,
                        n_jobs = -1)
grid_kn.fit(x_train_CV, y_train_CV)

# best estimator
print(grid_kn.best_estimator_)
preds_knn_grid = grid_kn.predict(x_train_CV)

Train dataset - confusion matrix:

In [None]:
conf_m.conf_m_report_w_labels(y_train_CV, preds_knn_grid, 'KNN, auto tune, train data')

Test dataset - confusion matrix:

In [None]:
preds_knn_grid_test = grid_kn.predict(x_test_CV)
conf_m.conf_m_report_w_labels(y_test_CV, preds_knn_grid_test, 'KNN, auto tune, test data')

Comparing results within train and test dataset, we may suspect noticeable **overfitting**. Test conducted before proved that k=4 may minimize that effect. <br>So the model needs improvement but KNN did not perform well during submissions so I will not be developing models using KNN anymore.

># XGBoost

In [None]:
XGB1 = XGBRegressor()
XGB1.fit(x_train_dum, y_train_dum)
predictionsXGB1 = XGB1.predict(x_test_dum)

conf_m.conf_m_report_no_labels(y_test_dum, predictionsXGB1, 'XGB; model 1')

># Logistic Regression - automation

Code to create models faster:

In [None]:
def get_model_lr(x,y, set_max_iter=max_iterations, set_class_weight='None', set_warm_start=False):
    lr_balanced1 = LogisticRegression(solver="newton-cg", C=1, multi_class="ovr", max_iter=set_max_iter, n_jobs=-1,
                                     warm_start=set_warm_start, class_weight=set_class_weight)
    lr_balanced1.fit(x,y)
    return lr_balanced1

# Submissions/Final models<hr>

KNN and XGB models were excluded because they provided me with not accurate enough probabilities.

># Logistic Regression - 4 models - unbalanced samples

4 models (for each class)<br>
Each one fitted based on sample where the specific class prevailed<br>
Each model used to predict the probability of one class (where samples from it dominated)

In [None]:
# create LR model where class 1 prevails:    (sample: [class1: 8000, class2: 5000, class3: 5000, class4: 5000] )
x,y = get_unbalanced_data([8000, 5000, 5000, 5000])
lr_bal_1 = get_model_lr(x, y)

x,y = get_unbalanced_data([5000, 8000, 5000, 5000])
lr_bal_2 = get_model_lr(x, y)

x,y = get_unbalanced_data([5000, 5000, 8000, 5000])
lr_bal_3 = get_model_lr(x, y)

x,y = get_unbalanced_data([5000, 5000, 5000, 8000])
lr_bal_4 = get_model_lr(x, y)

In [None]:
models = [lr_bal_1, lr_bal_2, lr_bal_3, lr_bal_4]
final_submission_lr = test_data['id'].to_frame()
cols_submission = ['Class_1', 'Class_2', 'Class_3', 'Class_4']
test_data_to_pred = test_data.drop(columns=['id'])

for count, model in enumerate(models):
    final_submission_lr[cols_submission[count]] = model.predict_proba(test_data_to_pred)[0:,count]

In [None]:
final_submission_lr.to_csv('submission_lr_4.csv', index=False)

Kaggle public score: 1.37537

># Logistic Regression, balanced sample

Model based on balanced sample (8490 samples of each class - this is the max possible number of samples to have them balanced).

In [None]:
# Get model
x,y = get_unbalanced_data([8490, 8490, 8490, 8490])

lr_balanced_8490 = get_model_lr(x, y)
test_data_to_pred = test_data.drop(columns=['id'])
cols_submission = ['Class_1', 'Class_2', 'Class_3', 'Class_4']

preds_lr_8490 = lr_balanced_8490.predict_proba(test_data_to_pred)

preds_lr_8490_df = pd.DataFrame(preds_lr_8490, columns=cols_submission)
submission_lr_balanced_8490 = test_data['id'].to_frame().join(preds_lr_8490_df)
submission_lr_balanced_8490.to_csv('submission_lr_balanced_8490.csv', index=False)

Kaggle public score: 1.37538
<br>So balancing the sample was not the thing.

># Logistic Regression - whole dataset + logs of variables + normalized data

In [None]:
lr_whole_dataset_log = get_model_lr(np.log(x_temp+10), y_temp)

test_data_to_pred = test_data.drop(columns=['id'])
test_data_to_pred_normal = pd.DataFrame(scaler.transform(test_data_to_pred), 
                                        columns = test_data_to_pred.columns)
test_data_to_pred_log_norm = np.log(
    test_data_to_pred_normal +10 )

cols_submission = ['Class_1', 'Class_2', 'Class_3', 'Class_4']
preds_lr_log = lr_whole_dataset_log.predict_proba(test_data_to_pred_log_norm)
preds_lr_log_df = pd.DataFrame(preds_lr_log, columns=cols_submission)
submission_lr_log_df = test_data['id'].to_frame().join(preds_lr_log_df)

submission_lr_log_df.to_csv('submission_lr_log_df_normal.csv', index=False)

Kaggle public score: 1.10090

># Logistic Regression - dummy variables for each feature

Create dummy variables for each feature in whole training dataset and train a logistic regression model using it.

1. Prepare data for final prediction (to make a submission)<br>
(generate dummies, same as in the training set):

In [None]:
x_test_data_temp = test_data.drop(columns=['id'])

# dummy variables
x_test_data_dum = pd.get_dummies(deepcopy(x_test_data_temp), 
                                 columns=x_test_data_temp.columns, 
                                 prefix=x_test_data_temp.columns)

new_submission_data = test_data['id'].to_frame()

# for each column in data used to train models (I want to have same coulmns in the dataset for predictions)
for col_submission_data in x_temp_dum.columns:
    #if the columns appears in submission data:
    if col_submission_data in list(x_test_data_dum.columns):
        new_submission_data[col_submission_data] = x_test_data_dum[col_submission_data].values
    else:
        new_submission_data[col_submission_data] = 0

new_submission_data_no_id = new_submission_data.drop(columns=['id'])

2. Train model, make predictions, upload results: <br>
(be aware that logistic regression with so many features takes a lot of time to train; luckily making predictions is only a simple math)

In [None]:
lr_dummies = get_model_lr(x_temp_dum, y_temp, 400)
preds_lr_dummies = lr_dummies.predict_proba(new_submission_data_no_id)

cols_submission = ['Class_1', 'Class_2', 'Class_3', 'Class_4']
preds_lr_dummies_df = pd.DataFrame(preds_lr_dummies, columns=cols_submission)
submission_lr_dummies_df = test_data['id'].to_frame().join(preds_lr_dummies_df)

submission_lr_dummies_df.to_csv('submission_lr_dummies_df.csv', index=False)

**Kaggle public score: 1.09789 <font color="green">(BEST MODEL)</font>**

# Next steps<hr>

I would expect better performance for predicting class 1 and class 4 but using balanced sample did not help. So other ways may be considered.
<br><br>

It is also an option to delete outliers (e.g. using Least Trimmed Squares or quantile regression, though these models may not be fitted well) and verify if it helped.<br><br>