# About this notebook

As the data for this challenge are provided without context, I find it difficult to guess which models may be suited to the problem at hand and which one will certainly not work well.

The goal here is to try a "brute force" approach of this question by systematicly testing 8 classification models with minimal hyperparameters optimization.

In this notebook, we will:
1. Preprocess data: encoding, scaling, train/val split
2. Train and test 8 models
3. Choose the most promising one and sumit the result

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import pandas as pd
import os
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Preprocessing data

1. Data loading and overview
2. Categorical values encoding
3. Feature scaling
4. Train/test split

In [None]:
train = pd.read_csv("../input/tabular-playground-series-mar-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-mar-2021/test.csv")

#Features
X = train.drop(["id", "target"], axis=1)
X_test = test.drop(["id"], axis=1)
X_all = pd.concat([X, X_test], axis=0)

#Label
y = train.target

In [None]:
#Data overview
from pandas_profiling.profile_report import ProfileReport
ProfileReport(X_all)

In [None]:
#Encoding categorical data

#List of categorical col
list_cat = [col for col in X.columns if col.startswith("cat")]


le = LabelEncoder()

for col in list_cat:
    X_all[col] = le.fit_transform(X_all[col])
   

In [None]:
#Feature scaling

scaler = StandardScaler().fit(X_all)
X_all = pd.DataFrame(columns = X_all.columns,
                            data = scaler.transform(X_all))

X_all.head()

In [None]:
#Train, val, test split

X = X_all.iloc[:len(train), :]
X_test = X_all.iloc[len(train):, :]

#To save time we keep only a random subset of the training set
sample_size = 0.1
train_sample = pd.concat([X, y], axis = 1)
train_sample = train_sample.sample(frac = sample_size, random_state = 0)

# Train/val split
train_size = 0.8
train_set = train_sample.iloc[:int(len(train_sample) * train_size), :]
val_set = train_sample.iloc[int(len(train_sample) * train_size):, :]


Xtrain = train_set.drop(labels = ['target'], axis = 1)
ytrain = train_set.target

Xval = val_set.drop(labels = ['target'], axis = 1)
yval = val_set.target

# Model testing

We test a selection of classification models from sklearn without any hyperparameter optimization.

In [None]:
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, HistGradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis


from sklearn.metrics import roc_auc_score

In [None]:
classifiers = [
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
    HistGradientBoostingClassifier(max_leaf_nodes=100, validation_fraction=None)]

for model in classifiers:
    clf = model
    clf.fit(Xtrain, ytrain)
    
    #Print model name, train set performance and val set performance
    print("Model :", str(model).split('(')[0])
    train_score = roc_auc_score(ytrain, clf.predict(Xtrain))
    print('Training set evaluation :', train_score)
    val_score = roc_auc_score(yval, clf.predict(Xval))
    print('Validation set evaluation :', val_score)

    print('')

## Quick interpretation :

The area under the ROC curve is comprised between 0 and 1, 1 being the best possible performance.

The performance of most classifiers are pretty close around 0.75 except for HistGradientBoostingClassifier which is obviously overfitting. However the AdaBoostClassifier each a slightly better result.

Let's choose this model for our first submission and see how we perform...


# Submission

In [None]:
# Training on the whole data
clf = AdaBoostClassifier()
clf.fit(X, y)

ytest = clf.predict(X_test)

#Save
result = pd.DataFrame()

result["id"] = test.id
result["target"] = ytest.flatten()

result.to_csv(os.path.join("submission.csv"), index=False)