### Introduction

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting the biological response of molecules given various chemical properties. Although the features are anonymized, they have properties relating to real-world features.

This notebook contains a performance comparison of different baseline models without parameter optimization to give everyone an idea of Tabular playground November's dataset. This notebook can be used as starting point for anyone. If you do like this notebook, don't forget to <b>upvote</b>.

#### Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
from sklearn import metrics
from sklearn.metrics import confusion_matrix, roc_auc_score, auc, matthews_corrcoef, f1_score 
from sklearn.model_selection import train_test_split
from sklearn import svm
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing

%matplotlib inline

#### Reading Data

In [None]:
train=pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')

In [None]:
train.head()

In [None]:
train.shape

In [None]:
sns.countplot(x="target", data=train);

In [None]:
train.info()

In [None]:
train.dtypes.to_dict()

The data doesn't contain any categorical variable so we can start ahead. Now lets check for any null value

In [None]:
[i for i in train.columns if train[i].isna().any()]

### Insights from data

- The complete data contains 600000 rows and 102 columns out of which we have 100 feature columns.
- The target column is either 0 or 1 and both contain approximately same amount of data
- There are no null values

#### Reduce Memory usage

There are a million rows with 100 feature columns which is taking around 466.9 mb of RAM. So, it is better to reduce some memory by changing its datatype to store fewer floating points precision.

Here's a quick comparison of integer data type. which shows the values we can store in it.
- int8 can store integers from -128 to 127.
- int16 can store integers from -32768 to 32767.
- int64 can store integers from -9223372036854775808 to 9223372036854775807.

In [None]:
X = train.iloc[:,1:-1]
y = train.target

In [None]:
def ReduceMem(data):
    feature_cols = data.columns.tolist()
    
    print("Memory usage before: ", data.memory_usage(deep=True).sum()/(1024**3)," GB")
    for col in feature_cols:
        if data[col].dtype=='float64':
            data[col] = data[col].astype('float32')
        else:
            data[col] = data[col].astype('uint8')
    
    print("Memory usage after: ", data.memory_usage(deep=True).sum()/(1024**3), "GB")
    
    return data

In [None]:
X = ReduceMem(X)

### Normalization

In [None]:
scaler = preprocessing.StandardScaler()
scaled_features = scaler.fit_transform(X.values)
X = pd.DataFrame(scaled_features, index=X.index, columns=X.columns)
X.head()

### DataSplit 

The provided test set doesn't contain a target variable, So we'll create a test set from training data to evaluate its performance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)

In [None]:
X_train.shape

In [None]:
X_test.shape

## Classification

In [None]:
result = []

We are going to use different models and choose the best performing model among those.

In [None]:
def print_roc_auc_score(model, test_data, label):
    y_pred = model.predict_proba(test_data)[::,1]
    auc = metrics.roc_auc_score(label, y_pred)    
    fpr_logit, tpr_logit, _ = metrics.roc_curve(label, y_pred)    
    plt.plot(fpr_logit,tpr_logit,label="AUC Curve, auc={:.3f})".format(auc))
    plt.plot([0, 1], [0, 1], 'k--')
    plt.xlabel('False positive rate')
    plt.ylabel('True positive rate')
    plt.title('ROC curve')
    plt.legend(loc=4)
    plt.show()    
    print("AUC Score :", auc)  
    
    return auc

### Logistic Regression

In [None]:
lr = LogisticRegression(C=0.00003, solver='lbfgs',max_iter=1000)
lr.fit(X_train, y_train)
lr_auc = print_roc_auc_score(lr, X_test, y_test)
result.append(["Logistic regression all features",lr_auc])

### XGBoost

In [None]:
xgb = XGBClassifier(random_state=10, use_label_encoder=False)
xgb.fit(X_train, y_train, eval_metric='aucpr');
xgboost_auc = print_roc_auc_score(xgb, X_test, y_test)
result.append(["XGBoost all features",xgboost_auc])

## AdaBoost

In [None]:
adaBoost = AdaBoostClassifier()
adaBoost.fit(X_train, y_train);
adaBoost_auc = print_roc_auc_score(adaBoost, X_test, y_test)
result.append(["AdaBoost all features",adaBoost_auc])

## CatBoost

In [None]:
catBoost = CatBoostClassifier()
catBoost.fit(X_train, y_train);
catBoost_auc = print_roc_auc_score(catBoost, X_test, y_test)
result.append(["catBoost all features",catBoost_auc])

## LightGBM

In [None]:
lgbm = LGBMClassifier()
lgbm.fit(X_train, y_train);
lgbm_auc = print_roc_auc_score(lgbm, X_test, y_test)
result.append(["lightgbm all features",lgbm_auc])

### Summary 

In [None]:
df_results = pd.DataFrame(result)
df_results = df_results.rename(columns={0:"ModelName",1:"Auc Score"})
df_results

In [None]:
df_results.iloc[df_results["Auc Score"].argmax()]

As we can see that models trained on selected features using ExtraTreesClassifier have performed really well even after removing 276 of its features, which really shows the power of choosing the right features. Training only on 9 features take small amount of time for training but for the sake of submission we are going to use the model with highest score i.e. catBoost model with all of the features of training data.

## Submission

In [None]:
model = LogisticRegression(C=0.00003, solver='lbfgs',max_iter=1000)
model.fit(X, y)

In [None]:
X_test = pd.read_csv("../input/tabular-playground-series-nov-2021/test.csv", index_col='id')
scaler = preprocessing.StandardScaler()
scaled_features = scaler.fit_transform(X_test.values)
X_test = pd.DataFrame(scaled_features, index=X_test.index, columns=X_test.columns)
X_test.head()

In [None]:
predict = model.predict_proba(X_test)[::,1]
predictions = pd.DataFrame({"id":X_test.index, "target":predict})

In [None]:
predictions.to_csv('submission.csv', index=False)