# Introduction

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting identifying spam emails via various extracted features from the email. Although the features are anonymized, they have properties relating to real-world features.

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.



# EDA Notebook

#### Link: https://www.kaggle.com/rigeltal/tps-11-first-look-eda

### 1) What are we going to do in this notebook?
* We are going to train many different models and see which model is performing well.
* Feature importances of each model.
* Then we are going to pick the best performing models and play with it in the next notebook.

### 2) Please look into the comment section where I will post my insights. I will be glad if you engage in discussion in comments section.

# Importing libraries

In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns

# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

import gc
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from statistics import mean

from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# Importing data

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

# Describing the data

In [None]:
train.describe().style.background_gradient("copper_r")

# Checking for null values

In [None]:
print("Null values in train data", train.isnull().sum().sum())
print("Null values in test data", test.isnull().sum().sum())

# Normalization

In [None]:
cols = test.columns
cols

In [None]:
# Normalizing the features
scaler = StandardScaler()

train[cols] = scaler.fit_transform(train[cols])
test[cols] = scaler.transform(test[cols])

# Model Trainer

In [None]:
def Trainer(model, model_name, train_data, test_data, fold):
    test_preds = np.zeros(test_data.shape[0])
    train_preds = np.zeros(train_data.shape[0])
    
    kf = StratifiedKFold(n_splits=fold,random_state=48,shuffle=True)
    
    train_auc=[]
    test_auc=[]
    
    n=0
    
    for train_index, test_index in kf.split(train[cols],train['target']):
        
        X_train, X_test = train[cols].iloc[train_index], train[cols].iloc[test_index]
        y_train, y_test = train['target'].iloc[train_index], train['target'].iloc[test_index]
        
        if model_name == 'catb':
            model.fit(X_train, y_train, eval_set=[(X_test, y_test)], silent=True)
        elif model_name == 'lgbm' or model_name == 'xgb':
            model.fit(X_train, y_train, eval_set=[(X_test,y_test)], early_stopping_rounds=100, eval_metric="auc", verbose=False)
        else:
            model.fit(X_train, y_train)
        
        train_preds += model.predict_proba(train_data[cols])[:,1]/kf.n_splits
        test_preds += model.predict_proba(test_data[cols])[:,1]/kf.n_splits
        
        train_auc.append(roc_auc_score(y_train, model.predict_proba(X_train)[:, 1]))
        test_auc.append(roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
        
        gc.collect()
        
        print(f"fold: {n+1}, train_auc: {train_auc[n]}, test_auc: {test_auc[n]}")
        n+=1
    print(f"train_avg = {mean(train_auc)}, test_avg = {mean(test_auc)}" )
    return train_preds, test_preds

# Initialization

In [None]:
lgbm = LGBMClassifier()
xgb = XGBClassifier()
catb = CatBoostClassifier()
rad = RandomForestClassifier()
ada = AdaBoostClassifier()
dec = DecisionTreeClassifier()
lr = LogisticRegression()

# LogisticRegression

In [None]:
lr_train, lr_test = Trainer(lr, 'lr', train, test, 5)
del lr
gc.collect()

sample_submission['target'] = lr_test
sample_submission.to_csv('lr_test.csv', index=False)

# LGBMClassifier

In [None]:
lgbm_train, lgbm_test = Trainer(lgbm, 'lgbm', train, test, 5)
importances_df = pd.DataFrame(lgbm.feature_importances_, columns=['Feature_Importance'], index=cols).sort_values(by="Feature_Importance", ascending=False)

del lgbm
gc.collect()

sample_submission['target'] = lgbm_test
sample_submission.to_csv('lgbm_test.csv', index=False)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

# CatBoostClassifier

In [None]:
catb_train, catb_test = Trainer(catb, 'catb', train, test, 5)
importances_df = pd.DataFrame(catb.feature_importances_, columns=['Feature_Importance'], index=cols).sort_values(by="Feature_Importance", ascending=False)

del catb
gc.collect()

sample_submission['target'] = catb_test
sample_submission.to_csv('catb_test.csv', index=False)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

# XGBClassifier

In [None]:
xgb_train, xgb_test = Trainer(xgb, 'xgb', train, test, 5)
importances_df = pd.DataFrame(xgb.feature_importances_, columns=['Feature_Importance'], index=cols).sort_values(by="Feature_Importance", ascending=False)

del xgb
gc.collect()

sample_submission['target'] = xgb_test
sample_submission.to_csv('xgb_test.csv', index=False)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

# RandomForestClassifier

In [None]:
rad_train, rad_test = Trainer(rad, 'rad', train, test, 5)
importances_df = pd.DataFrame(rad.feature_importances_, columns=['Feature_Importance'], index=cols).sort_values(by="Feature_Importance", ascending=False)

del rad
gc.collect()

sample_submission['target'] = rad_test
sample_submission.to_csv('rad_test.csv', index=False)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

# AdaBoostClassifier

In [None]:
ada_train, ada_test = Trainer(ada, 'ada', train, test, 5)
importances_df = pd.DataFrame(ada.feature_importances_, columns=['Feature_Importance'], index=cols).sort_values(by="Feature_Importance", ascending=False)

del ada
gc.collect()

sample_submission['target'] = ada_test
sample_submission.to_csv('ada_test.csv', index=False)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

# DecisionTreeClassifier

In [None]:
dec_train, dec_test = Trainer(dec, 'dec', train, test, 5)
importances_df = pd.DataFrame(dec.feature_importances_, columns=['Feature_Importance'], index=cols).sort_values(by="Feature_Importance", ascending=False)

del dec
gc.collect()

sample_submission['target'] = dec_test
sample_submission.to_csv('dec_test.csv', index=False)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

# Weighted average

In [None]:
sample_submission['target'] = (lr_test*4 + lgbm_test*3 + catb_test*2 + xgb_test)/10
sample_submission.to_csv('average.csv', index=False)

# Observation
* Logistic regression is performing well in submission which "might" means that the dataset is quite simple. Therefore we can use NN because simpler dataset don't need very large dataset to train NN.
* I can see f34, f55, f43 and f8 in top 5 of almost all feature importance of models.
* CatBoostClassifier gives id a large value in feature importance which is shocking. 
* Why? Well my idea of not dropping id column is to see what is the position of id column in feature importance and the features after the id column are waster than id column.(Hope we engage in discussion in this topic)
* Random forest and Decision tree is overfitting the model with default parameters which in turn "may" confirm that the dataset is simple, given that Logistic regression performs well. 

# Next notebook
#### Please look in the comment section... Hope I will be able to complete it soon

# Final note
#### Thank you!
If you like it please upvote it. If you have suggestion please leave it in comment. Even I am beginner looking forward to learn something new. So let me know how can I improve this