# Tab Play September
- This notebook covers my code for the **Tabular Playground Series - September challenge**, which can be found [here](https://www.kaggle.com/c/tabular-playground-series-sep-2021).
- In this notebook, I have used an **Auto-Visualization Library** for visualizing the data, which can be found [here](https://github.com/AutoViML/AutoViz).
- I have used **Mean Imputation** for all the features having NULL(s)
- After that, I have determined the **PCC (Pearson Correlation Coefficient)** of all the features with the 'claim' variable, and eliminated all those features having |PCC| <= 0.0025, based on the fact that they don't explain the target variable to any considerable extent.
- Also, I standardized all the features with the help of **StandardScaler**
- I also tried using **PCA (Principal Component Analysis)**, but it only deteriorated the score, hence, I didn't use it in the final submission.
- For training purposes, I used multiple models such as **Gaussian Naive Bayes**, **Logistic Regression**, **Gradient Boosting Classifier**, and **Light Gradient Boosted Machine (LGBM)**, out of which LGBM gave the best score.

**I would love to improve my existing score and am open to any suggestions. Please do leave them in the comments section, and if you liked my work, an upvote would be awesome :)**

In [None]:
!pip install xlrd
!pip install autoviz

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.decomposition import PCA
from lightgbm import LGBMClassifier
from autoviz.AutoViz_Class import AutoViz_Class

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing the Dataset

In [None]:
df_train = pd.read_csv("../input/tabular-playground-series-sep-2021/train.csv")
df_test = pd.read_csv("../input/tabular-playground-series-sep-2021/test.csv")
df_sub = pd.read_csv("../input/tabular-playground-series-sep-2021/sample_solution.csv")

In [None]:
print(df_train.shape)
df_train.info(verbose=True, show_counts=True)

In [None]:
print(df_test.shape)
df_test.info(verbose=True, show_counts=True)

In [None]:
# Keeping a separator variable and the target variable
sep = df_train.shape[0]
Y = df_train["claim"]

# Dropping the IDs and the target variable
df_train.drop(["id", "claim"], axis=1, inplace=True)
df_test.drop(["id"], axis=1, inplace=True)

# Concatenating the datasets for pre-processing
df = pd.concat([df_train, df_test], axis=0)

print(df_train.shape, Y.shape, sep)

# Visualizing & Pre-processing the Dataset
- From the above code cells, we can see that all the features are numerical.
- However, for some of the features, there exists some data-points which have NULL as a value, so, we will perform mean-imputation for those features

In [None]:
AV = AutoViz_Class()
data = AV.AutoViz('../input/tabular-playground-series-sep-2021/train.csv')

In [None]:
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
df = imp_mean.fit_transform(df)
df = pd.DataFrame(df)

In [None]:
df.info(verbose=True, show_counts=True)

In [None]:
# We are trying to find PCC (Pearson Correlation Coefficient) between features.
# So that, we can eliminate some of the redundant features. But for plotting the
# correlation matrix, we will use the training set only.

# Getting the train set
df_train = df.iloc[ : sep, : ]
df_train = df_train.assign(claim = pd.Series(Y))
print(df_train.shape)

# Calculating the PCC
cor_mat = df_train.corr(method='pearson', min_periods=50)
print(cor_mat.shape)

# Number of variables having abs(PCC) with 'claim', less than or equal to 0.005
# We will simply eliminate those features, as they are related with the 'claim', to the minimum extent
red_fea = []
for i, pcc in enumerate(cor_mat['claim']):
    if(-0.0025 <= pcc and pcc <= 0.0025):
        red_fea.append(cor_mat.index[i])
print(red_fea)

In [None]:
# Dropping all the Redundant features
df.drop(red_fea, axis=1, inplace=True)
print(df.shape)

In [None]:
# Splitting the df back into df_train and df_test
df_train = df.iloc[ :sep, : ]
df_test = df.iloc[sep: , : ]
print(df_train.shape, df_test.shape)

In [None]:
scaler = StandardScaler()
df_train = scaler.fit_transform(df_train)
df_test = scaler.transform(df_test)
print(df_train.shape, df_test.shape)

In [None]:
# PCA is reducing the accuracy in the case of any model, hence, not using it.
# pca = PCA(n_components = 70, random_state = 42)
# df_train = pca.fit_transform(df_train)
# df_test = pca.transform(df_test)
# print("Explained Variance Ratio: ", np.sum(pca.explained_variance_ratio_))

# Training the Model

In [None]:
# Splitting the df_train into train & val sets
X_train, X_val, y_train, y_val = train_test_split(df_train, Y, test_size=0.1, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)

In [None]:
# Gaussian Naive Bayes
# gnb = GaussianNB()
# gnb.fit(X_train, y_train)
# y_pred = gnb.predict_proba(X_val)[ : , 1]
# print(roc_auc_score(y_val, y_pred))

In [None]:
# Logistic Regression
# lr = LogisticRegression(C = 0.001)
# lr.fit(X_train, y_train)
# y_pred = lr.predict_proba(X_val)[ : , 1]
# print(roc_auc_score(y_val, y_pred))

In [None]:
# Gradient Boosting Classifier
# rfc = RandomForestClassifier(n_estimators = 10, verbose = 1)
# rfc.fit(X_train, y_train)
# y_pred = rfc.predict_proba(X_val)[ : , 1]
# print(roc_auc_score(y_val, y_pred))

In [None]:
# Light Gradient Boosted Machine (LightGBM)
lgbm = LGBMClassifier(
    max_depth = 3, 
    num_leaves = 7, 
    n_estimators = 10000, 
    colsample_bytree = 0.3, 
    subsample = 0.5, 
    random_state = 41, 
    reg_alpha=18, 
    reg_lambda=17, 
    learning_rate = 0.095, 
    device = 'gpu', 
    objective= 'binary'
)
lgbm.fit(X_train, y_train)
y_pred = lgbm.predict_proba(X_val)[ : , 1]
print(roc_auc_score(y_val, y_pred))

# Submitting the Predictions

In [None]:
# Training the model on the entire df_train
model = LGBMClassifier(
    max_depth = 3, 
    num_leaves = 7, 
    n_estimators = 10000, 
    colsample_bytree = 0.3, 
    subsample = 0.5, 
    random_state = 41, 
    reg_alpha=18, 
    reg_lambda=17, 
    learning_rate = 0.095, 
    device = 'gpu', 
    objective= 'binary'
)
model.fit(df_train, Y)

In [None]:
y_test = model.predict_proba(df_test)[ : , 1]
df_sub['claim'] = y_test
print(df_sub.shape)

In [None]:
df_sub.head()

In [None]:
df_sub.to_csv("submission.csv", index = False)