# Introduction

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting identifying spam emails via various extracted features from the email. Although the features are anonymized, they have properties relating to real-world features.

Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.



# First thoughts
* This months Tabular Playground Dataset is once again quite large, so managing both cpu usage and ram is going to be an important element of the project.
* It looks like another classification problem.
* There is no missing data, so imputing values will not be required.
* Looks like there is no categorical features
* Data engineering and feature importance may be important.
* Its likely that model selection and hyper parameter tuning will be important.
* Staking, blending and ensambles are likely to be important to get higher scores.

# Importing Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Importing data

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

#### Describing the data

In [None]:
train.describe().style.background_gradient("copper_r")

#### Droping id column

In [None]:
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

# Features

In [None]:
features=[]
cat_features=[]
cont_features=[]

for feature in test.columns:
    features.append(feature)
    if test.dtypes[feature] == object or train.dtypes[feature] == 'int8':
        cat_features.append(feature)
    else:
        cont_features.append(feature)

plt.bar([1,2],[len(cat_features),len(cont_features)])
plt.xticks([1,2],('Categorical','continuous'))
plt.show()

The above plot clearly tells that all features are continuous and thus no categorical features this time.

# Info about data

#### Train and test shape

In [None]:
print("Shape of Train data:", train.shape)
print("Shape of Test data:", test.shape)

#### Missing data

In [None]:
print("Missing train data:", train.isnull().sum().sum(), f"({train.isnull().sum().sum()/train.shape[0]}%)")
print("Missing test data:", test.isnull().sum().sum(), f"({test.isnull().sum().sum()/test.shape[0]}%)")

#### Feature type

In [None]:
print("Categorical features:", len(cat_features))
print("Continuous features:", len(cont_features))

#### Memory used

In [None]:
print("Memory used by train data:", train.memory_usage().sum() / 1024**2)
print("Memory used by test data:", test.memory_usage().sum() / 1024**2)

# Glance at train data

In [None]:
pd.set_option("display.max_columns", None)

In [None]:
train.head()

# Target distribution

In [None]:
pie, ax = plt.subplots(figsize=[18,8])
train.groupby('target').size().plot(kind='pie',autopct='%.1f',ax=ax,title='Target distibution')

#### All credits: https://www.kaggle.com/davidcoxon/first-look-at-october-data
# Distribution of data

In [None]:
print("Train: Red")
print("Test: Green")
nrows = 20
ncols = 5
i = 0

fig, ax = plt.subplots(nrows, ncols, figsize = (25, 25))

for row in range(nrows):
    for col in range(ncols):
        sns.histplot(data = train.iloc[:, i],color='r', ax = ax[row, col]).set(ylabel = '')
        sns.histplot(data = test.iloc[:, i],color='g', ax = ax[row, col]).set(ylabel = '')
        i += 1

# Boxplots of continuous features

In [None]:
train_outliers = ((train - train.min())/(train.max() - train.min()))

fig, ax = plt.subplots(7, 1, figsize = (25,25))

sns.boxplot(data = train_outliers.iloc[:, 0:15], ax = ax[0])
sns.boxplot(data = train_outliers.iloc[:, 15:30], ax = ax[1])
sns.boxplot(data = train_outliers.iloc[:, 30:45], ax = ax[2])
sns.boxplot(data = train_outliers.iloc[:, 45:60], ax = ax[3])
sns.boxplot(data = train_outliers.iloc[:, 60:75], ax = ax[4])
sns.boxplot(data = train_outliers.iloc[:, 75:90], ax = ax[5])
sns.boxplot(data = train_outliers.iloc[:, 90:101], ax = ax[6])
plt.show()

del train_outliers

# Feature correlation

In [None]:
corr=train.corr()

mask = np.triu(np.ones_like(corr, dtype = bool))
plt.figure(figsize = (15, 15))
plt.title('Correlation matrix for features of Training data')
sns.heatmap(corr,cmap='coolwarm', mask = mask, annot=False, linewidths = .5,square=True, cbar_kws={"shrink": .60})
plt.show()

# Feature correlation with target

In [None]:
corr[['target']].sort_values(by='target', ascending=False).T.style.background_gradient(cmap="copper_r")

In [None]:
corr_ = abs(corr[['target']].sort_values(by='target', ascending=False))
fig, axes = plt.subplots(1, 2, figsize=(18, 10))
fig.suptitle('Correlation to Target')

sns.heatmap(ax=axes[0], data=corr_.iloc[0:50,:], annot=False, cmap='tab20c', linewidth=0.5, xticklabels=corr_.iloc[0:50,:].columns, yticklabels=corr_.iloc[0:50,:].index)
sns.heatmap(ax=axes[1], data=corr_.iloc[50:,:], annot=False, cmap='tab20c', linewidth=0.5, xticklabels=corr_.iloc[50:100,:].columns, yticklabels=corr_.iloc[50:100,:].index)
plt.show()

# Feature importance of LGBM

In [None]:
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier

X = train[features]
y = train['target']

X_train, X_valid, y_train, y_valid = train_test_split(X, y,train_size=0.8,test_size = 0.2,random_state = 0)

lgbm = LGBMClassifier()
lgbm.fit(X_train, y_train)

importances_df = pd.DataFrame(lgbm.feature_importances_, columns=['Feature_Importance'],index=X_train.columns)
importances_df.sort_values(by=['Feature_Importance'], ascending=False, inplace=True)

In [None]:
importances_df.T.style.background_gradient(cmap="copper_r")

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(18, 10))
fig.suptitle('Correlation to Target')

sns.heatmap(ax=axes[0], data=importances_df.iloc[0:50,:], annot=False,cmap='tab20c', linewidth=0.5, xticklabels=importances_df.iloc[0:50,:].columns, yticklabels=importances_df.iloc[0:50,:].index)
sns.heatmap(ax=axes[1], data=importances_df.iloc[50:,:], annot=False,cmap='tab20c', linewidth=0.5, xticklabels=importances_df.iloc[50:100,:].columns, yticklabels=importances_df.iloc[50:100,:].index)
plt.show()

# Baseline lgbm submission

In [None]:
from sklearn.metrics import roc_auc_score

print(roc_auc_score(y_valid, lgbm.predict_proba(X_valid)[:, 1]))

In [None]:
test_preds = lgbm.predict_proba(test)[:, 1]
sample_submission['target'] = test_preds
sample_submission.to_csv('submission.csv', index=False)

# Observation
#### * The test dataset is approx equal to the size of the training dataset
#### * The training dataset is highly representative of the test dataset
#### * There is no missing data
#### * There is no binary features
#### * There is relatively low correlation between features
#### * There appears to be a relatively high correlation between f34 and target value.
#### * Feature have show both positive and negative correlations to target classification.
#### * Feature importance doesn't indicate f34 as an important feature.
#### * Feature importance indicates f91 as important feature

# Next notebook
##### What's in it?
* Training many different models and see which model is performing well.
* Feature importances of each model.

#### Link: https://www.kaggle.com/rigeltal/tps-11-starter

# Final note
#### Thank you!
##### If you like it please upvote it. If you have suggestion please leave it in comment. Even I am beginner looking forward to learn something new. So let me know how can I improve this