# 1. Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The data is synthetically generated by a GAN that was trained on a real-world dataset used to identify spam emails via various extracted features from the email.

For this competition, we need to be predicting a binary target based on 100 feature columns given in the data. All columns are continuous.

The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN. The original dataset deals with predicting whether a claim will be made on an insurance policy. Although the features are anonymized, they have properties relating to real-world features.

The ground truth target is binary valued, but a prediction may be any number from 0.0 to 1.0, representing the probability of a spam.
Submissions are evaluated on **area under the ROC curve** between the predicted probability and the observed target.

# 2. Summary

- There are no missing values in both train ans test dataset.
- The train consists of 600000 data, and the test consists of 540000 data.
- All features of `f0`~`f99` : continuous feature (100)
- The value of target is 0 or 1.
- The value of target is almost half-and-half. 

# 3. Preparations
Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation, data visualization and modeling. There are 2 datasets that are used in the analysis, they are train and test dataset. The main use of train dataset is to train models and use it to predict test dataset. While sample submission file is used to informed participants on the expected submission for the competition. *(to see the details, please expand)*

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns

In [None]:
# matplotlib setting
mpl.rcParams['figure.dpi'] = 200
mpl.rcParams['axes.spines.top'] = False
mpl.rcParams['axes.spines.right'] = False

# pandas setting
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# 4. Dataset Overview
The intend of the overview is to get a feel of the data and its structure in train, test and submission file. An overview on train and test datasets will include a quick analysis on missing values and basic statistics, while sample submission will be loaded to see the expected submission.

## Train dataset
As stated before, train dataset is mainly used to train predictive model as there is an available target variable in this set. This dataset is also used to explore more on the data itself including find a relation between each predictors and the target variable.

**Observations:**
- `target` column is the target variable which is only available in the `train` dataset.
- There are `102` columns: `100` features, `1` target variable `target` and `1` column of `id`.
- `train` dataset contain `600000` rows
- `test` dataset contain `540000` rows
- `train` and `test` data has no `NULL` values

## 4.1 Loading the Dataset

In [None]:
train = pd.read_csv('../input/tabular-playground-series-nov-2021/train.csv')
test = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
sample_submission = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

## 4.2 Quick view
Below is the first 5 rows of train dataset:

In [None]:
train.head()

## 4.3 Train and Test data shape

In [None]:
print(train.shape)
print(test.shape)

In [None]:
train.info()

In [None]:
test.info()

## 4.4 Missing values in train and test

In [None]:
print(f'Missing values in train : {train.isnull().sum().sum()}')
print(f'Missing values in test : {test.isnull().sum().sum()}')

## 4.5 Feature Statistics
Below is the basic statistics for each variables which contain information on `count`, `mean`, `standard deviation`, `minimum`, `1st quartile`, `median`, `3rd quartile` and `maximum`.

In [None]:
train.loc[:, 'f0':'f99'].describe().T.style.bar(subset=['mean'], color='#205ff2')\
                            .background_gradient(subset=['std'], cmap='Reds')\
                            .background_gradient(subset=['50%'], cmap='coolwarm')

##  4.6 Distribution
Showing distribution on each feature that are available in train and test dataset. As there are 100 features, `Orange` represents train dataset while `Blue` will represent test dataset

**Observations:**
- All features distribution on train and test dataset are almost similar.

In [None]:
np.random.seed(2110)
train_samples = train.sample(10000)
test_samples = test.sample(10000)

In [None]:
fig, axes = plt.subplots(10,10,figsize=(14, 14))
axes = axes.flatten()

for idx, ax in enumerate(axes, 0):
    if (idx>99):
        ax.axis("off")
        continue
    sns.kdeplot(data=train_samples, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    sns.kdeplot(data=test_samples, x=f'f{idx}', 
                fill=True, 
                ax=ax)
    ax.set_xticks([])
    ax.set_yticks([])
    ax.set_xlabel('')
    ax.set_ylabel('')
    ax.spines['left'].set_visible(False)

    ax.set_title(f'f{idx}', loc='right', weight='bold', fontsize=10)

fig.supxlabel('Feature Distribution', ha='center', fontweight='bold')

fig.tight_layout()
plt.show()

## 4.7 Distribution of Target
Target variable has a value of `0` to `1` which indicate mail is not spam and spam from the email. Let's see how the distribution of the `claim` variable.

**Observations:**
- The number of mail that is not spam and spam (`0` and `1`) are almost the same of `2,96,394` and `3,03,603`, respectively.
- The number of mail percentage that is not spam and spam (`0` and `1`) are almost the same of `49.399%` and `50.601%`, respectively.
- In term of percentage both the email that is spam and not spam are almost around half and half.

In [None]:
train['target'].value_counts(normalize = True)