# Table of Contents
<a id="table-of-contents"></a>
- [1 Introduction](#1)
- [2 Preparations](#2)
- [3 Datasets Overview](#3)
- [4 Distribution](#4)
    - [4.1 F_1](#4.1)
    - [4.2 F_2](#4.2)
    - [4.3 F_3](#4.3)
    - [4.4 F_4](#4.4)
- [5 Missing Values](#5)
    - [5.1 Columns Missing Values](#5.1)
    - [5.2 Rows Missing Values](#5.2)
    - [5.3 F Categories Missing Values](#5.3)
- [6 Deep into 1 Missing Values](#6)
    - [6.1 F_1](#6.1)
    - [6.2 F_3](#6.2)
    - [6.3 F_4](#6.3)

[back to top](#table-of-contents)
<a id="1"></a>
# 1 Introduction

Kaggle competitions are incredibly fun and rewarding, but they can also be intimidating for people who are relatively new in their data science journey. In the past, Kaggle have launched many Playground competitions that are more approachable than Featured competition, and thus more beginner-friendly.

The goal of these competitions is to provide a fun, but less challenging, tabular dataset. These competitions will be great for people looking for something in between the Titanic Getting Started competition and a Featured competition.

The June edition of the 2022 Tabular Playground series is all about data imputation. For this challenge, we are given (simulated) manufacturing control data that contains missing values due to electronic errors. Our task is to predict the values of all missing data in this dataset. (Note, while there are continuous and categorical features, only the continuous features have missing values.)

[back to top](#table-of-contents)
<a id="2"></a>
# 2 Preparations
Preparing packages and data that will be used in the analysis process. Packages that will be loaded are mainly for data manipulation, data visualization and modeling. There are 2 datasets that are used in the analysis, they are train and test dataset. The main use of train dataset is to train models and use it to predict test dataset. While sample submission file is used to informed participants on the expected submission for the competition. *(to see the details, please expand)*

In [None]:
# import packages
import os
import joblib
import numpy as np
import pandas as pd
import warnings

import matplotlib
import matplotlib.pyplot as plt
from matplotlib import ticker
import seaborn as sns

# setting up options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')

# import datasets
data = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv')
submission = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv')

[back to top](#table-of-contents)
<a id="3"></a>
# 3 Dataset Overview
The intend of the overview is to get a feel of the data and its structure.

**Observations:**
- There are `81 columns` with `1 million rows`.
- There are `1 million` missing values that should be predicted in this competition.
- It seems the data is divided into `4 big categories`: `F_1`, `F_2`,`F_3` and `F_4`.
- `F_1` and `F_4` are divided into `14` parts while `F_2` and `F_3` are divided into `24` parts.
- Column `F_2_0` through `F_2_24` are `categorical` features and we are note expecting any missing values.
- `Continuous` features can be seen in column starting with `F_1`, `F_3` and `F_4`.

Below is the first 5 rows of data dataset:

In [None]:
data.head()

In [None]:
print(f'Number of rows: {data.shape[0]};  Number of columns: {data.shape[1]}; No of missing values: {sum(data.isna().sum())}')

In [None]:
data.dtypes

[back to top](#table-of-contents)
<a id="4"></a>
# 4 Data Distributions
We are going to explore data distribution on each feature. This will be useful to get a sense where the missing value should be when making prediction.

In [None]:
columns_name = list(data.columns)
columns_name = columns_name[1:len(columns_name)]
columns_name_F1 = [col for col in columns_name if col[:3] == "F_1"]
columns_name_F2 = [col for col in columns_name if col[:3] == "F_2"]
columns_name_F3 = [col for col in columns_name if col[:3] == "F_3"]
columns_name_F4 = [col for col in columns_name if col[:3] == "F_4"]

<a id="4.1"></a>
## 4.1 F_1

In [None]:
background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 3)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = columns_name_F1

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=data[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

<a id="4.2"></a>
## 4.2 F_2

In [None]:
plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

background_color = "#f6f5f5"
sns.set_palette(['#ff355d']*25)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1

#ax0.text(-0.5, 92, 'Top 5 Values - Train Dataset', fontsize=10, fontweight='bold')
#ax0.text(-0.5, 85, 'feature_0 - feature_24', fontsize=6, fontweight='light')        

features = columns_name_F2

run_no = 0
for col in features:
    temp_df = pd.DataFrame(data[col].value_counts())
    temp_df = temp_df.reset_index(drop=False)
    temp_df.columns = ['Number', 'Count']
    sns.barplot(ax=locals()["ax"+str(run_no)],x=temp_df['Number'], y=temp_df['Count']/len(data)*100, zorder=2, linewidth=0, alpha=1, saturation=1)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].yaxis.set_major_formatter(ticker.PercentFormatter())
    run_no += 1
plt.show()

<a id="4.3"></a>
## 4.3 F_3

In [None]:
background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 5)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 5):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = columns_name_F3

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=data[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

<a id="4.4"></a>
## 4.4 F_4

In [None]:
background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 10), facecolor='#f6f5f5')
gs = fig.add_gridspec(5, 3)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 5):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

features = columns_name_F4

run_no = 0
for col in features:
    sns.kdeplot(ax=locals()["ax"+str(run_no)], x=data[col], zorder=2, alpha=1, linewidth=1, color='#ff355d')
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].set_xlabel(col, fontsize=4, fontweight='bold')
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5)
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)
    locals()["ax"+str(run_no)].yaxis.offsetText.set_fontsize(4)
    run_no += 1

plt.show()

[back to top](#table-of-contents)
<a id="5"></a>
# 5 Missing Values
We are going to see more on the missing values in the dataset by `rows`, by `columns` and breaking down them by `F categories`.

<a id="5.1"></a>
## 5.1 Columns Missing Values
To see how many missing values in each column.

**Observations:**
- As mentioned in the competition description, there will be no missing value for categorical features. Our analysis proves consistent with it as can be seen in features `F_2`.
- Missing value in each `column` are around `18,000` observations.

In [None]:
print(data.isnull().sum(axis=0))

<a id="5.2"></a>
## 5.2 Rows Missing Values
To see how many missing values in each rows. 

**Observations:**
* `More than 35%` without any missing value.
* `More than 35%` has only `1 missing value`.


In [None]:
missing_row = data.isnull().sum(axis=1)
missing_row = missing_row.value_counts()
missing_row = pd.DataFrame(missing_row)
missing_row = missing_row.reset_index(drop=False)
missing_row.columns = ["Number", "Count"]

background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(3, 2), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.3, hspace=0.3)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

sns.barplot(ax=ax0,x=missing_row['Number'], y=missing_row['Count']/len(data)*100, zorder=2, linewidth=0, alpha=1, saturation=1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.set_ylabel('')
ax0.set_xlabel("Missing Value", fontsize=4, fontweight='bold')
ax0.tick_params(labelsize=4, width=0.5, length=1.5)
ax0.yaxis.set_major_formatter(ticker.PercentFormatter())
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)



<a id="5.3"></a>
## 5.3 F Categories Missing Values
To see how many missing values in each row in each F category. There are 2 ideas in here:
* Perform missing value projection based on each `F categories` using `imputer`.
* Using `Machine Learning algorithm` (XGBOOST, LGBM, Catboost, etc) to perform projection that only has `1 missing value` and using `imputer` to fill `more than 1 missing value`.

**Observations**:
* `F_1 and F_4 category` have `more than 20%` of `1 missing value` in each row.
* `F_3` has almost `30%` of `1 missing value`.

In [None]:
def missing_values(dataset, columns):
    new_dataset = dataset[columns].isnull().sum(axis=1)
    new_dataset = new_dataset.value_counts()
    new_dataset = pd.DataFrame(new_dataset)
    new_dataset = new_dataset.reset_index(drop=False)
    new_dataset.columns = ["Number", "Count"]
    return new_dataset

f_columns = [columns_name_F1, columns_name_F3, columns_name_F4]

background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(7, 2), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 3)
gs.update(wspace=0.3, hspace=0.3)

run_no = 0
for row in range(0, 1):
    for col in range(0, 3):
        locals()["ax"+str(run_no)] = fig.add_subplot(gs[row, col])
        locals()["ax"+str(run_no)].set_facecolor(background_color)
        for s in ["top","right"]:
            locals()["ax"+str(run_no)].spines[s].set_visible(False)
        run_no += 1  

run_no = 0
for col in f_columns:
    temp_df = missing_values(data, col)
    sns.barplot(ax=locals()["ax"+str(run_no)],x=temp_df['Number'], y=temp_df['Count']/len(data) * 100, zorder=2, linewidth=0, alpha=1, saturation=1)
    locals()["ax"+str(run_no)].grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
    locals()["ax"+str(run_no)].set_ylabel('')
    locals()["ax"+str(run_no)].xaxis.offsetText.set_fontsize(4)  
    locals()["ax"+str(run_no)].tick_params(labelsize=4, width=0.5, length=1.5)
    locals()["ax"+str(run_no)].yaxis.set_major_formatter(ticker.PercentFormatter())
    run_no += 1

    ax0.set_xlabel('f1', fontsize=4, fontweight='bold')
    ax1.set_xlabel('f2', fontsize=4, fontweight='bold')
    ax2.set_xlabel('f3', fontsize=4, fontweight='bold')
    
plt.show()

<a id="6"></a>
# 6 Deep into 1 Missing Value
Imagine missing value in each column (F_1_0, F_2_0, etc) that has 1 missing value is `target value` in `test` dataset. We can perform projection to those target value by creating a `train` and `test` dataset for each columns and using machine learning algorithm to perform the projection. 

**Observations:**
* There will be quite a huge number of `train` dataset for `F_1`, `F_3` and `F_4` categories. 
* For `F_1` and `F_4` there will be more than `252,000` rows with around `18,000` that can be treated as test dataset. 
* For `F_3` there will be more than `432,000` rows with around `18,000` that can be treated as test dataset. 
* It's expected to be `52` (14 + 14 + 24) models if we are going to perform individual projection. 
* We can also combine with `F_2` category to add more data to the `train` and `test` dataset as there is no missing value in `F_2`.

Please check great notebook [Top 3% solution: LGBM+Mean](https://www.kaggle.com/code/abdulravoofshaik/top-3-solution-lgbm-mean/notebook) by [ARShaik](https://www.kaggle.com/abdulravoofshaik) that perform this methodology.

<a id="6.1"></a>
## 6.1 F_1 Category

In [None]:
new_dataset = data[columns_name_F1]
new_dataset['missing'] = data[columns_name_F1].isnull().sum(axis=1)
new_dataset = new_dataset[new_dataset['missing'] == 1]
summary_dataset = pd.DataFrame(data[columns_name_F1].isnull().sum(axis=0))
summary_dataset = summary_dataset.reset_index(drop=False)
summary_dataset.columns = ["Column", "Count"]

background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 2), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.3, hspace=0.3)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

sns.barplot(ax=ax0,x=summary_dataset['Column'], y=summary_dataset['Count'], zorder=2, linewidth=0, alpha=1, saturation=1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.set_ylabel('')
ax0.set_xlabel("Column", fontsize=4, fontweight='bold')
ax0.tick_params(labelsize=4, width=0.5, length=1.5)
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)

<a id="6.2"></a>
## 6.2 F_3 Category

In [None]:
new_dataset = data[columns_name_F3]
new_dataset['missing'] = data[columns_name_F3].isnull().sum(axis=1)
new_dataset = new_dataset[new_dataset['missing'] == 1]
summary_dataset = pd.DataFrame(data[columns_name_F3].isnull().sum(axis=0))
summary_dataset = summary_dataset.reset_index(drop=False)
summary_dataset.columns = ["Column", "Count"]

background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 2), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.3, hspace=0.3)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

sns.barplot(ax=ax0,x=summary_dataset['Column'], y=summary_dataset['Count'], zorder=2, linewidth=0, alpha=1, saturation=1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.set_ylabel('')
ax0.set_xlabel("Column", fontsize=4, fontweight='bold')
ax0.tick_params(labelsize=4, width=0.5, length=1.5)
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)

<a id="6.3"></a>
## 6.3 F_4 Category

In [None]:
new_dataset = data[columns_name_F4]
new_dataset['missing'] = data[columns_name_F4].isnull().sum(axis=1)
new_dataset = new_dataset[new_dataset['missing'] == 1]
summary_dataset = pd.DataFrame(data[columns_name_F4].isnull().sum(axis=0))
summary_dataset = summary_dataset.reset_index(drop=False)
summary_dataset.columns = ["Column", "Count"]

background_color = "#f6f5f5"

plt.rcParams['figure.dpi'] = 600
fig = plt.figure(figsize=(10, 2), facecolor='#f6f5f5')
gs = fig.add_gridspec(1, 1)
gs.update(wspace=0.3, hspace=0.3)

ax0 = fig.add_subplot(gs[0, 0])
ax0.set_facecolor(background_color)

sns.barplot(ax=ax0,x=summary_dataset['Column'], y=summary_dataset['Count'], zorder=2, linewidth=0, alpha=1, saturation=1)
ax0.grid(which='major', axis='x', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.grid(which='major', axis='y', zorder=0, color='#EEEEEE', linewidth=0.4)
ax0.set_ylabel('')
ax0.set_xlabel("Column", fontsize=4, fontweight='bold')
ax0.tick_params(labelsize=4, width=0.5, length=1.5)
for s in ["top","right"]:
    ax0.spines[s].set_visible(False)

Thank you for reading. It's just a beginner work. Any critics are very welcome. If you find any errors in the notebook, please let me know.