<font color="lightseagreen" size=+3.5><b> JUNE 2022 TPS: Missing Value Imputation Challenge</b></font>

---
---
<a id="1"></a>
<font color="lightseagreen" size=+2.0><b> Introduction</b></font>

In this month's Tabular Playground series we are tasked with data imputation challeng. According to the hosts, the dataset has similarities to the May 2022 Tabular Playground, except that there are no targets; instead, there are missing data values in the dataset, and our task is to predict what these values should be.

**Evaluation Metric** : Submissions are scored on the root mean squared error (RMSE).

---

<font color="lightseagreen" size=+2.0><b> Table of Contents</b></font>
    
* [1. Introduction](#1)
* [2. Data Overview](#2)
* [3. Missing Values](#3)
* [4. Imputing Missing Values](#4)
* [5. Reference](#5)

---
---

In [None]:
import os
import numpy as np
import pandas as pd 

from tqdm.notebook import tqdm
import lightgbm
import xgboost
import catboost

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer


import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
from matplotlib.ticker import FormatStrFormatter

import warnings
warnings.filterwarnings('ignore')

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
sub =  pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv', index_col='row-col')
df = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv', index_col='row_id')

<a id="2"></a>
<font color="lightseagreen" size=+2.5><b>2. Data Overview</b></font>

- The dataset has `1 million rows` and `80 columns` excluding the `row_id` column.
- All columns have numerical values. Features starting with `F_2` are on **int type**. The rest are of **float type**.
- The max (31.23) and min (-26.28) values of the dataset occur in the same columns, `F_4_11`


In [None]:
df.info()

In [None]:
display(df.shape)
display(df.head())

In [None]:
df.describe().T.sort_values(by='mean' , ascending = False)\
.style.background_gradient(cmap='Greys')\
.bar(subset=["mean",], color='#6495ED')\
.bar(subset=["max"], color='#ff355d')

<a id="3"></a>
<font color="lightseagreen" size=+2.5><b>3. Missing Values</b></font>

- The challenge of this competition is about filling the missing values. But not all the columns have missing values.
- Features starting with `F_1, F_3 and F_4` do have missing values, but features starting with `F_2` (a total of 25 features) do not.
- The percentage of missing values is just around 1.8% for the columns which have missing values in them.


In [None]:
def null_value_df(data):    
    null_values_df = []    
    for col in data.columns[:]:
        pct_na = np.round((100 * (data[col].isna().sum())/len(data)), 2)
        avg = data[col].mean()
        max_ = data[col].max()
        min_ = data[col].min()

        dict1 ={
            'Features' : col,
            'NA (count)': data[col].isna().sum(),
            'NA (%)': (pct_na),
            'avg' : avg,
            'min' : min_,
            'max' : max_
        }
        null_values_df.append(dict1)
    return pd.DataFrame(null_values_df, index=None)

DF1 = null_value_df(df)

fig = go.Figure(data=[go.Scatter(x=DF1['Features'],
                             y=DF1["NA (%)"],                              
                             name='train', mode='markers', marker_color='lightseagreen'),
                      ])
fig.add_vrect(
    x0=15, x1=39,
    annotation_text="No Missing Values Area", annotation_position="top",
    fillcolor="lightgray", opacity=0.5,
    layer="below", line_width=0,
),
fig.update_layout(title_text='<b> Percentage of missing values <b>',
                  font_family="San Serif",
                  template='simple_white',
                  width=850, height=400,
                  xaxis_title='Features', 
                  yaxis_title='Missing Values (%)',
                  titlefont={'color':'black', 'size': 24, 'family': 'San-Serif'})
fig.update_yaxes(showgrid=False, showline=False, showticklabels=True)
fig.update_xaxes(showgrid=False, showline=True, showticklabels=True)
fig.show()

fig = go.Figure(data=[go.Scatter(x=DF1['Features'],
                             y=DF1["avg"],                              
                             name='Avg', mode='markers', marker_color='lightseagreen'),
                      go.Scatter(x=DF1['Features'],
                             y=DF1["min"],                              
                             name='Min', mode='markers', marker_color='salmon'),
                      go.Scatter(x=DF1['Features'],
                             y=DF1["max"],                              
                             name='Max', mode='markers', marker_color='gold'),
                      
                      ])
fig.add_vrect(
    x0=15, x1=39,
    annotation_text="No Missing Values Area", annotation_position="top",
    fillcolor="lightgray", opacity=0.5,
    layer="below", line_width=0,
),
fig.add_vrect(
    x0=65, x1=80,
    annotation_text="High variance region", annotation_position="top",
    fillcolor="lightgray", opacity=0.1,
    layer="below", line_width=0,
),

fig.update_layout(title_text='<b> Column-wise Avg/Max/Min <b>',
                  font_family="San Serif",
                  template='simple_white',
                  showlegend =True,
                  width=850, height=400,
                  xaxis_title='Features', 
                  yaxis_title='Column Stats (mean, min, max)',
                  titlefont={'color':'black', 'size': 24, 'family': 'San-Serif'})
fig.update_yaxes(showgrid=False, showline=False, showticklabels=True)
fig.update_xaxes(showgrid=False, showline=True, showticklabels=True)
fig.show()


<font color="lightseagreen" size=+1.5><b>Observations</b></font>

- We see a kind of symmetry with the max and min values of columns with missing values which we do not see when columns have no missing value in them. We also see that the min values for all columsn with no missing values is ZERO, but the other group (cols with missing values) have negative values in them. Finding an imputation strategy which can reverse this symmetry could be helpful if anyone can find it.
- We also observe that the variance in features which starts with `F_4_x` as compared to the other missing-value features such as `F_1_x` and `F_3_x`


In [None]:
def density_plot(df, title):     
       
    fig, ax = plt.subplots(8, 10, figsize=(24, 16), sharey=False, facecolor='#dddddd')
    
    fig.subplots_adjust(top=0.90)
    i = 1
    for feature in df.columns:
        plt.subplot(8, 10, i)
        ax = sns.kdeplot(df[feature], shade=True,  color='#6495ED',  alpha=0.85, label='train')
        ax.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
        ax.xaxis.set_label_position('top')
        
        if feature.startswith('F_2') :           
            ax.set_facecolor('lightsalmon')
            
        ax.set_ylabel('')
        ax.set_yticks([])        
        ax.set_xticks([])
        i += 1

    plt.suptitle(title, fontsize=20)
    plt.show()

In [None]:
density_plot(df.sample(frac=0.05), title='Density Plot: All Features')

<font color="lightseagreen" size=+1.5><b>Observations</b></font>

- The distribution of values in columns with missing values is close to normal. Whereas those columns with No missing values have right-skewed distribution. This could again be important in choosing imputing strategies.
- The spiky distribustions of `F_2_x` features suggests that these features might be of categorical type. Let's check the unique values in each of those coloumns. 
- We see that the unique values of each `F_2_x` feature is between 11 and 18 confirming that these are categorical features.

In [None]:
def df_unique(df):
    df_unique = []
    for feature in df.columns:
        if feature.startswith('F_2') : 

            dict1 ={
                'Features' : feature,
                'Unique': df[feature].nunique(),            
            }
            df_unique.append(dict1)
    return pd.DataFrame(df_unique, index=None)

DF1 = df_unique(df)

fig = go.Figure(data=[go.Bar(x=DF1['Features'],
                             y=DF1['Unique'],                              
                             name='train', marker_color='lightseagreen'),
                      ])
fig.update_layout(title_text='<b> Unique values of Categorical Features (F_2_*)<b>',
                  font_family="San Serif",
                  template='simple_white',
                  width=850, height=400,
                  xaxis_title='Features', 
                  yaxis_title='Number of unique values',
                  titlefont={'color':'black', 'size': 24, 'family': 'San-Serif'})#.update_xaxes(categoryorder='total descending')

fig.update_yaxes(showgrid=False, showline=False, showticklabels=True)
fig.update_xaxes(showgrid=False, showline=True, showticklabels=True)
fig.show()

In [None]:
fig, ax = plt.subplots(5, 5, figsize=(24, 16), sharey=False, facecolor='#dddddd')    
fig.subplots_adjust(top=0.93)
i = 1
for feature in df.columns:
    if feature.startswith('F_2'):
        plt.subplot(5, 5, i)
        ax = sns.countplot(data=df.sample(frac=0.01), x=feature, color='#42ddd4')
        ax.set_facecolor('white')
        i += 1
plt.suptitle('Categorical Features Count Plot (F_2_x)', fontsize=24)
plt.show()

<font color="lightseagreen" size=+1.5><b>Creating New Features (stats)</b></font>

Let's create additional features from simple stats, such as row-wise average, max/min ration, NA count.

In [None]:
df['average'] = df.mean(axis=1)
df['abs_min_max_ratio'] = df.max(axis=1) / np.abs(df.min(axis=1))
df['na_count'] = df.isnull().sum(axis=1)

<font color="lightseagreen" size=+1.5><b>Creating New Features (from F_2_x features)</b></font>

Here I will try creating new features by combining the `F_2_x` features based on unique category  similarities. This may not make any sense at all as my basis for selecting interaction is the shape of distributions, nevertheless I will do so and get feedback from the LB score.

In [None]:
# additional features tried in version 4 and 5
df['f1318']= df['F_2_1'] + df['F_2_3'] - df['F_2_18'] #feats with 15 cats
df['f5617']= df['F_2_5'] + df['F_2_6'] - df['F_2_17'] #features with 13 cats
df['f1024']= df['F_2_10'] - df['F_2_24'] #features with 18 cats
df['f47']= df['F_2_4'] - df['F_2_7'] #features with 17 cats
df['f2923']= df['F_2_2'] + df['F_2_9'] - df['F_2_23'] #features with 12 and 11 cats
df['f141516']= df['F_2_14'] + df['F_2_15'] - df['F_2_16'] #features with 14 cats
df['f81119']= df['F_2_8'] + df['F_2_11'] - df['F_2_19'] #features with 14 cats
df['f012132122']= df['F_2_0'] + df['F_2_12'] + df['F_2_13'] - df['F_2_21'] - df['F_2_22']#features with 16 cats

<!-- feats_2 = [feat for feat in df.columns if feat.startswith('f') | feat.startswith('F_2')]
DF_ = df[feats_2]

corr = DF_.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(16, 12))
#cmap = sns.diverging_palette(230, 0, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap='coolwarm', vmax=1.0, vmin=-.1, center=0, annot=False,
            square=True, linewidths=.5, cbar_kws={"shrink": 0.75}); -->

In [None]:
fig, ax = plt.subplots(3, 3, figsize=(18, 10), sharey=False, facecolor='#dddddd')    
fig.subplots_adjust(top=0.90)
i = 1
for feature in df.columns:
    if feature.startswith('f'):
        plt.subplot(3, 3, i)
        ax = sns.countplot(data=df.sample(frac=0.1), x=feature, color='#42ddd4')
        ax.set_facecolor('white')
        
        ax.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
        ax.xaxis.set_label_position('top')
        
        ax.set_ylabel('')
        ax.set_yticks([])        
        ax.set_xticks([])
        i += 1
plt.suptitle('Newly Created Cat. Features (from F_2_x features)', fontsize=20)
plt.show()

<a id="4"></a>
<font color="lightseagreen" size=+2.5><b>4. Imputing Missing Values</b></font>

The challenge in this competition is finding the right imputation technique for the missing values. There are several out-of-the box imputing algorithms available. SimpueImputer, IterativeImputer, KNNImputer, and few others. We will see which one works better.

"For various reasons, many real world datasets contain missing values, often encoded as blanks, NaNs or other placeholders. Such datasets however are incompatible with scikit-learn estimators which assume that all values in an array are numerical, and that all have and hold meaning. A basic strategy to use incomplete datasets is to discard entire rows and/or columns containing missing values. However, this comes at the price of losing data which may be valuable (even though incomplete). A better strategy is to impute the missing values, i.e., to infer them from the known part of the data. 

"One type of imputation algorithm is `univariate`, which imputes values in the `i-th` feature dimension using only non-missing values in that feature dimension (e.g. `impute.SimpleImputer`). By contrast, `multivariate` imputation algorithms use the `entire set of available feature` dimensions to estimate the missing values (e.g. `impute.IterativeImputer`)" [[Scikit-learn](https://scikit-learn.org/stable/modules/impute.html#iterative-imputer)]


<font color="lightseagreen" size=+1.5 ><b>Iterative Imputation</b></font>

In [None]:
from tqdm.notebook import tqdm

import lightgbm
import xgboost
import catboost

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

SEED = 622

In [None]:
# code from: https://www.kaggle.com/code/hiro5299834/tps-jun-2022-iterativeimputer-baseline

xgb = xgboost.XGBRegressor(
        n_estimators=500,
        random_state=SEED,
        tree_method='gpu_hist',
    )
imp = IterativeImputer(
    estimator=xgb,
    missing_values=np.nan,
    max_iter=7,
    initial_strategy='mean',
    imputation_order='ascending',
    verbose=2,
    random_state=SEED
)

df[:] = imp.fit_transform(df)

In [None]:
for i in tqdm(sub.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    sub.loc[i, 'value'] = df.loc[row, col]

sub.to_csv("submission_xgb_03.csv")
sub

**Score history**:
- Original features , 7 iterations (LB = 0.92441)
- Original + additional features, 7 iterations (LB = 0.92401)
- Original + additional feature, 10 iterations (LB = 0.92552)

<a id="5"></a>
<font color="lightseagreen" size=+2.5><b>5. Reference</b></font>

1. https://www.kaggle.com/competitions/tabular-playground-series-jun-2022/overview/evaluation

2. https://scikit-learn.org/stable/modules/impute.html#iterative-imputer

3. https://www.kaggle.com/code/hiro5299834/tps-jun-2022-iterativeimputer-baseline