<div style="padding:20px;color:white;margin:0;font-size:200%;text-align:center;display:fill;border-radius:5px;background-color:#38A6A5;overflow:hidden;font-weight:500">TPS June 2022</div>

# <b><span style='color:#444444'>1 |</span><span style='color:#38A6A5'> Competition Overview</span></b>

The June edition of the 2022 Tabular Playground series is all about data imputation. The dataset has similarities to the May 2022 Tabular Playground, except that there are no targets. Rather, there are missing data values in the dataset, and your task is to predict what these values should be.

In [None]:
# Reference : https://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python/notebook
# Credits : PARUL PANDEY

In [None]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None) #Shows all columns

In [None]:
df = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")
df.head()

In [None]:
df_describe = df.describe()
display(df_describe.style.format('{:,.3f}')
        .background_gradient(subset=(df_describe.index[1:],df_describe.columns[:]), cmap='GnBu'))

In [None]:
null_columns = dict(df.isnull().sum())
null_columns

# Handling Missing Values

<img src="https://i.imgur.com/68u0dD2.png"/>

# Deletion

#### Pairwise Deletions

Parwise Deletion is used when values are missing completely at random i.e MCAR. During Pairwise deletion, only the missing values are deleted. All operations in pandas like mean,sum etc intrinsically skip missing values.

In [None]:
df_1 = df.copy()
print(df_1['F_1_10'].mean())
print(df_1['F_3_5'].mean())
print(df_1['F_4_5'].mean())
print(df_1['F_4_4'].mean())
print(df_1['F_1_12'].mean())

#### Listwise Deletion

In statistics, listwise deletion is a method for handling missing data. In this method, an entire record is excluded from analysis if any single value is missing.

In [None]:
print(df_1.F_3_13.isnull().sum())
df_1.dropna(subset=['F_3_13'],inplace=True)
print(df_1.F_3_13.isnull().sum())

#### Dropping complete columns

If a column contains a lot of missing values, say more than 80%, and the feature is not significant, you might want to delete that feature. However, again, it is not a good methodology to delete data.

# Imputations

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
%%time

df_2 = df.copy()
const = SimpleImputer(strategy='constant') # imputing using constant value
df_2.iloc[:,:] = const.fit_transform(df_2)
df_2.isnull().sum()

In [None]:
%%time

df_3 = df.copy()
# Missing values will be imputed with values which are occuring most frequently
const = SimpleImputer(strategy='most_frequent') # imputing using most_frequent
df_3.iloc[:,:] = const.fit_transform(df_3)
df_3.isnull().sum()

In [None]:
%%time

df_4 = df.copy()
const = SimpleImputer(strategy='mean') # imputing using mean
df_4.iloc[:,:] = const.fit_transform(df_4)
df_4.isnull().sum()

In [None]:
%%time

df_5 = df.copy()
const = SimpleImputer(strategy='median') # imputing using median
df_5.iloc[:,:] = const.fit_transform(df_5)
df_5.isnull().sum()

# Imputations Techniques for Time Series Problems

Basic Imputation Techniques
1. 'ffill' or 'pad' - Replace NULL values with the value from the previous row's value
2. 'bfill' or 'backfill' - Replace NULL values with the value from the next row's value
3. Linear interpolation method

<img src="https://miro.medium.com/max/1348/0*EigRTqT8ybMMUZZe"/>

#### 1.ffill

In [None]:
df_ffill = df.copy()
df_ffill.fillna(method='ffill',inplace=True)
df_ffill.isnull().sum()

In [None]:
# For imputiting in specific column
df_ffill[['F_1_2', 'F_4_14']] = df_ffill.loc[:,['F_1_2', 'F_4_14']].fillna(method='ffill')

#### 1.bfill

In [None]:
df_bfill = df.copy()
df_bfill.fillna(method='bfill',inplace=True)
df_bfill.isnull().sum()

In [None]:
# For imputiting in specific column
df_bfill[['F_1_2', 'F_4_14']] = df_bfill.loc[:,['F_1_2', 'F_4_14']].fillna(method='bfill')

####  Linear Interpolation method

Linear interpolation is an imputation technique that assumes a linear relationship between data points and utilises non-missing values from adjacent data points to compute a value for a missing data point.

In [None]:
interpolate = df.copy()
interpolate.isnull().sum()

In [None]:
interpolate['F_1_0'].interpolate(limit_direction="both",inplace=True)
interpolate.isnull().sum()

## Checkout this wonderfull [Notebook](https://www.kaggle.com/code/parulpandey/a-guide-to-handling-missing-values-in-python/notebook) for Advanced Imputation Techniques