# Tabular Playground Series- June 2022 Part 1

Objective:- 
This notebook is used to participate in the TPS June2022 edition. We aim to impute missing values in the provided dataset using descriptive statistics and other column entries in the table

**In this component, we perform EDA and impute feature missing values for groups 1,3. Features in group 4 are imputed in the next part, using models (with accelerators)**

In [None]:
!pip install swifter
from IPython.display import clear_output;
clear_output();

In [None]:
# Performing general package imports:-
import numpy as np;
import pandas as pd;
import swifter;
from termcolor import colored;
import matplotlib.pyplot as plt;
%matplotlib inline
import seaborn as sns;
from scipy.stats import skew, iqr;

from warnings import filterwarnings;
filterwarnings(action= 'ignore');
from gc import collect;
from tqdm.notebook import tqdm;

np.random.seed(10);

In [None]:
# Performing model specific imports:-
# Using the sklearnex.patch_sklearn to improve the speed of sklearn library:-
from sklearnex import patch_sklearn; patch_sklearn()
from sklearn.model_selection import KFold;
from sklearn.metrics import mean_squared_error;

## 1. Training data import

In [None]:
xytrain = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv', 
                      encoding = 'utf8', index_col= 'row_id');
sub_fl = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv', 
                     encoding= 'utf8');

print(colored(f"\nTraining data\n", color = 'blue', attrs= ['bold']));
display(xytrain.head(5).style.format(precision=2));

print(colored(f"\nSample submission data\n", color = 'blue', attrs= ['bold']));
display(sub_fl.head(5).style.format(precision= 2));

In [None]:
# Basic information check:-
print(colored(f"\nTraining data information\n", color = 'blue', attrs= ['bold']));
display(xytrain.info());

print(colored(f"\nSubmission file information\n", color = 'blue', attrs= ['bold']));
display(sub_fl.info());

## 2. Memory reduction

The dataframes needlessly assign float64 to most columns and int64 to all integer columns.
This can be managed by re-assigning datatypes based on min-max column values. Memory usage is thus reduced substantially.


In [None]:
# Reducing dataframe memory usage :-
def ReduceMemory(df: pd.DataFrame):
    """
    This function reduces the associated dataframe's memory usage.
    It reassigns the data-types of columns according to their min-max values.
    It also displays the dataframe information after memory reduction.
    """;
    
    # Reducing float column memory usage:-
    for col in tqdm(df.iloc[0:2, 1:].select_dtypes('float').columns):
        col_min = np.amin(df[col].dropna());
        col_max = np.amax(df[col].dropna());
        
        if col_min >= np.finfo(np.float16).min and col_max <= np.finfo(np.float16).max: 
            df[col] = df[col].astype(np.float16)
        elif col_min >= np.finfo(np.float32).min and col_max <= np.finfo(np.float32).max : 
            df[col] = df[col].astype(np.float32)
        else: pass;

    # Reducing integer column memory usage:-
    for col in tqdm(df.iloc[0:2, 1:].select_dtypes('int').columns):
        col_min = df[col].min(); 
        col_max = df[col].max();
        
        if col_min >= np.iinfo(np.int8).min and col_max <= np.iinfo(np.int8).max:
            df[col] = df[col].astype(np.int8);
        elif col_min >= np.iinfo(np.int16).min and col_max <= np.iinfo(np.int16).max:
            df[col] = df[col].astype(np.int16);
        elif col_min >= np.iinfo(np.int32).min & col_max <= np.iinfo(np.int32).max:
            df[col] = df[col].astype(np.int32);
        else: pass;
        
    print(colored(f"\nDataframe information after memory reduction\n", 
                  color = 'blue', attrs= ['bold']));
    display(df.info()); 
    
    return df;

In [None]:
# Implementing memory reduction on the existing data-sets:-
xytrain = ReduceMemory(df= xytrain);
sub_fl = ReduceMemory(df= sub_fl);
collect();

**We have successfully reduced the memory usage of the input tables substantially from 651Mb to approximately 140Mb. This translates to an appriximately 78% memory reduction without major information loss**

## 3. Exploratory Data Analysis:-

EDA analysis consists of- 
1. Data-type and distribution details
2. Null values per column in the data table
3. Correlation plots and inferences
4. Other association metrics (mutual information)- if needed
5. Data visualization as deemed necessary

In [None]:
# Model data description:-
print(colored(f"\nModel data description with null values per column\n", 
              color = 'blue', attrs= ['bold']));

xytrain_desc_sum = \
pd.concat((xytrain.describe(percentiles= np.arange(0.10,1.0,0.10)).transpose(),
           xytrain.isna().sum(axis=0),
           xytrain.nunique(axis=0),
           pd.DataFrame(data= skew(xytrain.dropna(), axis=0),
                        index= xytrain.columns, columns= ['skewness'])), 
                  axis=1).\
rename({0: 'nb_null_values',1: 'nb_unique_values'}, axis=1);

display(xytrain_desc_sum.style.format('{:,.2f}'));

In [None]:
# Grouping the table columns into 4 groups based on the F_{nb} values:-
# Creating input column list:-
xytrain_col = xytrain.columns;

# Populating the feature groups with dictionary comprehension:-
Ftre_Grp_Dict = {i: xytrain_col[xytrain_col.str[0:3] == f"F_{i}"] for i in range(1,5,1)};
print(colored(f"\nModel data feature groups\n", color = 'blue', attrs= ['bold']));
display(Ftre_Grp_Dict);

#### Calculation of nulls per column in the data-set:-

The number of nulls per column could be of use in model development, especially for columns in group 4 where a model is developed.

In [None]:
_ = pd.DataFrame(xytrain.isna().sum(axis=0), columns = ['Nb_Nulls']);
_['Null_Rate'] = _/ len(xytrain.index);

fig, ax = plt.subplots(2,1, figsize = (25,16), sharex= True);
_.loc[_['Nb_Nulls'] > 0, ['Nb_Nulls']].plot.bar(ax= ax[0], color = 'tab:blue');
ax[0].set_title(f"Null instances per column in the data\n", color = 'tab:blue', fontsize= 12);
ax[0].set_yticks(range(0,19001,1000));

_.loc[_['Nb_Nulls'] > 0, ['Null_Rate']].plot.bar(ax= ax[1], color = 'tab:blue');
ax[1].set_title(f"Null-rate per column in the data\n", color = 'tab:blue', fontsize= 12);
ax[1].set_yticks(np.arange(0.0, 0.022, 0.002));

plt.tight_layout();
plt.show();

del _;

#### Calculation of nulls per row in the dataset:-

The number of nulls per row could be of use in model development, especially for columns in group 4 where a model is developed. We aim to elicit the nulls per row and then the column names where nulls are present per row.

In [None]:
# Analyzing the nulls per row only in feature group 4:-
def Locate_NullRows(df:pd.DataFrame, title: str):
    """
    This function locates the nulls across rows in the dataframe (or a subset) and plots it.
    Input:- 
    df (pd.DataFrame)- Analysis dataframe
    title (string)- Title for plots
    """;
    
    df['Nb_Nulls'] = df.isna().sum(axis=1);
    _ = pd.DataFrame(df.query("Nb_Nulls > 0").groupby(['Nb_Nulls']).size(),
                     columns = ['Nb_Nulls']);

    print(colored(f"\n{title}\n", color = 'blue', attrs= ['bold', 'dark']));
    display(_.style.format('{:,.0f}'));

    fig, ax= plt.subplots(1,1, figsize= (12,6));
    _.plot.bar(ax=ax);
    ax.set_title(f"\n{title}\n", color = 'tab:blue', fontsize= 12);
    ax.grid(visible= True, which = 'both', color = 'grey', linewidth= 0.75, linestyle= '--');
    ax.set_xlabel('Number of nulls', fontsize= 9);
    
    plt.tight_layout();
    plt.xticks(rotation= 0);
    plt.show();

    del _;
    df = df.drop(['Nb_Nulls'], axis=1);

In [None]:
# Locating nulls across table components:-
Locate_NullRows(df= xytrain, title = 'Full table- nulls');
Locate_NullRows(df= xytrain.loc[:, Ftre_Grp_Dict.get(4)],title = 'Feature group4- nulls');
collect();

In [None]:
# Delving into group4 nulls to elicit model specifics:-
_ = xytrain.loc[:, list(Ftre_Grp_Dict.get(4))];
_['Nb_Nulls'] = _.isna().sum(axis=1);
_['Nb_Nulls'].astype(np.int8);

def EnlistNullCol(row):
    "This function forms a list of all columns in a row that contain null values";
    null_col_comb = [];
    for i in row.items(): 
        if np.isnan(i[1], casting= 'safe'): null_col_comb.append(i[0]);
    return ' '.join(null_col_comb);

_['Null_Col'] = _.swifter.apply(EnlistNullCol, axis=1);

print(colored(f"\nNull columns in feature group4-\n", color= 'blue', attrs= ['dark', 'bold']));
display(_.loc[_.Null_Col != '', ['Null_Col']].value_counts());
collect();

In [None]:
# Plotting correlation plots for various sections in group4 with nulls between 1-5:-
for nb_nulls in tqdm(range(1,6,1)):
    fig, ax= plt.subplots(1,1, figsize= (20,10));
    _corr = _.loc[_.Nb_Nulls == nb_nulls].drop(['Nb_Nulls', 'Null_Col'], axis=1).corr();
    sns.heatmap(_corr, annot= True, fmt= '.0%', linewidth= 0.35, center= True, cmap= 'icefire',
                linecolor= 'black', mask = np.triu(np.ones_like(_corr)),ax= ax);
    ax.set_title(f"\nCorrelation plot for nulls = {nb_nulls}\n", color= 'tab:blue');
    plt.tight_layout();
    plt.yticks(rotation= 0);
    plt.xticks(rotation= 90);
    plt.show();
    del _corr;
    collect();
    
del _;
collect();

#### Data distribution analysis for all features:-

In [None]:
# Analyzing the data distributions per column group with continuous columns:-
for ftre_grp_nb in tqdm([1,3,4]):
    fig, axes = plt.subplots(1,1, figsize= (30,30));

    nplots= len(Ftre_Grp_Dict[ftre_grp_nb]);
    ncols= 4;
    nrows= int(np.ceil(nplots/ ncols));
    
    print(colored(f"\nDistributions for feature group {ftre_grp_nb}\n", 
                  color= 'blue', attrs= ['bold']));
    for i , col in tqdm(enumerate(Ftre_Grp_Dict[ftre_grp_nb])):
        plt.subplot(nrows, ncols, i+1);
        sns.histplot(x=xytrain[col].values, bins=100, color = 'tab:blue');
    plt.show();

In [None]:
fig,ax= plt.subplots(1,1, figsize= (16,6));
xytrain_desc_sum.loc[(np.abs(xytrain_desc_sum.skewness) >= 0.10) 
                     & (xytrain_desc_sum.index.str[0:3] != 'F_2'), 'skewness'].plot.bar(ax=ax);
ax.set_title(f"Non-normally distributed columns in groups F_1, F_3 and F_4\n", 
             color= 'tab:blue', fontsize= 12);
ax.grid(visible= True, which= 'both', color= 'grey', linestyle = '--', linewidth= 0.75);
ax.set_yticks(np.arange(-1.0, 0.50, 0.1), fontsize= 8);

plt.tight_layout();
plt.show();

In [None]:
# Analyzing discrete column distributions in group 2:-
print(colored(f"Value-counts for feature group 2\n", color= 'blue', attrs= ['bold']));
fig, axes = plt.subplots(5,5, figsize= (31,31));

for i, col in tqdm(enumerate(Ftre_Grp_Dict.get(2))):
    xytrain[col].value_counts().sort_index(ascending= True).plot.bar(ax=axes[i//5, i%5]);
    axes[i//5, i%5].grid(visible= True, which= 'both', color= 'lightgrey', 
                         linewidth = 0.75, linestyle= '--');
    axes[i//5, i%5].set(xlabel= '', ylabel='');
plt.tight_layout();
plt.show();

In [None]:
# Plotting the positions of nulls in the data-set:-
for i in tqdm([1,3,4]):
    fig, ax= plt.subplots(1,1, figsize= (25,20));
    sns.heatmap(data= xytrain[Ftre_Grp_Dict[i]].isna(), 
                cmap= 'binary', cbar_kws={'label': 'Missing Data'},ax= ax);
    ax.set_title(f"\nMissing value heatmap for group {i}\n", fontsize= 12, color= 'tab:blue');
    ax.set(ylabel='');
    plt.xticks(rotation = 45, fontsize= 7);
    plt.show();

#### Feature line-plots creation to exhibit possible periodicity patterns:-

In [None]:
# Plotting line-plots for the features in the given group for potential time-series data/ periodicity detection:-
def Plot_Ftre(grp_nb: np.int8, color: str= 'blue', sample_size: list= [100,1000,10000]):
    """
    This function plots the features specified in the group to elicit patterns/periodicity.
    This is used to check for potential time series data/ null value filling algorithm choice
    """;
    
    print(colored(f"Feature group {grp_nb} lineplots\n", color = 'green', attrs= ['dark', 'bold']));

    for chunk_size in tqdm(sample_size):
        print(colored(f"\nCurrent chunk size = {chunk_size}\n", color = 'red', attrs= ['dark', 'bold']));

        for ftre in tqdm(Ftre_Grp_Dict.get(4)):
            y= xytrain['F_4_0'].dropna()[0:chunk_size];
            fig, ax= plt.subplots(1,1, figsize= (25, 7.5));
            sns.lineplot(y= y, x= y.index, color = color, linestyle= '-', linewidth = 1.0, ax= ax);
            ax.set_title(f"\nSample lineplot for {ftre}\n", color= color, fontsize= 12);
            ax.grid(visible= True, which= 'both', color= 'grey', linestyle= '--', linewidth= 0.50);
            ax.set(ylabel= '', xlabel= '');
            ax.set_xticks(range(0,chunk_size + 1, int(chunk_size/10)), fontsize= 8);

            plt.xticks(rotation= 45);
            plt.tight_layout();
            plt.show();
            del y;
            collect();
        collect();
    collect();

In [None]:
Plot_Ftre(grp_nb=1, color= 'crimson');
collect();

In [None]:
Plot_Ftre(grp_nb=3, color = 'indigo');
collect();

In [None]:
Plot_Ftre(grp_nb=4, color = 'teal');
collect();

#### Correlation heatmaps for potential linear relations between all features and within a group:-

In [None]:
# Plotting correlation matrices across all feature groups:-
fig, ax= plt.subplots(1,1, figsize= (30,20));
sns.heatmap(xytrain.drop('Nb_Nulls', axis=1).\
            corr(method = 'pearson'),linewidths= 0.50, linecolor="black",
            square= True, cmap= 'Blues', annot= False, ax= ax);   
ax.set_title(f"\n{method.capitalize()} correlation heatmap- full data\n", 
         fontsize= 12, color= 'tab:blue');  
plt.xticks(rotation= 45);
plt.yticks(rotation= 0);
plt.tight_layout();
plt.show();

In [None]:
# Plotting correlation matrices across feature groups:-
for method in tqdm(['pearson', 'spearman', 'kendall']):
    _ = xytrain[Ftre_Grp_Dict[4]].corr(method= method);
    fig, ax= plt.subplots(1,1, figsize= (42,16));
    sns.heatmap(_,fmt=".1%", mask= np.triu(np.ones_like(_)),linewidths= 0.75,
                linecolor="black", square= True,cmap= 'Spectral', annot= True, ax= ax);   
    ax.set_title(f"\n{method.capitalize()} correlation heatmap-group 4\n", 
                 fontsize= 12, color= 'black');  
    plt.xticks(rotation= 45);
    plt.yticks(rotation= 0);
    plt.show();
    del _;

collect();

### Key inferences from EDA:- 

1. F_2 columns are integer encoded columns for categorical features. They don't have any null features
2. F_1, F_3 and F_4 features are float columns, we have reduced the memory size of these columns without any major information loss. 
3. F_1 and F_3 columns are not correlated with themselves or with other table columns. Imputing their column missing values with descriptive statistics values (mean, median, IQR, constant (like 0)) could be a good idea here. We will try various approaches for each column herewith and evaluate the best option thereby. **These columns could be considered as 'missing completely at random'**
4. F_4 set of columns exhibit some level of correlation among themselves. Model development for each F_4 set columns (eg. F_4_0) as a function of other F_4 group columns could be an option. Missing values across each column (index) is the test-set for the regression model. Standard machine learning models like ensemble trees/ neural networks could be used. **According to the literature for null values, these columns could be considered as 'missing at random'**. In total, 703 combinations of null valued columns exist in feature group 4.
5. Location of nulls throughout the table is checked and no specific pattern is elicited. Nulls are scattered throughout the table, except for the columns in feature group 2
6. *Columns in feature group F_2 behave strangely. They don't have nulls, are categorical encoded values and are of the integer type versus float otherwise. Do they hold an unravelled mystery? Am I searching incorrectly? Am I missing out on something? Thoughts? Comments? Why are these columns even included in the table if they offer a nondescript contribution to the model?*

## 4. Imputation for feature groups 1 and 3:-

In this section, we will impute the nulls across the feature groups 1,3 using the descriptive statistics measures and elicit the best method for each column in these groups

In [None]:
# Creating output data-set with all relavant columns, options and relavant MSE:-
desc_stat_sum = np.zeros((0,6), dtype= np.float32);

# Creating groups 1,3 features:-
grp13_ftre = list(Ftre_Grp_Dict.get(1)) + list(Ftre_Grp_Dict.get(3));

In [None]:
for ftre in tqdm(grp13_ftre):
    print(colored(f"Current feature = {ftre}", color = 'blue', attrs= ['dark', 'bold']));
    y = xytrain[ftre].dropna();
    
    for fold_nb, (train_idx, dev_idx) in enumerate(
        KFold(n_splits=5, shuffle= True, random_state= 10).split(y)):

        print(colored(f"Current fold = {fold_nb}", color = 'red' , attrs= ['dark']));
        
        ytrain, ydev = y.values[train_idx], y.values[dev_idx];
        mse_mean = mean_squared_error(ydev, np.full(len(ydev), np.mean(ytrain)));
        mse_median = mean_squared_error(ydev, np.full(len(ydev), np.median(ytrain)));
        mse_iqr = mean_squared_error(ydev, np.full(len(ydev), iqr(ytrain)));
        mse_zero = mean_squared_error(ydev, np.zeros(len(ydev)));
        
        desc_stat_sum= np.vstack((desc_stat_sum,
                                 np.array([ftre, fold_nb, mse_mean, mse_median, mse_iqr, mse_zero])));
        del mse_mean, mse_median, mse_iqr, mse_zero;  

In [None]:
# Converting the array to a dataframe and displaying results with best imputation method:-
desc_stat_sum = pd.DataFrame(desc_stat_sum, 
                             columns = ['ftre_nm', 'fold_nb', 'mean', 'median', 'iqr', 'zero']);
desc_stat_sum[['mean', 'median', 'iqr', 'zero']] = \
desc_stat_sum[['mean', 'median', 'iqr', 'zero']].astype(np.float32);
desc_stat_sum['imp_mthd_lbl'] = desc_stat_sum[['mean', 'median', 'iqr', 'zero']].idxmin(axis=1);

imp_mthd_sum = \
desc_stat_sum[['ftre_nm', 'imp_mthd_lbl']].groupby('ftre_nm').agg({'imp_mthd_lbl': lambda x: x.mode()});

# plotting the best imputation by feature:-
fig, ax= plt.subplots(1,1, figsize= (5,5));
imp_mthd_sum['imp_mthd_lbl'].value_counts().plot.bar(ax= ax);
ax.set_title(f"Imputation best option per feature\n", fontsize= 12, color= 'tab:blue');
ax.grid(visible= True, which= 'both', color= 'grey', linestyle= '--', linewidth= 0.75);
plt.tight_layout();
plt.show();

#### Implementation steps:- 

1. Collate the dataset sample with null records from the reference column. This will become the index for the submission file
2. Impute the values of the null with the chosen strategy. This depends on the results from the previous step
3. Align the imputed data to the master submission table

In [None]:
# Creating output data-set to store the imputed values:-
Submission = pd.DataFrame(data= None, columns= sub_fl.iloc[0:5,:].columns);

# Imputing the respective columns with the chosen method:-
print(colored(f"Missing value imputation for groups 1,3\n", 
              color = 'blue', attrs= ['dark', 'bold']));

for ftre in tqdm(grp13_ftre):
    print(colored(f"Current feature = {ftre}", color = 'red', attrs= ['dark']));
    _ = xytrain[[ftre]].loc[xytrain[ftre].isna()];
    _['row-col'] = _.index.map(str) + '-' + ftre;
    _['value']= \
    np.select(\
    [(imp_mthd_sum.loc[ftre] == 'zero'), (imp_mthd_sum.loc[ftre] == 'mean'), 
     (imp_mthd_sum.loc[ftre] == 'median')],
    [np.zeros(len(_)),  
     np.full(len(_), np.mean(xytrain[ftre].dropna())), 
     np.full(len(_), np.median(xytrain[ftre].dropna()))]
    );

    Submission = \
    pd.concat((Submission, _.drop([ftre], axis=1)), axis=0, ignore_index= True);
    
print(colored(f"\nTotal missing values imputed = {len(Submission):,.0f}\n", 
              color = 'blue', attrs= ['dark', 'bold']));

collect();

In [None]:
Submission.to_csv("Submission_Grp13.csv", index= False);

**The next notebook in this series will develop a model for the F_4_ group using the other F_4_ columns. We will explore a variety of ML models and a neural network for the same.**