In [None]:
!pip install scikit-learn-intelex --progress-bar off >> /tmp/pip_sklearnex.log

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from catboost import CatBoostRegressor
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from tqdm.notebook import tqdm
from sklearnex import patch_sklearn, config_context
patch_sklearn()

from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer

In [None]:
from IPython.core.display import display, HTML, Javascript

color_map = ['#FFFFFF','#FF5733']

prompt = color_map[-1]
main_color = color_map[0]
strong_main_color = color_map[1]
custom_colors = [strong_main_color, main_color]

css_file = '''
div #notebook {
background-color: white;
line-height: 20px;
}

#notebook-container {
%s
margin-top: 2em;
padding-top: 2em;
border-top: 4px solid %s;
-webkit-box-shadow: 0px 0px 8px 2px rgba(224, 212, 226, 0.5);
    box-shadow: 0px 0px 8px 2px rgba(224, 212, 226, 0.5);
}

div .input {
margin-bottom: 1em;
}

.rendered_html h1, .rendered_html h2, .rendered_html h3, .rendered_html h4, .rendered_html h5, .rendered_html h6 {
color: %s;
font-weight: 600;
}

div.input_area {
border: none;
    background-color: %s;
    border-top: 2px solid %s;
}

div.input_prompt {
color: %s;
}

div.output_prompt {
color: %s; 
}

div.cell.selected:before, div.cell.selected.jupyter-soft-selected:before {
background: %s;
}

div.cell.selected, div.cell.selected.jupyter-soft-selected {
    border-color: %s;
}

.edit_mode div.cell.selected:before {
background: %s;
}

.edit_mode div.cell.selected {
border-color: %s;

}
'''

def to_rgb(h): 
    return tuple(int(h[i:i+2], 16) for i in [0, 2, 4])

main_color_rgba = 'rgba(%s, %s, %s, 0.1)' % (to_rgb(main_color[1:]))
open('notebook.css', 'w').write(css_file % ('width: 95%;', main_color, main_color, main_color_rgba, 
                                            main_color,  main_color, prompt, main_color, main_color, 
                                            main_color, main_color))

def nb(): 
    return HTML("<style>" + open("notebook.css", "r").read() + "</style>")
nb()

![](https://i.imgur.com/wNuFkP1.png)

# <div style="padding: 30px;color:white;margin:10;font-size:80%;text-align:left;display:fill;border-radius:10px;background-color:#F1C40F;overflow:hidden;background-image: url(https://i.imgur.com/f1SNYqc.png)"><b><span style='color:white'>1 | Introduction </span></b> </div>

## **<span style='color:#370FA9'>NOTEBOOK AIM - DATA IMPUTATION</span>**

- As the **[dataset](https://www.kaggle.com/competitions/tabular-playground-series-jun-2022)** suggests, the data comes with missing data & we have impute missing data, in one form or another
- There are a number of ways we can treat missing data in a dataset & there are plenty of notebooks showing **simple imputation methods**
- My favourite approach for data imputation is an **<span style='color:#CB61CE'>ensemble model approach</span>**, as it offers a wide range of of variability and allows us to experiment with hyperparameters
- In this notebook, we'll take a look at a **<span style='color:#CB61CE'>model based approach to imputation</span>**, including **<span style='color:#CB61CE'>ensembling</span>** of different models, 
- We'll be using very contrasting models from **unsupervised** & **supervised learning** methods

## **<span style='color:#370FA9'>DATA IMPUTATION TYPES</span>**

- I will not go into too much depth, as can be seen in a **[discussion post](https://www.kaggle.com/competitions/tabular-playground-series-jun-2022/discussion/328568)**, there is a no shortage of literature on this topic
- There are lots of notebooks utilising **[SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)** as well, which is quite a **quick method**, so it has its merits
- Focusing on Model based imputation methods here:

> - Model based approached are slightly more costly compared to more simpler methods such as **[SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)**:
> - While some imputation methods are deemed appropriate for a specific type of data, e.g. normally distributed data, MCAR missingness, etc., 
> - These methods are criticized mostly for biasing our estimates and models. Some, therefore, believed that deletion methods are safer in some circumstances.
> - Fortunately for us, newer categories of imputation methods address these weaknesses of the simple imputation and the deletion methods.
> - These are **model-based** and **multiple imputation** methods (**ensembles**)
> - An example of a  **<span style='color:#CB61CE'>model based imputer</span>** can be found in **[IterativeImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html)** & in **[notebook](https://www.kaggle.com/code/reymaster/0-93002-iterative-imputer-baseline)**

## **<span style='color:#370FA9'>GPU ACCELERATION</span>**
- One clear downside of a model based approach (even more an **<span style='color:#CB61CE'>ensemble</span>** approach) is the **high cost of imputation** compared to simpler methods
- However, there are some neat tools we can utilise in order to accelerate the entire process:
> - **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Catboost</mark>** itself, allows models to be trained on GPUs, as shown in **[reference](https://catboost.ai/en/docs/features/training-on-gpu)**
> - For **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">kNR</mark>**, it's a little more trickier, the base library doesn't use GPU by default, however we can utilise the extension **<span style='color:#CB61CE'>sklearnex</span>**, as shown in **[reference](https://github.com/intel/scikit-learn-intelex)**
- The effectiveness of GPU depends on **how much data we use** to train a model, as there is a cost of moving data to the GPU
- Smaller datasets don't benefit from GPU usage, however as the **[Tabular Data Competition](https://www.kaggle.com/competitions/tabular-playground-series-jun-2022)** dataset contains a million rows,
- So there should be a clear difference between using & not using a GPU to impute missing data when using the **<span style='color:#CB61CE'>model based approach for imputation</span>** 

# <div style="padding: 30px;color:white;margin:10;font-size:80%;text-align:left;display:fill;border-radius:10px;background-color:#F1C40F;overflow:hidden;background-image: url(https://i.imgur.com/f1SNYqc.png)"><b><span style='color:white'>2 | Imputation Class </span></b> </div>

## **<span style='color:#370FA9'>CLASS ATTRIBUTES & METHODS</span>**

**<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">ATTRIBUTES</mark>**
- <code>df</code> - Each **instance** requires an input dataframe (which we want to impute)
- <code>idf</code> - Created upon calling **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">methods</mark>** (**impute_model**,**impute_simple**), othewise is **None**
- <code>na_cols</code> - List of features which have missing/NaN data
- <code>coeffCB</code> - Ensemble Coefficient for the **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Catboost</mark>** (gradient boosting regressor)
- <code>coeffkN</code> - Ensemble Coefficient for the **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">kNR</mark>** (regressor)
- <code>knr_neigh</code> - Number of nearest neighbours in **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">kNR</mark>**
- <code>gpu</code> - Utilise **GPU** during data imputation
- <code>cat_only</code> - Impute data using **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Catboost</mark>** only (reduces the imputation time)

**<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">METHODS</mark>**
- <code>plot_na</code> - Visualise missing data before imputation (**option**='percentage'/'count',**show_id**='original'/'imputed')
- <code>stats</code> - Visualise the feature/column univariate distribution
- <code>impute_model</code> - **Ensemble Model Imputation Approach**
- <code>impute_simple</code> - Simple imputation, by **mean** or **median** value in a column, the **most frequent value** or some **constant value**

In [None]:
''' DATA IMPUTATION '''
# class is used for data imputation & visualisation

# [1] Input dataframe upon instantiation, set ldf
# [2] Visualise missing data % or count in each column using plot_na method
# [3] Impute imported dataframe using impute_model, impute_simple
# [4] Once imputed, data is stored in the idf attribute
# If the sklearnex library is loaded, scikit-learn-intelex extension will be used

class imputer:
    
    def __init__(self,ldf=None,gpu=False,cat_only=False):
        self.df = ldf            # imported dataframe 
        self.idf = None          # imputed data
        self.na_cols = None      # columns with missing data
        
        # Check if there is any missing data in the dataframe
        na_sum = ldf.isna().sum(axis=0) # series
        na_cols = list(na_sum[na_sum>0].index)
        if(len(na_cols) > 0):
            print(f'[note] {len(na_cols)} columns with missing data found')
            self.na_cols = na_cols
        
        # Model Imputation attributes
        
        # parameters we can set before calling impute_model
        self.knr_neigh = 2       # Number of Neighbours in kNN model
        self.coeffCB = 0.5       # Ensemble Coefficient for CB
        self.coeffkN = 0.5       # Ensemble Coefficient for knn
        self.gpu = gpu           # GPU option for Catboost
        self.cat_only = cat_only # CatBoost Imputation only (no kNR) 
        
        # adjust the ensemble coefficient if only one model
        if(self.cat_only):
            self.coeffCB = 1.0
        
        # Constant Imputer attributes
        
        self.si_const_val = 1.0  # impute NaN with constant value
        
    ''' Function to plot missing data '''

    # arguments:
    # option - display missing data as percentage ('percentage') or count ('count')
    # show_id - identifier to show which dataframe to show missing data

    def plot_na(self,option='percentage',show_id='original'):
        
        # choose which dataframe to use
        if(show_id == 'original'):
            ldf = self.df
        else:
            ldf = self.idf
    
        # Find missing data in columns    
        if(option == 'percentage'):
            leng = len(ldf)
            na_sum = (ldf.isnull().sum()/leng*100).sort_values(ascending=False)
        else:
            na_sum = ldf.isna().sum(axis=0)
    
        fig = px.bar(na_sum,color=na_sum.values)
    
        # Plot aesthetics
        fig.update_layout(height=300,template='plotly_white',
                          title='Missing Data in Columns',
                      margin=dict(l=50,r=80,t=80,b=40),
                      font=dict(family='sans-serif',size=12),
                      xaxis_title="Feature Names",
                      yaxis_title="Number of Missing Rows",
                      showlegend=False)
    
        if(option == 'perc'):
            fig.update_layout(yaxis_title="% Missing Data")
        else:
            fig.update_layout(yaxis_title="Number of Missing Rows")
        
        fig.update_traces(width=.9)          
        fig.update_coloraxes(showscale=False) # remove colourbar
        fig.show()
        
    ''' Plot Univariate Statistics of Missing Data Columns '''
    # visualise univariate statistics data of missing columns
    
    # to_plot - (str) argument option to change to histogram/boxplot/violin plot
    # n_cols - (int)number of columns in subplot
    # height - (int) height of the figure
    # sample - (int) number of samples only

    def stats(self,n_cols=5,height=400,sample=None,bins=50,barmode='overlay'):

        # select only int or floats (assumed 64 bit)
        ldf = self.df.select_dtypes(include=['float64','int64'])
        if(self.idf is not None):
            ildf = self.idf.select_dtypes(include=['float64','int64']) 

        # If there are any columss with missing data
        if(len(self.na_cols)>0):
            
            if(sample is not None):
                show_columns = self.na_cols[:sample]
            else:
                show_columns = self.na_cols

            n_rows = -(-len(show_columns) // n_cols)  # math.ceil in a fast way, without import
            row_pos, col_pos = 1, 0
            
            fig = make_subplots(rows=n_rows, cols=n_cols,subplot_titles=show_columns)

            for col in show_columns:
                if(self.idf is not None):                    
                    itrace = go.Histogram(x=ildf[col],nbinsx=bins,marker_color='#283747',name='imputed')
                    trace = go.Histogram(x=ldf[col],nbinsx=bins,marker_color='#C7C7C7',name='before')
                else:
                    trace = go.Histogram(x=ldf[col],nbinsx=bins,marker_color='#283747')

                if(col_pos == n_cols): 
                    row_pos += 1
                col_pos = col_pos + 1 if (col_pos < n_cols) else 1
                
                if(self.idf is not None):
                    fig.add_trace(itrace, row=row_pos, col=col_pos)
                    fig.add_trace(trace, row=row_pos, col=col_pos)
                else:
                    fig.add_trace(trace, row=row_pos, col=col_pos)

            fig.update_layout(template='plotly_white',
                              title=f'Univariate Feature Distribution',
                              font=dict(family='sans-serif',size=12))

            fig.update_traces(marker=dict(line=dict(width=0.5, color='white')),
                              opacity=0.75)
                
            # If we have already imputed & can make comparison
            if(self.idf is not None):
                fig.update_layout(barmode=barmode)
                
            fig.update(layout_showlegend=False)
            fig.update_layout(height=height);fig.show()
        
    ''' Model basaed imputation '''
    # ensemble based model imputation kNR & CatBoost
    # if method argument cols is not given, all columns are imputed
    # cols - list of strings (column names)
    
    def impute_model(self,cols=None):

        # separate dataframe into numerical/categorical
        
        # select numerical columns in df
        ldf = self.df.select_dtypes(include=[np.number])           
        #  select categorical columns in df
        ldf_putaside = self.df.select_dtypes(exclude=[np.number])  

        # define columns w/ and w/o missing data
        
        # list of features w/ missing data 
        cols_nan = ldf.columns[ldf.isna().any()].tolist()       
        # get all colun data w/o missing data
        cols_no_nan = ldf.columns.difference(cols_nan).values     

        if(cols is not None):
            cols_nan = cols
            df1 = ldf[cols_nan].describe()

        ''' Cycle through all columns '''
        # that have Nan Data and impute data using modeling
            
        fill_id = -1
        for col in tqdm(cols_nan):  
            
            fill_id+=1
            
            # indicies which have missing data will become our test set
            imp_test = ldf[ldf[col].isna()]
            
            # all indicies which which have no missing data
            imp_train = ldf.dropna()
            
            # instantiate Catboost Regressor Supervised Approach
            if(self.gpu):
                cbmodel = CatBoostRegressor(verbose=False,
                                            task_type="GPU")
            else:
                cbmodel = CatBoostRegressor(verbose=False)
                
            # instantiate KNR Unsupervised Approach
            if(self.gpu):
                with config_context(target_offload="gpu:0"):
                    knmodel = KNeighborsRegressor(n_neighbors=self.knr_neigh)
            else:
                knmodel = KNeighborsRegressor(n_neighbors=self.knr_neigh)
            
            # fit models on no-nan data
            if(self.cat_only):
                cbm = cbmodel.fit(imp_train[cols_no_nan], imp_train[col])
                cbP = cbm.predict(imp_test[cols_no_nan])
                pred = self.coeffCB*cbP
                
            else:
                knm = knmodel.fit(imp_train[cols_no_nan], imp_train[col])
                cbm = cbmodel.fit(imp_train[cols_no_nan], imp_train[col])
                knrP = knm.predict(imp_test[cols_no_nan])
                cbP = cbm.predict(imp_test[cols_no_nan])
                
                pred = self.coeffkN*knrP + self.coeffCB*cbP # Simple Model Ensemble

            # add imputation information
            ldf.loc[self.df[col].isna(), col] = pred              
            ldf.loc[self.df[col].isna(),'fill_id'] = fill_id # Add imp. 

        df2 = ldf[cols_nan].describe()

        self.idf =  pd.concat([ldf,ldf_putaside],axis=1)
        print(f'[note] imputation complete')
        
       
    ''' Simple Imputer Function '''
    # simple imputation using SimpleImputer

    def impute_simple(self,option='mean'):
        
        if(option == 'mean'):

            # by default simple imputer uses mean
            imputed_mean = SimpleImputer()
            self.idf = imputed_mean.fit_transform(self.df)

        elif(option == 'median'):

            # impute with column median
            imputed_median = SimpleImputer(strategy='median')
            self.idf = imputed_median.fit_transform(self.df)

        elif(option == 'most_freq'):

            # impute using most frequent value in each column
            imputed_freq = SimpleImputer(strategy='most_frequent')
            self.idf = imputed_freq.fit_transform(self.df)

        elif(option == 'constant'):

            # impute using most frequent value in each column
            imputed_const = SimpleImputer(strategy='constant',
                                          fill_value=self.si_const_val)
            self.idf = imputed_const.fit_transform(self.df)

In [None]:
''' Simple Test - Class Usage '''
# Testing our class

# Dataset Usage Examples
from sklearn import datasets

# convert sklearn dataset to pandas DataFrame
def sklearn_to_df(sklearn_dataset):
    df = pd.DataFrame(sklearn_dataset.data, columns=sklearn_dataset.feature_names)
    df['target'] = pd.Series(sklearn_dataset.target)
    return df

''' Load DataFrame w/o missing data '''
df_cali = sklearn_to_df(datasets.fetch_california_housing())
df_cali.head()

''' Remove random percentage of data in two columns '''
# Randomly remove some data in columns

def create_nan(ldf,column,p):
    ldf[column] = ldf[column].apply(lambda x: np.nan if np.random.rand() < p else x)
    return ldf
    
df_mod = create_nan(df_cali,'HouseAge',2/10)
df_mod = create_nan(df_mod,'AveBedrms',2/4)

# Instantiate Imputer Class
cali_impute = imputer(ldf=df_cali)

# Model based imputation of two columns
cali_impute.impute_model(['HouseAge','AveBedrms'])

# Dataframe contaning imputed df & imputer info column (fill_id)
cali_impute.idf.head()
# cali_impute.stats(sample=2,barmode='relative')

# <div style="padding: 30px;color:white;margin:10;font-size:80%;text-align:left;display:fill;border-radius:10px;background-color:#F1C40F;overflow:hidden;background-image: url(https://i.imgur.com/f1SNYqc.png)"><b><span style='color:white'>3 | Visualising Missing Data </span></b> </div>

## **<span style='color:#370FA9'>LOADING DATA</span>**
- Let's read the data & view the some basic information about the dataset using **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">info</mark>**
- We'll create two subsets, one for **<span style='color:#CB61CE'>CPU imputation</span>** (2000 entries) & one for **<span style='color:#CB61CE'>GPU imputation</span>** (10000 entries), using **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">sample</mark>** method

In [None]:
# Import data
data = pd.read_csv('/kaggle/input/tabular-playground-series-jun-2022/data.csv')
print(f'[note] data loaded {data.shape}')

# Create a small subset
data_small = data.sample(2000)
print(f'[note] selected subset {data_small.shape}')

# Create a large dataset (used for GPU imputation)
data_large = data.sample(100000)
print(f'[note] selected subset {data_large.shape}')

- We have two types of **<span style='color:#CB61CE'>dtypes</span>**, both of which are **<span style='color:#CB61CE'>numerical</span>**, so we can use a model approach (**regressor**)
- Without having to do any modification to column data before imputation, 'int64' suggests there are discretised numeric features

In [None]:
# Check dtype information
data_small.dtypes.unique()

- If we had **<span style='color:#CB61CE'>categorical features</span>**, **[@ijcrook](https://www.kaggle.com/ijcrook)** has uploaded some useful functions for data imputation using **<span style='color:#CB61CE'>model based imputation</span>** in functions **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">knn_impute</mark>**, **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">ML_impute</mark>**

## **<span style='color:#370FA9'>CHECKING FOR NAN VALUES</span>**
- Aside from imputation, class <code>impute</code> contains methods for visualising missing data as well
- We can call method **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">plot_na</mark>**, which by default will show the percentage of missing data in each column (**count** or **percentage** are available)

In [None]:
check_data = imputer(data_small) # create instance
check_data.plot_na()       # visualise missing data % in each column

# <div style="padding: 30px;color:white;margin:10;font-size:80%;text-align:left;display:fill;border-radius:10px;background-color:#F1C40F;overflow:hidden;background-image: url(https://i.imgur.com/f1SNYqc.png)"><b><span style='color:white'>4 | Data Imputation </span></b> </div>

## **[CPU] <span style='color:#370FA9'>ENSEMBLE IMPUTATION OF MISSING DATA</span>**
- We import the desired **dataframe** containing missing data by first creating an instance of class <code>imputer</code> & setting argument **ldf**
- Setting parameters such as <code>coeffCB</code>, <code>coeffkN</code> ..., the full list of class attributes are summarised in **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">Section 2</mark>**
- Then we fill in the missing data in the dataframe by calling the **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">impute_model</mark>** method

In [None]:
# impute model on CPU
data_impute_small = imputer(ldf=data_small)
data_impute_small.impute_model()

## **[GPU] <span style='color:#370FA9'>ENSEMBLE IMPUTATION OF MISSING DATA</span>**
- For larger datasets, we can utilise **GPU** acceleration for **<mark style="background-color:#FFC300;color:white;border-radius:5px;opacity:1.0">Catboost</mark>** & **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">kNR</mark>** (via the scikit-learn-intelex extension), to save some time training the model, setting <code>gpu</code> to **True**
- However there is little use to utilise the **GPU** on small datasets, but for the entire dataset (**1M rows**) this should give some speedup for imputation

In [None]:
# impute missing data using gpu
data_impute_large = imputer(ldf=data_large,gpu=True)
data_impute_large.impute_model()

## **<span style='color:#370FA9'>VISUALISE MISSING DATA POST IMPUTATION</span>**
- To visualise the imputed dataframe, we can set <code>show_id</code> to **imputed** in the **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">plot_na</mark>** method

In [None]:
# check if data has been imputed correctly 
data_impute_small.plot_na(show_id='imputed')

## **<span style='color:#370FA9'>CHECKING UNIVARIATE DISTRIBUTIONS</span>**
- After imputation, we probably want to **check the data distribution** of **<span style='color:#CB61CE'>before</span>** and **<span style='color:#CB61CE'>after</span>**, to make sure the data hasn't been altered significantly
- We can plot the features/columns that had missing data and compare the **<span style='color:#CB61CE'>univariate feature distributions</span>** using the **<mark style="background-color:#393939;color:white;border-radius:5px;opacity:1.0">stats</mark>** method

In [None]:
# compare univariate distributions before/after
data_impute_small.stats(n_cols=3,height=900,sample=9,barmode='relative')