### MissXGB

This notebook takes a well known algorithm called [MissForest](https://academic.oup.com/bioinformatics/article/28/1/112/219101) and tweaks it ever so slightly.

The orginal implementation in Python, called *missingpy* was used as a base. In the orginal implementation a random forest is used which does not perform all too well with large datasets. As a result, in this notebook, I removed the RF model and injected an XGBoost regression model. Currently it is not optimized so feel free to hyperparam tune it and see if you can get better results.


In [None]:
!pip install xgboost missingpy --quiet

In [None]:
# missingpy has some weird sklearn base issues, this a fix
import sys
import sklearn.neighbors._base
sys.modules['sklearn.neighbors.base'] = sklearn.neighbors._base

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from xgboost import XGBClassifier, XGBRegressor
from missingpy import MissForest
from sklearn.model_selection import train_test_split

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Load the data into Pandas DF

In [None]:
data_path = '/kaggle/input/tabular-playground-series-jun-2022/data.csv'

In [None]:
df = pd.read_csv(data_path, index_col=0)

### Tweaking the MissForest algo
The MissForest class was simply extended with its internal works left in tact except for the *_miss_forest* method which contained the RF model. This was rebuilt using an xgboost regressor but other than that the code is as it was implemented in the python package missingpy

In [None]:
class MissXGB(MissForest):
    def __init__(self, *args, **kwargs):
        super().__init__(*args,**kwargs)
    
    def _miss_forest(self, Ximp, mask):
        """The missForest algorithm with XGBoost"""

        # Count missing per column
        col_missing_count = mask.sum(axis=0)

        # Get col and row indices for missing
        missing_rows, missing_cols = np.where(mask)

        if self.num_vars_ is not None:
            # Only keep indices for numerical vars
            keep_idx_num = np.in1d(missing_cols, self.num_vars_)
            missing_num_rows = missing_rows[keep_idx_num]
            missing_num_cols = missing_cols[keep_idx_num]

            # Make initial guess for missing values
            col_means = np.full(Ximp.shape[1], fill_value=np.nan)
            col_means[self.num_vars_] = self.statistics_.get('col_means')
            Ximp[missing_num_rows, missing_num_cols] = np.take(
                col_means, missing_num_cols)

            # Reg criterion
            reg_criterion = self.criterion if type(self.criterion) == str \
                else self.criterion[0]

            # Instantiate regression model (our substitution for the RF model)
            
            rf_regressor = XGBRegressor(n_jobs=-1,
                                        eta=0.1,
                                        max_depth=15,
                                        tree_method='gpu_hist', 
                                        gpu_id=0)

        

        # 2. misscount_idx: sorted indices of cols in X based on missing count
        misscount_idx = np.argsort(col_missing_count)
        # Reverse order if decreasing is set to True
        if self.decreasing is True:
            misscount_idx = misscount_idx[::-1]

        # 3. While new_gammas < old_gammas & self.iter_count_ < max_iter loop:
        self.iter_count_ = 0
        gamma_new = 0
        gamma_old = np.inf
        col_index = np.arange(Ximp.shape[1])

        while (
                gamma_new < gamma_old) and \
                self.iter_count_ < self.max_iter:

            # 4. store previously imputed matrix
            Ximp_old = np.copy(Ximp)
            if self.iter_count_ != 0:
                gamma_old = gamma_new
            # 5. loop
            for s in misscount_idx:
                # Column indices other than the one being imputed
                s_prime = np.delete(col_index, s)

                # Get indices of rows where 's' is observed and missing
                obs_rows = np.where(~mask[:, s])[0]
                mis_rows = np.where(mask[:, s])[0]

                # If no missing, then skip
                if len(mis_rows) == 0:
                    continue

                # Get observed values of 's'
                yobs = Ximp[obs_rows, s]

                # Get 'X' for both observed and missing 's' column
                xobs = Ximp[np.ix_(obs_rows, s_prime)]
                xmis = Ximp[np.ix_(mis_rows, s_prime)]

                # 6. Fit a xgboost model over observed and predict the missing
   
                X_train,x_test, y_train, y_test = train_test_split(xobs, yobs, test_size=0.33, random_state=1337)
                rf_regressor.fit(X=X_train, y=y_train, eval_set=[(x_test,y_test)],eval_metric='rmse', verbose=0, early_stopping_rounds=10)
                
                print(f'ColumnIdx:{s}\tRMSE:{rf_regressor.best_score:0.4f}')

                # 7. predict ymis(s) using xmis(x)
                ymis = rf_regressor.predict(xmis, iteration_range=(0, rf_regressor.best_iteration + 1))
                # 8. update imputed matrix using predicted matrix ymis(s)
                Ximp[mis_rows, s] = ymis

            # 9. Update gamma (stopping criterion)
            if self.num_vars_ is not None:
                gamma_new = np.sum((Ximp[:, self.num_vars_] - Ximp_old[:, self.num_vars_]) ** 2) / np.sum((Ximp[:, self.num_vars_]) ** 2)

            print("Iteration:", self.iter_count_)
            self.iter_count_ += 1

        return Ximp_old

In [None]:
xgb_imputer = MissXGB()

In [None]:
imputed_data = xgb_imputer.fit_transform(df)

### Submission

In [None]:
df_imputed = pd.DataFrame(imputed_data, columns=df.columns)
records = []
for col in df.columns:
    idxs = df[df[col].isnull()].index.tolist()
    missing = df_imputed[col].loc[idxs]
    new_records = [{'row-col': f'{row}-{col}', 'value':imputed_value} for row, imputed_value in missing.items()]
    records.extend(new_records)
results_df = pd.DataFrame().from_records(records)
results_df.head()

In [None]:
import datetime 
results_df.to_csv(f'submission_{datetime.date.today()}.csv', index=False)