# Table of Contents
<a id="table-of-contents"></a>
- [1 - Introduction](#1)
- [2 - Libraries and Data import](#2)
    - [2.1 - Memory Reduction Funtionality](#2.1)
    - [2.2 - Min Max Scaler FUntion](#2.2)
    - [2.3 - XGB Trainer Function](#2.3)
    - [2.4 - Data Import](#2.4)
- [3 - NA values in train and test](#3)
- [4 - Feature Engeneer with NA Values](#4)
- [5 - Filling NA value strategies](#5)
    - [5.1 - No NA handling](#5.1)
    - [5.2 - NA values to Zeros](#5.2)
    - [5.3 - NA Values Imputed to mean](#5.3)
    - [5.4 - NA Values Imputed to median](#5.4)
    - [5.5 - MICE with a dropNa training set](#5.5)
    - [5.6 - MICE with a sampled training set](#5.6)
    - [5.7 - MICE with the whole training set](#5.7)
    - [5.8 - Mean, Median, Mode](#5.8)
- [6 - Results](#6)
- [7 - Comments](#6)
   

[back to top](#table-of-contents)
<a id="1"></a>
# 1 - Introduction
Hello everybody.
In this notebook I tried to calculate the differences in performance between several strategies of NA handling.
I applied a simple, fast, barely tuned Stratified XGB Classifier in a 6 branches CrossValidation to understand which strategy could aim to better result.
Hope you like this notebook.
Feel free to comment and please upvote if you like.


[back to top](#table-of-contents)
<a id="1"></a>
# 2 - Libraries and Data import

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import random

from xgboost import XGBClassifier

from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score

# Pandas setting to display more dataset rows and columns
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)

import warnings
warnings.simplefilter(action='ignore', category=UserWarning)

[back to top](#table-of-contents)
<a id="2.1"></a>
## 2.1 - Memory Reduction Funtionality

Due to the size of the DataSet I had to reduce memory usage to avoid NB crash.


In [None]:
def reduce_memory_usage(df, verbose=True):
    numerics = ["int8", "int16", "int32", "int64", "float16", "float32", "float64"]
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if (
                    c_min > np.finfo(np.float16).min
                    and c_max < np.finfo(np.float16).max
                ):
                    df[col] = df[col].astype(np.float16)
                elif (
                    c_min > np.finfo(np.float32).min
                    and c_max < np.finfo(np.float32).max
                ):
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    if verbose:
        print(
            "Mem. usage decreased to {:.2f} Mb ({:.1f}% reduction)".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
            )
        )
    return df

[back to top](#table-of-contents)
<a id="2.2"></a>
## 2.2 - Min Max Scaler function

In [None]:
def min_max_scaler(train, test):
    x_Mm_scaler = MinMaxScaler()
    X = pd.DataFrame(x_Mm_scaler.fit_transform(train.drop("claim", axis=1)),
                     columns=train.drop("claim", axis=1).columns, index = train.index)
    X_test = pd.DataFrame(x_Mm_scaler.transform(test), columns=test.columns, index = test.index)
    return X, train.claim, X_test

[back to top](#table-of-contents)
<a id="2.3"></a>
## 2.3- XGB Trainer function


In [None]:
# xgb parameters obtained by optuna studies: 
xgb_params = {'n_estimators': 10000, 
              'learning_rate': 0.05378242228966539, 
              'subsample': 0.7502075656719964, 
              'colsample_bytree': 0.8665945134281674, 
              'max_depth': 6, 'booster': 'gbtree', 
              'tree_method': 'gpu_hist', 
              'reg_lambda': 99.32136303076183,
              'reg_alpha': 32.78057640555784,
              'random_state': 42,
              'n_jobs': 4}

def xgb_train(name, X, y ,X_test,splits = 6, xgb_params = xgb_params):
  
    
    predictions = pd.DataFrame()
    predictions["id"] = X_test.index
    
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
    oof_preds = np.zeros((X.shape[0],))
    preds = 0
    model_fi = 0
    total_mean_rmse = 0
    fold_roc_auc_score = 0
    total_mean_roc_auc_score = 0

    for fold, (train_indicies, valid_indicies) in enumerate(skf.split(X,y)):

        X_train, X_valid = X.loc[train_indicies], X.loc[valid_indicies]
        y_train, y_valid = y.loc[train_indicies], y.loc[valid_indicies]
        print(f"Training fold num. {fold+1} of {splits}")
        
        model = XGBClassifier(**xgb_params)
        model.fit(X_train, y_train,
                  eval_set=[(X_train, y_train), (X_valid, y_valid)],
                  eval_metric="rmse",
                  early_stopping_rounds=100,
                  verbose=False)
#         print("fitted")
        preds += (model.predict_proba(X_test))[:,1] / splits
#         print(preds.shape)
#         print("preds ok")
#         model_fi += model.feature_importances_
#         print("model_fi ok")
        oof_preds[valid_indicies] =  model.predict_proba(X_valid)[:,1]
        # print(oof_preds)
        oof_preds[oof_preds < 0] = 0
#         fold_rmse = np.sqrt(mean_squared_error(y_scaler.inverse_transform(np.array(y_valid).reshape(-1,1)), y_scaler.inverse_transform(np.array(oof_preds[valid_idx]).reshape(-1,1))))
#         fold_rmse = np.sqrt(mean_squared_error(y_valid, oof_preds[valid_indicies]))
        fold_roc_auc_score = roc_auc_score(y_valid, oof_preds[valid_indicies])

        print(f"\nScorinf fold num. {fold+1} of {splits}: ROC AUC Score = {fold_roc_auc_score}")

#         print(f"Fold {fold} RMSE: {fold_rmse}")
#         total_mean_rmse += fold_rmse / splits
        total_mean_roc_auc_score += fold_roc_auc_score / splits
    print(f"\nOverall mean Roc AUC Score: {total_mean_roc_auc_score}")
     
    predictions["claim"] = preds
    csv_name = "subm_"+name+".csv"
    predictions.to_csv(csv_name, index=False, header=predictions.columns)
    
    
    return total_mean_roc_auc_score

[back to top](#table-of-contents)
<a id="2.4"></a>
## 2.4 - Data Import

Due to the size of the DataSet I choose DataTable to create the DataFrame


In [None]:
import datatable as dt  # pip install datatable
# Read the data
train = dt.fread("../input/tabular-playground-series-sep-2021/train.csv").to_pandas().set_index("id")
test = dt.fread("../input/tabular-playground-series-sep-2021/test.csv").to_pandas().set_index("id")
train = reduce_memory_usage(train, verbose=True)
test = reduce_memory_usage(test, verbose=True)

[back to top](#table-of-contents)
<a id="3"></a>
# 3 - NA values in train and test
Let's check how many NA values there are in train and test dataframe

In [None]:
print("(train, test) na --> ",(train.isna().sum().sum(), test.isna().sum().sum()))

Quite a lot Na Values to handle. We need to handle all the NA. 
We have studied them in the EDA session: https://www.kaggle.com/sgiuri/sep21tp-eda-na-handle-xgbc


[back to top](#table-of-contents)
<a id="4"></a>

## 4 - Feature Engeneer with NA Values
After we have imputed the NA we will miss some information: We no longer know where the NA-values were and how many for each row. The following code will avoid at least the count.

In [None]:
print(train.shape, test.shape)
train["nNA"] = train.isna().sum(axis = 1)
test["nNA"] = test.isna().sum(axis = 1)
print(train.shape, test.shape)

[back to top](#table-of-contents)
<a id="5"></a>

# 5 - Filling NA value strategies
There are some ML alghorithms that doesn't support the presence of NA values in dataframe. 
I'd like to verify the efficiency of several methods applying a barely tuned XGBC:

* [No NA handling (XGBC can handle a DF with NA)](#5.1)
* [Filling all NA with zeros](#5.2)
* [Filling all NA wth the mean value ](#5.3)
* [Filling all NA with the median value for each features](#5.4)
* [CINI ML training algorithm to search an appropriate feature value on a sample of datas without NA](#5.5)
* [CINI ML training algorithm to search an appropriate feature value on a sample of datas ](#5.6)
* [CINI ML training algorithm to search an appropriate feature value on the whole dataset](#5.7)
* [Mean, Median, Mode](#5.8)

In [None]:
# I create a dictionary to store training results
results = {}

[back to top](#table-of-contents)
<a id="5.1"></a>
## 5.1 - No NA handling

In [None]:
%%time

X, y, X_test = min_max_scaler(train, test)
print("(X, X_test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

results["no_NA_handling"] = xgb_train("no_NA_handling", X, y, X_test)

[back to top](#table-of-contents)
<a id="5.2"></a>
## 5.2 - NA Values Imputed to zeros

We will use sklearn "Simple Imputer".
We will fit the SimpleImputer only in the Train Set. We will aplly it to both Train and Test set to avoid 
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
test.head()

In [None]:
%%time
X, y, X_test = min_max_scaler(train, test)
print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

imputer = SimpleImputer(strategy="constant", fill_value = 0)
X = pd.DataFrame(imputer.fit_transform(X),
                 columns=X.columns,
                index = X.index)
X_test = pd.DataFrame(imputer.transform(test), 
                      columns=X_test.columns, 
                      index = X_test.index)

print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))
results["NA_to_0"] = xgb_train("NA_to_0", X, y, X_test)

[back to top](#table-of-contents)
<a id="5.3"></a>
## 5.3 - NA Values Imputed to mean

In [None]:
%%time
X, y, X_test = min_max_scaler(train, test)
print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

imputer = SimpleImputer(strategy="mean")
X = pd.DataFrame(imputer.fit_transform(X),
                 columns=X.columns,
                index = X.index)
X_test = pd.DataFrame(imputer.transform(test), 
                      columns=X_test.columns, 
                      index = X_test.index)

print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))
results["NA_to_mean"] = xgb_train("NA_to_mean", X, y, X_test)


[back to top](#table-of-contents)
<a id="5.4"></a>
## 5.4 - NA Values Imputed to median

In [None]:
%%time
X, y, X_test = min_max_scaler(train, test)
print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

imputer = SimpleImputer(strategy="median")
X = pd.DataFrame(imputer.fit_transform(X),
                 columns=X.columns, 
                 index = X.index)

X_test = pd.DataFrame(imputer.transform(test), 
                      columns=X_test.columns, 
                      index = X_test.index)

print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))
results["NA_to_median"] = xgb_train("NA_to_median", X, y, X_test)


[back to top](#table-of-contents)
<a id="5.5"></a>
## 5.5 - MICE with a dropNa training set

Applying the multivariate feature imputation by chained equations (MICE) with a drop_NA training set

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
#train_mice = train.copy(deep=True)

In [None]:
%%time

X, y, X_test = min_max_scaler(train, test)
mice_imputer = IterativeImputer()
mice_train_set = X.dropna()

print("The mice_train_set shape is: ", mice_train_set.shape)
print("mice_train_set na values =", mice_train_set.isna().sum().sum())

mice_imputer.fit(mice_train_set)
X = pd.DataFrame(mice_imputer.transform(X),
                 columns=X.columns,
                 index = X.index)
X_test = pd.DataFrame(mice_imputer.transform(X_test),
                 columns=X_test.columns,
                     index = X_test.index)

print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

results["NA_to_MICE_naDrop"] = xgb_train("NA_to_MICE_naDrop", X, y, X_test)

[back to top](#table-of-contents)
<a id="5.6"></a>
## 5.6 - MICE with a sampled training set
In this test I use a sample training set o same shape of previous one.

In [None]:
%%time

X, y, X_test = min_max_scaler(train, test)

mice_imputer = IterativeImputer()

mice_train_set = X.sample(X.dropna().shape[0])
print("The mice_train_set shape is: ", mice_train_set.shape)
print("mice_train_set na values =", mice_train_set.isna().sum().sum())

mice_imputer.fit(mice_train_set)
X = pd.DataFrame(mice_imputer.transform(X),
                 columns=X.columns,
                index = X.index)
X_test = pd.DataFrame(mice_imputer.transform(X_test),
                 columns=X_test.columns,
                     index = X_test.index)



print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))
results["NA_to_MICE_sampled"] = xgb_train("NA_to_MICE_sampled", X, y, X_test)

In [None]:
print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

[back to top](#table-of-contents)
<a id="5.7"></a>
## 5.7 - MICE with the whole training set
In this test I'll use the whole training set to calculate the MICE Na Value

In [None]:
%%time

X, y, X_test = min_max_scaler(train, test)

mice_imputer = IterativeImputer()

mice_train_set = X.copy(deep = True)
print("The mice_train_set shape is: ", mice_train_set.shape)
print("mice_train_set na values =", mice_train_set.isna().sum().sum())

mice_imputer.fit(mice_train_set)
X = pd.DataFrame(mice_imputer.transform(X),
                 columns=X.columns,
                index = X.index)
X_test = pd.DataFrame(mice_imputer.transform(X_test),
                 columns=X_test.columns
                     ,index = X_test.index)

print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))
results["NA_to_MICE"] = xgb_train("NA_to_MICE", X, y, X_test)

5.8 - 

[back to top](#table-of-contents)
<a id="5.8"></a>
## 5.8 - Mean, Median, Mode

Idea taken from www.kaggle.com/dlaststark/tps-sep-single-xgboost-model, copied from: https://www.kaggle.com/realtimshady/single-simple-lightgbm

Mean: normal distribution
Median: unimodal and skewed
Mode: all other cases

In [None]:
%%time

X, y, X_test = min_max_scaler(train, test)
from tqdm import tqdm
features = [x for x in X.columns.values if x[0]=="f"]

fill_value_dict = {
    'f1': 'Mean', 
    'f2': 'Median', 
    'f3': 'Median', 
    'f4': 'Median', 
    'f5': 'Mode', 
    'f6': 'Mean', 
    'f7': 'Median', 
    'f8': 'Median', 
    'f9': 'Median', 
    'f10': 'Median', 
    'f11': 'Mean', 
    'f12': 'Median', 
    'f13': 'Mean', 
    'f14': 'Median', 
    'f15': 'Mean', 
    'f16': 'Median', 
    'f17': 'Median', 
    'f18': 'Median', 
    'f19': 'Median', 
    'f20': 'Median', 
    'f21': 'Median', 
    'f22': 'Mean', 
    'f23': 'Mode', 
    'f24': 'Median', 
    'f25': 'Median', 
    'f26': 'Median', 
    'f27': 'Median', 
    'f28': 'Median', 
    'f29': 'Mode', 
    'f30': 'Median', 
    'f31': 'Median', 
    'f32': 'Median', 
    'f33': 'Median', 
    'f34': 'Mean', 
    'f35': 'Median', 
    'f36': 'Mean', 
    'f37': 'Median', 
    'f38': 'Median', 
    'f39': 'Median', 
    'f40': 'Mode', 
    'f41': 'Median', 
    'f42': 'Mode', 
    'f43': 'Mean', 
    'f44': 'Median', 
    'f45': 'Median', 
    'f46': 'Mean', 
    'f47': 'Mode', 
    'f48': 'Mean', 
    'f49': 'Mode', 
    'f50': 'Mode', 
    'f51': 'Median', 
    'f52': 'Median', 
    'f53': 'Median', 
    'f54': 'Mean', 
    'f55': 'Mean', 
    'f56': 'Mode', 
    'f57': 'Mean', 
    'f58': 'Median', 
    'f59': 'Median', 
    'f60': 'Median', 
    'f61': 'Median', 
    'f62': 'Median', 
    'f63': 'Median', 
    'f64': 'Median', 
    'f65': 'Mode', 
    'f66': 'Median', 
    'f67': 'Median', 
    'f68': 'Median', 
    'f69': 'Mean', 
    'f70': 'Mode', 
    'f71': 'Median', 
    'f72': 'Median', 
    'f73': 'Median', 
    'f74': 'Mode', 
    'f75': 'Mode', 
    'f76': 'Mean', 
    'f77': 'Mode', 
    'f78': 'Median', 
    'f79': 'Mean', 
    'f80': 'Median', 
    'f81': 'Mode', 
    'f82': 'Median', 
    'f83': 'Mode', 
    'f84': 'Median', 
    'f85': 'Median', 
    'f86': 'Median', 
    'f87': 'Median', 
    'f88': 'Median', 
    'f89': 'Median', 
    'f90': 'Mean', 
    'f91': 'Mode', 
    'f92': 'Median', 
    'f93': 'Median', 
    'f94': 'Median', 
    'f95': 'Median', 
    'f96': 'Median', 
    'f97': 'Mean', 
    'f98': 'Median', 
    'f99': 'Median', 
    'f100': 'Mode', 
    'f101': 'Median', 
    'f102': 'Median', 
    'f103': 'Median', 
    'f104': 'Median', 
    'f105': 'Median', 
    'f106': 'Median', 
    'f107': 'Median', 
    'f108': 'Median', 
    'f109': 'Mode', 
    'f110': 'Median', 
    'f111': 'Median', 
    'f112': 'Median', 
    'f113': 'Mean', 
    'f114': 'Median', 
    'f115': 'Median', 
    'f116': 'Mode', 
    'f117': 'Median', 
    'f118': 'Mean'
}


print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))

for col in tqdm(features):
    if fill_value_dict.get(col)=='Mean':
        fill_value = X[col].mean()
    elif fill_value_dict.get(col)=='Median':
        fill_value = X[col].median()
    elif fill_value_dict.get(col)=='Mode':
        fill_value = X[col].mode().iloc[0]
    
    X[col].fillna(fill_value, inplace=True)
    X_test[col].fillna(fill_value, inplace=True)

print("(train, test) na --> ",(X.isna().sum().sum(), X_test.isna().sum().sum()))
results["NA_to_MMM"] = xgb_train("NA_to_MMM", X, y, X_test)

In [None]:
train.head()


In [None]:
results_df = pd.DataFrame(list(results.items()),columns = ['Strategy','ROC AUC'], index = [(51+x)/10 for x in (range(len(results))) ] )
results_df["Wall Time"] = ["5min 5s", "4min 35s", "4min 43s", "4min 51s", "14min 46s", "35min 20s", "1h 24min 28s", "4m 10s" ]
results_df["Pubblic Score"] = [0.81722, 0.78181, 0.78671, 0.78640, 0.81743, 0.81747, 0.81727, 0.81730]
print(results_df.sort_values(by = ['ROC AUC'], ascending=False))

[back to top](#table-of-contents)
<a id="7"></a>
# 7 - Comments
THe best strategy looks like to be the CINI - MICE trained on a sample dataset. 
This Imputer is really slower, but aim to better results.<br>
I can't understand the very slow results of Pubblic Score of Simple Imputer strategies.. More studies soon.


