# XGBoost is Back!!
## Imputation Using XGBoost

> - In this notebook, we will use XGBoost to impute the missing values
> - This notebook is based on the EDA done in the notebook <a href = "https://www.kaggle.com/code/raviista/tpsjune22-art-of-eda">link here</a>
> - Please Upvote if you find my work useful

---

# Import Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session


from xgboost import XGBRegressor
from tqdm import tqdm

## For Suppressing the warnings
import warnings
warnings.filterwarnings("ignore")

---

# Load the data

In [None]:
data = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")
sub = pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv', index_col='row-col') #, index_col='row_id'

In [None]:
data.head()

---

# Creating Categorical features while handling rare class problem in int64 features

In [None]:
def categorize(data,col,threshold = 2 ):
    series = pd.value_counts(data[col], normalize = True)
    mask = (series* 100).lt(threshold)
    data[f"cat_{col}"] = np.where(data[col].isin(series[mask].index),'Other',data[col])
    return data

In [None]:
int_features = [f for f in data.columns if data[f].dtype == 'int64' and f != 'row_id' and f.startswith("F_2")]

print("Preprocessing int64 feature starts !!")
for col in int_features:
    print(f"Feature -- {col}")
    data = categorize(data, col)
print("Done!!")


In [None]:
cat_cols = [col for col in data.columns if col.startswith("cat_F_2")]
print(cat_cols)
data_cat_cols = pd.get_dummies(data[cat_cols])
data_cat_cols.shape

## Concatening the data

In [None]:
float_features = [f for f in data.columns if data[f].dtype == 'float64']

total_data = pd.concat([data["row_id"] , data_cat_cols,data[float_features]], axis = 1)
total_data.shape

In [None]:
total_data.head()

---

# Modeling

For modeling, I am using xgBoost in this notebook 

For every `float64` feature, I am setting train, validation and test as follow :-
> - Filtering all the rows having non-null for that feature
> - Imputing null values for remaining features with -999, which will be used to inform xgboost regarding missing values. (Here we are using imputation approach provided by XGBoost)
> - 80% of the above comprises train
> - 20% of remaining comprises validation
> - All the rows having nulls for that feature are set as test

I am creating the dataset with above split for every `float64` feature and training xgBoost model to predict for null records of that features

In [None]:
for column in float_features:
    print("-"*50)
    print(f"Modeling for the column - {column}")
    featuresToUse = list(total_data.columns)
    featuresToUse.remove(column)
    featuresToUse.remove("row_id")

    test = total_data[total_data[column].isna()]
    print(f"Shape of test data - {test.shape}")
    total_data_left = total_data.loc[(~total_data.index.isin(test.index))]

    train = total_data_left.sample(frac=0.8, random_state=18)

    val  = total_data_left.loc[(~total_data_left.index.isin(train.index)) ]
    train_x = train[featuresToUse]
    train_y = train[column]
    val_x = val[featuresToUse]
    val_y = val[column]    
    print(f"Number of records for training - {train_x.shape[0]}, for validation -  {val_x.shape[0]}")
    train_x = train_x.fillna(-999)
    val_x = val_x.fillna(-999)

    xgb =  XGBRegressor(n_estimators = 100 , learning_rate = 0.1,missing=-999.0,tree_method = "gpu_hist")

    xgb.fit(train_x, train_y, eval_set=[(train_x,train_y),(val_x, val_y)], eval_metric='rmse',early_stopping_rounds = 10,verbose = False)

    test_x = test[featuresToUse]
    pred = xgb.predict(test_x)
    test["output"] = pred
    test = test[["row_id","output"]]
    print(f"Null records in column - {column} before imputation - {data[column].isnull().sum()} ")
    for i, row in enumerate(test.values):
        row_id = row[0]      
        pred = row[1]
        data.loc[row_id,column] = pred
    print(f"Null records in column - {column} after imputation - {data[column].isnull().sum()} ")


---

# Generating the submission data

In [None]:
for i in tqdm(sub.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    sub.loc[i, 'value'] = data.loc[row, col]

sub.to_csv('submission.csv')

In [None]:
sub.head()