# Starter Code - sklearn IterativeImputer


#### In this notebook, scikit-learn's ```IterativeImputer``` with XGBoost is used to fill in the missing values. All the filled-in values are then outputted to the submission.csv file.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("../input/tabular-playground-series-jun-2022/data.csv")
submission = pd.read_csv("../input/tabular-playground-series-jun-2022/sample_submission.csv")

In [None]:
df

# Imputation

Scikit-learn's ```IterativeImputer``` is used. This uses other features to estimate missing values. More information can be found here: http://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

This estimator is still experimental for now.

To use it, we need to explicitly import enable_iterative_imputer.

The estimator used is XGBoost utilizing GPU acceleration. If you wish to use this without a GPU, simply remove ```tree_method="gpu_hist"``` and ```predictor="gpu_predictor"``` and the code will run properly.

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import xgboost

imputer = IterativeImputer(estimator=xgboost.XGBRegressor(n_estimators=500, tree_method='gpu_hist', predictor="gpu_predictor"),
                           verbose=2,
                           max_iter=20)

imputed_df = pd.DataFrame(imputer.fit_transform(df), columns = df.columns) # preserve the column names, used later

In [None]:
imputed_df

# Getting the Predicted Values

The below code loops through the submission file to check at which rows and columns it needs to predict. It then checks the imputed dataframe and adds the found value to the submission dataframe.

In [None]:
#iterating through a NumPy array is much faster than using df.iterrows() !
for i, row in enumerate(submission.values):
    row_col = row[0]
    imputed_row = row_col.split("-")[0] #get the row index
    imputed_col = row_col.split("-")[1] #get the column index
    submission.at[i, "value"] = imputed_df.iloc[int(imputed_row)][imputed_col]

# Submission

In [None]:
submission.head()

In [None]:
submission.to_csv("submission.csv", index=False)

### Please upvote if you found this helpful :)