# This notebook is to show you how to use a brand new Regressor called SuloRegressor which produces a highly performant model
This notebook is derived from the following notebook:
https://www.kaggle.com/code/devsubhash/tps-june-eda-xgb-simpleimputer

Many Thanks to Devashree Madhugiri for creating this notebook. I have modified the model in that notebook to use the brand new SuloRegressor.

# What is the SuloRegressor and why do you need it?

Check out the performance of the SuloRegressor from the featurewiz library. You can click and see it here:
<a href="https://github.com/AutoViML/featurewiz"><img src="https://i.ibb.co/sbk6S2B/Sulo-Regressor.png" alt="Sulo-Regressor" border="0"></a>

<p style="font-family: Arials; font-size: 20px;text-align: left;; font-style: normal;line-height:1.3">For this dataset, you are given (simulated) manufacturing control data that contains missing values due to electronic errors. Your task is to predict the values of all missing data in this dataset.</p>

**Observations on this TPS dataset:**
- The dataset has `10,00,000` rows and `80` columns 
- The dataset contains`4` features categories - `F_1`,`F_2`,`F_3` and `F_4`
- `F_1` and `F_4` are divided into `15` features while `F_2` and `F_3` are divided into `25` features.
- `25` features are `int` type and `55` features are `float` type
- We have to fill the missing values. But not all the columns have missing values.

### We are going to use the brand-new estimator that will use SuloRegressor from the featurewiz library to train on missing values and make predictions. Click the link below to see it.
<a href="https://github.com/AutoViML/featurewiz"><img src="https://i.ibb.co/ZLdZMZg/featurewiz-logos.png" alt="featurewiz-logos" border="0"></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
import gc
warnings.filterwarnings('ignore')

from sklearn.impute import SimpleImputer
from tqdm.notebook import tqdm
from lightgbm import LGBMRegressor

## Import the data set

In [None]:
df_data = pd.read_csv('../input/tabular-playground-series-jun-2022/data.csv', index_col='row_id')
df_subm =  pd.read_csv('../input/tabular-playground-series-jun-2022/sample_submission.csv', index_col='row-col')

In [None]:
df_data_row_count, df_data_column_count = df_data.shape
print('Total number of rows:', df_data_row_count) 
print('Total number of columns:', df_data_column_count)
df_data.head()

## We will now divide the features into four fold

In [None]:
features = list(df_data.columns)
features = features[1:len(features)]
F1_features = [feat for feat in features if feat[:3] == "F_1"]
F2_features = [feat for feat in features if feat[:3] == "F_2"]
F3_features = [feat for feat in features if feat[:3] == "F_3"]
F4_features = [feat for feat in features if feat[:3] == "F_4"]

# Let us perform the feature pre-processing

In [None]:
gc.collect()

In [None]:
df_data1 = df_data.copy()

# Importing SuloRegressor from featurewiz library

In [None]:
#### this is what we need to do first
!pip install featurewiz --ignore-installed --no-deps
!pip install xlrd --ignore-installed --no-deps

### You need to install this since Kaggle has a wrong version ##
!pip install Pillow==9.0.0


In [None]:
from featurewiz import SuloRegressor

In [None]:
from xgboost import XGBRegressor
lgb =  XGBRegressor(n_estimators=550, 
                    #learning_rate=0.009, 
                    n_jobs=-1,
                    tree_method = 'gpu_hist',
                    gpu_id=0, 
                    predictor="gpu_predictor")
lgb

In [None]:
spe = SuloRegressor(base_estimator=lgb, n_estimators=None, pipeline=True, imbalanced=False,
                    integers_only=False, log_transform=False, verbose=0)
spe

In [None]:
F4_feat = [col for col in df_data1.columns if col.split('_')[1] == '4']
df_new = pd.DataFrame()
for feature in df_data1.columns:
    if feature in F4_feat:
        df_new[feature] = df_data1[feature]
F123_features = df_data1.drop(F4_feat, axis=1)

In [None]:
feat_imp = SimpleImputer(missing_values=np.nan, strategy= 'mean')
imp = pd.DataFrame(feat_imp.fit_transform(F123_features), columns = F123_features.columns)

In [None]:
from pandas.api.types import is_numeric_dtype, is_integer_dtype

In [None]:
import copy
for column in df_new.columns:
    df_train = df_new[df_new[column].isna() == False]
    df_test = df_new[df_new[column].isna() == True]
    
    X = df_train.drop(column, axis=1)
    y = df_train[column]
    
    model = copy.deepcopy(spe)
    print('Using SuloRegressor with XGB Regressor')
    model.fit(X, y)
    
    pred = model.score(X, y)
    print('    model score for %s column = %0.03f' %(column, pred))
    df_new[column][df_test.index] = model.predict(df_test.drop(column, axis = 1))

In [None]:
df = pd.concat([imp, df_new], axis = 1)

In [None]:
for i in tqdm(df_subm.index):
    row = int(i.split('-')[0])
    col = i.split('-')[1]
    df_subm.loc[i, 'value'] = df.loc[row, col]
df_subm.to_csv('submission_xgb.csv')
df_subm.head()