## <center>Tabular Playground Series - Aug 2021</center>
### <center>CatBoost regressor with tuned hyperparameters + Cross-validation</center>

Kaggle table competitions of Aug 2021.

#### Dataset:
The dataset is used for this competition is synthetic, but based on a real dataset and generated using a CTGAN
For this competition, you will be predicting a target loss based on a number of feature columns given in the data. The ground truth loss is integer valued, although predictions can be continuous.

## Import libraries

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting
import seaborn as sns # plotting
from sklearn.metrics import mean_squared_error # MSE metric
from sklearn.preprocessing import OrdinalEncoder # ordinal encoding categorical variables
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from catboost import CatBoostRegressor
from catboost import Pool
import shap as shap

SEED = 91 # random seed

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

In [None]:
PATH = '/kaggle/input/tabular-playground-series-aug-2021/' # you can use your own local path

print('Files in directory:')
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print('  '+os.path.join(dirname, filename))
print()

# 1. Load data and first look

In [None]:
try:
    df_train = pd.read_csv(PATH+'train.csv', index_col=0)
    df_test = pd.read_csv(PATH+'test.csv', index_col=0)
    submission = pd.read_csv(PATH+'sample_submission.csv', index_col=0)
    print('All of the data has been loaded successfully!')
except Exception as err:
    print(repr(err))
print()

In [None]:
full_lenght_data = len(df_train) + len(df_test)
print(f"{len(df_train)} ({100*len(df_train)/full_lenght_data}%)")
print(f"{len(df_test)} ({100*len(df_test)/full_lenght_data}%)")

In [None]:
df_train.info()

In [None]:
df_test.info()

In [None]:
df_train.isna().sum().sum(), df_test.isna().sum().sum()

There are no missing value in both datasets.

# 2. Exploratory Data Analysis (EDA)

In [None]:
df_train.describe().T

In [None]:
df_test.describe().T

#### Target

In [None]:
TARGET='loss'
df_train[TARGET].value_counts()

#### Features

In [None]:
CAT_FEATURES = ['f1', 'f86', 'f55', 'f27']

In [None]:
df_train[CAT_FEATURES].nunique().sort_values()

In [None]:
df_test[CAT_FEATURES].nunique().sort_values()

# 3. Data preproccesing

In [None]:
X = df_train.drop(TARGET, axis=1)
y = df_train[TARGET].copy()

#### Split data into folds

In [None]:
N_FOLDS = 5

X_train, X_val, y_train, y_val = train_test_split(X, y, 
                                                  test_size=0.20,
                                                  shuffle=True,
                                                  random_state=SEED)

kf = KFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)

# 4. Train model

#### Build CatBoost Regressor

In [None]:
catboost_params = {
    'loss_function': 'RMSE',
    'eval_metric': 'RMSE',
    'cat_features': CAT_FEATURES,
    'depth': 5,
    'od_type': "Iter",
    'learning_rate': 0.015,
    'iterations': 8000,
    'early_stopping_rounds': 70,
    'l2_leaf_reg': 2.0,
    'leaf_estimation_method': 'Newton',
    'min_child_samples': 20,
    'bagging_temperature': 40,
    'verbose': 2000,
    'thread_count': 4,
    'random_seed': SEED
}

Add prediction on every step

In [None]:
predictions_valid = np.zeros((df_train.shape[0],))
predictions = 0
model_fi = 0
mean_rmse = 0

for num, (train_idx, valid_idx) in enumerate(kf.split(df_train)):
    # split the train data into train and validation
    X_train = X.iloc[train_idx]
    X_valid = X.iloc[valid_idx]
    y_train = y.iloc[train_idx]
    y_valid = y.iloc[valid_idx]
    
    model = CatBoostRegressor(**catboost_params)
    model.fit(X_train, y_train,
             eval_set=(X_valid, y_valid))
    
    # Mean of the predictions
    predictions += model.predict(df_test) / N_FOLDS
    
    # Mean of feature importance
    model_fi += model.feature_importances_ / N_FOLDS 
    
    # Out of Fold predictions
    predictions_valid[valid_idx] = model.predict(X_valid)
    fold_rmse = np.sqrt(mean_squared_error(y_valid, predictions_valid[valid_idx]))
    print(f"Fold {num} | RMSE: {fold_rmse}\n")
    
    mean_rmse += fold_rmse / N_FOLDS
    
print(f"\nOverall RMSE: {mean_rmse}")

#### Feature importance

In [None]:
feature_importance_df = pd.DataFrame(model_fi, index=X.columns)
feature_importance_df.sort_values(by=0, ascending=False)

In [None]:
train_data = Pool(data=X,
                  label=y,
                  cat_features=CAT_FEATURES
                 )
                 
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X, feature_names=X.columns)

# 5. Submit predictions

In [None]:
output = pd.DataFrame({'id': df_test.index,
                       TARGET: predictions})
output.to_csv('submission.csv', index=False)