# CatBoost Tabular Playground Series Prediction

## Summary
In this notebook, I will use CatBoost Regressor to solve Tabular Playground Series Prediction. I will try hyperparameter searching and K-Fold Alogorithm to see if this can have an impact on test results.

## Import Packages

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Common Functions

In [None]:
def submit(model, test_features, test_ids, filename):
    loss_pred = model.predict(test_features)
    submission = pd.DataFrame({"id": test_ids, "loss": loss_pred.reshape(-1)})
    submission.to_csv(filename, index=False)

## Load datasets

In [None]:
train = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/train.csv")
test = pd.read_csv("/kaggle/input/tabular-playground-series-aug-2021/test.csv")

## EDA

In [None]:
train.head()

In [None]:
train.shape

In [None]:
train.describe().transpose()

There isn't an obvious correlation between features and target values.

In [None]:
corr_score = train.corr()

In [None]:
corr_score["loss"].sort_values(ascending=False)

## Data Preprocessing

## Drop id column

In [None]:
train.pop("id")
test_ids = test.pop("id")

In [None]:
train_mean = train.mean()
train_std = train.std()

In [None]:
train_targets_mean = train_mean.pop("loss")
train_targets_std = train_std.pop("loss")

### Train Validation Split

In [None]:
validation_split = 0.2

In [None]:
train_features, validation_features = train_test_split(train, test_size=validation_split)

In [None]:
train_targets, validation_targets = train_features.pop("loss"),  validation_features.pop("loss")

### Data Scaling

In [None]:
should_scale = False
if should_scale == True:
    train_features = (train_features - train_mean) / train_std
    validation_features = (validation_features - train_mean) / train_std
    test_features = (test - train_mean) / train_std
    print(test_features.head())
    print(train_features.head())
    print(validation_features.head())
else:
     test_features = test

## Model Development
### Using Catboost

In [None]:
import catboost
import time
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
begin = time.time()
parameters = {
    "depth": [6, 7, 8],
    "learning_rate": [0.08, 0.1],
    "iterations": [300, 350], 
}
def train_catboost(hyperparameters, X_train, X_val, y_train, y_val):
    keys = hyperparameters.keys()
    best_index = {key:0 for key in keys}
    best_cat = None
    best_score = 10e8
    for (index, key) in enumerate(keys):
        print("Find best parameter for %s" %(key))
        items = hyperparameters[key]
        best_parameter = None
        temp_best = 10e8
        for (key_index, item) in enumerate(items):
            iterations = hyperparameters["iterations"][best_index["iterations"]] if key != "iterations" else item
            learning_rate = hyperparameters["learning_rate"][best_index["learning_rate"]] if key != "learning_rate" else item
            depth = hyperparameters["depth"][best_index["depth"]] if key != "depth" else item
            print("Train with iterations: %d learning_rate: %.2f depth:%d"%(iterations, learning_rate, depth))
            cat = catboost.CatBoostRegressor(
                iterations = iterations, 
                learning_rate = learning_rate,
                depth = depth
            )
            cat.fit(X_train, y_train, verbose=False)
            y_pred = cat.predict(X_val)
            score = np.sqrt(mean_squared_error(y_val, y_pred))
            print("RMSE: %.2f"%(score))
            if score < temp_best:
                temp_best = score
                best_index[key] = key_index
                best_parameter = item
            if score < best_score:
                best_score = score
                best_cat = cat
        print("Best Parameter for %s: "%(key), best_parameter)
    best_parameters = {
        "iterations": hyperparameters["iterations"][best_index["iterations"]],
        "learning_rate": hyperparameters["learning_rate"][best_index["learning_rate"]],
        "depth": hyperparameters["depth"][best_index["depth"]]
    }
    return best_cat, best_score, best_parameters
best_cat, best_score, best_parameters = train_catboost(parameters, train_features, validation_features, train_targets, validation_targets)
print("Best CatBoost Model: ", best_cat)
print("Best MAE: ", best_score)
elapsed = time.time() - begin 
print("Elapsed time: ", elapsed)
submit(best_cat, test_features, test_ids, "submission.csv")

I will apply K-Fold alogorithm to best Model for training. The result looks good, sometime I can get 7.82, howerver when I submit the results, the scores are about 7.9, and just litte difference between different fold of data.

In [None]:
from sklearn.model_selection import KFold
fold = 1
for train_indices, val_indices in KFold(n_splits=5, shuffle=True).split(train):
    print("Training with Fold %d"%(fold))
    X_train = train.iloc[train_indices]
    X_val = train.iloc[val_indices]
    y_train = X_train.pop("loss")
    y_val = X_val.pop("loss")
    if should_scale:
        X_train = (X_train - train_mean) / train_std
        X_val = (X_val - train_mean) / train_std
    cat = catboost.CatBoostRegressor(
        iterations = best_parameters["iterations"], 
        learning_rate = best_parameters["learning_rate"],
        depth = best_parameters["depth"]
    )
    cat.fit(X_train, y_train, verbose=False)
    y_pred = cat.predict(X_val)
    score = np.sqrt(mean_squared_error(y_val, y_pred))
    print("RMSE: %.2f"%(score))
    submit(cat, test_features, test_ids, "submission_fold%d.csv"%(fold))
    fold += 1

## Conclusion
Hyper Parameter Searching don't affect the result too much. K-Fold Alogrithm has obvious impact on validation dataset, but doesn't have an obvious impact on test result.

## Your upvote can encourage me updating notebooks on Kaggle, if you like my work, give me an upvote.