## <a>Introduction</a>

Welcome to this new competition series by Kaggle. This is somewhat in between basic playground competitions and competitive featured ones. 

In this competition, we are given a regression task. We will be predicting a continuous target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous.

Let's get started.

## <a>Loading Packages and Data</a>

In [None]:
import numpy as np 
import pandas as pd 
import os, gc
import matplotlib.pyplot as plt
import seaborn as sns
import math
import lightgbm as lgb
import xgboost as xgb
import optuna

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import warnings
warnings.filterwarnings('ignore')

In [None]:
PATH = '../input/tabular-playground-series-feb-2021/'

train = pd.read_csv(PATH + 'train.csv')
test = pd.read_csv(PATH + 'test.csv')
sample = pd.read_csv(PATH + 'sample_submission.csv')

print(train.shape, test.shape)

Both train and test are medium sized datasets. Let's take a look at the train set.


In [None]:
train.head(10)

In [None]:
test.head(10)

In [None]:
train.info()

In [None]:
test.info()

We've no missing values in the train and test sets. Let's move on to EDA.

## <a>EDA</a>

Let's first check the distribution of target variable.


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.distplot(train['target'], ax=ax[0])
sns.boxplot(train['target'], ax=ax[1])

Just like the target variable distribution of TPS Jan 2021, the target variable has a bimodel distribution. Outliers are present. We'll be using LightGBM so no need for transformations for now.

In [None]:
train.describe()

In [None]:
FEATURES = train.drop(['id', 'target'], 1).columns
FEATURES

In [None]:
fig, ax = plt.subplots(7, 2, figsize=(16, 40))

ax = ax.flatten()

cont_features = ['cont0', 'cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6',
       'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13']

for k, i in enumerate(cont_features):
    sns.distplot(train[i], ax=ax[k], hist=False, label='train')
    sns.distplot(test[i], ax=ax[k], hist=False, label='test')

All the features are multimodal with varying number of peaks. The feature distributions from train and test set are almost same.  Let's check the categorical features now.

In [None]:
fig, ax = plt.subplots(10, 2, figsize=(16, 50))
ax = ax.flatten()

cat_features = ['cat0', 'cat1', 'cat2', 'cat3', 'cat4', 'cat5', 'cat6', 'cat7', 'cat8',
       'cat9']

for k, i in enumerate(cat_features):
    sns.countplot(train[i], ax=ax[2*k], label='train')
    sns.countplot(test[i], ax=ax[(2*k)+1], label='test')
    

The countplots look same for cat0, cat1, cat2, cat4, cat5, cat6 and different for the rest. Let's look at the correlations now. We can LabelEncode the categorial variables before plotting correlation matrix.

In [None]:
for i in cat_features:
    le = LabelEncoder()
    le.fit(train[i])
    train[i] = le.transform(train[i])
    test[i] = le.transform(test[i])

train.head()

In [None]:
x = train.corr()
plt.figure(figsize=(20,20))
sns.heatmap(x, annot=True)

1. There is a correlation cluster from cont5 to cont12, but the highest value of correlation coefficient is 0.63, so no need to drop any features.
2. Features are not correlated to the target.
3. This is very similar to the datasets in TPS Jan2021.

## <a>Model</a>

In [None]:
cv = KFold(n_splits=5, shuffle=True)
cv

In [None]:
X = train[FEATURES]
y = train.target
print(X.shape, y.shape)

In [None]:
model = lgb.LGBMRegressor()
model

NUM_BOOST_ROUNDS = 20000
EARLY_STOPPING_ROUNDS = 500
VERBOSE_EVAL = 0

oof_df = train[['id', 'target']].copy()
fold_ = 1

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.3, shuffle=True)

In [None]:
def objective(trial):
    train_set = lgb.Dataset(X_train, y_train)
    val_set = lgb.Dataset(X_val, y_val)

    param = {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": 1,
        "boosting_type": "gbdt",


        "num_leaves": trial.suggest_int("num_leaves", 0, 256),
        "max_depth": trial.suggest_int("max_depth", 3, 31),
        "lambda_l1": trial.suggest_float("lambda_l1", 0.0, 10),
        "lambda_l2": trial.suggest_float("lambda_l2", 0.4, 10),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 0.9),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 0.9),
        "bagging_freq": trial.suggest_int("bagging_freq", 5, 15),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
    }

    model = lgb.train(param,
                      train_set, 
                         num_boost_round=NUM_BOOST_ROUNDS,
                         early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                         verbose_eval=VERBOSE_EVAL,
                         valid_sets=[train_set, val_set])
    
    val_preds = model.predict(X_val, num_iteration=model.best_iteration)
    scc = math.sqrt(mean_squared_error(val_preds, y_val))
    return -1*scc

In [None]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
trial = study.best_trial
trial.params['metric'] = 'rmse'

In [None]:
print(trial.params)

In [None]:
for train_idx, val_idx in cv.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    train_set = lgb.Dataset(X_train, y_train)
    val_set = lgb.Dataset(X_val, y_val)

    model = lgb.train(trial.params,
                      train_set,
                      num_boost_round=NUM_BOOST_ROUNDS,
                      early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                      verbose_eval=-10,
                      valid_sets=[train_set, val_set]
                      )

    val_preds = model.predict(X_val, num_iteration=model.best_iteration)
    test_preds = model.predict(
        test[FEATURES], num_iteration=model.best_iteration)

    oof_df.loc[oof_df.iloc[val_idx].index, 'oof'] = val_preds
    sample[f'fold{fold_}'] = test_preds

    score = mean_squared_error(
        oof_df.loc[oof_df.iloc[val_idx].index]['target'], oof_df.loc[oof_df.iloc[val_idx].index]['oof'])
    print(math.sqrt(score))
    fold_ += 1

In [None]:
print(math.sqrt(mean_squared_error(oof_df.target, oof_df.oof)))
sample['target'] = sample.drop(['id', 'target'], 1).mean(axis=1)
sample[['id', 'target']].to_csv('submission.csv', index=False)