## <a>Introduction</a>

Welcome to this new competition series by Kaggle. This is somewhat in between basic playground competitions and competitive featured ones. 

In this competition, we are given a regression task. We will be predicting a continuous target based on a number of feature columns given in the data. All of the feature columns, cont1 - cont14 are continuous.

Let's get started.

## <a>Loading Packages and Data</a>

In [None]:
import numpy as np 
import pandas as pd 
import os, gc
import matplotlib.pyplot as plt
import seaborn as sns
import math
import lightgbm as lgb
import xgboost as xgb
import optuna

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, GridSearchCV, train_test_split
from sklearn.linear_model import LinearRegression
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
PATH = '../input/tabular-playground-series-jan-2021/'

train = pd.read_csv(PATH + 'train.csv')
test = pd.read_csv(PATH + 'test.csv')
sample = pd.read_csv(PATH + 'sample_submission.csv')

print(train.shape, test.shape)

Both train and test are medium sized datasets. Let's take a look at the train set.


In [None]:
train.head(10)

In [None]:
test.head(10)

In [None]:
train.info()

In [None]:
test.info()

There are no missing values present train and test. Also, all the feature columns are float type.

## <a>EDA</a>

Let's first check the distribution of target variable.


In [None]:
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
sns.distplot(train['target'], ax=ax[0])
sns.boxplot(train['target'], ax=ax[1])

The target variable has a bimodel distribution and outliers are present. We'll be using LightGBM so no need for transformations.

In [None]:
train.describe()

In [None]:
FEATURES = train.drop(['id', 'target'], 1).columns
FEATURES

In [None]:
fig, ax = plt.subplots(7, 2, figsize=(16, 40))
ax = ax.flatten()

for k, i in enumerate(FEATURES):
    sns.distplot(train[i], ax=ax[k], hist=False, label='train')
    sns.distplot(test[i], ax=ax[k], hist=False, label='test')

All the features are multimodal with varying number of peaks. The feature distributions from train and test set are almost same.  Let's check the correlations

In [None]:
x = train.corr()
plt.figure(figsize=(12,12))
sns.heatmap(x, annot=True)

1. There is a correlation cluster from cont6 to cont13, but the highest value of correlation coefficient is 0.83, so no need to drop any features.

2. Features are not correlated to the target.

Let's check feature-wise outliers

In [None]:
fig, ax = plt.subplots(7, 2, figsize=(16, 40))
ax = ax.flatten()

for k, i in enumerate(FEATURES):
    sns.boxplot(train[i], ax=ax[k])

## <a>Feature Engineering</a>

In [None]:
train['train'] = 1
test['train'] = 0
target = train.target

In [None]:
combined_df = pd.concat([train, test], 0)
combined_df = combined_df.sort_values(by='id', ascending=True)
combined_df

In [None]:
for i in FEATURES:
    combined_df[f'{i}_lag_1'] = combined_df[i].shift(1)
    combined_df[f'{i}_lag_5'] = combined_df[i].shift(5)
    combined_df[f'{i}_lag_10'] = combined_df[i].shift(10)
    
    combined_df[f'{i}_lag_-1'] = combined_df[i].shift(-1)
    combined_df[f'{i}_lag_-5'] = combined_df[i].shift(-5)
    combined_df[f'{i}_lag_-10'] = combined_df[i].shift(-10)
    
    
    combined_df[f'{i}_50_rl_max'] = combined_df[i].rolling(window=50).max()
    combined_df[f'{i}_50_rl_max'] = combined_df[i].rolling(window=50).min()
    combined_df[f'{i}_50_rl_max'] = combined_df[i].rolling(window=50).std()
    combined_df[f'{i}_50_rl_max'] = combined_df[i].rolling(window=50).mean()
    combined_df[f'{i}_50_rl_max'] = combined_df[i].rolling(window=50).median()
    
    combined_df[f'{i}_20_rl_max'] = combined_df[i].rolling(window=20).max()
    combined_df[f'{i}_20_rl_max'] = combined_df[i].rolling(window=20).min()
    combined_df[f'{i}_20_rl_max'] = combined_df[i].rolling(window=20).std()
    combined_df[f'{i}_20_rl_max'] = combined_df[i].rolling(window=20).mean()
    combined_df[f'{i}_20_rl_max'] = combined_df[i].rolling(window=20).median()
    
    
    combined_df[f'{i}_10_rl_max'] = combined_df[i].rolling(window=10).max()
    combined_df[f'{i}_10_rl_max'] = combined_df[i].rolling(window=10).min()
    combined_df[f'{i}_10_rl_max'] = combined_df[i].rolling(window=10).std()
    combined_df[f'{i}_10_rl_max'] = combined_df[i].rolling(window=10).mean()
    combined_df[f'{i}_10_rl_max'] = combined_df[i].rolling(window=10).median()
combined_df

## <a>Model</a>

In [None]:
cv = KFold(n_splits=5, shuffle=True)
cv

In [None]:
X = combined_df[combined_df['train'] == 1].drop(['id', 'target', 'train'], 1)
y = combined_df[combined_df['train'] == 1].target
print(X.shape, y.shape)

In [None]:
NUM_BOOST_ROUNDS = 10000
EARLY_STOPPING_ROUNDS = 1000
VERBOSE_EVAL = 1

oof_df = train[['id', 'target']].copy()
fold_ = 1

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1)

def objective(trial):
        train_set = lgb.Dataset(X_train, y_train)
        val_set = lgb.Dataset(X_val, y_val)
        
        param = {
        "objective": "regression",
        "metric": "rmse",
        "verbosity": 1,
        "boosting_type": "gbdt",
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.005),
        "num_leaves": trial.suggest_int("num_leaves", 128, 512),
        "max_depth": trial.suggest_int("max_depth", 3, 31),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.7, 0.9),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.7, 0.9),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        }
        
        
        model = lgb.train(param,
                          train_set)
        val_preds = model.predict(X_val, num_iteration=model.best_iteration)
        scc = math.sqrt(mean_squared_error(val_preds, y_val))
        return -1*scc
    
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
trial = study.best_trial
print(trial.params)
trial.params['metric']  = 'rmse'

In [None]:
for train_idx, val_idx in cv.split(X, y):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]
    
    train_set = lgb.Dataset(X_train, y_train)
    val_set = lgb.Dataset(X_val, y_val)
    
    
    model = lgb.train(trial.params,
                          train_set,
                          num_boost_round=NUM_BOOST_ROUNDS,
                          early_stopping_rounds=EARLY_STOPPING_ROUNDS,
                          verbose_eval=VERBOSE_EVAL,
                          valid_sets=[train_set, val_set]
                          )
    
    val_preds = model.predict(X_val, num_iteration=model.best_iteration)
    test_preds = model.predict(combined_df[combined_df['train'] == 0].drop(['id', 'target', 'train'], 1), num_iteration=model.best_iteration)

    oof_df.loc[oof_df.iloc[val_idx].index, 'oof'] = val_preds
    sample[f'fold{fold_}'] = test_preds
    
    score = mean_squared_error(oof_df.loc[oof_df.iloc[val_idx].index]['target'], oof_df.loc[oof_df.iloc[val_idx].index]['oof'])
    print(math.sqrt(score))
    fold_ += 1

In [None]:
print(math.sqrt(mean_squared_error(oof_df.target, oof_df.oof)))
sample['target'] = sample.drop(['id', 'target'], 1).mean(axis=1)
sample[['id', 'target']].to_csv('submission.csv', index = False)