# Tabular Playground Series - Jan 2021

The Kaggle Tabular Playground for January offer us a Blackbox Challenge: an array of readings labeled cont1-cont14 which translate to a target value.  The goal is to develop a model based off of training data which provides the readings and the resulting target to be able to accurately predict the target value for a test dataset where we are supplied readings but must predict the target.   

### The Prelimnaries, import the usual basic libraries. 

In [None]:
import numpy as np
import pandas as pd
from pathlib import Path

import os
#for dirname, _, filenames in os.walk('/kaggle/input'):
#    for filename in filenames:
#        print(os.path.join(dirname, filename))
        
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

input_path = Path('/kaggle/input/tabular-playground-series-jan-2021/')

## Read in the data files

In [None]:
train = pd.read_csv(input_path / 'train.csv', index_col='id')
display(train.head())

In [None]:
test = pd.read_csv(input_path / 'test.csv', index_col='id')
display(test.head())

## Let us look at the Data

In [None]:
train.describe()

In [None]:
test.describe()

No missing values to worry about, and the dataset look very similar, ;ets plot the two datasets as Boxplots

In [None]:
boxplot = train.boxplot(column=['cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14'],
                       figsize=(12,9))

In [None]:
boxplot = test.boxplot(column=['cont1', 'cont2', 'cont3', 'cont4', 'cont5', 'cont6', 'cont7', 'cont8', 'cont9', 'cont10', 'cont11', 'cont12', 'cont13', 'cont14'],
                       figsize=(12,9))

Even the distributions of outliers appears the same between datasets. Just to round off lets look at the correlation matrix

In [None]:
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(15, 12))

sns.heatmap(train.corr(), annot = True,fmt='.1g', vmin=-1, vmax=1, center= 0,cmap= 'coolwarm')

Not seeing any obvious clues to feature engineering, so lets break our training data into a training and training validation subset and see what the models tells us

### Pull out the target, and make a validation split

In [None]:
target = train.pop('target')

In [None]:
train.head()

One thing I noted in the example notebook, they used a rather low (60%) of the training data for training.  I changed this to a perhaps more standard 80%

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train, target, train_size=0.80)

# Determination of Best Model 

We now do a preliminary run of some of the common regression models to see which one(s) perform well and are worth pursuing further.

In [None]:
from sklearn.dummy import DummyRegressor
from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.linear_model import SGDRegressor
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import LassoLars
from sklearn.linear_model import ARDRegression
from sklearn.linear_model import PassiveAggressiveRegressor
from sklearn.linear_model import TheilSenRegressor
from sklearn.linear_model import LinearRegression
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

        

In [None]:
def FitAndScoreModel(df,name, model,X_tr,y_tr,X_tst,y_tst):
    model.fit(X_tr,y_tr)
    Y_pred = model.predict(X_tst)
    score=mean_squared_error(y_tst, Y_pred, squared=False)
    df = df.append({'Model':name, 'MSE': score},ignore_index = True) 
    return df

Create a Blank Dataframe, as we run each model we will note it and its score 

In [None]:
dResults = pd.DataFrame(columns = ['Model', 'MSE'])

In [None]:
classifiers = [
    DummyRegressor(strategy='median'),
   # SVR(),
    SGDRegressor(),
    BayesianRidge(),
    LassoLars(),
    ARDRegression(),
    PassiveAggressiveRegressor(),
    LinearRegression(),
    LGBMRegressor(),
    RandomForestRegressor(),
    XGBRegressor()]

 
for item in classifiers:
    print(item)
    clf = item
    dResults=FitAndScoreModel(dResults,item,item,X_train,y_train,X_test,y_test) 

# Score Model

In [None]:
dResults.sort_values(by='MSE', ascending=True,inplace=True)
dResults.set_index('MSE',inplace=True)
dResults.head(dResults.shape[0])

So LGBm Regressor, Random Forrest and XGBRegressor are the top 3. 

Note:  As tuning takes a verrrrry long time, I am leaving my tuning actions in for reference but commented out and just referencing the resulting parameters that were generated

## Tuning LGBM Model

In [None]:
import optuna.integration.lightgbm as lgbTune

#dtrain = lgbTune.Dataset(X_train, label=y_train)
#dval = lgbTune.Dataset(X_test, label=y_test)
#params = {"objective": "regression",
#          "metric": "rmse",
#          'num_leaves':2 ** 8,
#          "verbosity": -1,
#          "boosting_type": "gbdt",
#          "n_estimators":20000, 
#          "early_stopping_round":400,
#          'n_jobs': -1,
#          'learning_rate': 0.005,
#          'max_depth': 8,
#          'tree_learner': 'serial',
#          'colsample_bytree': 0.8,
#          'subsample_freq': 1,
#          'subsample': 0.8,
#          'max_bin': 255}


#model = lgbTune.train(params, dtrain, valid_sets=[dval], verbose_eval=False)

In [None]:
#params = model.params
#params

In [None]:
params={'objective': 'regression',
 'metric': 'rmse',
 'num_leaves': 234,
 'verbosity': -1,
 'boosting_type': 'gbdt',
 'n_jobs': -1,
 'learning_rate': 0.005,
 'max_depth': 8,
 'tree_learner': 'serial',
 'max_bin': 255,
 'feature_pre_filter': False,
 'bagging_fraction': 0.4134640813947842,
 'bagging_freq': 1,
 'feature_fraction': 0.4,
 'lambda_l1': 9.511141306606756,
 'lambda_l2': 1.3196758411622028e-08,
 'min_child_samples': 20,
 'num_iterations': 20000,
 'early_stopping_round': 400}

We will now fit over 10 folds and arrive at LGBM predictions. 

In [None]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import KFold

n_fold = 10
folds = KFold(n_splits=n_fold, shuffle=True, random_state=42)
train_columns = train.columns.values

oof = np.zeros(len(train))
LGBMpredictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, target.values)):
    
    strLog = "fold {}".format(fold_)
    print(strLog)
    
    X_tr, X_val = train.iloc[trn_idx], train.iloc[val_idx]
    y_tr, y_val = target.iloc[trn_idx], target.iloc[val_idx]

    model = LGBMRegressor(**params, n_estimators = 20000)
   
    model.fit(X_tr, y_tr, 
              eval_set=[(X_tr, y_tr), (X_val, y_val)], eval_metric='rmse',
              verbose=1000, early_stopping_rounds=400)
    
    
    oof[val_idx] = model.predict(X_val, num_iteration=model.best_iteration_)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = train_columns
    fold_importance_df["importance"] = model.feature_importances_[:len(train_columns)]
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    #predictions
    LGBMpredictions += model.predict(test, num_iteration=model.best_iteration_) / folds.n_splits


Let us look at the feature importance for LGBM

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")
        .mean()
        .sort_values(by="importance", ascending=False)[:3014].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure()
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False))
plt.title('LightGBM Features (averaged over folds)')
plt.tight_layout()

## Using XGBoost

Let's repeat the process for XGBoot, again tuning is commented out and the parameters that resulted from tuning used.

In [None]:
#from sklearn.model_selection import GridSearchCV

#xgb = XGBRegressor()
#parameters = {'nthread':[4], 
#              'objective':['reg:squarederror'],
#              'learning_rate': [.01,.03, 0.05, .07], 
#              'max_depth': [5, 6, 7],
#              'min_child_weight':range(1,6,2),
#              'silent': [1],
#              'subsample': [0.7],
#              'colsample_bytree': [0.7],
#              'n_estimators': [500,1000,2000,4000]}

 
  

#xgb_grid = GridSearchCV(xgb,
#                        parameters,
#                        cv = 2,
#                        n_jobs = 5,
#                        verbose=True)

#xgb_grid.fit(train,
#         target)

#print(xgb_grid.best_score_)
#print(xgb_grid.best_params_)

In [None]:
#XGparams = xgb_grid.best_params_
#XGparams

In [None]:
XGparams={'colsample_bytree': 0.7,
 'learning_rate': 0.01,
 'max_depth': 7,
 'min_child_weight': 1,
 'n_estimators': 4000,
 'nthread': 4,
 'objective': 'reg:squarederror',
# 'silent': 1,
 'subsample': 0.7}

As before we fit over 10 folds

In [None]:
n_fold = 10
folds = KFold(n_splits=n_fold, shuffle=True, random_state=42)
train_columns = train.columns.values

oof = np.zeros(len(train))
XGpredictions = np.zeros(len(test))
feature_importance_df = pd.DataFrame()

for fold_, (trn_idx, val_idx) in enumerate(folds.split(train, target.values)):
    
    strLog = "fold {}".format(fold_)
    print(strLog)
    
    X_tr, X_val = train.iloc[trn_idx], train.iloc[val_idx]
    y_tr, y_val = target.iloc[trn_idx], target.iloc[val_idx]

    model = XGBRegressor(**XGparams)
   
    model.fit(X_tr, y_tr, 
              eval_set=[(X_tr, y_tr), (X_val, y_val)], verbose=1000, early_stopping_rounds=400)
    
    
    oof[val_idx] = model.predict(X_val, ntree_limit=model.best_iteration)
    preds = model.predict(test, ntree_limit=model.best_iteration)

    fold_importance_df = pd.DataFrame()
    fold_importance_df["Feature"] = train_columns
    fold_importance_df["importance"] = model.feature_importances_[:len(train_columns)]
    fold_importance_df["fold"] = fold_ + 1
    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)

    #predictions
    XGpredictions += model.predict(test, ntree_limit=model.best_iteration)/ folds.n_splits
   


Again let us look at feature importance

In [None]:
cols = (feature_importance_df[["Feature", "importance"]]
        .groupby("Feature")  
        .mean()  
        .sort_values(by="importance", ascending=False)[:3014].index)
best_features = feature_importance_df.loc[feature_importance_df.Feature.isin(cols)]

plt.figure()
sns.barplot(x="importance", y="Feature", data=best_features.sort_values(by="importance",ascending=False)) 
plt.title('XGBoost Features (averaged over folds)')
plt.tight_layout()

## Submission

In [None]:
submission  = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')
submission.reset_index(inplace=True)
submission = submission.rename(columns = {'index':'id'})

#### Create an LGBM Submission

In [None]:
LGBMsubmission=submission.copy()
LGBMsubmission['target'] = LGBMpredictions
LGBMsubmission.to_csv('submission_LGBM.csv', header=True, index=False)
LGBMsubmission.head()

Score  0.69780

#### Create an XGBoost Submission

In [None]:
XGBoostsubmission=submission.copy()
XGBoostsubmission['target'] = XGpredictions
XGBoostsubmission.to_csv('submission_XGBoost.csv', header=True, index=False)
XGBoostsubmission.head()

Score 0.69945

### Ensemble Solution of LGBM and XGBoost

In [None]:
EnsembledSubmission=submission.copy()
#EnsembledSubmission['target'] = (0.5*XGpredictions)+(0.5*LGBMpredictions)
EnsembledSubmission['target'] = (LGBMpredictions*0.72 + XGpredictions*0.28)
EnsembledSubmission.to_csv('ensembled_submission.csv', header=True, index=False)
EnsembledSubmission.head()

Score 0.69819

Disappointed that the Ensemble did not produce an improvement, especially since each model weighed different features differently.  Will look into pulling the third place algorithm, Random Forest, into the mix and see if that doesn't make a change for the better.