# Tabular Playground Series Feb2021(Basic Linear Regression Model to LGBM Hyperparameter tuning with optuna)

* In this notebook I am gonna go through what I have learned through the course of participating in this Tabular Playground Series in the month of Febraury.
* Firstly these Tabular Playground Series are great fun to learn for beginners who want to participate in competitions after going through Titanic or even you can start your Kaggle journey right here.
* Earlier I used to just join a competition, submit the predictions of basic model and I wasn't at all used to learning new concepts from discussions and notebooks. The best way to learn in Kaggle is to go through notebooks and this is how I got to learn so much during this competition and ended in the top 19%.
* I have started with basic Linear Regression Model and then while going through a lot of others work, understood the importance of hyperparameter tuning and we will go through all this stuff in this notebook.
* This notebook will be super useful I guess for beginners who want to try out each model and learn new things such as training model with KFold and how important is hyperparameter tuning.

In [None]:
# I don't usually import all the libraries at once, maybe because I can't recall all of those at once.
# I'll be importing the libraries when required

import pandas as pd  
import numpy as np   

In [None]:
data=pd.read_csv('../input/tabular-playground-series-feb-2021/train.csv') # train dataset
test=pd.read_csv('../input/tabular-playground-series-feb-2021/test.csv') # test dataset

# Data Preprocessing

In [None]:
data.head() 

Here as we can see we have the first column id which is of no use for prediction, 9 categorical features(cat0 to cat9) and numeric features(cont1 to cont13) and the target.

In [None]:
train=data.copy()
train.shape

In [None]:
test.shape

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

cat_cols=train.select_dtypes(include=np.object).columns
fig,ax=plt.subplots(10,1,figsize=(10,20))
for i,col in enumerate(cat_cols):
    train[col].value_counts().plot(kind='bar',ax=ax[i],label=col)


For the categorical variables I have tried only two types of encoding which are LabelEncoding and OneHotEncoding and from my results OneHotEncoding performed a little better than LabelEncoding. I prefer to use LabelEncoding only when categroical variables have less number of categrories(2 to 4 at max). So for the first three categorical features I have used LabelEncoding and OneHot for others.

In [None]:
from sklearn.preprocessing import LabelEncoder
# LabelEncoding for the first three categorical features
for col in cat_cols[0:3]:
    train[col]=LabelEncoder().fit_transform(train[col])
    test[col]=LabelEncoder().fit_transform(test[col])

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder()
ohe.fit(train[cat_cols[3:]])
train=pd.concat([train.drop(columns=cat_cols[3:]),pd.DataFrame(ohe.transform(train[cat_cols[3:]]).toarray())],axis=1,join='inner')
test=pd.concat([test.drop(columns=cat_cols[3:]),pd.DataFrame(ohe.transform(test[cat_cols[3:]]).toarray())],axis=1,join='inner')

In [None]:
train.head()

In [None]:
test.head()

We can even try pd.get_dummies() for OneHotencoding and this is somewhat easier when compared to using sklearn OneHotEncoder, but the get_dummies method varies the dataset directly and we can't use fit and then transform on the dataset. So if there are different categories for a feature in train and test set the size of dataset will be different which will cause problems, so instead fitting the train dataset to OneHotEncoder and then transforming both train and test set is better for this case

In [None]:
# If we want to use get_dummies it can be done this way.

#train=pd.get_dummies(train,prefix_sep='_',columns=cat_cols[3:])
#test=pd.get_dummies(test,prefix_sep='_',columns=cat_cols[3:])

# Here the test and train won't be of same size as there is a missing category G in cat6 feature in the testset.
# we should insert the missing column seperately

# cat6_G=np.zeros(test.shape[0])
#test.insert(loc=35,column='cat6_G',value=cat6_G)


# Model Building

In [None]:
X=train.drop(columns=['target','id'])
Y=train.target

In [None]:
from sklearn.model_selection import train_test_split

# since we have so much data I've used only 0.01% of training data as test data
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,shuffle=True,test_size=0.01,random_state=100)

In [None]:
print(X_train.shape)
print(X_test.shape)

**Firstly I'll be using normal models without any hyperparameter tuning or so and not even KFold. We will see how it varies the performance after we make these changes.**

In [None]:
def get_predictions(model,X_train,Y_train,testset):
    model.fit(X_train,Y_train)
    return model.predict(testset)

In [None]:
from sklearn.metrics import mean_squared_error

def rmse(predictions,Y_test):
    return np.sqrt(mean_squared_error(predictions,Y_test))

In [None]:
from sklearn.linear_model import LinearRegression
rmse(get_predictions(LinearRegression(),X_train,Y_train,X_test),Y_test)

In [None]:
from sklearn.ensemble import RandomForestRegressor
rmse(get_predictions(RandomForestRegressor(),X_train,Y_train,X_test),Y_test)

In [None]:
from xgboost import XGBRegressor
rmse(get_predictions(XGBRegressor(tree_method='gpu_hist'),X_train,Y_train,X_test),Y_test)

In [None]:
from lightgbm import LGBMRegressor
rmse(get_predictions(LGBMRegressor(),X_train,Y_train,X_test),Y_test)

In [None]:
#!pip install catboost
from catboost import CatBoostRegressor

#for catboost we can directly use categorical features, 
#So splitting the categorical data itself into train test sets.

X_train1,X_test1,Y_train1,Y_test1=train_test_split(data.drop(columns=['id']),data.target,test_size=0.01,random_state=100)
rmse(get_predictions(CatBoostRegressor(cat_features=cat_cols,task_type='GPU'),X_train1,Y_train1,X_test1),Y_test1)

Here as we can see these are the models I have used and worked with to get better accuracy by tuning their hyperparameters. We can't ignore LinearRegression as it is the basic and first thing everyone tries with regression problems. Below Hyperparameter tuning is shown for the three boosting algorithms.

Now we will start using KFold on all the three boosting algorithms and see how the performance increases which is because we are not training the model all at once with the whole dataset.

In [None]:
from sklearn.model_selection import KFold

def get_predictions_with_kfold(model,X_train,Y_train,testset,nfolds):
    kf=KFold(n_splits=nfolds,shuffle=True,random_state=1)
    preds=np.zeros((testset.shape[0]))
    for fold,(train_idx,valid_idx) in enumerate(kf.split(X=X_train)):
        X1,Y1=X_train.iloc[train_idx],Y_train.iloc[train_idx]
        X2,Y2=X_train.iloc[valid_idx],Y_train.iloc[valid_idx]
        
        model.fit(X1,Y1,eval_set=[(X2,Y2)],early_stopping_rounds=500,eval_metric='rmse')
        preds+=model.predict(testset)/nfolds
    return preds

We can see the difference in rmse when model is trained on the whole dataset at once and by using KFold. This is helpful when the dataset is large and noisy to decrease the variance of the model by training and validating on smaller datasets.

In [None]:
nfolds=5
rmse(get_predictions_with_kfold(XGBRegressor(tree_method='gpu_hist'),X_train,Y_train,X_test,nfolds),Y_test)

In [None]:
rmse(get_predictions_with_kfold(LGBMRegressor(),X_train,Y_train,X_test,nfolds),Y_test)

Before this competition I was using GridSearchCV and RandomSearchCV for Hyperparameter tuning. I got to learn hyperopt and optuna libraries among which optuna is a bit faster, So i will be tuning Hyperparameters using optuna here.

In [None]:
!pip install optuna

XGboost Hyperparameter tuning.

In [None]:
def objective(trial):
    param={
        'tree_method':'gpu_hist',
        'n_estimators':trial.suggest_int('n_estimators',50,500),
        'tree_method':'gpu_hist',
        'max_depth':trial.suggest_int('max_depth',3,20),
        'learning_rate':trial.suggest_float('learning_rate',0.001,0.1,log=True),
        'reg_lambda':trial.suggest_float('reg_lambda',0.0,10),
        'reg_alpha':trial.suggest_float('reg_alpha',0.0,10),
        'gamma': trial.suggest_float('gamma', 0.0, 10),
        'subsample': trial.suggest_categorical('subsample', [0.8, 0.9, 1.0]),
        'colsample_bytree': trial.suggest_categorical('colsample_bytree', [0.1, 0.2, 0.3, 0.4, 0.5])
    }
    modell=XGBRegressor(**param)
    modell.fit(X_train,Y_train)
    return np.sqrt(mean_squared_error(modell.predict(X_test),Y_test))

In [None]:
import optuna
study=optuna.create_study(direction='minimize')
study.optimize(objective,n_trials=50)

In [None]:
study.best_params

In [None]:
optuna.visualization.plot_param_importances(study)

In [None]:
optuna.visualization.plot_optimization_history(study)

In [None]:
optuna.visualization.plot_slice(study)

LightGBM Hyperparameter tuning.

In [None]:
def objective1(trial):
    param={
      'n_estimators':trial.suggest_int('n_estimators',400,1500),
      'max_depth':trial.suggest_int('max_depth',10,40),
      'num_leaves':trial.suggest_int('num_leaves',60,150),
      'learning_rate':trial.suggest_float('learning_rate',0.001,0.2,log=True),
      'boosting_type':trial.suggest_categorical('boosting_type',['gbdt']),
      'class_weight':trial.suggest_categorical('class_weight',['balanced']),
      'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0),
      'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0),
      'colsample_bytree': trial.suggest_float('colsample_bytree', 0.2, 0.9),
      'min_child_samples': trial.suggest_int('min_child_samples', 40, 250),
      'subsample_freq': trial.suggest_int('subsample_freq', 1, 10),
      'subsample': trial.suggest_float('subsample', 0.3, 0.9),
      'max_bin': trial.suggest_int('max_bin', 128, 1024),
      'min_data_per_group': trial.suggest_int('min_data_per_group', 50, 200),
      'cat_smooth': trial.suggest_int('cat_smooth', 50, 100),
      'cat_l2': trial.suggest_int('cat_l2', 1, 20)
    }
    modell=LGBMRegressor(**param)
    modell.fit(X_train,Y_train)

    return np.sqrt(mean_squared_error(modell.predict(X_test),Y_test))

In [None]:
study1=optuna.create_study(direction='minimize')
study1.optimize(objective1,n_trials=50)

In [None]:
study1.best_params

In [None]:
optuna.visualization.plot_optimization_history(study1)

In [None]:
optuna.visualization.plot_slice(study1)

In [None]:
optuna.visualization.plot_param_importances(study1)

If we want to we can tune the Hyperparameters even further and also the value of nfolds in kfold should also be considered as hyperparameter as different values give different model predictions. I have tried some values in between 5 to 10.

Whichever model you feel is performing better convert those predictions into csv file and then submit

In [None]:
def csv(model,nfolds):
    preds=get_predictions_with_kfold(model,X_train,Y_train,test.drop(coluns=['id']),nfolds)
    pd.DataFrame({
        'id':test.id,
        'taget':preds
    }).to_csv('prediction.csv',index=False)

In [None]:
csv(LGBMRegressor(**study1.best_params),10)