<a id="Table-Of-Contents"></a>
# Table Of Contents
* [Table Of Contents](#Table-Of-Contents)
* [Problem Statement](#Problem-Statement)
    - [Introduction](#Introduction)
    - [Goal](#Goal)
    - [Evaluation Metrics](#Evaluation-Metrics)
* [Importing Libraries](#Importing-Libs)
* [Descriptive Statistics](#Descriptive-Statistics)
* [Exploratory Data Analysis](#Exploratory-Data-Analysis)
    - [Categorical Features](#Categorical-Features)
    - [Continuous Features](#Continuous-Features)
    - [Target](#Target)
* [Data Preprocessing](#Data-Preprocessing)
    - [Train Test Split](#Train-Test-Split)
    - [Transforms and Pipelines](#Transforms-and-Pipelines)
* [Modelling](#Modelling)
    - [Lasso Regression](#Lasso-Regression)
    - [Random-forest Regression](#Random-Forest)
    - [LightGBM Regression](#Light-GBM-Regressor)
* [Submission](#Submission)

# Problem Statement

### Introduction
The dataset deals with predicting the amount of an insurance claim. Although the features are anonymized, they have properties relating to real-world features.

### Goal
For this competition, we will be predicting a continuous target based on a number of feature columns given in the data. All of the feature columns, cat0 - cat9 are categorical, and the feature columns cont0 - cont13 are continuous. Hence, this is a regression task.

### Evaluation Metrics
Submissions are scored on the root mean squared error. RMSE is defined as:
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$
 
where is the predicted value, is the original value, and is the number of rows in the test data.

# Importing Libs

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, LabelBinarizer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
import lightgbm as lgb
from bayes_opt import BayesianOptimization
from lightgbm import LGBMRegressor 
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score, cross_val_predict,KFold, GridSearchCV, train_test_split, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.base import BaseEstimator, TransformerMixin, clone
from datetime import datetime

# Descriptive Statistics

In [None]:
warnings.filterwarnings('ignore')
data_f = pd.read_csv("../input/tabular-playground-series-feb-2021/train.csv")
data = data_f.copy()
data = data.iloc[:,1:]

In [None]:
data_cat = data.iloc[:,:10]
data_cont = data.iloc[:,10:]

In [None]:
data.info()

In [None]:
data_cat.describe()

In [None]:
data_cont.describe()

In [None]:
print("Null values in continuos variables:{}\nNull values in categorical variables:{}".format(data_cont.isna().sum().sum(), data_cat.isna().sum().sum()))

# Exploratory Data Analysis

### Looking for trends in continuous features.

In [None]:
sns.pairplot(data_cont)

### No. of categories in each categorical variable.


In [None]:
sns.set(font_scale = 2)
fig,ax = plt.subplots(5,2, figsize=(20,30), sharex=True)
axes = ax.flatten()
object_bol = data_cat.dtypes == 'object'
for ax, catplot in zip(axes, data_cat.dtypes[object_bol].index):
    sns.countplot(y=catplot, data=data_cat, ax=ax, order=data_cat[catplot].value_counts().index)
    ax.xaxis.label.set_size(30)
    ax.yaxis.label.set_size(30)
    ax.tick_params(axis="x", labelsize=15)
    ax.tick_params(axis="x", labelsize=15)
plt.tight_layout()  
plt.show()

Cat 9 and Cat 8 accommodate a larger variety compared to the rest of the features.

### Distributions of the continuous variables.

In [None]:

fig,ax = plt.subplots(5,2, figsize=(12,12), sharex=False)
axes = ax.flatten()
object_bol = data_cont.dtypes == 'float'
# print(data_cont.dtypes[object_bol].index)
for ax, catplot in zip(axes, data_cont.dtypes[object_bol].index):
    sns.kdeplot(x=data_cont[catplot],ax=ax,shade = True)
    ax.xaxis.label.set_size(20)
    ax.yaxis.label.set_size(20)
    ax.tick_params(axis="x", labelsize=15)
    ax.tick_params(axis="y", labelsize=15)
plt.tight_layout()  
plt.show()   

### Correlation

In [None]:
sns.set(font_scale = 1)
fig, ax = plt.subplots(figsize = (15,15))
sns.heatmap(data_cont.corr(),ax = ax,annot=True)

There is no variable that is highly correlated to the target variable.<br>
However, Cont 8, Cont 9 and Cont 12 are highly positively correlated to Cont5.

### Cumulative distribution function

In [None]:
warnings.filterwarnings('ignore')
fig,ax = plt.subplots(5,2, figsize=(20,20), sharex=False)
axes = ax.flatten()
object_bol = data_cont.dtypes == 'float'
for ax, catplot in zip(axes, data_cont.dtypes[object_bol].index):
    res = sns.ecdfplot(data=data_cont,x=catplot,ax=ax)
    ax.xaxis.label.set_size(20)
    ax.yaxis.label.set_size(20)
    ax.tick_params(axis="x", labelsize=15)
    ax.tick_params(axis="y", labelsize=15)
plt.tight_layout()  
plt.show()   

### Boxplots

In [None]:
warnings.filterwarnings('ignore')
fig,ax = plt.subplots(10,1, figsize=(25,30), sharex=False)
axes = ax.flatten()
object_bol = data_cont.dtypes == 'float'
for ax, catplot in zip(axes, data_cont.dtypes[object_bol].index):
    res = sns.boxplot(x=catplot,ax=ax,data=data_cont)
    ax.xaxis.label.set_size(30)
    ax.yaxis.label.set_size(30)
    ax.tick_params(axis="x", labelsize=20) 
plt.tight_layout()  
plt.show()   

cont 0, cont 2, cont 6, cont 8 have considerable outliers

# Data Preprocessing 

### Train Test Split

In [None]:
training_data = pd.read_csv("../input/tabular-playground-series-feb-2021/train.csv")
test_data = pd.read_csv("../input/tabular-playground-series-feb-2021/test.csv") 

In [None]:
#id column is unnecessary
training_data.drop(['id'],axis=1,inplace=True)
test_data.drop(['id'],axis=1,inplace=True)

In [None]:
categorical = list(filter(lambda x: 'cat' in x, training_data.columns))
continuous  = list(filter(lambda x: 'cat'not in x, training_data.columns))

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(training_data.iloc[:,:-1], training_data.iloc[:,-1], test_size = 0.2, random_state = 42)

In [None]:
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

### Transforms and Pipelines

In [None]:
cat_pipeline = ColumnTransformer([('encoder', OneHotEncoder(), [idx for idx,_ in enumerate(categorical)])], remainder='passthrough')
cont_pipeline = ColumnTransformer([('scaler', StandardScaler(), [idx+10 for idx,_ in enumerate(continuous[:-1])])], remainder='drop')

In [None]:
full_pipeline = FeatureUnion(transformer_list = [("Categorical_Pipeline",cat_pipeline),
                                                 ("Quantitative_Pipeline",cont_pipeline)])

x_train = cat_pipeline.fit_transform(X_train)
x_test = cat_pipeline.transform(X_test)

# Modelling

Let us train these models:
1. Lasso Regression
2. Random Forest
3. Light Gradient Boosted Machines


### **Lasso Regression**

In [None]:
# Function to get crossvalidation metrics 
def get_score(model, x=x_train, y=Y_train, cv = 5, verbose = False, ack=False):
    scores = cross_val_score(model, x, y, scoring = 'neg_mean_squared_error',cv=cv,n_jobs=-1)
    mean = np.mean(-scores)
    std = np.std(-scores)
    if verbose:
        print("Scores: {}\nMean: {:.4f}\nStd: {:.4f}".format(scores,mean,std))
    if ack:
        return scores,mean,std

In [None]:
# Look performance of default params 
lasso_model = Lasso()
get_score(lasso_model,cv=10,verbose=True)

In [None]:
# Funtion to find optimal alpha
def find_opt_params_lasso(raange, x=x_train, y=Y_train, div_fac=1):
    alphas = []
    errors = []
    for alpha in raange:
        lr_model = Lasso(alpha= (alpha/div_fac))
        lr_model.fit(x,y)
        alphas.append(alpha/div_fac)
        errors.append(get_score(lr_model,ack=True)[1])
    return alphas, errors

In [None]:
# Plot errors vs alphas 
alphas, errors = find_opt_params_lasso(range(1,100),div_fac = 1000)
plt.plot(alphas,errors)

Error is lowest when alpha is 0.0001 (without std sc) and 0.001 (otherwise) 

In [None]:
#Lasso regression
lasso_model = Lasso(alpha=0.001)
lasso_model.fit(x_train, Y_train)
# get_score(lasso_model,cv=10,verbose=True)

In [None]:
#Evaluation on test set
predictions = lasso_model.predict(x_test)
mean_squared_error(Y_test,predictions)

### **Random Forest**

In [None]:
#raw model
rf_model = RandomForestRegressor(n_jobs=-1)
get_score(rf_model,y=Y_train.ravel(),verbose=True)

In [None]:
parameters = {'n_estimators':range(50,300,50),'max_features':('auto','sqrt','log2')}
search_rf_params = RandomizedSearchCV(rf_model, parameters, n_jobs=-1, scoring='neg_mean_squared_error',cv=3)

In [None]:
search_rf_params.fit(x_train,Y_train.ravel())

In [None]:
search_rf_params.best_estimator_

In [None]:
rf_model = RandomForestRegressor(n_estimators= 250, max_features= 'sqrt',n_jobs=-1)
rf_model.fit(x_train, Y_train.ravel()) 

In [None]:
# default predictions
predictions = rf_model.predict(x_test)
mean_squared_error(Y_test,predictions)

In [None]:
predictions_test = rf_model.predict(cat_pipeline.transform(test_data))

More parameters can be included in the search space to see improvement in performance.

### Light GBM Regressor


In [None]:
# https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021
def search_best_param(X,y,cat_features):
    
    trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    
    # define the lightGBM cross validation
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                    lambda_l1, lambda_l2, min_child_weight):

            params = {'boosting_type': 'gbdt', 'objective': 'regression', 'metric':'rmse', 'verbose': -1,
                      'early_stopping_round':100}

            params['max_depth'] = int(round(max_depth))
            params["num_leaves"] = int(round(num_leaves))
            params["n_estimators"] = int(round(n_estimators))
            params['learning_rate'] = learning_rate
            params['subsample'] = subsample
            params['colsample_bytree'] = colsample_bytree
            params['lambda_l1'] = max(lambda_l1, 0)
            params['lambda_l2'] = max(lambda_l2, 0)
            params['min_child_weight'] = min_child_weight

            score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=False, verbose_eval =False, metrics=['rmse'])

            return -np.min(score['rmse-mean']) # min or max can change best_param

    
    # use bayesian optimization to search for the best hyper-parameter combination
    # https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.pyta
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                      {
                                          'max_depth': (5, 50),
                                          'num_leaves': (20, 100),
                                          'n_estimators': (50, 1000),
                                          'learning_rate': (0.01, 0.3),
                                          'subsample': (0.7, 0.8),
                                          'colsample_bytree' :(0.5, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 50) 
                                      },
                                       random_state = 1,
                                       verbose = -1
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points=5, n_iter=25) # 20 combinations 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'rmse'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'regression'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

In [None]:
## https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
warnings.filterwarnings('ignore')
class MultiColumnLabelEncoder(BaseEstimator):
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)
    
mult_enc_pipeline = ColumnTransformer([('encoder',MultiColumnLabelEncoder(), [idx for idx,_ in enumerate(categorical)])], remainder='passthrough')
# trans_x = MultiColumnLabelEncoder(columns = categorical).fit_transform(training_data)
xtrain = pd.DataFrame(mult_enc_pipeline.fit_transform(training_data.iloc[:,:-1]),columns=training_data.columns[:-1])
ytrain = pd.DataFrame(training_data.iloc[:,-1],columns=['target'])
best_params = search_best_param(xtrain,ytrain,categorical)

In [None]:
def K_Fold_LightGBM(X_train, y_train , cat_features, params_set, num_folds = 5):
    num = 0
    models = []
    folds = KFold(n_splits=num_folds, shuffle=True, random_state=0)
    type(X_train)
        # 5 times 
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        print(f"     model{num}")
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
        valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)
        
        CV_LGBM = lgb.train(params_set,
                 train_data,
                 num_boost_round = 2500,
                 valid_sets = valid_data,
                 early_stopping_rounds = 100,
                 verbose_eval = 50
                 )
        # increase early_stopping_rounds can lead to overfitting 
        models.append(CV_LGBM)
        
        print("Train set RMSE:", mean_squared_error(train_y,models[num].predict(train_X),squared = False))
        print(" Test set RMSE:", mean_squared_error(valid_y,models[num].predict(valid_X),squared = False))
        print("\n")
        num = num + 1
        
    return models

lgbm_models = K_Fold_LightGBM(xtrain,ytrain,categorical,best_params,5)

Light GBM generalizes better than random forest on the test set.

In [None]:
predictLGBM = lgbm_models[3].predict(mult_enc_pipeline.transform(test_data))

# Submission

In [None]:
def make_submission_csv(predictions_test):
    submission_csv = pd.read_csv("../input/tabular-playground-series-feb-2021/sample_submission.csv")
    submission_csv.drop('target',axis=1)
    submission_csv['target']=predictions_test 
    submission_csv.to_csv('Result_{}.csv'.format(datetime.now().strftime("%d_%m_%Y_%H_%M")),index=False)

In [None]:
make_submission_csv(predictions_test)