## Introduction

### <span style="color:blue">Please up-vote if you find this notebook-content of use and help! </span>

This  is not black.

This notebook implemented following key features:
    1. Implemented two models, `XGB` and `LightGBM` regressor model. 
    2. Next carried out `grid serach` for optimal `hyper-parameter`, and 
    3. Third, compared two ensemble methods i.e. `averaging` and `weighted average` to make final submission.

Conclusion: Results below shows, weighted average performs better than simple average ensemble model. Therefore, we will submit, weighted average prediction as submission.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)



Lets import the data.

In [None]:
df_train = pd.read_csv("../input/tabular-playground-series-jan-2021/train.csv")
df_test = pd.read_csv("../input/tabular-playground-series-jan-2021/test.csv")

In [None]:
df_train.columns

In [None]:
print(f"Shape of train dataset: {df_train.shape}")
print(f"Shape of test dataset: {df_test.shape}")

Data split here:

In [None]:
x = df_train.iloc[:, 1:15].values  
print(x) 
y = df_train.iloc[:, -1].values 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=0)

We know, this tabular data is clean and does not have any missing values; therefore, skipping data exporation and quality check. Now, lets jump to two important `Ensembling techiques`.

1. Averaging, and 
2. Weighted Average

1. Averaging:

Here, first we will train two models i.e. `XGBRegressor` and `LGBRegressor`. And later take average of predicted target and make submission file for this challenge.

In [None]:
from xgboost import XGBRegressor
import lightgbm as ltb
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn import model_selection

XGBRegressor training and prediction here:

In [None]:
# XGB
XGB = XGBRegressor(max_depth=3,learning_rate=0.1,n_estimators=1000,reg_alpha=0.001,reg_lambda=0.000001,n_jobs=-1,min_child_weight=3)
XGB.fit(X_train,y_train)

In [None]:
y_pred_xgb = XGB.predict(X_test)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_xgb))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_xgb))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_xgb)))

LGBRegressor here:

In [None]:

lgbm = ltb.LGBMRegressor()

In [None]:
#Defining a dictionary containing all the releveant parameters
param_grid = {
    "boosting_type": ['gbdt'],
    "num_leaves": [9, 19],  #[ 19, 31, 37, 47],
    "max_depth": [29], #[7, 15, 29, 37, 47, 53], 
    "learning_rate": [0.1, 0.15],
    "n_estimators": [1000], #[500, 1000, 2000], 
    "subsample_for_bin": [200000], #[20000, 200000, 2000000], 
    "objective": ["regression"],
    "min_child_weight": [0.01], #[0.001, 0.01], 
    "min_child_samples":[100, 200], #[20, 50, 100], 
    "subsample":[1.0], 
    "subsample_freq":[0], 
    "colsample_bytree":[1.0], 
    "reg_alpha":[0.0], 
    "reg_lambda":[0.0]
}

In [None]:
model = model_selection.RandomizedSearchCV(
    estimator=lgbm,
    param_distributions=param_grid,
    n_iter=100,
    scoring="neg_root_mean_squared_error",
    verbose=10,
    n_jobs=-1,
    cv=5
)

In [None]:
# fit the model and extract best score
model.fit(X_train, y_train)

In [None]:
print(f"Best score: {model.best_score_}")
print("Best parameters from the RandomSearchCV:")
best_parameters = model.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print(f"\t{param_name}: {best_parameters[param_name]}")

In [None]:
# Get best model
best_model = model.best_estimator_

In [None]:
y_pred_lgb = best_model.predict(X_test)

In [None]:

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred_lgb))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred_lgb))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred_lgb)))

## Lets ensemble test split predictions from XGB and LGB to see if rmse improved

1. Averageing 

In [None]:
preds_ensemble_avg = (y_pred_xgb + y_pred_lgb)/2

In [None]:
print("Averaging Ensemble predictions KPI here:")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds_ensemble_avg))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds_ensemble_avg))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds_ensemble_avg)))

2. Weighted Average

In [None]:
preds_ensemble_avg = (y_pred_xgb*0.1 + y_pred_lgb *0.9)

In [None]:
print("Weighted average Ensemble predictions KPI here:")
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, preds_ensemble_avg))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, preds_ensemble_avg))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, preds_ensemble_avg)))

`Conclusion`: Results avobe shows, `weighted average` performs better than `simple average` ensemble model. Therefore, we will submit, weighted average prediction as submission.

## Ensemble averaging & Submission file prepration here 

In [None]:
#make prediction using XGB regressor model
preds_xgb = XGB.predict(df_test.iloc[:,1:].values)

In [None]:
#Make prediction using LighGBM regressor model
preds_lgb = best_model.predict(df_test.iloc[:,1:].values)

In [None]:
sub=pd.read_csv("../input/tabular-playground-series-jan-2021/sample_submission.csv")

In [None]:
# Taking weighted average of predictions from XGB and LGB models for submission here
#sub.target = (preds_xgb + preds_lgb)/2 
sub.target = (preds_xgb*0.1+ preds_lgb*0.9) 
sub.to_csv("submission.csv", index=False)