# Introduction

For this competition, we will be predicting a continuous target variable with 14 continuous independent variables.

Submissions are evaluated on the Root-Mean-Squared-Error (RMSE).   Thus, the goal of this notebook is to obtain the lowest RMSE.

Notebook Layout

* Exploratory Data Analysis
* Model Creation
* Model Stacking
* Predictions

If you have any feedback, please let me know!

## Let's get started!

In [None]:
import numpy as np
import pandas as pd

# Plots
import seaborn as sns
import matplotlib.pyplot as plt

# Models
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, Lasso
import xgboost as xgb
from xgboost import XGBRegressor
import lightgbm as lgb
from lightgbm import LGBMRegressor
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from mlxtend.regressor import StackingCVRegressor

# Stats
from scipy.stats import skew, norm

# Misc
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import mean_squared_error
pd.set_option('display.max_columns', None)
from pathlib import Path
input_path = Path('/kaggle/input/tabular-playground-series-jan-2021/')

# Ignore useless warnings
import warnings
warnings.filterwarnings(action="ignore")
pd.options.display.max_seq_items = 8000
pd.options.display.max_rows = 8000

In [None]:
# Load the data as dataframes
train = pd.read_csv(input_path / 'train.csv', index_col='id')
test = pd.read_csv(input_path / 'test.csv', index_col='id')

# EDA

In [None]:
train.info()

The training data has 300,000 observations and 15 columns.

In [None]:
test.info()

* The test data has 200,000 observations and 14 columns.
* The 15th column is absent because it is what we are trying to predict.

In [None]:
print(train.isna().any().any())
print(test.isna().any().any())

Neither of the datasets have missing values.

In [None]:
train.describe()

In [None]:
test.describe()

The train and dest data look very similar.  Let's first look at the target variable.

In [None]:
sns.set_style("white")
sns.set_color_codes(palette='deep')
fig=plt.figure(figsize=(6,6))
ax = sns.distplot(train['target'], color="b")
ax.set(xlabel="Target", ylabel="Frequency", title="Target Variable Distribution")
plt.show()

The target variable is bimodal and has one outlier valued at 0.  Let's remove it.  

In [None]:
train.drop(train[train['target'] == 0].index, inplace = True)

Let's look at the remainder of the variables with boxplots.

In [None]:
train.boxplot(column = list(train.columns[0:14]), figsize= (15,10))

In [None]:
test.boxplot(column = list(test.columns[0:14]), figsize= (15,10))

* The boxplots confirm that the two datasets are similar.  The boxplots of each variable are roughly the same in both datasets.
* Cont5 and Cont7 have similar outliers in both datasets.
* Let's not remove any of them for now.  We should consider using cooks distance to remove influential outliers.
* Let's plot the distributions of each variables.

In [None]:
# Create dataset with only numerical values
numericaldata = train.select_dtypes(exclude='object')

# Create plot space
fig = plt.figure(figsize=(10,15))

# Create subplot for each loop
for i in range(len(numericaldata.columns)):
    fig.add_subplot(6,4,i+1)
    sns.distplot(numericaldata.iloc[:,i])
    plt.xlabel(numericaldata.columns[i])

# Display plots        
plt.tight_layout()
plt.show()

* The variables are mostly bimodal and multimodal.
* None of the variables seem skewed.  Let's check just in case.

In [None]:
# Loop through each feature
for column in train:
    
    # calculate skew for feature
    sk = round(train[column].skew(), 2)
    
    # Print if skew is significant
    if sk > 2 or sk < -2:
        print("Skew for", column, "is", sk)

* None of the variables were highly skewed.
* Let's explore the relationships between variables.

In [None]:
# Create plotting space
plt.subplots(figsize=(16,12))

# Calculate correlations
corr = train[train.columns[1:]].corr()

# Plot the correlations
sns.heatmap(corr, vmax=0.9, cmap="Blues", square=True, annot=True)

plt.yticks(rotation=0, fontsize = 15)
plt.xticks(rotation=0, fontsize = 15)
plt.tight_layout()

* Unfortunately, none of the predictor variables are highy correlated with the target variable.  
* There is a cluster of variables that are correlated with each other.

# Modeling

* We cannot test our models on the given test dataset, because it doesn't contain the target variable.
* Let's create a new train and test dataset out of the given train dataset.

In [None]:
# Create test and train data
X = train.loc[:, ((train.columns != 'target') & (train.columns != 'id'))] 
y = train.target

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .2, random_state = 42)

# Linear Regression

In [None]:
# Create model
lm = LinearRegression()

# Fit the model
lm.fit(X_train, y_train)

# Make predictions 
y_pred = lm.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(lm, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

The cross validated RMSE for the linear regression is 0.7265.

The minimum and maximum of the target variable are 3.70 and 10.27, respectively.  Range: 10.27 - 3.70 = 6.57.

The RMSE/Range ratio is .7265/6.57 = 0.111.  

Because RMSE is the standard deviation of the model's residuals, we want this to be as low as possible.

This is a good start.  Let's see if we can obtain a better score with different models.

# Lasso

In [None]:
# Create the parameter grid
params = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Create the base model
lasso = Lasso(random_state=42)

# Create exhaustive search over parameters
grid = GridSearchCV(estimator=lasso, param_grid=params,
                   scoring="neg_mean_squared_error", cv = 5, verbose = 0)

# Fit the grid search to the training data
grid.fit(X_train,y_train)

# Print the best parameters
print("Best parameters found: ", grid.best_params_)

In [None]:
# Set best parameter to model
lasso.set_params(random_state = 42, alpha = 0.001)

# Fit the new model
lasso.fit(X_train,y_train)

# Make predictions 
y_pred = lasso.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(lasso, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

The cross validated RMSE for the lasso model is 0.7269.  This model performed roughly as well as the linear regression.

# Ridge Regression

In [None]:
# Create the parameter grid
params = {'alpha': [0.001, 0.01, 0.1, 1, 10, 100]}

# Create the base model
ridge = Ridge(random_state=42)

# Create exhaustive search over parameters
grid = GridSearchCV(estimator=ridge, param_grid=params,
                   scoring="neg_mean_squared_error", cv = 5, verbose = 0)

# Fit the grid search to the training data
grid.fit(X_train,y_train)

# Print the best parameters
print("Best parameters found: ", grid.best_params_)

In [None]:
# Set best parameter to model
ridge.set_params(random_state = 42, alpha = 10)

# Fit the new model
ridge.fit(X_train,y_train)

# Make predictions 
y_pred = ridge.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(ridge, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

The cross validated RMSE for the ridge regression is 0.7265.  This model performed roughly as well as the linear and ridge regression models.

# Random Forest

In [None]:
# Create the base model
rf_model = RandomForestRegressor(random_state = 42, n_jobs = -1)

# Fit the model
rf_model.fit(X_train,y_train)

# Create predictions
y_pred = rf_model.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(rf_model, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

* The cross validated RMSE for the base random forest model is 0.7089.  This is our best model so far.
* The hidden code below shows the parameter grid used to tune the model.  The code is commented to decrease the computational expense.

In [None]:
"""
# Create the parameter grid
params = {'max_depth': [5,10,15,20],
         'max_features': ['auto', 'sqrt']}

# Create exhaustive search over parameters
grid = GridSearchCV(estimator=rf_model, param_grid=params,
                   scoring="neg_mean_squared_error", cv = 3, verbose = 1)

# Fit the grid search to the training data
grid.fit(X_train,y_train)

# Print the best parameters
print("Best parameters found: ", grid.best_params_)

"""

In [None]:
# Set best parameters to model
rf_model.set_params(random_state = 42, max_depth=20, max_features = 'sqrt') 

# Fit the new model
rf_model.fit(X_train,y_train)

# Make predictions 
y_pred = rf_model.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(rf_model, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

The cross validated RMSE for the tuned random forest model is 0.7058.  This is now our best model!

# XGBoost

In [None]:
# Create the base model
xgb_model = xgb.XGBRegressor(random_state=42)

# Fit the model
xgb_model.fit(X_train,y_train)

# Make predictions
preds = xgb_model.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(xgb_model, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

* The cross validated RMSE for the base random XGBoost is 0.7050.  Looks like we keep improving!  This is now our best model
* The hidden code below shows the parameter grid used to tune the model.  The code is commented to decrease the computational expense.

In [None]:
"""
from sklearn.model_selection import GridSearchCV

params = {'colsample_bytree': [.1,.2,.3,.4,.5,.6,.7,.8,.9],
          'n_estimators': [100],
          'max_depth': [4,5,6],
          'learning_rate': [0.05, 0.10, .015, 0.20, 0.25, 0.30]}

xgb_model = gbm.XGBRegressor()

grid = GridSearchCV(estimator=xgb_model, param_grid=params,
                   scoring="neg_mean_squared_error", cv = 4, verbose = 1)

grid.fit(X_train,y_train)

print("Best parameters found: ", grid.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid.best_score_)))
"""

In [None]:
# Set best parameters to model
xgb_model.set_params(random_state=42, colsample_bytree=0.5, learning_rate= 0.2,max_depth=6, n_estimators=100) 
# Fit the new model
xgb_model.fit(X_train,y_train)

# Create predictions
y_pred = xgb_model.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(xgb_model, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5, verbose= 0)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

The cross validated RMSE for the tuned XGBoost model is 0.7030.  We continue to lower the score!

# Light GBM

In [None]:
# Create the base model
lgbm_model = LGBMRegressor(random_state=42)

# Fit the model
lgbm_model.fit(X_train,y_train)

# Make predictions
y_pred = lgbm_model.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(lgbm_model, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5, verbose= 0)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

* The cross validated RMSE for the base LGBM m odel is 0.7034.
* The hidden code below shows the parameter grid used to tune the model.  The code is commented to decrease the computational expense.

In [None]:
"""
from sklearn.model_selection import GridSearchCV

params = {'feature_fraction': [.1,.2,.3,.4,.5,.6,.7,.8,.9],
         'max_depth': [2,4,6,8,10,12,14,16,18,20],
         'max_bin': [5,10,15,20],
         'n_estimators': [100,500,1000,2000,4000]}

grid = GridSearchCV(estimator=lgbm_reg, param_grid=params,
                   scoring="neg_mean_squared_error", cv = 4, verbose = False)

grid.fit(X_train,y_train)

print("Best parameters found: ", grid.best_params_)
print("Lowest RMSE found: ", np.sqrt(np.abs(grid.best_score_)))
"""

In [None]:
# Set the best parameters
lgbm_model.set_params(random_state=42, feature_fraction=.4, max_depth=6, max_bin = 20, n_estimators=500) 

# Fit the new model
lgbm_model.fit(X_train,y_train)

# Create predictions
y_pred = lgbm_model.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(lgbm_model, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5, verbose= 0)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))


* The cross validated RMSE for the tuned LGBM model is 0.7034.

# Voting

After experimenting with different weights, it became clear that the voting model performs best without the random forest.

In [None]:
# Create the voting model 
clf_voting = VotingRegressor(
    
    estimators=[
        ('XGBoost',xgb_model),
        ('LGBoost',lgbm_model)],
    
    #Choose weights for each model
    weights = [.5,.5]
)

In [None]:
# Train the model
clf_voting.fit(X_train,y_train)

#Make predictions
y_pred = clf_voting.predict(X_test)

# Calculate root mean squared error
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))

# Calculate the cross validated RMSE
scores = cross_val_score(clf_voting, X_train, y_train, scoring = "neg_root_mean_squared_error", cv = 5, verbose= 0)
print("RMSE with cross validation: ", np.mean(np.abs(scores)))

* The cross validated RMSE for the base voting regressor is .7014, which is officially our lowest score. 

In [None]:
# Create submission
submission = pd.read_csv(input_path / 'sample_submission.csv', index_col='id')
submission['target'] = clf_voting.predict(test)
submission.to_csv('Jan_Tab_Playground.csv')