# <center>HousePricesRegressor </center>
<img src= "https://miro.medium.com/max/402/1*2foyXif7hwkO8wWB5T9KtQ.png" height="200" align="center"/>

<a id="Table-Of-Contents"></a>
# Table Of Contents
* [Table Of Contents](#Table-Of-Contents)
* [Introduction](#Introduction)
* [Importing Libraries](#Importing-Libraries)
* [Task Details](#Task-Details)
* [Feature Description](#Feature-Description)
* [Read in Data](#Read-in-Data)
    - [Training Data](#Training-Data)
    - [Test Data](#Test-Data)
* [Preprocessing Data](#Preprocessing-Data)
    - [Label Encoding](#Label-Encoding)
    - [Train-Test Split](#Train-Test-Split)
* [Initial Models](#Initial-Models)
* [LightGBM Regressor](#LightGBM-Regressor)
    - [Bayesian Optimization](#Bayesian-Optimization)
    - [Tuning LightGBM](#Tuning-LightGBM)
    - [Feature Importance](#Feature-Importance)
    - [Cross Validation](#Cross-Validation)
* [Prediction for Test Data](#Prediction-for-Test-Data)
* [Conclusion](#Conclusion)

<a id="Introduction"></a>
# Introduction 
This notebook goes through various machine learning techniques for regression. The main focus will be hyper-tuning LightGBM using bayesian optimization. This is a beginner-level notebook but I believe you will find it still useful to read and look over. Please leave comments on where I can improve and give a like! Thank you!

<a id="Importing-Libraries"></a>
# Importing Libraries

In [None]:
#%% Imports

# Basic Imports 
import numpy as np
import pandas as pd

# Plotting 
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Preprocessing
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import LabelEncoder

# Metrics 
from sklearn.metrics import mean_squared_error, mean_absolute_error

# ML Models
import lightgbm as lgb
from lightgbm import LGBMRegressor 
import xgboost as xg 
from sklearn.ensemble import RandomForestRegressor
from sklearn import svm

# Model Tuning 
from bayes_opt import BayesianOptimization

# Feature Importance 
import shap

# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

<a id="Task-Details"></a>
# Task Detail 

## Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

## Metric
Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

<a id="Feature-Description"></a>
# Feature Description 
Here's a brief version of what you'll find in the data description file.

## Target  
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.  

## Features
MSSubClass: The building class  
MSZoning: The general zoning classification  
LotFrontage: Linear feet of street connected to property  
LotArea: Lot size in square feet  
Street: Type of road access  
Alley: Type of alley access  
LotShape: General shape of property  
LandContour: Flatness of the property  
Utilities: Type of utilities available  
LotConfig: Lot configuration  
LandSlope: Slope of property  
Neighborhood: Physical locations within Ames city limits  
Condition1: Proximity to main road or railroad  
Condition2: Proximity to main road or railroad (if a second is present)  
BldgType: Type of dwelling  
HouseStyle: Style of dwelling  
OverallQual: Overall material and finish quality  
OverallCond: Overall condition rating  
YearBuilt: Original construction date  
YearRemodAdd: Remodel date  
RoofStyle: Type of roof  
RoofMatl: Roof material  
Exterior1st: Exterior covering on house  
Exterior2nd: Exterior covering on house (if more than one material)  
MasVnrType: Masonry veneer type  
MasVnrArea: Masonry veneer area in square feet  
ExterQual: Exterior material quality  
ExterCond: Present condition of the material on the exterior  
Foundation: Type of foundation  
BsmtQual: Height of the basement  
BsmtCond: General condition of the basement  
BsmtExposure: Walkout or garden level basement walls  
BsmtFinType1: Quality of basement finished area  
BsmtFinSF1: Type 1 finished square feet  
BsmtFinType2: Quality of second finished area (if present)  
BsmtFinSF2: Type 2 finished square feet  
BsmtUnfSF: Unfinished square feet of basement area  
TotalBsmtSF: Total square feet of basement area  
Heating: Type of heating  
HeatingQC: Heating quality and condition  
CentralAir: Central air conditioning  
Electrical: Electrical system  
1stFlrSF: First Floor square feet  
2ndFlrSF: Second floor square feet  
LowQualFinSF: Low quality finished square feet (all floors)  
GrLivArea: Above grade (ground) living area square feet  
BsmtFullBath: Basement full bathrooms  
BsmtHalfBath: Basement half bathrooms  
FullBath: Full bathrooms above grade  
HalfBath: Half baths above grade  
Bedroom: Number of bedrooms above basement level  
Kitchen: Number of kitchens  
KitchenQual: Kitchen quality  
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)  
Functional: Home functionality rating  
Fireplaces: Number of fireplaces  
FireplaceQu: Fireplace quality  
GarageType: Garage location  
GarageYrBlt: Year garage was built  
GarageFinish: Interior finish of the garage  
GarageCars: Size of garage in car capacity  
GarageArea: Size of garage in square feet  
GarageQual: Garage quality  
GarageCond: Garage condition  
PavedDrive: Paved driveway  
WoodDeckSF: Wood deck area in square feet  
OpenPorchSF: Open porch area in square feet  
EnclosedPorch: Enclosed porch area in square feet  
3SsnPorch: Three season porch area in square feet  
ScreenPorch: Screen porch area in square feet  
PoolArea: Pool area in square feet  
PoolQC: Pool quality  
Fence: Fence quality  
MiscFeature: Miscellaneous feature not covered in other categories  
MiscVal: $Value of miscellaneous feature  
MoSold: Month Sold  
YrSold: Year Sold  
SaleType: Type of sale  
SaleCondition: Condition of sale  

<a id="Read-in-Data"></a>
# Read in Data

<a id="Training-Data"></a>
## Training Data

In [None]:
#%% Read train.csv
train_csv = pd.read_csv('../input/house-prices-advanced-regression-techniques/train.csv')

# Initial glance at train.csv
print(train_csv.info(verbose = True,null_counts=True))

In [None]:
# Check for missing or NaN data in train.csv
pd.set_option("display.max_rows", None, "display.max_columns", None)
print('Feature      Number of NaN')
print(train_csv.isnull().sum().sort_values(ascending=False).loc[lambda x : x!=0])

<a id="Test-Data"></a>
## Test Data

In [None]:
#%% Read test.csv
test_csv = pd.read_csv('../input/house-prices-advanced-regression-techniques/test.csv')

# Initial glance at data
print(test_csv.info(verbose = True,null_counts=True))

In [None]:
# Check for missing or NaN data in test.csv
pd.set_option("display.max_rows", None, "display.max_columns", None)
print('Feature      Number of NaN')
print(test_csv.isnull().sum().sort_values(ascending=False).loc[lambda x : x!=0])

<a id="Preprocessing-Data"></a>
# Preprocessing Data
We see that the both train.csv and test.csv have missing data. The features ['PoolQC', 'MiscFeature', 'Alley'] should be dropped due to having more than 90% missing values. We will preprocess the dataset by imputation and removing columns with high missing values. 

In [None]:
## return columns with more than num_percent % or more missing values 
def findMissing(df , num_percent = 0.9):
    len_df = len(df)
    amt_missing = len_df * num_percent
    missing_columns = (df.isnull().sum().sort_values(ascending=False).loc[lambda x : x > amt_missing]).index.to_list()
    missing_columns_count = [df[missing_column].isnull().sum() for missing_column in missing_columns]
    missing_columns_percent = [f"{round(df[missing_column].isnull().sum()/len_df,4)*100}%" for missing_column in missing_columns]
    missing_columns_type = [df[missing_column].dtypes for missing_column in missing_columns]
    print(missing_columns_count)
    print(missing_columns_percent)
    print(missing_columns_type)
    return missing_columns

print(findMissing(train_csv))

In [None]:
# columns to be drop in train.csv and test.csv due to having 90% or more missing values 
drop_columns = (train_csv.isnull().sum().sort_values(ascending=False).loc[lambda x : x > .90*1460]).index.to_list()
drop_columns.append('Id')

# save the 'Id' for Train and Test 
train_Id = train_csv['Id'].to_list()
test_Id = test_csv['Id'].to_list()

print('Feature      Number of NaN')
print(test_csv.isnull().sum().sort_values(ascending=False).loc[lambda x : x > .90*1460])

For features/columns that have low missing values, less than 10% of each indidiual dataset, we can impute using median imputation for numerical values and mode imputation for categorical values. 

In [None]:
# mode imputation on categorical features 
# median imputation on numeric features
train_clean = train_csv.drop(drop_columns, axis = 'columns', errors = 'ignore')
test_clean = test_csv.drop(drop_columns, axis = 'columns', errors = 'ignore')

train_10_percent_missing_features = train_clean.isnull().sum().sort_values(ascending=False).loc[lambda x : (x<.10*1460)  & (x != 0)].index.to_list()
train_10_percent_missing_features_cat = train_clean[train_10_percent_missing_features].select_dtypes('object').columns.to_list()
train_10_percent_missing_features_num = train_clean[train_10_percent_missing_features].select_dtypes('number').columns.to_list()

train_clean[train_10_percent_missing_features_cat] = train_clean[train_10_percent_missing_features_cat].fillna(train_clean[train_10_percent_missing_features_cat].mode().iloc[0])
train_clean[train_10_percent_missing_features_num] = train_clean[train_10_percent_missing_features_num].fillna(train_clean[train_10_percent_missing_features_num].median().iloc[0])


test_10_percent_missing_features = test_clean.isnull().sum().sort_values(ascending=False).loc[lambda x : (x<.10*1460)  & (x != 0)].index.to_list()
test_10_percent_missing_features_cat = test_clean[test_10_percent_missing_features].select_dtypes('object').columns.to_list()
test_10_percent_missing_features_num = test_clean[test_10_percent_missing_features].select_dtypes('number').columns.to_list()

test_clean[test_10_percent_missing_features_cat] = test_clean[test_10_percent_missing_features_cat].fillna(test_clean[test_10_percent_missing_features_cat].mode().iloc[0])
test_clean[test_10_percent_missing_features_num] = test_clean[test_10_percent_missing_features_num].fillna(test_clean[test_10_percent_missing_features_num].median().iloc[0])

#LotFrontage is a numeric feature and needs to be imputed as well. We can use median imputation again. 

train_clean["LotFrontage"] = train_clean["LotFrontage"].fillna(train_clean["LotFrontage"].median())
test_clean["LotFrontage"] = test_clean["LotFrontage"].fillna(test_clean["LotFrontage"].median())

print("train_clean")
print('Feature      Number of NaN')
print(train_clean.isnull().sum().sort_values(ascending=False).loc[lambda x : x!=0])

print('\n')

print("test_clean")
print('Feature      Number of NaN')
print(test_clean.isnull().sum().sort_values(ascending=False).loc[lambda x : x!=0])

For the rest of the features we can use just use NaN as a categorical feature. This will be done in MultiColumnLabelEncoder

In [None]:
# Seperate train_clean into target and features 
y = train_clean['SalePrice']
X_train_clean = train_clean.drop('SalePrice',axis = 'columns')

# save the index for X_aug_train 
X_train_clean_index = X_train_clean.index.to_list()

# row bind aug_train features with aug_test features 
# this makes it easier to apply label encoding onto the entire dataset 
X_total = X_train_clean.append(test_clean,ignore_index = True)
display(X_total.info(verbose = True,null_counts=True))

# save the index for X_aug_test 
X_test_clean_index = np.setdiff1d(X_total.index.to_list() ,X_train_clean_index) 

<a id="Label-Encoding"></a>
# Label Encoding

In [None]:
#%% MultiColumnLabelEncoder
# Code snipet found on Stack Exchange 
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
from sklearn.preprocessing import LabelEncoder


class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                # convert float NaN --> string NaN
                output[col] = output[col].fillna('NaN')
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

# store the catagorical features names as a list      
cat_features = X_total.select_dtypes(['object']).columns.to_list()

# use MultiColumnLabelEncoder to apply LabelEncoding on cat_features 
# uses NaN as a value , no imputation will be used for missing data
X_total_encoded = MultiColumnLabelEncoder(columns = cat_features).fit_transform(X_total)

In [None]:
##% Split X_total_encoded 
X_train_clean_encoded = X_total_encoded.iloc[X_train_clean_index, :]
X_test_clean_encoded = X_total_encoded.iloc[X_test_clean_index, :].reset_index(drop = True) 
X_train_clean_encoded.info()

In [None]:
##% Before and After LabelEncoding for train.csv 
display(X_train_clean.head())
display(X_train_clean_encoded.head())

In [None]:
##% Before and After LabelEncoding for test.csv 
display(test_clean.head())
display(X_test_clean_encoded.head())

<a id="#Train-Test-Split"></a>
# Train-Test Split

In [None]:
# Create test and train set 80-20
#%%  train-test split using a 80-20 split
train_X, valid_X, train_y, valid_y = train_test_split(X_train_clean_encoded, y, test_size=0.2, shuffle = True, random_state=0)

train_X.info()


# Initial Models
We can not apply different machine learning algorthims to test which model perform better on this dataset. I've listed below various machine learning techniques applied in this section.
 
1. RandomForest
2. Support Vector Machine
3. XGBoost
4. LightGBM

In [None]:
##% evaluateRegressor
# from sklearn.metrics import mean_squared_error, mean_absolute_error
def evaluateRegressor(true,predicted,message = "Test set"):
    MSE = mean_squared_error(true,predicted,squared = True)
    MAE = mean_absolute_error(true,predicted)
    RMSE = mean_squared_error(true,predicted,squared = False)
    LogRMSE = mean_squared_error(np.log(true),np.log(predicted),squared = False)
    print(message)
    print("MSE:", MSE)
    print("MAE:", MAE)
    print("RMSE:", RMSE)
    print("LogRMSE:", LogRMSE)

In [None]:
##% Plot True vs predicted values. Useful for continuous y 
def PlotPrediction(true,predicted, title = "Dataset: "):
    fig = plt.figure(figsize=(20,20))
    ax1 = fig.add_subplot(111)
    ax1.set_title(title + 'True vs Predicted')
    ax1.scatter(list(range(0,len(true))),true, s=10, c='r', marker="o", label='True')
    ax1.scatter(list(range(0,len(predicted))), predicted, s=10, c='b', marker="o", label='Predicted')
    plt.legend(loc='upper right');
    plt.show()

In [None]:
##% Initial Models
# from sklearn.ensemble import RandomForestRegressor
# from sklearn import svm
# import lightgbm as lgb
# import xgboost as xg 

RFReg = RandomForestRegressor(random_state = 0).fit(train_X, train_y)
SVM = svm.SVR().fit(train_X, train_y) 
XGReg = xg.XGBRegressor(objective ='reg:squarederror', seed = 0,verbosity=0).fit(train_X,train_y) 
LGBMReg = lgb.LGBMRegressor(random_state=0).fit(train_X,train_y)

In [None]:
##% Model Metrics
print("Random Forest Regressor") 
predicted_train_y = RFReg.predict(train_X)
evaluateRegressor(train_y,predicted_train_y,"    Training Set")
predicted_valid_y = RFReg.predict(valid_X)
evaluateRegressor(valid_y,predicted_valid_y,"    Test Set")
print("\n")
    
print("Support Vector Machine") 
predicted_train_y = SVM.predict(train_X)
evaluateRegressor(train_y,predicted_train_y,"    Training Set")
predicted_valid_y = SVM.predict(valid_X)
evaluateRegressor(valid_y,predicted_valid_y,"    Test Set")
print("\n")


print("XGBoost Regressor") 
predicted_train_y = XGReg.predict(train_X)
evaluateRegressor(train_y,predicted_train_y,"    Training Set")
predicted_valid_y = XGReg.predict(valid_X)
evaluateRegressor(valid_y,predicted_valid_y,"    Test Set")
print("\n")

print("LightGBM Regressor") 
predicted_train_y = LGBMReg.predict(train_X)
evaluateRegressor(train_y,predicted_train_y,"    Training Set")
predicted_valid_y = LGBMReg.predict(valid_X)
evaluateRegressor(valid_y,predicted_valid_y,"    Test Set")

These models perform poorly as we conduct no parameter tuning. We see lightGBM performs the best on the test set. I use Bayesian Optimization to tune the hyperparameters for LightGBM to obtain a better model. 

<a id="LightGBM-Regressor"></a>
# LightGBM Regressor

<a id="Bayesian-Optimization"></a>
## Bayesian Optimization

In [None]:
##% parameter tuning for lightgbm 
# store the catagorical features names as a list      
cat_features = X_train_clean_encoded.select_dtypes(['object']).columns.to_list()

# Create the LightGBM data containers
# Make sure that cat_features are used
train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)

In [None]:
# https://medium.com/analytics-vidhya/hyperparameters-optimization-for-lightgbm-catboost-and-xgboost-regressors-using-bayesian-6e7c495947a9
# from lightgbm import LGBMRegressor 
# from bayes_opt import BayesianOptimization
def search_best_param(X,y,cat_features):
    
    trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    # define the lightGBM cross validation
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                lambda_l1, lambda_l2, min_child_weight):
    
        params = {'boosting_type': 'gbdt', 'objective': 'regression', 'metric':'rmse', 'verbose': -1,
                  'early_stopping_round':100}
        
        params['max_depth'] = int(round(max_depth))
        params["num_leaves"] = int(round(num_leaves))
        params["n_estimators"] = int(round(n_estimators))
        params['learning_rate'] = learning_rate
        params['subsample'] = subsample
        params['colsample_bytree'] = colsample_bytree
        params['lambda_l1'] = max(lambda_l1, 0)
        params['lambda_l2'] = max(lambda_l2, 0)
        params['min_child_weight'] = min_child_weight
    
        score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=False, verbose_eval =False, metrics=['rmse'])

        return -np.min(score['rmse-mean']) # return negative rmse to minimize rmse 

    # use bayesian optimization to search for the best hyper-parameter combination
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                       {
                                          'max_depth': (5, 50),
                                          'num_leaves': (20, 100),
                                          'n_estimators': (50, 1000),
                                          'learning_rate': (0.01, 0.3),
                                          'subsample': (0.7, 0.8),
                                          'colsample_bytree' :(0.5, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 50) 
                                      },
                                       random_state = 1,
                                       verbose = 0
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points=5, n_iter=25) 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'rmse'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'regression'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

best_params = search_best_param(train_X,train_y,cat_features)

In [None]:
# Print best_params
for key, value in best_params.items():
    print(key, ' : ', value)

<a id="Tuning-LightGBM"></a>
## Tuning LightGBM

In [None]:
# Train lgbm_best using the best params found from Bayesian Optimization
lgbm_best = lgb.train(best_params,
                 train_data,
                 num_boost_round = 2500,
                 valid_sets = valid_data,
                 early_stopping_rounds = 200,
                 verbose_eval = 100
                 )


<a id="LightGBM-Model-Peformance "></a>
## LightGBM Model Peformance 

In [None]:
print("LightGBM Regressor Tuned") 
predicted_train_y = lgbm_best.predict(train_X)
evaluateRegressor(train_y,predicted_train_y,"    Training Set")
PlotPrediction(train_y,predicted_train_y,"Training Set: ")

In [None]:
predicted_valid_y = lgbm_best.predict(valid_X)
evaluateRegressor(valid_y,predicted_valid_y,"    Test Set")
PlotPrediction(valid_y,predicted_valid_y,"Test Set: ")

<a id="Feature-Importance "></a>
## Feature Importance 

In [None]:
##% Feature Importance 
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
lgb.plot_importance(lgbm_best,figsize=(25,20))

In [None]:
##% Feature Importance using shap package 
# import shap
lgbm_best.params['objective'] = 'regression'
shap_values = shap.TreeExplainer(lgbm_best).shap_values(valid_X)
shap.summary_plot(shap_values, valid_X)

From both feature importance, we can see that **GrLivArea** and **LotArea** contributes a lot in prediction the price of home. The shap package is prefer when finding feature importance as it preservces consistency and accuracy. You can read more about the shap package in the links provided below:

[https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Census%20income%20classification%20with%20LightGBM.html](https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Census%20income%20classification%20with%20LightGBM.html)

[https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d](https://towardsdatascience.com/explain-your-model-with-the-shap-values-bc36aac4de3d)  

[https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27](https://towardsdatascience.com/interpretable-machine-learning-with-xgboost-9ec80d148d27)

<a id="Cross-Validation "></a>
## Cross Validation 

In [None]:
# Cross Validation with LightGBM

def K_Fold_LightGBM(X_train, y_train , cat_features, num_folds = 3):
    num = 0
    models = []
    folds = KFold(n_splits=num_folds, shuffle=True, random_state=0)

        # 5 times 
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        print(f"     model{num}")
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
        valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)
        
        params_set = search_best_param(train_X,train_y,cat_features)
        
        CV_LGBM = lgb.train(params_set,
                            train_data,
                            num_boost_round = 2500,
                            valid_sets = valid_data,
                            early_stopping_rounds = 200,
                            verbose_eval = 100
                           )
        
        # increase early_stopping_rounds can lead to overfitting 
        models.append(CV_LGBM)
        
        print("Train set logRMSE:", mean_squared_error(np.log(train_y),np.log(models[num].predict(train_X)),squared = False))
        print(" Test set logRMSE:", mean_squared_error(np.log(valid_y),np.log(models[num].predict(valid_X)),squared = False))
        print("\n")
        num = num + 1
        
    return models

lgbm_models = K_Fold_LightGBM(X_train_clean_encoded,y,cat_features,5)

In [None]:
# Predict y_prds using models from cross validation 
def predict_cv(models_cv,X):
    y_preds = np.zeros(shape = X.shape[0])
    for model in models_cv:
        y_preds += model.predict(X)
        
    return y_preds/len(models_cv)

# evalute model using the entire dataset from Train.csv
evaluateRegressor(y,predict_cv(lgbm_models,X_train_clean_encoded),"Train.csv ")
PlotPrediction(y,predict_cv(lgbm_models,X_train_clean_encoded),"Train.csv: ")

<a id="Prediction-for-Test-Data"></a>
# Prediction for Test Data

In [None]:
predictLGBM = lgbm_best.predict(X_test_clean_encoded)
submissionLGBM = pd.DataFrame({'Id':test_Id,'SalePrice':predictLGBM})
display(submissionLGBM.head())

predictLGBM_CV = predict_cv(lgbm_models,X_test_clean_encoded)
submissionLGBM_CV = pd.DataFrame({'Id':test_Id,'SalePrice':predictLGBM_CV})
display(submissionLGBM_CV.head())

In [None]:
##% Submit Predictions 
submissionLGBM.to_csv('submissionLGBM4.csv',index=False)
submissionLGBM_CV.to_csv('submissionLGBM_CV4.csv',index=False)

<a id="Conclusion"></a>
# Conclusion

**Conclusion**
* LightGBM is a great ML algorithim that handles catagorical features and missing values 
* This is a great dataset to work on and lots of knowledge can be gain from withing with this dataset 
* Researching and reading other Kaggle notebooks is essential for becoming a better data scientist

**Challenges**
* This dataset had missing data so other imputation techniques could be used 
* Overfitting is an issue and might have occured
* No Data Visualization due to lots of features 

**Closing Remarks**  
* Please comment and like the notebook if it of use to you! Have a wonderful year! 

**Other Notebooks** 
* [https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26](https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26)
* [https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95](https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95)
* [https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm](https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm)
* [https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021](https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021)


2-12-2020
Joseph Chan 