# <center>TabularPlaygroundClassifier MAR2021</center>
<img src= "https://wallpaperaccess.com/full/1782494.jpg" height="200" align="center"/>


<a id="Table-Of-Contents"></a>
# Table Of Contents
* [Table Of Contents](#Table-Of-Contents)
* [Introduction](#Introduction)
* [Importing Libraries](#Importing-Libraries)
* [Task Details](#Task-Details)
* [Read in Data](#Read-in-Data)
    - [Train.csv](#Train.csv)
    - [Test.csv](#Test.csv)
    - [Notes](#Notes)
* [Data Visualization](#Data-Visualization)
    - [Categorical Features](#Categorical-Features)
    - [Continuous Features](#Continuous-Features)
    - [Target](#Target)
* [Preprocessing Data](#Preprocessing-Data)
    - [Label Encoding](#Label-Encoding)
    - [Train-Test Stratified Split](#Train-Test-Stratified-Split)
* [Random Forest Classifier](#Random-Forest-Classifier)
    - [Random Forest Bayesian Optimization](#Random-Forest-Bayesian-Optimization)
    - [Random Forest Cross Validation](#Random-Forest-Cross-Validation)
    - [Random Forest CV Model Peformance](#Random-Forest-CV-Model-Peformance)
* [LightGBM Classifier](#LightGBM-Classifier)
    - [Bayesian Optimization](#Bayesian-Optimization)
    - [Tuning LightGBM](#Tuning-LightGBM)
    - [Feature Importance](#Feature-Importance)
    - [Cross Validation](#Cross-Validation)
    - [LightGBM CV Model Peformance](#LightGBM-CV-Model-Peformance)
* [Prediction for Test.csv](#Prediction-for-Test.csv)
* [Conclusion](#Conclusion)

<a id="Importing-Libraries"></a>
# Importing Libraries

In [None]:
#%% Importing Libraries

# Basic Imports 
import numpy as np
import pandas as pd
from IPython.display import display, HTML

# Plotting 
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns
%matplotlib inline

# Preprocessing
from sklearn.model_selection import train_test_split, StratifiedKFold,cross_val_score
from sklearn.preprocessing import LabelEncoder

# Metrics 
from sklearn.metrics import roc_auc_score

# ML Models
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb

# Model Tuning 
from bayes_opt import BayesianOptimization

# Feature Importance 
import shap 

# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

<a id="Introduction"></a>
# Introduction
This is my third competition notebook on Kaggle. I hope to learn more about working with tabular data and I hope anyone who reads this learns more as well! This notebook will be working with a classfication task. If you have any questions or comments please leave below! 

<a id="Task-Details"></a>
# Task Detail 

## Goal
For this competition, you will be predicting a **binary target** based on a number of feature columns given in the data. All of the feature columns, **cat0** - **cat18** are **categorical**, and the feature columns **cont0** - **cont10** are continuous.

## Metric
Submissions are evaluated on [area under the ROC curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) between the predicted probability and the observed target.

<a id="Read-in-Data"></a>
# Read in Data

<a id="Train.csv"></a>
## Train.csv

In [None]:
#%% Read train.csv
train_csv = pd.read_csv('../input/tabular-playground-series-mar-2021/train.csv')

# Initial glance at train.csv
print(train_csv.info(verbose = True,show_counts=True))

<a id="Test.csv"></a>
## Test.csv

In [None]:
#%% Read train.csv
test_csv = pd.read_csv('../input/tabular-playground-series-mar-2021/test.csv')

# Initial glance at train.csv
print(test_csv.info(verbose = True,show_counts=True))

<a id="Notes"></a>
## Notes

Train.csv and Test.csv have no missing values so imputation is not needed. Since there aren't many features in this dataset I can do a quick explanatory data analysis on the features and target.

<a id="Data-Visualization"></a>
# Data Visualization 

<a id="Categorical-Features"></a>
## Categorical Features

In [None]:
#%% PlotMultiplePie 
# Input: df = Pandas dataframe, categorical_features = list of features , dropna = boolean variable to use NaN or not
# Output: prints multiple px.pie() 

def PlotMultiplePie(df,categorical_features = None,dropna = False):
    # set a threshold of 30 unique variables, more than 50 can lead to ugly pie charts 
    threshold = 30
    
    # if user did not set categorical_features 
    if categorical_features == None: 
        categorical_features = df.select_dtypes(['object','category']).columns.to_list()
        
    print("The Categorical Features are:",categorical_features)
    
    # loop through the list of categorical_features 
    for cat_feature in categorical_features: 
        num_unique = df[cat_feature].nunique(dropna = dropna)
        num_missing = df[cat_feature].isna().sum()
        # prints pie chart and info if unique values below threshold 
        if num_unique <= threshold:
            print('Pie Chart for: ', cat_feature)
            print('Number of Unique Values: ', num_unique)
            print('Number of Missing Values: ', num_missing)
            fig = px.pie(df[cat_feature].value_counts(dropna = dropna), values=cat_feature, 
                 names = df[cat_feature].value_counts(dropna = dropna).index,title = cat_feature,template='ggplot2')
            fig.show()
        else: 
            print('Pie Chart for ',cat_feature,' is unavailable due high number of Unique Values ')
            print('Number of Unique Values: ', num_unique)
            print('Number of Missing Values: ', num_missing)
            print('\n')

In [None]:
#%% Use PlotMultiplePie to see the distribution of the categorical variables 
PlotMultiplePie(train_csv.drop(["id"],axis = "columns"))

<a id="Continuous-Features"></a>
## Continuous Features

In [None]:
#%% Print the continous features in the dataset 
continous_features = train_csv.drop(["id","target"],axis = "columns").select_dtypes(['float64']).columns.to_list()

for cont_feature in continous_features: 
    plt.figure()
    plt.title(cont_feature)
    ax = sns.histplot(train_csv[cont_feature])

<a id="Target"></a>
## Target

In [None]:
PlotMultiplePie(train_csv,categorical_features = ["target"])

<a id="Preprocessing-Data"></a>
# Preprocessing Data
Because Train.csv and Test.csv have no missing data imputation is not needed.  
Label encoding is still require as this dataset has categorical features 

In [None]:
# save the 'id' for Train and Test 
train_csv_id = train_csv['id'].to_list()
test_csv_id = test_csv['id'].to_list()

# Seperate train_csv into target and features 
y_train_csv = train_csv['target']
X_train_csv = train_csv.drop('target',axis = 'columns')

# Save the index for X_train_csv 
X_train_csv_index = X_train_csv.index.to_list()

# Row bind train.csv features with test.csv features 
# this makes it easier to apply label encoding onto the entire dataset 
X_train_test = X_train_csv.append(test_csv,ignore_index = True)

# save the index for test.csv 
X_test_csv_index = np.setdiff1d(X_train_test.index.to_list() ,X_train_csv_index) 

# drop id from X_total
X_train_test = X_train_test.drop('id',axis = 'columns')

# X_train_test.info()

<a id="Label-Encoding"></a>
## Label Encoding

In [None]:
#%% MultiColumnLabelEncoder
# Code snipet found on Stack Exchange 
# https://stackoverflow.com/questions/24458645/label-encoding-across-multiple-columns-in-scikit-learn
# from sklearn.preprocessing import LabelEncoder

class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                # convert float NaN --> string NaN
                output[col] = output[col].fillna('NaN')
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)

# store the catagorical features names as a list      
cat_features = X_train_test.select_dtypes(['object']).columns.to_list()

# use MultiColumnLabelEncoder to apply LabelEncoding on cat_features 
# uses NaN as a value , no imputation will be used for missing data
X_train_test_encoded = MultiColumnLabelEncoder(columns = cat_features).fit_transform(X_train_test)

In [None]:
##% Split X_train_clean_encoded 
X_train_csv_encoded = X_train_test_encoded.iloc[X_train_csv_index, :]
X_test_csv_encoded = X_train_test_encoded.iloc[X_test_csv_index, :].reset_index(drop = True) 

In [None]:
##% Before and After LabelEncoding for train.csv 
display(X_train_csv.head().drop("id",axis = 'columns'))
display(X_train_csv_encoded.head())

In [None]:
##% Before and After LabelEncoding for test.csv 
display(test_csv.head().drop("id",axis = 'columns'))
display(X_test_csv_encoded.head())

<a id="#Train-Test-Stratified-Split"></a>
## Train-Test Stratified Split

In [None]:
# Create test and train set 80-20
#%%  train-test stratified split using a 80-20 split
X_train, X_test, y_train, y_test = train_test_split(X_train_csv_encoded, y_train_csv, test_size=0.2, shuffle = True, stratify = y_train_csv, random_state=0)

for df in [X_train, X_test, y_train, y_test]:
    df.reset_index(drop = True,inplace = True)
    
print(" Training Target")
print(y_train.value_counts())
print("\n")
print(" Test Target")
print(y_test.value_counts())

In [None]:
display(X_train)
display(X_test)

<a id="Initial Models"></a>
# Initial Models
I applied different machine learning algorthims to test which model perform better on this dataset. I've listed below various machine learning techniques applied in this section.
 
1. Random Forest Classifier 
2. LightGBM Classifier

In [None]:
#% Initial Models
RFC = RandomForestClassifier(n_estimators = 50, max_depth = 10, n_jobs = -1,random_state = 0).fit(X_train, y_train)
LGBMC = lgb.LGBMClassifier(num_leaves = 50,max_depth = 10,random_state=0).fit(X_train,y_train)

In [None]:
print("     Random Forest Classifier")
print("Training Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_train,RFC.predict_proba(X_train)[:,1]))
print("Test Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_test,RFC.predict_proba(X_test)[:,1]))

print("\n")

print("     LGBMClassifier")
print("Training Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_train,LGBMC.predict_proba(X_train)[:,1]))
print("Test Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_test,LGBMC.predict_proba(X_test)[:,1]))


<a id="Random-Forest-Classifier"></a>
# Random Forest Classifier

<a id="Random-Forest-Bayesian-Optimization"></a>
## Random Forest Bayesian Optimization

In [None]:
# https://github.com/fmfn/BayesianOptimization
# https://github.com/fmfn/BayesianOptimization/blob/master/bayes_opt/bayesian_optimization.py
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
# https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
# https://tech.ovoenergy.com/bayesian-optimisation/
# https://www.kdnuggets.com/2019/07/xgboost-random-forest-bayesian-optimisation.html
#crash
def search_best_param_rf(X, y):
    def rf_cv(X, y, **kwargs):
        estimator = RandomForestClassifier(**kwargs)
        cval = cross_val_score(
            estimator,
            X,
            y,
            scoring="roc_auc",
            cv=5,
            verbose=0,
            n_jobs=-1,
            error_score=0,
        )
        return cval.mean()

    def rf_crossval(n_estimators, max_depth, min_samples_split, min_samples_leaf):
        return rf_cv(
            X=X,
            y=y,
            n_estimators=int(n_estimators),
            max_depth=int(max(max_depth, 1)),
            min_samples_split=int(max(min_samples_split, 2)),
            min_samples_leaf=int(max(min_samples_leaf, 1)),
        )
    
    RFC_BO_params = {
        "n_estimators": (10, 100),
        "max_depth": (1, 100),
        "min_samples_split": (2, 10),
        "min_samples_leaf": (1, 5),
    }

    RFC_Bo = BayesianOptimization(rf_crossval, 
                                  RFC_BO_params, 
                                  random_state=0, 
                                  verbose=2
                                 )
    np.random.seed(1)
    
    RFC_Bo.maximize(init_points=2, n_iter=2)
    # n_iter: How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
    # init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
    # more iterations more time spent searching 
    
    params_set = RFC_Bo.max['params']
    
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['min_samples_split'] = int(round(params_set['min_samples_split']))
    params_set['min_samples_leaf'] = int(round(params_set['min_samples_leaf']))
    
    params_set.update({'n_jobs': -1})
    params_set.update({'random_state': 0})
    
    return params_set

<a id="Random-Forest-Cross-Validation"></a>
## Random Forest Cross Validation 

In [None]:
# Random Forest Cross Validation

def K_Fold_RandomForest(X_train,y_train, params_set = [], num_folds = 5):
    model_num = 0 # model number 
    models = [] # model list
    folds = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=0) # create folds

        # num_folds times ; default is 5
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        
        print(f"     model{model_num}")
        
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        if (params_set == []): # if param_set is empty
            # find best param_set in each fold, can lead to overfitting
            params_set = search_best_param(train_X,train_y,cat_features) 
        
        # fit RFC based of param_set and current fold
        CV_RF = RandomForestClassifier(**params_set).fit(train_X, train_y)
        
        # append RF model to model list 
        models.append(CV_RF)
        
        # model metrics for current fold 
        print("Training Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_train,models[model_num].predict_proba(X_train)[:,1]))
        print("Test Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_test,models[model_num].predict_proba(X_test)[:,1]))
        print("\n")
        
        model_num = model_num + 1
        
    return models

In [None]:
best_params_rf_cv = search_best_param_rf(X_train_csv_encoded,y_train_csv)

In [None]:
# Print best_params_rf_cv
for key, value in best_params_rf_cv.items():
    print(key, ' : ', value)

In [None]:
rf_models = K_Fold_RandomForest(X_train_csv_encoded,y_train_csv,params_set = best_params_rf_cv,num_folds = 5)

<a id="Random-Forest-CV-Model-Peformance"></a>
## Random Forest CV Model Peformance 

In [None]:
# Predict y_prds using models from RFC cross validation 
def predict_models_RFC(models_cv,X):
    y_preds = np.zeros(shape = X.shape[0])
    for model in models_cv:
        y_preds += model.predict_proba(X)[:,1]
        
    return y_preds/len(models_cv)

In [None]:
# RFC Cross Validation Model Performance
print("     RFC Cross Validation")
print("Total Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_train_csv,predict_models_RFC(rf_models,X_train_csv_encoded)))

<a id="LightGBM-Classifier"></a>
# LightGBM Classifier

<a id="Bayesian-Optimization"></a>
## Bayesian Optimization

In [None]:
##% parameter tuning for lightgbm 
# store the catagorical features names as a list      
cat_features = X_train_test.select_dtypes(['object']).columns.to_list()
# print(cat_features)

# Create the LightGBM data containers
# Make sure that cat_features are used
train_lgbdata=lgb.Dataset(X_train,label=y_train, categorical_feature = cat_features,free_raw_data=False)
test_lgbdata=lgb.Dataset(X_test,label=y_test, categorical_feature = cat_features,free_raw_data=False)

In [None]:
# https://github.com/fmfn/BayesianOptimization
def search_best_param(X,y,cat_features):
    
    trainXY = lgb.Dataset(data=X, label=y,categorical_feature = cat_features,free_raw_data=False)
    # define the lightGBM cross validation
    def lightGBM_CV(max_depth, num_leaves, n_estimators, learning_rate, subsample, colsample_bytree, 
                lambda_l1, lambda_l2, min_child_weight):
    
        params = {'boosting_type': 'gbdt', 'objective': 'binary', 'metric':'auc', 'verbose': -1,
                  'early_stopping_round':100}
        
        params['max_depth'] = int(round(max_depth))
        params["num_leaves"] = int(round(num_leaves))
        params["n_estimators"] = int(round(n_estimators))
        params['learning_rate'] = learning_rate
        params['subsample'] = subsample
        params['colsample_bytree'] = colsample_bytree
        params['lambda_l1'] = max(lambda_l1, 0)
        params['lambda_l2'] = max(lambda_l2, 0)
        params['min_child_weight'] = min_child_weight
    
        score = lgb.cv(params, trainXY, nfold=5, seed=1, stratified=True, verbose_eval =False, metrics=['auc'])
        return np.mean(score['auc-mean']) # maximize auc-mean

    # use bayesian optimization to search for the best hyper-parameter combination
    lightGBM_Bo = BayesianOptimization(lightGBM_CV, 
                                       {
                                          'max_depth': (5, 50),
                                          'num_leaves': (20, 100),
                                          'n_estimators': (50, 500),
                                          'learning_rate': (0.01, 0.3),
                                          'subsample': (0.7, 0.8),
                                          'colsample_bytree' :(0.5, 0.99),
                                          'lambda_l1': (0, 5),
                                          'lambda_l2': (0, 3),
                                          'min_child_weight': (2, 50) 
                                      },
                                       random_state = 1,
                                       verbose = 1
                                      )
    np.random.seed(1)
    
    lightGBM_Bo.maximize(init_points= 2, n_iter=2) # 2 + 2, 4 iterations 
    # n_iter: How many steps of bayesian optimization you want to perform. The more steps the more likely to find a good maximum you are.
    # init_points: How many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.
    # more iterations more time spent searching 
    
    params_set = lightGBM_Bo.max['params']
    
    # get the params of the maximum target     
    max_target = -np.inf
    for i in lightGBM_Bo.res: # loop thru all the residuals 
        if i['target'] > max_target:
            params_set = i['params']
            max_target = i['target']
    
    params_set.update({'verbose': -1})
    params_set.update({'metric': 'auc'})
    params_set.update({'boosting_type': 'gbdt'})
    params_set.update({'objective': 'binary'})
    
    params_set['max_depth'] = int(round(params_set['max_depth']))
    params_set['num_leaves'] = int(round(params_set['num_leaves']))
    params_set['n_estimators'] = int(round(params_set['n_estimators']))
    params_set['seed'] = 1 #set seed
    
    return params_set

In [None]:
best_params = search_best_param(X_train,y_train,cat_features)

In [None]:
# Print best_params
for key, value in best_params.items():
    print(key, ' : ', value)

<a id="Tuning-LightGBM"></a>
## Tuning LightGBM

In [None]:
# Train lgbm_best using the best params found from Bayesian Optimization
lgbm_best = lgb.train(best_params,
                 train_lgbdata,
                 num_boost_round = 100,
                 valid_sets = test_lgbdata,
                 early_stopping_rounds = 100,
                 verbose_eval = 50
                 )

<a id="LightGBM-Model-Peformance "></a>
## LightGBM Model Peformance 

In [None]:
print("     LGBM Tuned")
print("Training Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_train,lgbm_best.predict(X_train)))
print("Test Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_test,lgbm_best.predict(X_test)))

<a id="Feature-Importance "></a>
## Feature Importance 

In [None]:
##% Feature Importance 
# https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
lgb.plot_importance(lgbm_best,figsize=(25,20),max_num_features = 10)

In [None]:
##% Feature Importance using shap package 
# import shap
shap_values = shap.TreeExplainer(lgbm_best).shap_values(X_test[:2500])
shap.summary_plot(shap_values, X_test[:2500])

<a id="Cross-Validation "></a>
## Cross Validation 

In [None]:
# Cross Validation with LightGBM

def K_Fold_LightGBM(X_train, y_train , cat_features, num_folds = 5, params_set = []):
    num = 0 # model number
    models = [] # list of models 
    folds = StratifiedKFold(n_splits=num_folds, shuffle=True, random_state=0) # create folds 

        # num_folds times 
    for n_fold, (train_idx, valid_idx) in enumerate (folds.split(X_train, y_train)):
        
        print(f"     model{num}")
        train_X, train_y = X_train.iloc[train_idx], y_train.iloc[train_idx]
        valid_X, valid_y = X_train.iloc[valid_idx], y_train.iloc[valid_idx]
        
        train_data=lgb.Dataset(train_X,label=train_y, categorical_feature = cat_features,free_raw_data=False)
        valid_data=lgb.Dataset(valid_X,label=valid_y, categorical_feature = cat_features,free_raw_data=False)
        
        
        # params_set = search_best_param(train_X,train_y,cat_features) # find best param_set in each fold
        
        CV_LGBM = lgb.train(params_set,
                            train_data,
                            num_boost_round = 100,
                            valid_sets = valid_data,
                            early_stopping_rounds = 100,
                            verbose_eval = 50
                           )
        # increase early_stopping_rounds can lead to overfitting 
        
        # append LGBM model to models list 
        models.append(CV_LGBM)
        
        # model metrics for each fold 
        print("Training Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_train,models[num].predict(X_train)))
        print("Test Dataset")
        print("ROC_AUC_SCORE: ",roc_auc_score(y_test,models[num].predict(X_test)))
        print("\n")
        
        num = num + 1
        
    return models

In [None]:
best_params_cv = search_best_param(X_train_csv_encoded,y_train_csv,cat_features)

In [None]:
lgbm_models = K_Fold_LightGBM(X_train_csv_encoded,y_train_csv,cat_features,5,params_set = best_params_cv)

<a id="LightGBM-CV-Model-Peformance "></a>
## LightGBM CV Model Peformance 

In [None]:
# Predict y_prds using models from cross validation 
def predict_models_LGBM(models_cv,X):
    y_preds = np.zeros(shape = X.shape[0])
    for model in models_cv:
        y_preds += model.predict(X)
        
    return y_preds/len(models_cv)

In [None]:
# LightGBM Cross Validation Model Performance
print("     LGBM Cross Validation")
print("Total Dataset")
print("ROC_AUC_SCORE: ",roc_auc_score(y_train_csv,predict_models_LGBM(lgbm_models,X_train_csv_encoded)))

<a id="Prediction-for-Test.csv"></a>
# Prediction for Test.csv

In [None]:
# Prediction for Test.csv using LightGBM CV 
predictLGBM = predict_models_LGBM(lgbm_models,X_test_csv_encoded) 

submissionLGBM = pd.DataFrame({'id':test_csv_id,'target':predictLGBM})

display(submissionLGBM.head())

# Prediction for Test.csv using RFC CV 
predictRFC = predict_models_RFC(rf_models,X_test_csv_encoded) 

submissionRFC = pd.DataFrame({'id':test_csv_id,'target':predictRFC})

display(submissionRFC.head())

In [None]:
#% Submit Predictions 
submissionLGBM.to_csv('submissionCV_LGBM4.csv',index=False)
submissionRFC.to_csv('submissionCV_RFC4.csv',index=False)

<a id="Conclusion"></a>
# Conclusion

<a id="Conclusion"></a>
# Conclusion

**Conclusion**
* LightGBM is a great ML algorithm that handles categorical features and missing values 
* Cross Validation is useful to combat overfitting 
* Bayesian Optimization is necessary to get hyper parameters when building an initial model
* This is a great dataset to work on and lots of knowledge can be gain from withing with this dataset 
* Researching and reading other Kaggle notebooks is essential for becoming a better data scientist

**Challenges**
* Due to the size of the dataset my algorithms took a while to run 
* Overfitting might have occurred which reduces the model performance on the test set

**Closing Remarks**  
* Please comment and like the notebook if it of use to you! Have a wonderful year! 


**Other Notebooks** 
* [https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26](https://www.kaggle.com/josephchan524/studentperformanceregressor-rmse-12-26-r2-0-26)
* [https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95](https://www.kaggle.com/josephchan524/bankchurnersclassifier-recall-97-accuracy-95)
* [https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm](https://www.kaggle.com/josephchan524/housepricesregressor-using-lightgbm)
* [https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021](https://www.kaggle.com/josephchan524/tabularplaygroundregressor-using-lightgbm-feb2021)


3-12-2020
Joseph Chan 