##### Grading Feedback

# IST 718: Big Data Analytics

- Professor: Willard Williamson <wewillia@syr.edu>
- Faculty Assistant: Vidushi Mishra <vmishr01@syr.edu>
- Faculty Assistant: Pranav Kottoli Radhakrishna <pkottoli@syr.edu>
## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Code from the class text books or class provided code can be copied in its entirety.__
- Do not modify cells marked as grading cells or marked as do not modify.
- Before submitting your work, remember to check for run time errors with the following procedure:
`Runtime `$\rightarrow$ Factory reset runtime followed by Runtime $\rightarrow$ Run All.  All runtime errors will result in a minimum penalty of half off.
- Google Colab is the official class runtime environment so you should test your code on Colab before submission.
- All plots shall include descriptive title and axis labels.  Plot legends shall be included where possible.  Unless stated otherwise, plots can be made using any Python plotting package.  It is understood that spark data structures must be converted to something like numpy or pandas prior to making plots.  All required mathematical operations, filtering, selection, etc., required by a homework question shall be performed in spark prior to converting to numpy or pandas.
- Grading feedback cells are there for graders to provide feedback to students.  Don't change or remove grading feedback cells.
- Don't add or remove files from your git repo.
- Do not change file names in your repo.  This also means don't change the title of the ipython notebook.
- You are free to add additional code cells around the cells marked `your code here`.
- We reserve the right to take points off for operations that are extremely inefficient or "heavy weight".  This is a big data class and extremely inefficient operations make a big difference when scaling up to large data sets.  For example, the spark dataframe collect() method is a very heavy weight operation and should not be used unless it there is a real need for it.  An example where collect() might be needed is to get ready to make a plot after filtering a spark dataframe.
- import * is not allowed because it is considered a very bad coding practice and in some cases can result in a significant delay (which slows down the grading process) in loading imports.  For example, the statement `from sympy import *` is not allowed.  You must import the specific packages that you need. 
- The graders reserve the right to deduct points for subjective things we see with your code.  For example, if we ask you to create a pandas data frame to display values from an investigation and you hard code the values, we will take points off for that.  This is only one of many different things we could find in reviewing your code.  In general, write your code like you are submitting it for a code peer review in industry.  
- Level of effort is part of our subjective grading.  For example, in cases where we ask for a more open ended investigation, some students put in significant effort and some students do the minimum possible to meet requirements.  In these cases, we may take points off for students who did not put in much effort as compared to students who put in a lot of effort.  We feel that the students who did a better job deserve a better grade.  We reserve the right to invoke level of effort grading at any time.
- Only use spark, spark machine learning, spark data frames, RDD's, and map reduce to solve all problems unless instructed otherwise.
- Your notebook must run from start to finish without requiring manual input by the graders.  For example, do not mount your personal Google drive in your notebook as this will require graders to perform manual steps.  In short, your notebook should run from start to finish with no runtime errors and no need for graders to perform any manual steps.


## Note that this notebook is expected to run in the Google Colab environment.  All grading for this assignment will take place exclusively in Google Colab.

This homework proves that diamonds are forever.  In homework 3, we used linear regression to predict diamond prices and evaluated model performance using MSE as the scoring metric.  In this homework, we are going to use the same diamonds data set but this time use decision trees and deep learning to see if we can improve upon the linear regression performance from homework 3.

# Diamonds Data
Just to prove that diamonds are forever, we are going to revisit the diamonds data set.  This homework assignment will use the diamonds dataset to explore random forest decision tree models.

The diamonds.csv data set contains 10 columns:
- carat: Carat weight of the diamond
- cut: Describes cut quality of the diamond. Quality in increasing order Fair, Good, Very Good, Premium, Ideal
- color: Color of the diamond, with D being the best and J the worst
- clarity: How obvious inclusions are within the diamond:(in order from best to worst, FL = flawless, I3= level 3 inclusions) FL,IF, VVS1, etc.  See this web site for an exhaustive ranking of [clarity](https://4cs.gia.edu/en-us/diamond-clarity/?gclid=Cj0KCQjwnqH7BRDdARIsACTSAduMoc2KQbXkO94BxCfBNC5X8YyjAYcFpWThKQMW46cQj_3p0pZ0o84aAuagEALw_wcB).  The web site has a nice sliding scale you can drag to see the relationship between clarity grades.
- depth: depth % - The height of a diamond, measured from the culet to the table, divided by its average girdle diameter
- table: table% -  The width of the diamond's table expressed as a percentage of its average diameter
- price: The price of the diamond
- x: Length (mm)
- y: Width (mm)
- z: Height (mm)

In [None]:
# Grading Cell
enable_grid_search = False

The following cell is used to read the diamonds data set into the colab environment.  Do not change or modify the following cell.

In [None]:
%%bash
# Do not change or modify this file
# Need to install pyspark
# if pyspark is already installed, will print a message indicating pyspark already isntalled
pip install pyspark

# Download the data files from github
# If the data file does not exist in the colab environment
if [[ ! -f ./quotes_by_char.csv ]]; then 
   # download the data file from github and save it in this colab environment instance
   wget https://raw.githubusercontent.com/wewilli1/ist718_data/master/diamonds.csv  
fi

Collecting pyspark
  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
Collecting py4j==0.10.9
  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py): started
  Building wheel for pyspark (setup.py): finished with status 'done'
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=376990ba05faf4b183fe17fe5c9518c28d76b0cbbb93037b21d94a6be106603f
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1


--2020-11-24 03:37:53--  https://raw.githubusercontent.com/wewilli1/ist718_data/master/diamonds.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3192560 (3.0M) [text/plain]
Saving to: ‘diamonds.csv’

     0K .......... .......... .......... .......... ..........  1% 3.84M 1s
    50K .......... .......... .......... .......... ..........  3% 9.97M 1s
   100K .......... .......... .......... .......... ..........  4% 4.02M 1s
   150K .......... .......... .......... .......... ..........  6% 4.86M 1s
   200K .......... .......... .......... .......... ..........  8% 16.1M 0s
   250K .......... .......... .......... .......... ..........  9% 61.9M 0s
   300K .......... .......... .......... .......... .......... 11% 16.0M 0s
   350K .......... .......... ........

# Question 0 (0 pts)
Please provide the following the data so we can easily correlate your notebook with the grade book:
- Your Name: Wanyue Xiao
- Your github user name: xwanyue0221
- Your SU email address: xwanyue@syr.edu

Your grade for grid search problems in this assignment will be determined in part on level of effort and your model performance results as compared to other students in the class.

# Question 1 (10 pts)
Read the diamonds.csv file into a spark data frame named `diamonds_df`.  Perform feature engineering as needed for training decision trees.  Name the new data frame diamonds_df_xformed.

In [None]:
# your code here
%matplotlib inline
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
spark = SparkSession \
  .builder \
  .master("local[*]")\
  .config("spark.memory.fraction", 0.8) \
  .config("spark.executor.memory", "12g") \
  .config("spark.driver.memory", "12g")\
  .config("spark.memory.offHeap.enabled",'true')\
  .config("spark.memory.offHeap.size","12g")\
  .getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
import os

from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import StringIndexer, VectorAssembler, StringIndexerModel
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

# read the csv file
diamonds_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("./diamonds.csv")
# drop the unnamed column
diamonds_df = diamonds_df.drop('_c0')
categoricalColumns = ['cut', 'color', 'clarity']
numericCols = ['carat', 'depth', 'table', 'x', 'y', 'z']

# get the distinct values of each categorical column
cut_list = ['Fair', 'Good', 'Very Good', 'Premium', 'Ideal']
color_list = ['J','I','H','G','F','E','D']
clarity_list = ['I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF']
# To convert the three categorical data into ordinal numeric data, we need to define the StringIndexerModel pipeline respectively
pipe_1 = Pipeline(stages=[StringIndexerModel.from_labels(cut_list, inputCol="cut", outputCol="cut_idx")])
pipe_2 = Pipeline(stages=[StringIndexerModel.from_labels(color_list, inputCol="color", outputCol="color_idx")])
pipe_3 = Pipeline(stages=[StringIndexerModel.from_labels(clarity_list, inputCol="clarity", outputCol="clarity_idx")])
# combine and encapsulate all the transformation codes into one pipeline
feature_engineering_pipe = Pipeline(stages=[pipe_1, pipe_2, pipe_3])
result = feature_engineering_pipe.fit(diamonds_df).transform(diamonds_df)
# drop the original categorical columns and change the column names
result = result.drop(*categoricalColumns)
diamonds_df_xformed = result.toDF(*(c.replace('_idx', '') for c in result.columns))

# vector assembler
prepro_pipe = Pipeline(stages=[VectorAssembler(inputCols=diamonds_df_xformed.drop('price').columns, outputCol='features')])
diamonds_df_xformed = prepro_pipe.fit(diamonds_df_xformed).transform(diamonds_df_xformed)

# get train and test subset
train, test = diamonds_df_xformed.randomSplit([0.7, 0.3])

In [None]:
# Grading Cell - do not modify
display(diamonds_df_xformed.toPandas().head())

Unnamed: 0,carat,depth,table,price,x,y,z,cut,color,clarity,features
0,0.23,61.5,55.0,326,3.95,3.98,2.43,4.0,5.0,1.0,"[0.23, 61.5, 55.0, 3.95, 3.98, 2.43, 4.0, 5.0,..."
1,0.21,59.8,61.0,326,3.89,3.84,2.31,3.0,5.0,2.0,"[0.21, 59.8, 61.0, 3.89, 3.84, 2.31, 3.0, 5.0,..."
2,0.23,56.9,65.0,327,4.05,4.07,2.31,1.0,5.0,4.0,"[0.23, 56.9, 65.0, 4.05, 4.07, 2.31, 1.0, 5.0,..."
3,0.29,62.4,58.0,334,4.2,4.23,2.63,3.0,1.0,3.0,"[0.29, 62.4, 58.0, 4.2, 4.23, 2.63, 3.0, 1.0, ..."
4,0.31,63.3,58.0,335,4.34,4.35,2.75,1.0,0.0,1.0,"[0.31, 63.3, 58.0, 4.34, 4.35, 2.75, 1.0, 0.0,..."


##### Grading Feedback Cell

The following questions will create a random forest regressor model, train the model using a grid search, and use the model for inference.  The goal is to see if we can improve upon the linear regression score from homework 3. You can find the spark documentation for the random forest regressor [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression).

# Question 2 (20 pts)
Create and train your random forest regressor model using a grid search in the cell below.  You are free to use K-Fold Cross validation if you wish.  Your grid search must be entirely encapsulated in the `if enable_grid_search` if statement.  The `enable_grid_search` Boolean is defined in a grading cell above.  You will disable the grid search before you submit by setting enable_grid_search to false.  Setting enable_grid_search to false should not result in a runtime error.  You will not receive full credit if any part of your grid search is outside of the if statement or if runtime errros result from setting the `enable_grid_search` variable to false.

In [None]:
# your code here

# initialize a random forest transformer
rf = RandomForestRegressor(featuresCol = 'features', labelCol = 'price', minInstancesPerNode = 30)
evaluator = RegressionEvaluator(labelCol="price", predictionCol="prediction", metricName="mse")

if enable_grid_search:
    # building parameter grid 
    paramGrid = (ParamGridBuilder()
                 .addGrid(rf.maxDepth, [15, 20, 30])
                 .addGrid(rf.maxBins, [30, 50])
                 .addGrid(rf.numTrees, [20, 30, 50]).build())
    
    cv = CrossValidator(estimator=rf, 
                        estimatorParamMaps=paramGrid, 
                        evaluator=evaluator, 
                        numFolds=3)
    
    cvModel = cv.fit(train)
    predictions = cvModel.bestModel.transform(test)
    
    print("mse", evaluator.evaluate(predictions))
    print('numTrees - ', cvModel.bestModel.getNumTrees)
    print('maxDepth - ', cvModel.bestModel.getOrDefault('maxDepth'))
    print('maxBins - ', cvModel.bestModel.getOrDefault('maxBins'))

mse 451312.24664500024
numTrees -  30
maxDepth -  30
maxBins -  50


##### Grading Feedback Cell

# Question 3 (20 pts)
Create a pipeline named `best_pipe` that hard codes the tuning parameters from the best model found by the grid search in question 2 above.  Train and test best_pipe.  Do not use k-fold cross validation in question 3.  Clearly print the resulting train and test MSE for best_pipe so it's easy for the graders to see your resulting MSEs.

In [None]:
# Your code here
rf_pipe = Pipeline(stages=[RandomForestRegressor(featuresCol = 'features', labelCol = 'price', 
                                                 maxBins = 50, numTrees = 30, maxDepth = 20, minInstancesPerNode = 30)])
rfBestModel = rf_pipe.fit(train)
rfBestPredictions = rfBestModel.transform(test)

print("MSE for training dataset: " + str(evaluator.evaluate(rfBestModel.transform(train))))
print("MSE for testing dataset: " + str(evaluator.evaluate(rfBestPredictions)))

MSE for training dataset: 435798.26959282585
MSE for testing dataset: 462100.0243813725


##### Grading Feedback Cell

# Question 4 (20 pts)
Use your best_pipe pipeline in question 3 for inference.  Create a pandas data frame named `rf_feature_importance` which contains 2 columns: `feature`, and `importance`.  Load the feature column with the feature name and the importance column with the feature importance score as determined by the random forest model. Sort the feature importances from high to low such that the most important feature is in the first row of the data frame.

In [None]:
# your code here
import pandas as pd

rf_model = rfBestModel.stages[-1]
importances = rf_model.featureImportances.toArray()
feature_list = diamonds_df_xformed.drop('price').columns

# create a pandas dataframe
rf_feature_importance = pd.DataFrame(list(zip(feature_list, importances)), columns =['features', 'importance']).sort_values('importance', ascending = False)

In [None]:
# grading cell - do not modify
display(rf_feature_importance)

Unnamed: 0,features,importance
0,carat,0.317879
4,y,0.265153
3,x,0.214129
5,z,0.117791
8,clarity,0.05151
7,color,0.026332
6,cut,0.002638
1,depth,0.002383
2,table,0.002185


##### Grading Feedback Cell

# Question 5 (20 pts)
Write code to print the decision logic for any of the trees in the forest from the best_pipe pipeline.  Copy the printed decision text to the tree printout markdown cell below and retain the same formatting and indentation as the code printout so it's easy for the graders to view the data.  You need to double click the "Your Decision Tree Print Out Here" markdown cell and paste your output inside the two sets of triple quotes. The triple quotes are jupyter markdown indicating you want to present code.  Essentially, replace the text inside the triple quotes with your tree printout.  Solutions that do not maintain readable formatting will not receive full credit.

Add comments to the markdown cell below describing how the root node is split:  Describe 2 things in the markdown cell.  1) What specific predictor variable is being split and what is the value that determines the left / right split.  2) We need you to paste the tree decision logic output from your run in the markdown cell because the top level split may change from run to run.  If the graders run your notebook, the top level split for the tree may be different than the top level split from when you made the run.  Describe why the top level predictor changes from run to run.


In [None]:
# your code here
rf_default = Pipeline(stages=[RandomForestRegressor(featuresCol = 'features', labelCol = 'price')]).fit(train).stages[-1]
# print(rf_default.trees[0].toDebugString)

 ```
Your Decision Tree Print Out Here - 

DecisionTreeRegressionModel: uid=dtr_0fcbdc6c42b5, depth=5, numNodes=63, numFeatures=9
  If (feature 0 <= 0.975)
   If (feature 0 <= 0.605)
    If (feature 4 <= 5.005)
     If (feature 0 <= 0.355)
      If (feature 3 <= 4.305)
       Predict: 591.3299015219337
      Else (feature 3 > 4.305)
       Predict: 742.0420976229359
     Else (feature 0 > 0.355)
      If (feature 3 <= 4.755)
       Predict: 870.8565436241611
      Else (feature 3 > 4.755)
       Predict: 1031.3978240302743
    Else (feature 4 > 5.005)
     If (feature 8 in {0.0,1.0,2.0,3.0,4.0})
      If (feature 0 <= 0.525)
       Predict: 1464.4464864864865
      Else (feature 0 > 0.525)
       Predict: 1615.2197855750487
     Else (feature 8 not in {0.0,1.0,2.0,3.0,4.0})
      If (feature 7 in {0.0,1.0,2.0,3.0})
       Predict: 1925.928813559322
      Else (feature 7 not in {0.0,1.0,2.0,3.0})
       Predict: 2483.018390804598
   Else (feature 0 > 0.605)
    If (feature 4 <= 5.955)
     If (feature 7 in {0.0,1.0,2.0})
      If (feature 4 <= 5.665)
       Predict: 1955.8513931888544
      Else (feature 4 > 5.665)
       Predict: 2401.265097236438
     Else (feature 7 not in {0.0,1.0,2.0})
      If (feature 0 <= 0.715)
       Predict: 2637.2137698603756
      Else (feature 0 > 0.715)
       Predict: 2963.474683544304
    Else (feature 4 > 5.955)
     If (feature 4 <= 6.135)
      If (feature 3 <= 5.955)
       Predict: 3170.4883720930234
      Else (feature 3 > 5.955)
       Predict: 3482.8038379530917
     Else (feature 4 > 6.135)
      If (feature 6 in {0.0,1.0,3.0})
       Predict: 3802.2535211267605
      Else (feature 6 not in {0.0,1.0,3.0})
       Predict: 4182.627352572145
  Else (feature 0 > 0.975)
   If (feature 0 <= 1.505)
    If (feature 8 in {0.0,1.0,2.0})
     If (feature 8 in {0.0,1.0})
      If (feature 4 <= 6.985)
       Predict: 4384.502228163993
      Else (feature 4 > 6.985)
       Predict: 6683.794871794872
     Else (feature 8 not in {0.0,1.0})
      If (feature 5 <= 4.305)
       Predict: 5211.367283950617
      Else (feature 5 > 4.305)
       Predict: 8525.851449275362
    Else (feature 8 not in {0.0,1.0,2.0})
     If (feature 8 in {3.0,4.0})
      If (feature 4 <= 6.985)
       Predict: 6641.082848837209
      Else (feature 4 > 6.985)
       Predict: 10076.048327137547
     Else (feature 8 not in {3.0,4.0})
      If (feature 5 <= 4.0649999999999995)
       Predict: 9087.389432485323
      Else (feature 5 > 4.0649999999999995)
       Predict: 10829.599670510708
   Else (feature 0 > 1.505)
    If (feature 3 <= 7.835)
     If (feature 6 in {0.0,1.0,2.0,3.0})
      If (feature 8 in {0.0,1.0})
       Predict: 8534.397163120568
      Else (feature 8 not in {0.0,1.0})
       Predict: 11535.609347442682
     Else (feature 6 not in {0.0,1.0,2.0,3.0})
      If (feature 5 <= 4.8149999999999995)
       Predict: 11967.138436482084
      Else (feature 5 > 4.8149999999999995)
       Predict: 14996.153846153846
    Else (feature 3 > 7.835)
     If (feature 1 <= 63.849999999999994)
      If (feature 7 in {0.0})
       Predict: 14051.72340425532
      Else (feature 7 not in {0.0})
       Predict: 15275.223468507334
     Else (feature 1 > 63.849999999999994)
      If (feature 3 <= 8.295)
       Predict: 11624.786666666667
      Else (feature 3 > 8.295)
       Predict: 14696.733333333334

 ```

Your explanation here:
<br>
The first predictor (feature 0) is 'carat' and the variable that determines the split is '0.975'. The algorithm is an ensemble model, creating a ‘forest’ of many decision ‘trees’ with the number of trees defined by the user. Each decision tree is created based on a subset of columns and observations rows from the dataset. These trees are grown using the training data set and applied to the test dataset. The final classification returned by the model is the one which matches the classifications provided by the greatest number of individual decision trees. Given that Random Forest Algorithmn will randomly sample features during each split when bulding the tree, each decision tree within the algorithm is created using a different, ‘random’ subset of attributes and observations from the original training dataset. Therefore, the result will change each time it has been run. 

# Question 6 (5 pts)
Describe if the random forest model MSE score was better or worse than the MSE score from you best model in homework 3.  Include both scores in your description.

Your improvement explanation here:  
<br>
The MSE score of Random Forest (which is 462100.0243813725) is way much better than that of Linear Regression Model accomplished in homework 3 (which is 1452635.051274).

##### Grading Feedback Cell

# Question 7 (5 pts)
Set the `enable_grid_search` Boolean variable to False in the grading cell at the top of this notebook.  Perform a __Runtime -> factory reset__, __Runtime -> Run all__ test to verify there are no runtime errors.  Leave the `enable_grid_search` variable set to False and turn in your assignment.  This is the kind of thing you should be doing before you turn in every assignment. Remember this for future classes and when you get a job in industry.  This question will be graded as all or nothing.  You ether set the Boolean correct or not.  Additional points will be deducted elsewhere for runtime errors.

# Extra Credit (10 pts)
This homework was intended to take less time to complete and be about half the effort of previous assignments.  This doesn't allow us to explore GBT or deep learning.  

For extra credit, train a GBT or Deep Learning model using a grid search.  Protect the grid search inside the if enable_grid_search statement in the first code cell below.  You are free to use K-Fold cross validation if you wish.  The spark documentation for GBM can be found [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier).  The spark documentation for deep learning can be found [here](https://spark.apache.org/docs/latest/ml-classification-regression.html#multilayer-perceptron-classifier)

In the second code cell below, hard code the best model parameters as determined by the grid search in a new pipeline named `best_pipe_2`.  Train and test `best_pipe_2` and save your resulting test MSE in a variable.  Do not use K-Fold cross validation when training best_pipe_2.  

In the third code cell below, create a pandas data frame named `compare_1_df` which contains 2 columns: Model and MSE.  Populate the Model column with model names: LR, RF, GBT or DL.  Populate the score column with the linear regression, random forest, and gradient boosted tree or deep learning test MSE scores. The linear regression score is from homework 3. The random forest score is from the random forest model above.  The GBT or Deep Learning score is from this extra credit problem.  Sort compare_1_df such that the best score is in the first row of the data frame. 

To get full credit, your GBT or deep learning solution should produce a score as good or better than the random forest score above.  In addition, the same rules as above apply where all of your grid search code shall be protected by the enable_grid_search Boolean.  Code that produces a runtime error when enable_grid_search is set to False will get 0 credit.

In [None]:
# Your GBT / Deep Learning grid search code here
from pyspark.ml.regression import GBTRegressor
gbt = GBTRegressor(labelCol = 'price', featuresCol = 'features')

if enable_grid_search:
    # generate the grid object, which iterates over different combinations of paramters
    gbt_paramGrid = (ParamGridBuilder()\
                     .addGrid(gbt.maxDepth, [2, 4, 9, 12])
                     .build())
    
    # generate a 3-fold cross validation model
    gbt_cv = CrossValidator(estimator = gbt,
                            estimatorParamMaps = gbt_paramGrid,
                            evaluator = evaluator,
                            numFolds = 3)
    
    gbtModel = gbt_cv.fit(train)
    predictions = gbtModel.bestModel.transform(test)
    
    print("mse", evaluator.evaluate(predictions))
    print('maxDepth - ', gbtModel.bestModel.getOrDefault('maxDepth'))

mse 403312.8032579468
maxDepth -  9


In [None]:
# your hard coded parameter best model code here
gbt_pipeline = Pipeline(stages=[GBTRegressor(featuresCol = 'features', labelCol = 'price', maxDepth = 9)])
gbtBestModel = gbt_pipeline.fit(train)
gbtBestPredictions = gbtBestModel.transform(test)

print("MSE for training dataset: " + str(evaluator.evaluate(gbtBestModel.transform(train))))
print("MSE for testing dataset: " + str(evaluator.evaluate(gbtBestPredictions)))

MSE for training dataset: 201754.8090042882
MSE for testing dataset: 403312.8032579468


In [None]:
# Create compare_1_df
model_names = ['LR', 'RF', 'XGBoost']
mse_values = ['1452635.051274', '462100.0243813725', '403312.8032579468']

compare_1_df = pd.DataFrame(list(zip(model_names, mse_values)), columns =['Model', 'MSE'])

In [None]:
# Grading cell do not modify
display(compare_1_df)

Unnamed: 0,Model,MSE
0,LR,1452635.051274
1,RF,462100.0243813725
2,XGBoost,403312.8032579468
