# Modern Data Science 
**(Module 06: Apache Spark Platform)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


## Session G - Spark MLlib (3): Supervised Learning


Spark has many libraries, namely under MLlib (Machine Learning Library)! Spark allows for quick and easy scalability of practical machine learning!

In this lab exercise, you will learn about how to build a Linear Regression Model, a SVM model, and a Logistic Regression Model, also you will learn how to create Classification and Regression DecisionTree and RandomForest Models, as well as how to tune the parameters for each to create more optimal trees and ensembles of trees.

## Content



### Part 1 [Linear Regression](#lr)


### Part 2 [Support Vector Machine](#svm)


### Part 3 [Logistic Regression](#logit)


### Part 4 Decision Tree (Regression) 

4.1 [maxDepth Parameter](#md)

4.2 [maxBins Parameter](#mb)

4.3 [minInstancesPerNode Parameter](#mip)

4.4 [minInfoGain Parameter](#mig)


### Part 5 [Decision Tree (Classification)](#dtc)

### Part 6 [Random Forest (Classification)](#rfc)

6.1 [numTrees Parameter](#nt)

6.2 [featureSubsetStrategy Parameter](#fss)

### Part 7 [Random Forest (Regression)](#rfr)


---
<a id = "lr"></a>
## <span style="color:#0b486b">1. Linear Regression</span>

<img src = "http://www.biostathandbook.com/pix/regressionlollipop.gif", style="height: 200pt; width: 200pt;", align = 'center'>

<div align="justify"><font size="3">Linear regression uses a "line of best fit", based on previous data in order to predict future values. There are plenty of model evaluation metrics that can be applied to linear regression. 

In this lab, we will look at <b>Mean Squared Error (MSE)</b></font></div>

Import the following libraries: <br>
<ul>
    <li>LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel from pyspark.mllib.regression</li>
</ul>

In [None]:
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel

Now we need to create a <b>RDD of data</b> called <b>rdd_data</b>. That will be done by using the SparkContext (sc) to read in the <b>brain_body_data.csv</b> dataset. Take a look at the dataset so you have a feel for how it's structured.

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/brain_body_data.csv'
DataSet = wget.download(link_to_data)

In [None]:
rdd_data = sc.textFile("brain_body_data.csv")

Now, run a <b>map function</b> on <b>rdd_data</b>, where the input is a <b>lambda function</b> that is as follows: <i>lambda line: line.split(",")</i>. This is so we can split the dataset by commas, since it's a comma-separated value file (CSV). Store this into a variable called <b>split_data</b>

In [None]:
split_data = rdd_data.map(lambda line: line.split(","))

Next, run the following function that will convert each line in our RDD into a LabeledPoint.

In [None]:
def labeledParse(line):
    return LabeledPoint(line[0], [line[1]])

Now, run a <b>map function</b> on <b>split_data</b>, passing in <b>labeledParse</b> as input. Store the output into a variable called <b>reg_data</b>.

In [None]:
reg_data = split_data.map(labeledParse)

Now, we will create a variable called <b>linReg_model</b>, which will contain the linear regression model. The model will be made by calling the <b>LinearRegressionWithSGD</b> class and using the <b>.train</b> function with it. The .train function will take in 3 inputs:
<ul>
    <li>1st: The training data (reg_data in this case)</li>
    <li>2nd: The number of iterations, or how many times the regression will run (use iterations=150)</li>
    <li>3rd: step used in SGD (use step=0.00001 in this case) </li>
</ul>

In [None]:
linReg_model = LinearRegressionWithSGD.train(reg_data, iterations=150, step=0.00001)

Next, we will create a variable called <b>actualAndPred</b>, which will contain the actual response, along with the predicted response from the model. This will be done by using the <b>map</b> function on <b>reg_data</b>, and passing in:<br> <b>lambda p: (p.label, linReg_model.predict(p.features))</b> as the input.

In [None]:
actualAndPreds = reg_data.map(lambda p: (p.label, linReg_model.predict(p.features)))

We will calculate the Mean Squared Error (MSE) value for the prediction. Run the following code to calculate the MSE. <br> <br> 

The map function takes the actual value and subtracts it by the predicted value, then 
squares the result. This is done for each value. <br> <br> 

Next, the reduce function sums all of the mapped values together. <br> <br>

Afterwards, the result is divided by the number of elements that are present in actualAndPreds.


In [None]:
MSE = actualAndPreds.map(lambda (v, p): (v - p)**2).reduce(lambda x, y: x + y) / actualAndPreds.count()
print("Mean Squared Error = " + str(MSE))

---
<a id = "svm"></a>
## <span style="color:#0b486b">2. Support Vector Machine (SVM)</span>


<img src = "http://blogs.quovantis.com/wp-content/uploads/2015/09/Svm_max_sep_hyperplane_with_margin.png", style="height: 200pt; width: 200pt;", align = 'center'>


<div align="justify"><font size="3">Support Vector Machines can be used for both <b>classification and regression</b> analysis. In our case, we will be using it for classification. Linear SVM in Spark only supports <b>binary classification</b>.</font></div>

Import the following libraries: <br>
<ul>
    <li>SVMWithSGD, SVMModel from pyspark.mllib.classification</li>
    <li>LabeledPoint from pyspark.mllib.regression</li>
</ul>

In [None]:
from pyspark.mllib.classification import SVMWithSGD, SVMModel
from pyspark.mllib.regression import LabeledPoint

Now we need to create a <b>RDD of data</b> called <b>svm_data</b>. That will be done by using the SparkContext (sc) to read in the <b>sample_svm_data.txt</b> dataset, which is a sample dataset that is built-in to Spark. It contains 322 rows of data. 

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/sample_svm_data.txt'
DataSet = wget.download(link_to_data)

In [None]:
svm_data = sc.textFile("sample_svm_data.txt")

For this dataset, it isn't in a format that we need, so we will need the following function to modify it. This function will also create LabeledPoints out of the data, which is necessary to train the SVM Model. Depending on your dataset, the parsing required will differ.

In [None]:
def labeledParse(line):
    values = [float(x) for x in line.split(' ')]
    return LabeledPoint(values[0], values[1:])

This will be applied to <b>svm_data</b> by using the <b>.map</b> function, and passing in the <b>labeledParse function</b>. This will apply the labeledParse function to the entire dataset. Call the output <b>svm_parsed</b>

In [None]:
svm_parsed = svm_data.map(labeledParse)

Now create a SVM model using the <b>SVMWithSGD.train</b> function called <b>svm_model</b>, which requires two inputs:
<ul>
    <li>1st: The dataset containing the LabeledPoints (<b>svm_parsed</b> in this case)</li>
    <li>2nd: The number of iterations the model will run (<b>120</b> in this case)</li>
</ul>

In [None]:
svm_model = SVMWithSGD.train(svm_parsed, iterations=120)

Next, we will create a variable called <b>svm_Labels_Predicts</b>, which will map a tuple containing the label and the prediction. <br>
This will be done by using the <b>.map</b> function once again, but on the parsed data, <b>svm_parsed</b>. <br>
The input into svm_parsed.map() will be a lambda function: <b>lambda x: (x.label, svm_model.predict(x.features))</b>

In [None]:
svm_Labels_Predicts = svm_parsed.map(lambda x: (x.label, svm_model.predict(x.features)))

Now, we will take a look at the training error, called <b>trainingError</b>, which will tell us the accuracy of how well our model did. It will do this by counting the number of incorrect predictions it made, and divide it by the total number of predictions.<br>
We will run a <b>.filter</b> on the model we just created, <b>svm_Labels_Predicts</b>, <b>count</b> the output of that with <b>.count()</b>, then <b>divide</b> by the <b>number of elements in svm_parsed</b>. <br> <br>

This filter will take a lambda function as input: <b>lambda (v, p): v != p</b>, which just means that the function will look at the predicted value and the labeled value, then see if the prediction matched the label.<br><br>

Make sure to add a <b>.count()</b> to the <b>filter</b>, then <b>divide</b> the whole thing by <b>float(svm_parsed.count())</b>


In [None]:
trainingError = svm_Labels_Predicts.filter(lambda (x, y): x != y).count() / float(svm_parsed.count())

Finally, print trainingError, to see the percentage that the model predicted incorrectly.

In [None]:
print(trainingError)

---
<a id = "logit"></a>
## <span style="color:#0b486b">3. Logistic Regression</span>


<img src = "http://cvxr.com/cvx/examples/cvxbook/Ch07_statistical_estim/html/logistics__01.png", style="height: 200pt; width: 200pt;", align = 'center'>

<div align="justify"><font size="3">Logistic Regression is a classifier, similar to SVM. Logistic Regression can be used for Binary Classification, which is pretty clear when looking at the diagram above. In the diagram, where are two distinct sections that data resides, which represents a binary classification. <br> <br> In this lab, we will use the same dataset as the one used for SVM, so we can compare the accuracy of both models.</font></div>

Import the following libraries: <br>
<ul>
    <li>LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.classification</li>
    <li>LabeledPoint from pyspark.mllib.regression</li>
</ul>

In [None]:
from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
from pyspark.mllib.regression import LabeledPoint

Since we are still using the same dataset as in SVM, we will be using the same <b>svm_parsed</b> variable.

Create a variable called <b>logReg_model</b>, where we <b>train</b> a <b>LogisticRegressionWithLBFGS</b> model by passing in <b>svm_parsed</b>.

In [None]:
# Build the model
logReg_model = LogisticRegressionWithLBFGS.train(svm_parsed)

Next, create a variable called <b>logReg_Labels_Predicts</b> by <b>mapping</b> the <b>svm_parsed</b> data and passing in the <b>label</b>, along with the <b>logReg_model prediction</b>. This is similar to what we did in the SVM section of the lab.

In [None]:
logReg_Labels_Predicts = svm_parsed.map(lambda p: (p.label, logReg_model.predict(p.features)))

Finally, we will find the training error, or percentage that the model predicted incorrect. Thids will by done by applying the <b>filter</b> function on <b>logReg_Labels_Predicts</b>. We will pass in a lambda function that will filter for all values that do not equal <b>(lambda (v, p): v != p)</b>, then apply a <b>count()</b> on the filter. This will get the number of incorrect predictions. Now, we need to divide by the total number of predictions, or <b>float(svm_parsed.count())</b>. Store this as <b>trainingError2</b>. Refer to the SVM section if you need a hint.

In [None]:
trainingError2 = logReg_Labels_Predicts.filter(lambda (v, p): v != p).count() / float(svm_parsed.count())

Now print trainingError2 and trianingError (from the SVM section)

In [None]:
print(trainingError2)
print(trainingError)

It seems as though the training error for Logistic Regression is just slightly better than SVM for this case!

---
## <span style="color:#0b486b">4. Decision Tree (Regression)</span>

Import the following libraries:
<ul>
    <li>DecisionTree, DecisionTreeModel from pyspark.mllib.tree</li>
    <li>MLUtils from pyspark.mllib.util</li>
    <li>time</li>
</ul>

In [None]:
from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
import time

Next, we will load in the <b>poker.txt</b> LibSVM file, which is a dataset based on poker hands. Use <b>MLUtils.loadLibSVMFile</b> and pass in the spark context (<b>sc</b>) and the path to the file <b>'resources/poker.txt'</b>. Store this into a variable called <b>regDT_data</b> 

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/poker.txt'
DataSet = wget.download(link_to_data)

In [None]:
regDT_data = MLUtils.loadLibSVMFile(sc, 'poker.txt')

Next, we need to split the data into a training dataset (called <b>regDT_train</b>) and testing dataset (called <b>regDT_test</b>). This will be done by running the <b>.randomSplit</b> function on <b>regDT_data</b>. The input into .randomSplit will be <b>[0.7, 0.3]</b>. <br> <br>

This will give us a training dataset containing 70% of the data, and a testing dataset containing 30% of the data.

In [None]:
(regDT_train, regDT_test) = regDT_data.randomSplit([0.7, 0.3])

Next, we need to create the Regression Decision Tree called <b>regDT_model</b>. To instantiate the regressor, use <b>DecisionTree.trainRegressor</b>. We will pass in the following parameters:
<ul>
    <li>1st: The input data. In our case, we will use <b>regDT_train</b></li>
    <li>2nd: The categorical features info. For our dataset, have <b>categoricalFeaturesInfo</b> equal <b>{}</b></li>
    <li>3rd: The type of impurity. Since we're dealing with <b>Regression</b>, we will be have <b>impurity</b> set to <b>'variance'</b></li>
    <li>4th: The maximum depth of the tree. For now, set <b>maxDepth</b> to <b>5</b>, which is the default value</li>
    <li>5th: The maximum number of bins. For now, set <b>maxBins</b> to <b>32</b>, which is the default value</li>
    <li>6th: The minimum instances required per node. For now, set <b>minInstancesPerNode</b> to <b>1</b>, which is the default value</li>
    <li>7th: The minimum required information gain per node. For now, set <b>minInfoGain</b> to <b>0.0</b>, which is the default value</li>
</ul> <br> <br>

We will also be timing how long it takes to create the model, so run <b>start = time.time()</b> before creating the model and <b>print(time.time()-start)</b> after the model has been created. <br>
<b>Note</b>: The timings differ on run and by computer, therefore some statements throughout the lab may not directly align with the results you get, which is okay! There are many factors that can affect the time output.

In [None]:
start = time.time()
regDT_model = DecisionTree.trainRegressor(regDT_train, categoricalFeaturesInfo={},
                                    impurity='variance', maxDepth=5, maxBins=32,
                                    minInstancesPerNode=1, minInfoGain=0.0)
print (time.time()-start)

Next, we want to get the models prediction on the test data, which we will call <b>regDT_pred</b>. We will run <b>.predict</b> on regDT_model, passing in the testing data, <b>regDT_test</b> that is mapped using <b>.map</b> which maps the features by passing in a lambda function (<b>lambda x: x.features</b>).

In [None]:
regDT_pred = regDT_model.predict(regDT_test.map(lambda x: x.features))

Now create a variable called <b>regDT_label_pred</b> which uses a <b>.map</b> on <b>regDT_test</b>. Pass <b>lambda l: l.label</b> into the mapping function. Outside of the mapping function, add a <b>.zip(regDT_pred)</b>. This will merge the label with the prediction</b> 

In [None]:
regDT_label_pred = regDT_test.map(lambda l: l.label).zip(regDT_pred)

Now we will calculate the Mean Squared Error for this prediction, which we will call <b>regDT_MSE</b>. This will equate to <b>regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())</b>, which will take the difference of the actual value and the predicted response, square it, and sum that with the rest of the values. Afterwards, it is divided by the total number of values in the testing data.

In [None]:
regDT_MSE = regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())

Next, print out the MSE prediction value (<b>str(regDT_MSE)</b>), as well as the learned regression tree model (<b>regDT_model.toDebugString()</b>), so you have an idea of what the tree looks like.

In [None]:
print('Test Mean Squared Error = ' + str(regDT_MSE))
print('Learned Regression Tree Model: ' + regDT_model.toDebugString())

Now that we've created the basic Regression Decision Tree, let's start tuning some parameters! To speed up the process and reduce the amount of code that appears in this notebook, I've made a function that encorporates all of the code above. This way, we can tune the parameters in a single line of code. <br> <br>

Read over the code, and it should be apparent what each of the inputs should be. But just to reiterate:
<ul>
    <li>1st: maxDepthValue is the value for maxDepth (Type:Int, Range: 0 to 30)</li>
    <li>2nd: maxBinsValue is the value for maxBins (Type: Int, Range: >= 2)</li>
    <li>3rd: minInstancesValue is the value for minInstancesPerNode (Type: Int, Range: >=1)</li>
    <li>4th: minInfoGainValue is the value for minInfoGain (Type: Float)</li>
    <ul>
        <li><b>NOTE</b>: The input for minInfoGain MUST contain a decimal (ex. -3.0, 0.1, etc.) or else you will get an error</li>
    </ul>
</ul>

In [None]:
def regDT_tuner(maxDepthValue, maxBinsValue, minInstancesValue, minInfoGainValue):
    start = time.time()
    regDT_model = DecisionTree.trainRegressor(regDT_train, categoricalFeaturesInfo={},
                                        impurity='variance', maxDepth=maxDepthValue, maxBins=maxBinsValue,
                                        minInstancesPerNode=minInstancesValue, minInfoGain=minInfoGainValue)
    print (time.time()-start)

    regDT_pred = regDT_model.predict(regDT_test.map(lambda x: x.features))
    regDT_label_pred = regDT_test.map(lambda l: l.label).zip(regDT_pred)
    regDT_MSE = regDT_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(regDT_test.count())

    print('Test Mean Squared Error = ' + str(regDT_MSE))
    print('Learned Regression Tree Model: ' + regDT_model.toDebugString())

Start off by re-creating the original tree. That requires the inputs: <b>(5, 32, 1, 0.0)</b> into <b>regDT_tuner</b>

In [None]:
regDT_tuner(5, 32, 1, 0.0)

Remember that when we are tuning a specific parameter, that we will keep the other parameters at their original value

<a id = "md"></a>

### <span style="color:#0b486b">4.1 maxDepth Parameter</span>

Let's start by tuning the <b>maxDepth</b> parameter. Begin by setting it to a lower value, such as <b>1</b>

In [None]:
regDT_tuner(1, 32, 1, 0.0)

By decreasing the maxDepth parameter, you can see that the run-time slightly decreased, presenting a smaller tree as well. You may also see a slight increase in the error, which is to be expected since the tree is too small to make accurate predictions.

Now try increasing to value of <b>maxDepth</b> to a large number, such as <b>30</b>, which is the maximum value.

In [None]:
regDT_tuner(30, 32, 1, 0.0)

With a large value for maxDepth, you can see that the run-time increased greatly, along with the size of the tree. The MSE has increased greatly compared to the original, which is due to overfitting of the training data from having a deep tree.

<a id = "mb"></a>

### <span style="color:#0b486b">4.2 maxBins Parameter</span>

Now let's tune the <b>maxBins</b> variable. Start by decreasing the value to 2, to see what the lower end of this value does to the tree.

In [None]:
regDT_tuner(5, 2, 1, 0.0)

Comparing this to the original tree, we can see a small decrease in the training time, but not much of a difference in regards to MSE or the size of the tree.

Now let's take a look at the upper end, with a value of 15000

In [None]:
regDT_tuner(5, 15000, 1, 0.0)

With a very large maxBin value, we don't see too much of a change in the overall time or in the MSE. The model still has the same depth and nodes, as expected.

<a id = "mip"></a>

### <span style="color:#0b486b">4.3 minInstancesPerNode parameter</span>

Next we will look at tuning the <b>minInstancesPerNode</b> parameter. It starts off at the lowest value of 1, but let's see what happens if we keep increasing the value. Starting off with the value <b>100</b>

In [None]:
regDT_tuner(5, 32, 100, 0.0)

With minInstancesPerNode set to 100, we don't see much of a change in time and MSE, but we can see that there are less nodes in the tree. Try now with a value of <b>1000</b>

In [None]:
regDT_tuner(5, 32, 1000, 0.0)

With a value of 1000, we may see more of a decrease in the time, but the MSE has also increased a little bit. As well, the number of nodes in the model has decreased once again. Let's take it one step further and try with a value of <b>8000</b>

In [None]:
regDT_tuner(5, 32, 8000, 0.0)

With a value of 8000, we may see that the run-time to build the model is starting to decrease a lot more, with only a small increase in MSE compared to when the value was set to 1000. The main difference we see is that the tree has become a lot smaller! This is to be expected since we are tuning a stopping parameter, which determines when the model finishes building.


<a id = "mig"></a>

### <span style="color:#0b486b">4.4 minInfoGain Parameter</span>


For the last parameter, we will look at the minInfoGain parameter, which was initially set to 0.0. This value works well with negative values, and is very sensitive with values greater than 0.0. Try setting the value to a low number, such as -100.0

In [None]:
regDT_tuner(5, 32, 1, -100.0)

Overall, we don't see much of a change at all to anything. Now try changing the value to 0.0003

In [None]:
regDT_tuner(5, 32, 1, 0.0003)

We can see that small values greater than zero can cause drastic changes in how the model looks. Here, we see a small decrease in the training time, and small increase in the MSE value. But now the tree only contains one node in it. The affect of this parameter on the tree is similar to minInstancesPerNode, since they are both stopping parameters.

---
<a id = "dtc"></a>
## <span style="color:#0b486b">5. Decision Tree (Classification)</span>

Now it's time for you to try it out for yourself! Build a Classification DecisionTree in a similar way that the Regression DecisionTree was built. Please note that you will be using the same dataset in this section (regDT_train, regDT_test), therefore you do not need to re-initialize that section.<br> <br> 

Try to only reference the above section when you are experiencing a lot of difficulty. This section is mainly for you to apply your learning.

For some help with the variables:
<ul>
    <li><b>numClasses</b>: The number of classes for this dataset is <b>10</b> (parameter doesn't require tuning)</li>
    <li><b>categoricalFeaturesInfo</b>: Has a value of <b>{}</b> (parameter doesn't require tuning)</li>
    <li><b>impurity</b>: There are two types of impurites you can use -- <b>'gini'</b> or <b>'entropy'</b> <i>(Default: 'gini')</i></li>
    <li><b>maxDepth</b>: Values range between <b>0 and 30</b> <i>(Default: 5)</i></li>
    <li><b>maxBins</b>: Value ranges between <b>2 and 2147483647</b> (largest value for 32-bits) <i>(Default: 32)</i></li>
    <li><b>minInstancesPerNode</b> ranges between <b>1 and 2147483647</b> <i>(Default: 1)</i></li>
    <li><b>minInfoGain</b>: Ensure it is a float (has a decimal in the value) <i>(Default: 0.0)</i></li>
</ul>

When displaying the <b>Training Error</b>, use the following formula and print statement instead of MSE: <br>
<b>classDT_error = classDT_label_pred.filter(lambda (v, p): v != p).count() / float(regDT_test.count())</b> <br>
<b>print('Test Error = ' + str(classDT_error))</b>


#### The Goal
Try to create a model that is better than the model with default values. Challenge yourself by trying to create the best model you can!


#### Note
We want a model that doesn't take too long to train and will cause overfitting. Remember that a very large model with high accuracy but long run time may not be good because the model may have overfit the data.

In [None]:
start = time.time()
classDT_model = DecisionTree.trainClassifier(regDT_train, numClasses = 10, 
                                     categoricalFeaturesInfo = {},
                                     impurity = 'gini', maxDepth = 9,
                                     maxBins = 25, minInstancesPerNode = 4,
                                     minInfoGain = -3.0)
print(time.time() - start)
# Evaluate model on test instances and compute test error
classDT_pred = classDT_model.predict(regDT_test.map(lambda x: x.features))
classDT_label_pred = regDT_test.map(lambda lp: lp.label).zip(classDT_pred)
classDT_error = classDT_label_pred.filter(lambda (v, p): v != p).count() / float(regDT_test.count())
print('Test Error = ' + str(classDT_error))
print('Learned classification tree model:' + classDT_model.toDebugString())

# 1.16329193115
# Test Error = 0.495765559887
# Learned classification tree model:DecisionTreeModel classifier of depth 5 with 63 nodes
# Impurity: entropy
# maxDepth: 5
# maxBins: 32
# minInstancesPerNode: 1
# minInfoGain: 0.0

# 1.16743922234
# Test Error = 0.453958865439
# Learned classification tree model:DecisionTreeModel classifier of depth 9 with 577 nodes
# Impurity: gini
# maxDepth: 9
# maxBins: 25
# minInstancesPerNode: 4
# minInfoGain: -3.0


---
<a id = "rfc"></a>
## <span style="color:#0b486b">6. RandomForest (Classifier)</span>

Now that we've run through the DecisionTree model, let's work with RandomForests now. The process for this will be similar with the DecisionTree section.

Import the following libraries:
<ul>
    <li>RandomForest, RandomForestModel from pyspark.mllib.tree</li>
    <li>MLUtils from pyspark.mllib.util</li>
    <li>time</li>
</ul>

In [None]:
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
import time

Next, we will load in the <b>pendigits.txt</b> LibSVM file, which is a dataset based on Pen-Based Recognition of Handwritten Digits. Use <b>MLUtils.loadLibSVMFile</b> and pass in the spark context (<b>sc</b>) and the path to the file <b>'resources/pendigits.txt'</b>. Store this into a variable called <b>classRF_data</b> <br> <br>

Note: You can also try out this section with the poker.txt dataset if you want to compare results from both sections!

In [None]:
import wget

link_to_data = 'https://github.com/tuliplab/mds/raw/master/Jupyter/data/pendigits.txt'
DataSet = wget.download(link_to_data)

In [None]:
!ls -l

In [None]:
classRF_data = MLUtils.loadLibSVMFile(sc, 'pendigits.txt')

Next, we need to split the data into a training dataset (called <b>classRF_train</b>) and testing dataset (called <b>classRF_test</b>). This will be done by running the <b>.randomSplit</b> function on <b>classRF_data</b>. The input into .randomSplit will be <b>[0.7, 0.3]</b>. <br> <br>

This will give us a training dataset containing 70% of the data, and a testing dataset containing 30% of the data.

In [None]:
(classRF_train, classRF_test) = classRF_data.randomSplit([0.7, 0.3])

Next, we need to create the Random Forest Classifier called <b>classRF_model</b>. To instantiate the classifier, use <b>RandomForest.trainClassifier</b>. We will pass in the following parameters:
<ul>
    <li>1st: The input data. In our case, we will use <b>classRF_train</b></li>
    <li>2nd: The number of classes. For this dataset, there will be 10 classes, so set <b>numClasses</b> equal to <b>10</b>
    <li>3rd: The categorical features info. For our dataset, have <b>categoricalFeaturesInfo</b> equal <b>{}</b></li>
    <li>4th: The number of trees. We will set <b>numTrees = 3</b>
    <li>5th: The feature Subset Strategy. There are various inputs for this parameter, but for the sake of this section we will set <b>featureSubsetStrategy</b> equal to <b>"auto"</b></li>
    <li>6th: The type of impurity. Since we're dealing with <b>Classification</b>, we will be have <b>impurity</b> set to <b>'gini'</b></li>
    <li>7th: The maximum depth of the tree. For now, set <b>maxDepth</b> to <b>5</b>, which is the default value</li>
    <li>8th: The maximum number of bins. For now, set <b>maxBins</b> to <b>32</b>, which is the default value</li>
    <li>9th: The seed to generate random data. For now, set <b>seed</b> to <b>None</b></li>
</ul> <br> <br>

We will also be timing how long it takes to create the model, so run <b>start = time.time()</b> before creating the model and <b>print(time.time()-start)</b> after the model has been created. <br>
<b>Note</b>: The timings differ on run and by computer, therefore some statements throughout the lab may not directly align with the results you get, which is okay! There are many factors that can affect the time output.

In [None]:
start = time.time()
classRF_model = RandomForest.trainClassifier(classRF_train, numClasses = 10, categoricalFeaturesInfo={},
                                           featureSubsetStrategy="auto", numTrees=3,
                                           impurity='gini', maxDepth=4, maxBins=32, seed=None)
print (time.time()-start)

Next, we want to get the models prediction on the test data, which we will call <b>classRF_pred</b>. We will run <b>.predict</b> on classRF_model, passing in the testing data, <b>classRF_test</b> that is mapped using <b>.map</b> which maps the features using a lambda function (<b>lambda x: x.features</b>).

In [None]:
classRF_pred = classRF_model.predict(classRF_test.map(lambda x: x.features))

Now create a variable called <b>classRF_label_pred</b> which uses a <b>.map</b> on <b>classRF_test</b>. Pass <b>lambda l: l.label</b> into the mapping function. Outside of the mapping function, add a <b>.zip(classRF_pred)</b>. This will merge the label with the prediction</b> 

In [None]:
classRF_label_pred = classRF_test.map(lambda l: l.label).zip(classRF_pred)

Now we will calculate the Test Error for this prediction, which we will call <b>classRF_error</b>. This will equate to <b>classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())</b>, which will count the number of incorrectly predicted values and divide it by the total number of predictions.

In [None]:
classRF_error = classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())

Next, print out the test error value (<b>str(classRF_error)</b>, as well as the learned regression tree model (<b>classRF_model.toDebugString()</b>), so you have an idea of what the ensemble looks like.

In [None]:
print('Test Error = ' + str(classRF_error))
print('Learned classification tree model:' + classRF_model.toDebugString())

Now that we've created the basic Classification Random Forest, let's start tuning some parameters! This is similar to the previous section, but since most of the tuning parameters have been covered in the Decision Tree section, there will only be two parameter to tune in this section. <br> <br>

Read over the code and understand how to build the Classification Random Forest as a whole. For the inputs, we have:
<ul>
    <li>1st: numTreesValue is the value for numTrees (Type: Int, Range: > 0, Default: 3)</li>
    <li>2nd: featureSubsetStrategyValue is the value for featureSubsetStrategyValue (Default: "auto")</li>
    <ul>
        <li>Values include: "auto", "all", "sqrt", "log2", "onethird"</li>
    </ul>
</ul>

In [None]:
def classRF_tuner(numTreesValue, featureSubsetStrategyValue):
    start = time.time()
    classRF_model = RandomForest.trainClassifier(classRF_train, numClasses = 10, categoricalFeaturesInfo={},
                                           featureSubsetStrategy=featureSubsetStrategyValue, numTrees=numTreesValue,
                                           impurity='gini', maxDepth=4, maxBins=32, seed=None)
    print (time.time()-start)

    classRF_pred = classRF_model.predict(classRF_test.map(lambda x: x.features))
    classRF_label_pred = classRF_test.map(lambda l: l.label).zip(classRF_pred)
    classRF_error = classRF_label_pred.filter(lambda (v, p): v != p).count() / float(classRF_test.count())
    
    print('Test Error = ' + str(classRF_error))
    print('Learned classification tree model:' + classRF_model.toDebugString())

Start off by re-creating the original Random Forest. That requires the input: <b>(3)</b> and <b>"auto"</b> into <b>classRF_tuner</b>

In [None]:
classRF_tuner(3, "auto")

<a id = "nt"></a>
### <span style="color:#0b486b">6.1 numTrees Parameter</span>


Let's start by tuning the <b>numTrees</b> parameter. Begin by setting it to a lower value, such as <b>1</b>

In [None]:
classRF_tuner(1, "auto")

By setting numTrees to a value of 1, we see a slightly higher test error. Note that with numTrees equal to 1, the classifier acts as a Decision Tree, since there is only one tree in the ensemble.

Now let's try setting it to a numTrees to a larger value, such as 180. 

In [None]:
classRF_tuner(180, "auto")

With a lot more trees in the ensemble, the training error has decreased a lot! But the training time has increased substantially as well. Remember that the training time increases roughly linearly with the number of trees.

<a id = "fss"></a>

### <span style="color:#0b486b">6.2 featureSubsetStrategy Parameter</span>

Remember that the featureSubsetStrategy parameter only changes the number of features used as candidates for splitting. The default is set to <b>"auto"</b>, which will select "all", "sqrt", or "onethird" based on the value of numTrees. Since we are basing our analysis off of the default values, we have a numTrees value of 3, which means "sqrt" is selected. So let's start by changing it it <b>"all"</b>, which will use all of the features

In [None]:
classRF_tuner(3, "all")

We can see that there is a small increase in the building time of the model, which is expected since we are considering all of the features. As well, there is a small increase in the test error. A possibility to the increase in test error is that there are some features that aren't "good" in the model, causing an increase in the test error. Next, we will try with <b>"sqrt"</b>

In [None]:
classRF_tuner(3, "sqrt")

This has very similar values to the "auto", which is correct since "auto" is using "sqrt" for featureSubsetStrategy, since our numTrees value was set to 3. Let's try using "onethird" now, which uses one third of the features.

In [None]:
classRF_tuner(3, "onethird")

We see that the run-time is similar to the default, but the testing error has decreased a little bit. It's possible that there is about the same number of features when you take one third of them, as if you take the square root of them for this particular dataset. Let's try with the last type, which is <b>"log2"</b>

In [None]:
classRF_tuner(3, "log2")

When using <b>"log2"</b>, there is a decrease in run-time, along with testing error!

---
<a id = "rfr"></a>
## <span style="color:#0b486b">7. RandomForest (Regression)</span>

Now it's time for you to try it out for yourself! Build a Regression RandomForest in a similar way that the Classification RandomForest was built. Please note that you will be using the same dataset in this section (classRF_train, classRF_test), therefore you do not need to re-initialize that section.<br> <br> 

Try to only reference the above section when you are experiencing a lot of difficulty. This section is mainly for you to apply your learning.

For some help with the variables:
<ul>
    <li><b>categoricalFeaturesInfo</b>: Has a value of <b>{}</b> (parameter doesn't require tuning)</li>
    <li><b>featureSubsetStrategy</b>: Can change these values between <b>"auto"</b>, <b>"all"</b>, <b>"sqrt"</b>, <b>"log2"</b>, and <b>"onethird"</b></li>
    <li><b>numTrees</b>: Values range from <b>1</b> to infinity<i>(Default: 3)</i></li>
    <ul>
        <li>Note: If the value is too large, the system can run out of memory and not run.</li>
    </ul>
    <li><b>impurity</b>: For Regression, the value must be set to <b>'variance'</b> <i>(Default: 'variance')</i></li>
    <li><b>maxDepth</b>: Values range between <b>0 and 30</b> <i>(Default: 5)</i></li>
    <li><b>maxBins</b>: Value ranges between <b>2 and 2147483647</b> (largest value for 32-bits) <i>(Default: 32)</i></li>
    <li><b>seed</b> Can be set to any value, or to a value based on system time with <i>None</i> <i>(Default: None)</i></li>
</ul>

When displaying the <b>Mean Squared Error</b>, use the following formula and print statement instead of Training Error: <br>
<b>regRF_MSE = regRF_label_pred.map(lambda (v, p): (v - p)**2).sum() / float(classRF_test.count())</b> <br>
<b>print('Test Error = ' + str(regRF_MSE))</b>

#### The Goal
Try to create a model that is better than the model with default values.

#### Try to beat!
With some parameter tuning, I was able to get a run-time increase of the model by ~0.9 seconds and a Test error decrease of ~2.54. Try to get a value similar to this, or better.

#### Note
We want a model that doesn't take too long to train and will cause overfitting. Remember that a very large model with high accuracy but long run time may not be good because the model may have overfit the data.

In [None]:
start = time.time()
regRF_model = RandomForest.trainRegressor(classRF_train, categoricalFeaturesInfo={},
                                    numTrees=14, featureSubsetStrategy="onethird",
                                    impurity='variance', maxDepth=11, maxBins=24, seed=None)
print(time.time() - start)
# Evaluate model on test instances and compute test error
regRF_pred = regRF_model.predict(classRF_train.map(lambda x: x.features))
regRF_label_pred = classRF_train.map(lambda lp: lp.label).zip(regRF_pred)
regRF_MSE = regRF_label_pred.map(lambda (v, p): (v - p) ** 2).sum()/\
                                   float(classRF_train.count())
print('Test Mean Squared Error = ' + str(regRF_MSE))
print('Learned regression forest model: ' + regRF_model.toDebugString())

# 0.541887044907
# Test Mean Squared Error = 2.63255831252
# Learned regression forest model: TreeEnsembleModel regressor with 3 trees
# numTrees: 3
# featureSubsetStrategy="auto"
# Impurity: variance
# maxDepth: 4
# maxBins: 32


# 1.41001796722
# Test Mean Squared Error = 0.088487863674
# Learned regression forest model: TreeEnsembleModel regressor with 14 trees
# numTrees: 14
# featureSubsetStrategy="onethird"
# Impurity: variance
# maxDepth: 11
# maxBins: 16