# Gradient Boosting and Random Forests

In this programming assignment, your task is to classify geographical locations according to their predicted tree cover using Gradient Boosting and Random Forest classifiers. You are expected to fill in functions that would complete this task. All of the necessary helper code is included in this notebook. However, we advise you to go over the slides, lecture material and the corresponding notebooks before you attempt this Programming Assignment. You can find information about the dataset to be used in the following links:

* **Dataset:** http://archive.ics.uci.edu/ml/datasets/Covertype 

* **Dataset description:** http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

##  Notebook Setup

In [1]:
# To time the entire solution
import time
start_nb = time.time()

In [2]:
import os
import sys
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

from pyspark import SparkContext
sc=SparkContext()

In [3]:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

import pickle
from os.path import exists, join

%config IPCompleter.greedy=True

In [4]:
#define a dictionary of cover types
CoverTypes={1.0: 'Spruce/Fir',
            2.0: 'Lodgepole Pine',
            3.0: 'Ponderosa Pine',
            4.0: 'Cottonwood/Willow',
            5.0: 'Aspen',
            6.0: 'Douglas-fir',
            7.0: 'Krummholz' }
print('Tree Cover Types:', CoverTypes)

Tree Cover Types: {1.0: 'Spruce/Fir', 2.0: 'Lodgepole Pine', 3.0: 'Ponderosa Pine', 4.0: 'Cottonwood/Willow', 5.0: 'Aspen', 6.0: 'Douglas-fir', 7.0: 'Krummholz'}


## Collecting Data

In [7]:
# Break up features that are made out of several binary features.
def get_columns(cols_txt):
    cols=[a.strip() for a in cols_txt.split(',')]
    colDict={a:[a] for a in cols}
    colDict['Soil_Type (40 binary columns)'] = ['ST_'+str(i) for i in range(40)]
    colDict['Wilderness_Area (4 binarycolumns)'] = ['WA_'+str(i) for i in range(4)]
    columns=[]
    for item in cols:
        columns = columns + colDict[item]
    return columns
    #print(columns)

In [8]:
# Define the feature names
cols_txt="""
Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology,
Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways,
Hillshade_9am, Hillshade_Noon, Hillshade_3pm,
Horizontal_Distance_To_Fire_Points, Wilderness_Area (4 binarycolumns), 
Soil_Type (40 binary columns), Cover_Type
"""
columns = get_columns(cols_txt)

In [9]:
# Read the file into an RDD
# When using sc.textRead you need to use an absolute path.
# If doing this on a real cluster, you need the file to be available on all nodes, ideally in HDFS.
path='covtype/covtype.data'
inputRDD=sc.textFile(join('./resource/asnlib/publicdata', path))

## Helper Functions
Here are some helper functions that you will have to fill up.

### label_RDD

#### Task:

Finish `label_RDD` function. The function takes an RDD as input and returns an RDD of labeled points.


Input: 

- `inputRDD`: RDD consisting of a string with comma separated values

Output: 

- RDD of the type [`LabeledPoint`](https://spark.apache.org/docs/2.2.1/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) with the first element being the label and second element being a DenseVector that contains all the elements of the InputRDD(Except the last value which is the label).

---

**<font color="magenta" size=2>Example Input</font>**
``` python
'2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5'
```

**<font color="blue" size=2>Example Output</font>**
``` python
LabeledPoint(5.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
```

In [10]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def label_RDD(inputRDD):
    ###
    ### YOUR CODE HERE
    ###
    
    data = inputRDD.map(lambda x: x.split(",")).map(lambda x: LabeledPoint(float(x[-1]), Vectors.dense(x[:-1])))
    return data
    
    data = inputRDD.map(lambda arr: [int(i) for i in arr.split(',')])
    data = data.map(lambda fnc: LabeledPoint(fnc[-1],Vectors.dense(fnc[:-1])))

    return data

In [11]:
Data = label_RDD(inputRDD)
Data.cache()

PythonRDD[2] at RDD at PythonRDD.scala:48

In [12]:
assert Data.first().label == 5.0
assert Data.first().features == Vectors.dense([2596.0, 51.0, 3.0, 258.0, 0.0, 510.0, 221.0, 232.0, 148.0, 6279.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

### count_examples

#### Task:

Finish `count_examples` function. The function takes an RDD as input and returns count of number of labels belonging to each class.

Input: 

- `Data`: RDD obtained as the output of the labelRDD

Output: 

- list of tuples (label, count)

**NOTE: The outputs need to be sorted in descending order by counts.**

---

**<font color="magenta" size=2>Example Input</font>**
``` python
[LabeledPoint(5.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(5.0, [2590.0,56.0,2.0,212.0,-6.0,390.0,220.0,235.0,151.0,6225.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]),
 LabeledPoint(2.0, [2804.0,139.0,9.0,268.0,65.0,3180.0,234.0,238.0,135.0,6121.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])]
```

**<font color="blue" size=2>Example Output</font>**
``` python
[(5.0, 2), (2.0, 1)]
```

In [13]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def count_examples(Data):
    ###
    ### YOUR CODE HERE
    ###
    
    
    reduced_data = Data.map(lambda x: (x.label, 1)).reduceByKey(lambda x,y: x+y)
    count = sorted(reduced_data.collect(), key=lambda x: (-x[1]))
    
    return count

In [14]:
counts = count_examples(Data)

In [15]:
counts

[(2.0, 283301),
 (1.0, 211840),
 (3.0, 35754),
 (7.0, 20510),
 (6.0, 17367),
 (5.0, 9493),
 (4.0, 2747)]

In [16]:
counts3 = count_examples(sc.parallelize(Data.take(3)))

In [17]:
counts3

[(5.0, 2), (2.0, 1)]

In [18]:
assert type(counts3) == list, 'Incorrect return type'
assert type(counts3[0]) == tuple, 'Incorrect return type'
assert type(counts3[0][0]) == float, 'Incorrect return type'
assert type(counts3[0][1]) == int, 'Incorrect return type'

In [19]:
assert counts3[0][0] == 5.0, 'Incorrect return value'
assert counts3[0][1] == 2, 'Incorrect return value'

In [20]:
# Hidden Tests Here
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [21]:
total=Data.count()
print('total data size=',total)
print('              type (label):   percent of total')
print('---------------------------------------------------------')
print('\n'.join(['%20s (%3.1f):\t%4.2f'%(CoverTypes[a[0]],a[0],100.0*a[1]/float(total)) for a in counts]))

total data size= 581012
              type (label):   percent of total
---------------------------------------------------------
      Lodgepole Pine (2.0):	48.76
          Spruce/Fir (1.0):	36.46
      Ponderosa Pine (3.0):	6.15
           Krummholz (7.0):	3.53
         Douglas-fir (6.0):	2.99
               Aspen (5.0):	1.63
   Cottonwood/Willow (4.0):	0.47


### labels_to_binary (Making the problem binary)

The implementation of BoostedGradientTrees in MLLib supports only binary problems. the `CovType` problem has
7 classes. To make the problem binary we choose the `Lodgepole Pine` (label = 2.0). We therefore transform the dataset to a new dataset where the label is `1.0` is the class is `Lodgepole Pine` and is `0.0` otherwise.

#### Task:

Finish `labels_to_binary` function. The function takes an RDD as input and returns an RDD with binary labels such that: 

```python
if label == 2:      # Since label 2 has the highest count value
    new_label = 1
    
else:
    new_label = 0
```

Input: 

- `Data`: Labelled RDD (Output from `label_RDD` function)

Output: 

- The same RDD with label of all entries as 0 except for label = 2.0 where label becomes 1.0

---

**<font color="magenta" size=2>Example Input</font>**
``` python
LabeledPoint(5.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
```
**<font color="blue" size=2>Example Output</font>**
``` python
LabeledPoint(0.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])
```

In [22]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def labels_to_binary(Data):
    ###
    ### YOUR CODE HERE
    ###
    
    Data = Data.map(lambda x: relabel(x))
    
    return Data
    
    Data = Data.map(lambda fnc: LabeledPoint(1 if fnc.label == 2 else 0, fnc.features))
    
    return Data

def relabel(x):
    if (x.label == 2.0): 
        return LabeledPoint(label=1.0, features=x.features) 
    else:
        return LabeledPoint(label=0.0, features=x.features)

In [23]:
Data = labels_to_binary(Data)

In [24]:
Data.first()

LabeledPoint(0.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])

In [25]:
assert Data.first().label == 0.0

## Reducing data size
For this assignment, we will use only 10% of the original data.

In [26]:
trainingData = sc.parallelize(pickle.load(open(join('./resource/asnlib/publicdata', 'training10p.pkl'), 'rb')))
testData = sc.parallelize(pickle.load(open(join('./resource/asnlib/publicdata', 'test10p.pkl'), 'rb')))

In [27]:
print('Sizes: Data1=%d, trainingData=%d, testData=%d'%(trainingData.cache().count() + testData.cache().count(),trainingData.cache().count(),testData.cache().count()))

Sizes: Data1=58100, trainingData=40682, testData=17418


In [28]:
counts = count_examples(testData)

## Training classifiers

We will train classifiers using Gradient Boosted Trees and Random Forest implemented in pyspark.mllib package and evaluate their performances. 

You can follow the [example here](http://spark.apache.org/docs/2.2.1/mllib-ensembles.html#classification) from the mllib documentation if you don't know how to start.

### Gradient Boosted Trees

Pyspark has a built-in implementation of Gradient Boosted Trees. Please see [`trainClassifier`](http://spark.apache.org/docs/2.2.1/api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTrees) on how to train it on a dataset and [`predict`](https://spark.apache.org/docs/2.2.1/api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTreesModel.predict) on how to predict the labels for a dataset.

#### Task:

Finish `Classify_GB` function. The function trains a GradientBoostedTrees classifier that has trees with a maximum depth of `maxDepth` on the training data for 10 iterations and returns the error on the test data.

Input: 

- `trainingData` (RDD): Training data
- `testData` (RDD): Test data
- `maxDepth` (int): Depth of tree

Output:

- error (float)


**Hint:**

- Use `categoricalFeaturesInfo={}` for `trainClassifier`.
- Use default parameters for `trainClassifier` unless specified otherwise. 

In [63]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def Classify_GB(trainingData, testData, maxDepth):
    ###
    ### YOUR CODE HERE
    ###

    model = GradientBoostedTrees.trainClassifier(trainingData, maxDepth=maxDepth, categoricalFeaturesInfo={}, numIterations=10)
    
    predictions = model.predict(testData.map(lambda fnc: fnc.features))
    
    predictions_with_labels = testData.map(lambda fnc: fnc.label).zip(predictions)
    
    test_error = predictions_with_labels.filter(lambda fnc: fnc[0] != fnc[1]).count() / float(testData.count())
    
    return test_error


In [64]:
visible_results=pickle.load(open(join('./resource/asnlib/publicdata', 'GradientBoostingResultsVisible.pkl'),'rb'))
assert Classify_GB(trainingData, testData, 1) <= visible_results['B_10p_1'] 

In [65]:
assert Classify_GB(trainingData, testData, 3) <= visible_results['B_10p_3']

In [66]:
#Hidden Tests here
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [67]:
#Hidden Tests here
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### Random Forests

Pyspark has a built-in implementation of Random Forests. Please see [`trainClassifier`](http://spark.apache.org/docs/2.2.1/api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest) on how to train it on a dataset and [`predict`](https://spark.apache.org/docs/2.2.1/api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel.predict) on how to predict the labels for a dataset.

#### Task:

Finish `Classify_RF` function. The function trains a RandomForest classifier that has 10 trees with a maximum depth of `maxDepth` on the training data and returns the error on the test data.

Input

- `trainingData` (RDD): Training data
- `testData` (RDD): Test data
- `maxDepth` (int): Depth of tree

Output: 

- error (float)


**Hint:**

- Don't forget to manually set `numClasses` for `trainClassifier`.
- Use `categoricalFeaturesInfo={}` for `trainClassifier`.
- Use default parameters for `trainClassifier` unless specified otherwise. 

In [68]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def Classify_RF(trainingData, testData, maxDepth):    
    ###
    ### YOUR CODE HERE
    ###
    
    modelRF = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     numTrees=10, featureSubsetStrategy="auto",
                                     impurity='gini', maxDepth=maxDepth, maxBins=32)
    predictions = modelRF.predict(testData.map(lambda x: x.features))
    labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
    testErr = labelsAndPredictions.filter(lambda lp: lp[0] != lp[1]).count() / float(testData.count())
    
    return testErr


In [69]:
visible_results_rf=pickle.load(open(join('./resource/asnlib/publicdata', 'RandomForestResultsVisible.pkl'),'rb'))
assert Classify_RF(trainingData, testData, 3) <= visible_results_rf['RF_10p_3'] 

In [70]:
assert Classify_RF(trainingData, testData, 6) <= visible_results_rf['RF_10p_6']

In [71]:
#Hidden Tests here
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [72]:
#Hidden Tests here
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [73]:
end_nb = time.time()
print("Total time taken: ", end_nb - start_nb)

Total time taken:  4381.301802396774
