#  CSE 255 Programming Assignment 7

##  Problem Statement

In this programming assignment, your task is to classify geographical locations according to their predicted tree cover using Gradient Boosting and Random Forest classifiers. You are expected to fill in functions that would complete this task. All of the necessary helper code is included in this notebook. However, we advise you to go over the lecture material, the EdX videos and the corresponding notebooks before you attempt this Programming Assignment. You can find information about the dataset to be used in the following links:

* **Dataset:** http://archive.ics.uci.edu/ml/datasets/Covertype 

* **Dataset description:** http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.info

In [1]:
from pyspark import SparkContext
sc=SparkContext()

In [37]:
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.regression import LabeledPoint

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils

import os
from os.path import exists

In [3]:
#define a dictionary of cover types
CoverTypes={1.0: 'Spruce/Fir',
            2.0: 'Lodgepole Pine',
            3.0: 'Ponderosa Pine',
            4.0: 'Cottonwood/Willow',
            5.0: 'Aspen',
            6.0: 'Douglas-fir',
            7.0: 'Krummholz' }
print('Tree Cover Types:', CoverTypes)

Tree Cover Types: {1.0: 'Spruce/Fir', 2.0: 'Lodgepole Pine', 3.0: 'Ponderosa Pine', 4.0: 'Cottonwood/Willow', 5.0: 'Aspen', 6.0: 'Douglas-fir', 7.0: 'Krummholz'}


In [4]:
def get_data(file_path):
    %cd $file_path
    if not exists('covtype'):
        print("creating directory covtype")
        !mkdir covtype
    %cd covtype
    if not exists('covtype.data'):
        if not exists('covtype.data.gz'):
            print('downloading covtype.data.gz')
            !curl -O http://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz
        print('decompressing covtype.data.gz')
        !gunzip -f covtype.data.gz
    !ls -l

In [5]:
get_data("/home/jovyan/work/notebooks/HW2018/source/ps6/") #Change according to necessary


[Errno 2] No such file or directory: '/home/jovyan/work/notebooks/HW2018/source/ps6/'
/home/jovyan/work/HW/HW7
/home/jovyan/work/HW/HW7/covtype
total 85800
-rw-r--r-- 1 jovyan users     1264 May 20 22:42 BoostedTreesResults.pkl
drwxr-xr-x 4 jovyan users      128 May 21 04:26 covtype
-rw-r--r-- 1 jovyan users 75169317 May 17 18:35 covtype.data
-rw-r--r-- 1 jovyan users      293 May 20 22:47 GradientBoostingResults.pkl


In [6]:
# Break up features that are made out of several binary features.
def get_columns(cols_txt):
    cols=[a.strip() for a in cols_txt.split(',')]
    colDict={a:[a] for a in cols}
    colDict['Soil_Type (40 binary columns)'] = ['ST_'+str(i) for i in range(40)]
    colDict['Wilderness_Area (4 binarycolumns)'] = ['WA_'+str(i) for i in range(4)]
    columns=[]
    for item in cols:
        columns = columns + colDict[item]
    return columns
    #print(columns)

In [7]:
# Define the feature names
cols_txt="""
Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology,
Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways,
Hillshade_9am, Hillshade_Noon, Hillshade_3pm,
Horizontal_Distance_To_Fire_Points, Wilderness_Area (4 binarycolumns), 
Soil_Type (40 binary columns), Cover_Type
"""
columns = get_columns(cols_txt)

In [8]:
print(columns)

['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'WA_0', 'WA_1', 'WA_2', 'WA_3', 'ST_0', 'ST_1', 'ST_2', 'ST_3', 'ST_4', 'ST_5', 'ST_6', 'ST_7', 'ST_8', 'ST_9', 'ST_10', 'ST_11', 'ST_12', 'ST_13', 'ST_14', 'ST_15', 'ST_16', 'ST_17', 'ST_18', 'ST_19', 'ST_20', 'ST_21', 'ST_22', 'ST_23', 'ST_24', 'ST_25', 'ST_26', 'ST_27', 'ST_28', 'ST_29', 'ST_30', 'ST_31', 'ST_32', 'ST_33', 'ST_34', 'ST_35', 'ST_36', 'ST_37', 'ST_38', 'ST_39', 'Cover_Type']


In [9]:
# Have a look at the first two lines of the data file
!head -1 covtype.data

2596,51,3,258,0,510,221,232,148,6279,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5


In [10]:
# Read the file into an RDD
# When using sc.textRead you need to use an absolute path.
# If doing this on a real cluster, you need the file to be available on all nodes, ideally in HDFS.
path='covtype/covtype.data'
inputRDD=sc.textFile(path)

In [11]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def label_RDD(inputRDD):
    '''
    Transform the text RDD into an RDD of Labeled Points
    
    Input: inputRDD 
    type: RDD
    
    Returns: Data
    type: RDD 
    '''
    ### BEGIN SOLUTION
    
    Data=inputRDD.map(lambda line: [float(x.strip()) for x in line.split(',')])\
     .map(lambda V:LabeledPoint(V[-1],V[:-1]))
        
    ### END SOLUTION
    
    return Data

In [12]:
Data = label_RDD(inputRDD)

In [13]:
%config IPCompleter.greedy=True
Data.first()

LabeledPoint(5.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])

In [14]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def count_examples(Data):
    '''
    Count the number of examples of each type
    
    Input: Data
    type: RDD 
    
    Returns: counts
    type: list of tuples, where each tuple is (covertype(int), count(int))
    '''
    ### BEGIN SOLUTION
    
    counts=Data.map(lambda lp:(lp.label,1)).reduceByKey(lambda x,y:x+y).collect()
    
    ### END SOLUTION
    counts.sort(key=lambda x:x[1],reverse=True)
    return counts

In [15]:
counts = count_examples(Data)

In [16]:
assert type(counts) == list, 'Incorrect return type'
assert type(counts[0]) == tuple, 'Incorrect return type'
assert type(counts[0][0]) == float, 'Incorrect return type'
assert type(counts[0][1]) == int, 'Incorrect return type'
assert counts[0][0] == 2, 'Incorrect return value'
assert counts[0][1] == 283301, 'Incorrect return value'

In [17]:
total=Data.cache().count()
print('total data size=',total)
print('              type (label):   percent of total')
print('---------------------------------------------------------')
print('\n'.join(['%20s (%3.1f):\t%4.2f'%(CoverTypes[a[0]],a[0],100.0*a[1]/float(total)) for a in counts]))

total data size= 581012
              type (label):   percent of total
---------------------------------------------------------
      Lodgepole Pine (2.0):	48.76
          Spruce/Fir (1.0):	36.46
      Ponderosa Pine (3.0):	6.15
           Krummholz (7.0):	3.53
         Douglas-fir (6.0):	2.99
               Aspen (5.0):	1.63
   Cottonwood/Willow (4.0):	0.47


### Making the problem binary

The implementation of BoostedGradientTrees in MLLib supports only binary problems. the `CovTYpe` problem has
7 classes. To make the problem binary we choose the `Lodgepole Pine` (label = 2.0). We therefor transform the dataset to a new dataset where the label is `1.0` is the class is `Lodgepole Pine` and is `0.0` otherwise.

In [18]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def labels_to_binary(Data):
    '''
    Transform the dataset to a new dataset 
    such that new label is 1 if current label is 2, else new label is 0
    
    Input: Data
    type: RDD
    
    Returns: Data
    type: RDD
    '''
    ### BEGIN SOLUTION
    
    Label=2.0
    Data=inputRDD.map(lambda line: [float(x) for x in line.split(',')])\
    .map(lambda V:LabeledPoint(1.0*(V[-1]==Label),V[:-1]))
    
    ### END SOLUTION
    
    return Data

In [19]:
Data = labels_to_binary(Data)
Data.first()

LabeledPoint(0.0, [2596.0,51.0,3.0,258.0,0.0,510.0,221.0,232.0,148.0,6279.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0])

### Reducing data size
In order to see the effects of overfitting more clearly, we reduce the size of the data by a factor of 10

In [20]:
Data1=Data.sample(False,0.1).cache()
(trainingData,testData)=Data1.randomSplit([0.7,0.3])
(trainingData_all, testData_all)=Data.randomSplit([0.7, 0.3])

import pickle
pickle.dump(trainingData.collect(), open('training10p.pkl', 'wb'))
pickle.dump(testData.collect(), open('test10p.pkl', 'wb'))
pickle.dump(trainingData_all.collect(), open('training_all.pkl', 'wb'))
pickle.dump(testData_all.collect(), open('test_all.pkl', 'wb'))

In [27]:
trainingData = sc.parallelize(pickle.load(open('training10p.pkl', 'rb')))
testData = sc.parallelize(pickle.load(open('test10p.pkl', 'rb')))
trainingData_all = sc.parallelize(pickle.load(open('training_all.pkl', 'rb')))
testData_all = sc.parallelize(pickle.load(open('test_all.pkl', 'rb')))

In [28]:
print('Sizes: Data1=%d, trainingData=%d, testData=%d'%(Data1.count(),trainingData.cache().count(),testData.cache().count()))

Sizes: Data1=57963, trainingData=40758, testData=17205


In [29]:
counts=testData.map(lambda lp:(lp.label,1)).reduceByKey(lambda x,y:x+y).collect()
counts.sort(key=lambda x:x[1],reverse=True)
counts

[(0.0, 8749), (1.0, 8456)]

### Gradient Boosted Trees

* Following [this example](http://spark.apache.org/docs/latest/mllib-ensembles.html#classification) from the mllib documentation

* [pyspark.mllib.trees documentation](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#module-pyspark.mllib.tree)

#### Main classes and methods

* `GradientBoostedTrees` is the class that implements the learning trainClassifier,
   * It's main method is `trainClassifier(trainingData)` which takes as input a training set and generates an instance of `GradientBoostedTreesModel`
   * The main parameter from train Classifier are:
      * **data** – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1}.
      * categoricalFeaturesInfo – Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.
      * **loss** – Loss function used for minimization during gradient boosting. Supported: {“logLoss” (default), “leastSquaresError”, “leastAbsoluteError”}.
      * **numIterations** – Number of iterations of boosting. (default: 100)
      * **learningRate** – Learning rate for shrinking the contribution of each estimator. The learning rate should be between in the interval (0, 1]. (default: 0.1)
      * **maxDepth** – Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 3)
      * **maxBins** – maximum number of bins used for splitting features (default: 32) DecisionTree requires maxBins >= max categories
      
      
* `GradientBoostedTreesModel` represents the output of the boosting process: a linear combination of classification trees. The methods supported by this class are:
   * `save(sc, path)` : save the tree to a given filename, sc is the Spark Context.
   * `load(sc,path)` : The counterpart to save - load classifier from file.
   * `predict(X)` : predict on a single datapoint (the `.features` field of a `LabeledPont`) or an RDD of datapoints.
   * `toDebugString()` : print the classifier in a human readable format.

In [45]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def Classify_GB(trainingData, testData, maxDepth):
    '''
    Train and test GradientBoostedTrees classifier using given training and test data
    Repeat the procedure with different number of training iterations
    
    Input: trainingData, testData
    type: RDD, RDD
    
    Returns: errors
    type: dict with key = number of iterations, value = train & test error
    
    example output:
    errors = {1:{'train':0.2, 'test':0.3}, 3:{'train':0.15, 'test':0.16}}
    '''
    
    ### BEGIN SOLUTION
    
    from time import time
    errors={}
    start=time()
    model=GradientBoostedTrees.trainClassifier(trainingData,categoricalFeaturesInfo={},numIterations=i,maxDepth=maxDepth)
    #print model.toDebugString()
    errors[i]={}
    dataSets={'test':testData}
    for name in list(dataSets.keys()):  # Calculate errors on train and test sets
        data=dataSets[name]
        Predicted=model.predict(data.map(lambda x: x.features))
        LabelsAndPredictions=data.map(lambda lp: lp.label).zip(Predicted)
        Err = LabelsAndPredictions.filter(lambda v_p:v_p[0] != v_p[1]).count()/float(data.count())
        errors[i][name]=Err
        print(i,errors[i],int(time()-start),'seconds')
    print(errors)
            
    ### END SOLUTION
    
    return errors

In [46]:
#Train with 10% of the dataset
B_10p_1 = Classify_GB(trainingData, testData, 1)
B_10p_3 = Classify_GB(trainingData, testData, 3)
B_10p_6 = Classify_GB(trainingData, testData, 6)
B_10p_10 = Classify_GB(trainingData, testData, 10)
#Store Results
Results={ID:globals()[ID] for ID in ['B_10p_1','B_10p_3', 'B_10p_6', 'B_10p_10']}
import pickle
pickle.dump(Results, open('GradientBoostingResults.pkl', 'wb'))

10 {'test': 0.27829119442022665} 13 seconds
20 {'test': 0.27335077012496367} 23 seconds
{10: {'test': 0.27829119442022665}, 20: {'test': 0.27335077012496367}}
10 {'test': 0.25510026155187443} 17 seconds
20 {'test': 0.23940714908456845} 32 seconds
{10: {'test': 0.25510026155187443}, 20: {'test': 0.23940714908456845}}
10 {'test': 0.2122638767800058} 24 seconds
20 {'test': 0.202615518744551} 48 seconds
{10: {'test': 0.2122638767800058}, 20: {'test': 0.202615518744551}}
10 {'test': 0.17210113339145597} 44 seconds
20 {'test': 0.15948852077884335} 74 seconds
{10: {'test': 0.17210113339145597}, 20: {'test': 0.15948852077884335}}


In [33]:
Results=pickle.load(open('GradientBoostingResults.pkl','rb'))

### Random Forests

* Following [this example](http://spark.apache.org/docs/latest/mllib-ensembles.html#classification) from the mllib documentation

* [pyspark.mllib.trees.RandomForest documentation](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest)

**trainClassifier**`(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None)`   
Method to train a decision tree model for binary or multiclass classification.

**Parameters:**  
* *data* – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.  
* *numClasses* – number of classes for classification.  
* *categoricalFeaturesInfo* – Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.  
* *numTrees* – Number of trees in the random forest.  
* *featureSubsetStrategy* – Number of features to consider for splits at each node. Supported: “auto” (default), “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”.
* *impurity* – Criterion used for information gain calculation. Supported values: “gini” (recommended) or “entropy”.  
* *maxDepth* – Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 4)  
* *maxBins* – maximum number of bins used for splitting features (default: 32)
* *seed* – Random seed for bootstrapping and choosing feature subsets.  

**Returns:**	
RandomForestModel that can be used for prediction

In [48]:
## Insert your answer in this cell. DO NOT CHANGE THE NAME OF THE FUNCTION.
def Classify_RF(trainingData, testData, depth):
    '''
    Train and test RandomForest classifier using given training and test data
    Repeat the procedure with different tree depths
    
    Input: trainingData, testData
    type: RDD, RDD
    
    Returns: errors
    type: dict with key = number of iterations, value = train & test error
    
    example output:
    errors = {1:{'train':0.2, 'test':0.3}, 3:{'train':0.15, 'test':0.16}}
    '''
    
    ### BEGIN SOLUTION
    
    from time import time
    errors={}
    start=time()
    model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                 numTrees=numTrees, featureSubsetStrategy="auto",
                                 impurity='gini', maxDepth=depth, maxBins=32)
    #print model.toDebugString()
    errors[depth]={}
    dataSets={'test':testData}
    for name in list(dataSets.keys()):  # Calculate errors on train and test sets
        data=dataSets[name]
        Predicted=model.predict(data.map(lambda x: x.features))
        LabelsAndPredictions=data.map(lambda lp: lp.label).zip(Predicted)
        Err = LabelsAndPredictions.filter(lambda v_p:v_p[0] != v_p[1]).count()/float(data.count())
        errors[depth][name]=Err
    print(depth,errors[depth],int(time()-start),'seconds')
    print(errors)
    
    ### END SOLUTION
    
    return errors

In [49]:
#Train with 10% of the dataset
RF_10p_3 = Classify_RF(trainingData, testData, 3)
RF_10p_6 = Classify_RF(trainingData, testData, 6)
RF_10p_8 = Classify_RF(trainingData, testData, 8)
RF_10p_10 = Classify_RF(trainingData, testData, 10)
#Store Results
Results_RF={ID:globals()[ID] for ID in ['RF_10p_3','RF_10p_6', 'RF_10p_8', 'RF_10p_10']}
import pickle
pickle.dump(Results_RF, open('RandomForestResults.pkl', 'wb'))

3 {'test': 0.2967160709096193} 1 seconds
6 {'test': 0.2760244115082825} 1 seconds
10 {'test': 0.23109561174077303} 2 seconds
{3: {'test': 0.2967160709096193}, 6: {'test': 0.2760244115082825}, 10: {'test': 0.23109561174077303}}
3 {'test': 0.2864864864864865} 1 seconds
6 {'test': 0.2662016855565243} 1 seconds
10 {'test': 0.22063353676256903} 3 seconds
{3: {'test': 0.2864864864864865}, 6: {'test': 0.2662016855565243}, 10: {'test': 0.22063353676256903}}


In [None]:
import pickle
Results=pickle.load(open('BoostedTreesResults.pkl','rb'))

In [None]:
make_figure([Results['RF_10p_10'],Results['RF_10p_100']],['10Trees','100Trees'],Title='Random Forest using 10% of the dataset')

In [None]:
make_figure([Results['RF_all_10'],Results['RF_all_100']],['10Trees','100Trees'],Title='Random Forest using entire dataset')