### Reducing data size
In order to see the effects of overfitting more clearly, we reduce the size of the data by a factor of 10

In [None]:
Data1=Data.sample(False,0.1).cache()
(trainingData,testData)=Data1.randomSplit([0.7,0.3])

print('Sizes: Data1=%d, trainingData=%d, testData=%d'%(Data1.count(),trainingData.cache().count(),testData.cache().count()))

In [None]:
counts=testData.map(lambda lp:(lp.label,1)).reduceByKey(lambda x,y:x+y).collect()
counts.sort(key=lambda x:x[1],reverse=True)
counts

### Random Forests

* Following [this example](http://spark.apache.org/docs/latest/mllib-ensembles.html#classification) from the mllib documentation

* [pyspark.mllib.trees.RandomForest documentation](http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest)

**trainClassifier**`(data, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None)`   
Method to train a decision tree model for binary or multiclass classification.

**Parameters:**  
* *data* – Training dataset: RDD of LabeledPoint. Labels should take values {0, 1, ..., numClasses-1}.  
* *numClasses* – number of classes for classification.  
* *categoricalFeaturesInfo* – Map storing arity of categorical features. E.g., an entry (n -> k) indicates that feature n is categorical with k categories indexed from 0: {0, 1, ..., k-1}.  
* *numTrees* – Number of trees in the random forest.  
* *featureSubsetStrategy* – Number of features to consider for splits at each node. Supported: “auto” (default), “all”, “sqrt”, “log2”, “onethird”. If “auto” is set, this parameter is set based on numTrees: if numTrees == 1, set to “all”; if numTrees > 1 (forest) set to “sqrt”.
* *impurity* – Criterion used for information gain calculation. Supported values: “gini” (recommended) or “entropy”.  
* *maxDepth* – Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. (default: 4)  
* *maxBins* – maximum number of bins used for splitting features (default: 32)
* *seed* – Random seed for bootstrapping and choosing feature subsets.  

**Returns:**	
RandomForestModel that can be used for prediction

In [None]:
from time import time
errors={}
for depth in [1,3,6,10,15,20]:
    start=time()
    model = RandomForest.trainClassifier(## FILLIN ##)
    #print model.toDebugString()
    errors[depth]={}
    dataSets={'train':trainingData,'test':testData}
    for name in dataSets.keys():  # Calculate errors on train and test sets
        ### FILLIN ###
    print(depth,errors[depth],int(time()-start),'seconds')
print(errors)

In [None]:
Results={\
    'RF10p_10':{1: {'test': 0.41132990702777616, 'train': 0.4119253191646373}, 3: {'test': 0.27504764104637064, 'train': 0.2733131626202248}, 6: {'test': 0.2566264364497315, 'train': 0.2549627333776105}, 10: {'test': 0.23167985216838943, 'train': 0.2252724276191179}, 15: {'test': 0.20182479644280188, 'train': 0.18323370968932182}, 20: {'test': 0.19091066581971472, 'train': 0.15942242884903943}},\
    'RF10p_100':{1: {'test': 0.3960270254663048, 'train': 0.39411605539566574}, 3: {'test': 0.2815730207310735, 'train': 0.2798071483039382}, 6: {'test': 0.26396027025466307, 'train': 0.26189949081248615}, 10: {'test': 0.23081365132528728, 'train': 0.22281258455710526}, 15: {'test': 0.1997459144193567, 'train': 0.18001131527808525}, 20: {'test': 0.17803314661892938, 'train': 0.14658204806533343}},\
    'RFall_10':{1: {'test': 0.4347141897208231, 'train': 0.4350283735527522}, 3: {'test': 0.29120314267828323, 'train': 0.29215788917479535}, 6: {'test': 0.2565544254216944, 'train': 0.25831115305498004}, 10: {'test': 0.23009550939300133, 'train': 0.23181032851388447}, 15: {'test': 0.1900252126419288, 'train': 0.18882605500709032}, 20: {'test': 0.160103147847162, 'train': 0.154824487027302}},\
    'RFall_100':{1: {'test': 0.37008023248468, 'train': 0.37146866620954405}, 10: {'test': 0.21840810020732948, 'train': 0.2205764168958424}, 3: {'test': 0.289640992654449, 'train': 0.29019177031799515}, 6: {'test': 0.2532176270251954, 'train': 0.25488519094700574}, 15: {'test': 0.19286809595736248, 'train': 0.1926280373464277}}
}


In [None]:
import pickle
pickle.dump(Results,open('RandomForestResults.pkl','w'))
!ls -lrt

In [None]:
import pickle
Results=pickle.load(open('RandomForestsResults.pkl','r'))

In [None]:
%pylab inline
from plot_utils import *

In [None]:
make_figure([Results['RF10p_10'],Results['RF10p_100']],['10RForest','100RForest'],Title='random forsts using 10% of data')

In [None]:
make_figure([Results['RFall_10'],Results['RFall_100']],['10 trees','100 trees'],Title='random forsts using all of the data')