In [1]:
import numpy as np

In [2]:
from prep_terrain_data import makeTerrainData

In [3]:
from adaboost_tester import *

In [4]:
from bokeh.io import output_notebook
output_notebook()

Visualise the Data
==================================
In the plot below we can see that although the middle section appears mostly linear, the edges do deviate from a straight line. Also, the boundary between the two classes is is very fuzzy and overlappping. There is no clear boundary between the classes.


In [5]:
#==========================================================================
#                                                       SCATTERPLOT OF DATA
#==========================================================================
X_train,Y_train,X_test,Y_test = makeTerrainData(n_points=1000)
X_train = np.array(X_train)

p = figure(plot_width=500, plot_height=500, 
          title="Terrain Data", x_axis_label='x1', y_axis_label='x2')
p.circle(X_train[:,0], X_train[:,1], size=10,
         color=map(lambda x: COLORS[int(x)], Y_train), alpha=0.3)
p.logo = None
p.toolbar_location = None
#output_file("myPlot.html")
show(p)

<bokeh.io._CommsHandle at 0x7fcb33eaae10>

Adaboost Classifier
============================================
We will attempt to use an adaboost classifier using a decision tree of depth 1 as its weak classifier. We will test out multiple values of several of the key parameters in the Adaboost algorithm and generate plots to visualise how those changes affect the accuracy and training time of the model generated. 

Two kinds of accuracy will be plotted simultaneously, the in-sample accuracy (the accuracy of the model using the training data), and the out-of-sample accuracy (accuracy using data that it was not trained on).

Default Values
-------------------
We will start by setting up some default values for the parameters to be tested. 

In [5]:
data_generator = makeTerrainData
default_n_points=1000      # Size of dataset generated. 
default_max_depth=1         # Max depth of weak decision tree classifier.
default_n_estimators=50     # Number of classifiers to use for boosting.
default_learning_rate=1.0   # Learning rate for the boosted classifiers
seed = 387                  # Setting random seed for reproducible results

Sample Size
----------------------
We will now look at the effects of sample size. We can see from the plots and print out below that more data results in a greater convergence between the in-sample estimate of accuracy and the out-of-sample measure of accuracy. This is because the training data accumulates more and more data points that are representative of the variety of data that truly exists.  

We can see that accuracy increases up to some point, and then flattens out to a value that fluctuates by small amounts around an accuracy of about 0.962. This tells us that getting larger amounts of data is not going to be of too much benefit to us. 

We also see, that for this learning algorithm, the time it takes to train on the data scales linearly with the number of training examples. 

In [12]:
sample_sizes = range(1000, 30000+1, 1000) + \
               range(35000, 60000+1, 5000) + \
               range(70000, 100000+1, 10000) + \
               [150000, 200000]
sample_size_results = loop_adaboost_with_simple_tree(data_generator,
                               sample_size=sample_sizes,
                               max_depth=default_max_depth,
                               n_estimators=default_n_estimators,
                               learning_rate=default_learning_rate,
                               random_state=seed
                               )


---------------------------------------- 
Best Results 
----------------------------------------
sample_size: 14000 
Out of sample accuracy: 0.964857142857 
----------------------------------------


In [13]:
scaled_sample_sizes = np.array(sample_sizes) / 1000.0
parameter_plots(scaled_sample_sizes, results_dict=sample_size_results, 
                x_label="Sample Size (in thousands)", 
                title_accuracy="Sample Size vs Accuracy", 
                title_time="Sample Size vs Training Time")

default_max_depth
----------------
We go back to using the default sample size of 1000, and try out various values of the depth of the decision tree weak classifier. 

We can see that making the 'weak' clssifier stronger, does not actually have much benefit in making the boosted algorithm better. In fact it results in poorer accuracy when the depth is incresed above 7. The best accuracy achieved was with a depth of 4, giving an accuracy of 0.928. 

Although the depth decreases the accuracy it does reduce the training time (quite a lot).  

In [7]:
max_depths = range(1,20+1)
max_depth_results = loop_adaboost_with_simple_tree(data_generator,
                               sample_size=default_n_points,
                               max_depth=max_depths,
                               n_estimators=default_n_estimators,
                               learning_rate=default_learning_rate,
                               random_state=seed
                               )


---------------------------------------- 
Best Results 
----------------------------------------
max_depth: 4 
Out of sample accuracy: 0.928 
----------------------------------------


In [8]:
parameter_plots(max_depths, results_dict=max_depth_results, 
                x_label="Max Depth", 
                title_accuracy="Max Depth vs Accuracy", 
                title_time="Max Depth vs Training Time", 
                legend_pos="right_center")

Number of Estimators
-------------------
We now look at the number of estimators used by the Adaboost algorithm.

We see from the plot that there is a slight divergence between the in-sample accuracy and the out of dsample accuracy when we increase the number of estimators. This tells us that the models start to overfit slightly as we make them more complex. 

The time taken to train the model also increases linearly with an increased number of estimators. 

In [9]:
n_estimators= range(1,20+1) + range(25, 100+1, 5) + range(125, 300, 25)
n_estimators_results = loop_adaboost_with_simple_tree(data_generator,
                               sample_size=default_n_points,
                               max_depth=default_max_depth,
                               n_estimators=n_estimators,
                               learning_rate=default_learning_rate,
                               random_state=seed
                               )


---------------------------------------- 
Best Results 
----------------------------------------
n_estimators: 13 
Out of sample accuracy: 0.928 
----------------------------------------


In [10]:
parameter_plots(n_estimators, results_dict=n_estimators_results, 
                x_label="Number of Estimators", 
                title_accuracy="Num Estimators vs Accuracy", 
                title_time="Num Estimators vs Training Time", 
                legend_pos="bottom_right")

In [None]:
default_learning_rate

In [18]:
learning_rates = list(np.arange(0.001, 0.1+0.001, 0.001)) + \
                 list(np.arange(0.1, 1.0+0.1, 0.01)) + \
                 list(np.arange(1.5, 10.0+0.5, 0.5))
    
learning_rates_results = loop_adaboost_with_simple_tree(data_generator,
                               sample_size=default_n_points,
                               max_depth=default_max_depth,
                               n_estimators=default_n_estimators,
                               learning_rate=learning_rates,
                               random_state=seed
                               )


---------------------------------------- 
Best Results 
----------------------------------------
learning_rate: 0.22 
Out of sample accuracy: 0.928 
----------------------------------------


In [19]:
parameter_plots(learning_rates, results_dict=learning_rates_results, 
                x_label="Learning Rate", 
                title_accuracy="Learning Rate vs Accuracy", 
                title_time="Learning Rate vs Training Time", 
                legend_pos="bottom_right")