# Under and Overfitting by Example
## 48 U.S. Contiguous States

This notebook serves as an example of one means of generating decision surfaces over latitude and longitude coordinates. In this notebook, we construct a contrived but perhaps helpful example of underfitting and overfitting from data provided by the U.S. Census Bureau - latitude and longitude coordinates for each U.S. state. We only have border data for each state, so naturally tree-based models are expected to work well here. But we can also expect a few things from fitting the data with tree-based models - the models will have less certainty around borders, and the classifications will extend beyond the actual borders for each state into the sea. By fitting overfit, underfit and "just right" models, we can visually interpret and understand overfiting and underfiting through a visual example many know already. 

We will note that the "best" model here only represents a marginal amount of time dedicated to the modeling process - a true endeavour into machine learning for a specific goal will typically a lot more in depth understanding of the data, feature generation, and machine learning steps such as cross-validation and other metrics / approaches (ROC / AUC / F-Score etc.).

# Imports

In [1]:
import json
import matplotlib as m
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(0) # set a seed to enable reproducable results
import pandas as pd
# next two lines enable plotly offline
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.plotly as py
from plotly import tools
import seaborn as sns
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.metrics import classification_report, f1_score, precision_score, recall_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder

# Needed Functions

These functions plot our decision surface and export the decision surface data for visualization in d3 or another library of your choice. We will use them to plot our results here and export the data for visualization in d3. 

In [2]:
def plot_decision_surface(X_train, X_test, y_train, y_test, sklearn_classifier, label_encoder, labels=None):
    
    fig = tools.make_subplots(rows=1, cols=1)
    
    if labels is not None:
        classes = labels
    else:
        classes = list(set(y_test.values))
    n_classes = len(classes)
       
    x_min, x_max = X_test.iloc[:, 0].min() - 1, X_test.iloc[:, 0].max() + 1
    y_min, y_max = X_test.iloc[:, 1].min() - 1, X_test.iloc[:, 1].max() + 1
    x_ = np.arange(x_min, x_max, 0.75)
    y_ = np.arange(y_min, y_max, 0.75)
    xx, yy = np.meshgrid(x_, y_)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z_name = label_encoder.inverse_transform(Z)
    Z = Z.reshape(xx.shape)
    Z_name = Z_name.reshape(xx.shape)
    
    cs = go.Heatmap(x=x_, y=y_, z=Z, 
                    text=Z_name,
                    colorscale='Portland',
                    showscale=False)
    fig.append_trace(cs, 1, 1)
    fig['layout'].update(height=700, hovermode='closest',
                         title="Decision Surface of Model")
    iplot(fig)

In [3]:
def build_decision_surface(X_train, X_test, y_train, y_test, sklearn_classifier, label_encoder):
    
    x_min, x_max = X_test.iloc[:, 0].min() - 1, X_test.iloc[:, 0].max() + 1
    y_min, y_max = X_test.iloc[:, 1].min() - 1, X_test.iloc[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.75),
                         np.arange(y_min, y_max, 0.75))

    r = np.c_[xx.ravel(), yy.ravel()]
    Z = sklearn_classifier.predict(r)
    probas = sklearn_classifier.predict_proba(r)
    
    P = np.max(probas, axis=1)
    
    Z = Z.reshape(xx.shape)
    P = P.reshape(xx.shape)
    
    rows = []
    for i in range(xx.shape[0]):
        for j in range(xx[i].shape[0]):
            rows.append([Z[i][j], label_encoder.inverse_transform(Z[i][j]), xx[i][j], yy[i][j], P[i][j]])
                
    df = pd.DataFrame(rows, columns=['state_categorical', 'place', 'lon', 'lat', 'prob'])
         
    return df

In [4]:
def build_decision_matrix(X_train, X_test, y_train, y_test, sklearn_classifier, label_encoder):
    
    x_min, x_max = X_test.iloc[:, 0].min() - 1, X_test.iloc[:, 0].max() + 1
    y_min, y_max = X_test.iloc[:, 1].min() - 1, X_test.iloc[:, 1].max() + 1

    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.75),
                         np.arange(y_min, y_max, 0.75))

    r = np.c_[xx.ravel(), yy.ravel()]
    Z = sklearn_classifier.predict(r)
    probas = sklearn_classifier.predict_proba(r)
    
    class_natural_language = label_encoder.inverse_transform(sklearn_classifier.classes_)
    class_categoricals = sklearn_classifier.classes_
    
    rows = []
    for i in range(probas.shape[0]):
        rows.append([class_categoricals, class_natural_language, r[i][0], r[i][1], [round(p, 4) for p in probas[i]]])
    
    df = pd.DataFrame(rows, columns=['state_categorical', 'place', 'lon', 'lat', 'prob'])
         
    return df

# Process Data

We process a KML file provided by the U.S. Census Bureau to create a dataset of latitude and longitude coordinates and associated state as a label, for use in supervised learning. We'll split the data into training and testing. 

In [5]:
%run -i 'process_kml_files.py'

100%|██████████| 2/2 [00:00<00:00,  6.32it/s]


# Read Data

Read in the data from file `geo.json`, which was created using `process_kml_files.py`.

In [6]:
with open('../data/geo.json', 'r') as f:
    data = json.load(f)

In [7]:
rows = []
for k in data.keys():    
    for c in data[k]['coordinates']:
        # state, latitude, longitude
        rows.append([k, c[0], c[1]]) 

In [8]:
df = pd.DataFrame(rows, columns=['state', 'latitude', 'longitude'])
# drop rows (keep visualization to 48 contiguous states)
df = df[df['state'] != 'Alaska']
df = df[df['state'] != 'Hawaii']
df = df[df['state'] != 'Puerto Rico']
df = df.sample(frac=1) # shuffle the data to view
df.head(10)

Unnamed: 0,state,latitude,longitude
9909,Maine,-68.95189,44.218719
2091,Ohio,-81.466038,41.649148
9037,Delaware,-75.415062,39.801919
6087,Minnesota,-92.776496,45.790014
6652,Massachusetts,-71.186104,42.790689
2353,California,-121.888491,36.30281
7687,Virginia,-75.672877,37.483696
11799,Idaho,-115.29211,47.209861
3272,Tennessee,-89.729517,35.847632
13562,Florida,-80.710607,25.15253


Create a non-string categorical to be used by the models in `scikit-learn`. 

In [9]:
le = LabelEncoder()
df['state_categorical'] = le.fit_transform(df['state'])
df.head()

Unnamed: 0,state,latitude,longitude,state_categorical
9909,Maine,-68.95189,44.218719,17
2091,Ohio,-81.466038,41.649148,33
9037,Delaware,-75.415062,39.801919,6
6087,Minnesota,-92.776496,45.790014,21
6652,Massachusetts,-71.186104,42.790689,19


In [10]:
df.shape

(11192, 4)

In [11]:
np.mean(pd.groupby(df, 'state').count()['state_categorical']) # mean number of observations per state

228.40816326530611

# Fit the Models

In [12]:
# set independent and dependent variables
X = df[['latitude', 'longitude']]
y = df['state_categorical']

In [13]:
# create training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Underfitting

Use the training data for training, the test data for testing, but constrain the model through its paremeters. 

In [14]:
clf = RandomForestClassifier(min_samples_leaf=50, max_depth=3).fit(X_train, y_train)

In [15]:
plot_decision_surface(X_train, X_test, y_train, y_test, clf, le, labels=list(set(df['state'].values)))

This is the format of your plot grid:
[ (1,1) x1,y1 ]




The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



The underfit model doesn't look like much and fails to capture enough information from our training set to generalize well on our test set. 

In [17]:
underfit_surface = build_decision_surface(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i

In [18]:
underfit_surface.shape

(2880, 5)

In [19]:
underfit_matrix = build_decision_matrix(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



In [20]:
underfit_surface['label'] = 'underfitting'
underfit_surface.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,3,California,-125.625512,23.498716,0.419552,underfitting
1,3,California,-124.875512,23.498716,0.419552,underfitting
2,3,California,-124.125512,23.498716,0.419552,underfitting
3,3,California,-123.375512,23.498716,0.412875,underfitting
4,3,California,-122.625512,23.498716,0.412875,underfitting


In [21]:
underfit_matrix['label'] = 'underfitting'
underfit_matrix

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-125.625512,23.498716,"[0.0, 0.0814, 0.0, 0.4196, 0.0, 0.0, 0.0, 0.0,...",underfitting
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.875512,23.498716,"[0.0, 0.0814, 0.0, 0.4196, 0.0, 0.0, 0.0, 0.0,...",underfitting
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.125512,23.498716,"[0.0, 0.0814, 0.0, 0.4196, 0.0, 0.0, 0.0, 0.0,...",underfitting
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-123.375512,23.498716,"[0.0, 0.0814, 0.0, 0.4129, 0.0, 0.0, 0.0, 0.0,...",underfitting
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-122.625512,23.498716,"[0.0, 0.0814, 0.0, 0.4129, 0.0, 0.0, 0.0, 0.0,...",underfitting
5,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-121.875512,23.498716,"[0.0, 0.0814, 0.0, 0.4419, 0.0, 0.0, 0.0, 0.0,...",underfitting
6,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-121.125512,23.498716,"[0.0, 0.0814, 0.0, 0.4419, 0.0, 0.0, 0.0, 0.0,...",underfitting
7,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-120.375512,23.498716,"[0.0, 0.0814, 0.0, 0.4419, 0.0, 0.0, 0.0, 0.0,...",underfitting
8,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-119.625512,23.498716,"[0.0, 0.0814, 0.0, 0.4419, 0.0, 0.0, 0.0, 0.0,...",underfitting
9,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-118.875512,23.498716,"[0.0, 0.0814, 0.0, 0.4419, 0.0, 0.0, 0.0, 0.0,...",underfitting


## Overfitting

Use all of the data for both training and testing, allow default paremeters which are prone to overfitting. 

In [22]:
clf = RandomForestClassifier().fit(X, y)

In [23]:
plot_decision_surface(X, X, y, y, clf, le, labels=list(set(df['state'].values)))

This is the format of your plot grid:
[ (1,1) x1,y1 ]




The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



So, this looks more like the lower 48 U.S. states than the underfit model! We do see some problems however. While the general shape is correct for a decent amount of states, the inner borders are very non-uniform and tend to overlap one another. This is an example of an overfit model - one which is overly biased towards its training data. 

In [24]:
overfit_surface = build_decision_surface(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i

In [25]:
overfit_surface['label'] = 'overfitting'
overfit_surface.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,3,California,-125.625512,23.498716,0.6,overfitting
1,3,California,-124.875512,23.498716,0.6,overfitting
2,3,California,-124.125512,23.498716,0.7,overfitting
3,3,California,-123.375512,23.498716,0.7,overfitting
4,3,California,-122.625512,23.498716,0.6,overfitting


In [26]:
overfit_matrix = build_decision_matrix(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



In [27]:
overfit_matrix['label'] = 'overfitting'
overfit_matrix.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-125.625512,23.498716,"[0.0, 0.0, 0.0, 0.6, 0.0, 0.0, 0.0, 0.0, 0.0, ...",overfitting
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.875512,23.498716,"[0.0, 0.0, 0.0, 0.6, 0.0, 0.0, 0.0, 0.0, 0.0, ...",overfitting
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.125512,23.498716,"[0.0, 0.0, 0.0, 0.7, 0.0, 0.0, 0.0, 0.0, 0.0, ...",overfitting
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-123.375512,23.498716,"[0.0, 0.0, 0.0, 0.7, 0.0, 0.0, 0.0, 0.0, 0.0, ...",overfitting
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-122.625512,23.498716,"[0.0, 0.0, 0.0, 0.6, 0.0, 0.0, 0.0, 0.0, 0.0, ...",overfitting


## Striving for a Good Result 

Use the training data for training, test data for testing. Use a grid search with cross-validation to determine the best set of paremeter values when maximizing recall. 

In [28]:
# Set the parameters by cross-validation
tuned_parameters = [
                        {'min_samples_leaf': [1, 3, 5, 10, 15, 20, 25, 30, 50, 75],
                        'max_depth': [3, 5, 10, 25, 50, 100, 200, 250]}
                   ]

# metrics to use in evaluation
scores = ['precision', 'recall']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(RandomForestClassifier(), tuned_parameters, cv=5,
                       scoring='%s_macro' % score)
    clf.fit(X_train, y_train)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = y_test, clf.predict(X_test)
    print(classification_report(y_true, y_pred))
    print()



# Tuning hyper-parameters for precision




Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no 


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.


Precision is ill-defined and being set to 0.0 in labels with no predicted samples.



Best parameters set found on development set:

{'max_depth': 200, 'min_samples_leaf': 30}

Grid scores on development set:

0.181 (+/-0.037) for {'max_depth': 3, 'min_samples_leaf': 1}
0.151 (+/-0.020) for {'max_depth': 3, 'min_samples_leaf': 3}
0.140 (+/-0.049) for {'max_depth': 3, 'min_samples_leaf': 5}
0.153 (+/-0.061) for {'max_depth': 3, 'min_samples_leaf': 10}
0.161 (+/-0.008) for {'max_depth': 3, 'min_samples_leaf': 15}
0.155 (+/-0.032) for {'max_depth': 3, 'min_samples_leaf': 20}
0.162 (+/-0.029) for {'max_depth': 3, 'min_samples_leaf': 25}
0.156 (+/-0.030) for {'max_depth': 3, 'min_samples_leaf': 30}
0.170 (+/-0.040) for {'max_depth': 3, 'min_samples_leaf': 50}
0.142 (+/-0.053) for {'max_depth': 3, 'min_samples_leaf': 75}
0.344 (+/-0.049) for {'max_depth': 5, 'min_samples_leaf': 1}
0.341 (+/-0.039) for {'max_depth': 5, 'min_samples_leaf': 3}
0.341 (+/-0.053) for {'max_depth': 5, 'min_samples_leaf': 5}
0.338 (+/-0.035) for {'max_depth': 5, 'min_samples_leaf': 10}
0.349 (+/-0.07


Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.



Our modeling performance here is not so great. The input data isn't super great and only consists of borders (as opposed to having internal state coordinates, which I guess we could generate using the known border coordinates and labels) and we don't have any information other that the coordinates to help us predict the label. Turns out, the performance we have here is enough to show the difference between a well fit model and one that is under or overfit - but don't kid yourself - the performance metrics seen here are not indicative of a well-performing model to be used in inference. 

In [29]:
clf = RandomForestClassifier(min_samples_leaf=30, max_depth=200).fit(X_train, y_train)

In [30]:
plot_decision_surface(X_train, X_test, y_train, y_test, clf, le, labels=list(set(df['state'].values)))

This is the format of your plot grid:
[ (1,1) x1,y1 ]




The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



Much better! In general, this is an improvement over the underfit and overfit examples - the state borders away from the coastlines are clear, and we can make out some states in our data. This is the best performance we can get with a random forest on this data, but a gradient boosted decision tree may due better. 

In [35]:
between_surface = build_decision_surface(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i

In [36]:
between_surface['label'] = 'in between'
between_surface.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,3,California,-125.625512,23.498716,0.428793,in between
1,3,California,-124.875512,23.498716,0.428793,in between
2,3,California,-124.125512,23.498716,0.428793,in between
3,3,California,-123.375512,23.498716,0.428793,in between
4,3,California,-122.625512,23.498716,0.428793,in between


In [37]:
between_matrix = build_decision_matrix(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



In [38]:
between_matrix['label'] = 'in between'
between_matrix.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-125.625512,23.498716,"[0.0, 0.0302, 0.0, 0.4288, 0.0, 0.0, 0.0, 0.0,...",in between
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.875512,23.498716,"[0.0, 0.0302, 0.0, 0.4288, 0.0, 0.0, 0.0, 0.0,...",in between
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.125512,23.498716,"[0.0, 0.0302, 0.0, 0.4288, 0.0, 0.0, 0.0, 0.0,...",in between
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-123.375512,23.498716,"[0.0, 0.0302, 0.0, 0.4288, 0.0, 0.0, 0.0, 0.0,...",in between
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-122.625512,23.498716,"[0.0, 0.0302, 0.0, 0.4288, 0.0, 0.0, 0.0, 0.0,...",in between


## The Best Result (Gradient Boosted Tree)

Here, we'll use the same parameters identified in the above section to git a gradient boosted tree. There's likely better parameters to use for this type of model, but it's not really needed for this example since we can expect the gradient boosted tree to perform better than the random forest. 

In [39]:
clf = GradientBoostingClassifier(min_samples_leaf=30, max_depth=200).fit(X_train, y_train)

In [40]:
plot_decision_surface(X_train, X_test, y_train, y_test, clf, le, labels=list(set(df['state'].values)))

This is the format of your plot grid:
[ (1,1) x1,y1 ]




The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



Aha!! WOW. What a difference. Compare this to the overfit and underfit model plots and you'll soon realize the meaning of a "well" fit model (again, well is in quotes because we didn't take many usual steps to model training here - but you get the idea). 

In [41]:
gbd_surface = build_decision_surface(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array i

In [42]:
gbd_surface['label'] = 'gradient boosted classifier'
gbd_surface.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,3,California,-125.625512,23.498716,0.999988,gradient boosted classifier
1,3,California,-124.875512,23.498716,0.999988,gradient boosted classifier
2,3,California,-124.125512,23.498716,0.99999,gradient boosted classifier
3,3,California,-123.375512,23.498716,0.999984,gradient boosted classifier
4,3,California,-122.625512,23.498716,0.999988,gradient boosted classifier


In [43]:
gbd_matrix = build_decision_matrix(X_train, X_test, y_train, y_test, clf, le)


The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.



In [44]:
gbd_matrix['label'] = 'gradient boosted classifier'
gbd_matrix.head()

Unnamed: 0,state_categorical,place,lon,lat,prob,label
0,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-125.625512,23.498716,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",gradient boosted classifier
1,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.875512,23.498716,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",gradient boosted classifier
2,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-124.125512,23.498716,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",gradient boosted classifier
3,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-123.375512,23.498716,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",gradient boosted classifier
4,"[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,...","[Alabama, Arizona, Arkansas, California, Color...",-122.625512,23.498716,"[0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",gradient boosted classifier


# Export

While the plots above start to give a sense on how each model is fit over our geographic surface, a form of interactive visualization would allow for improved understanding. We can export the decision surface information to be displayed externally, such as in the d3 example in this repository. 

In [47]:
df_to_viz = pd.concat([underfit_surface, overfit_surface, between_surface, gbd_surface])
df_to_viz.tail()
df_to_viz.to_csv('../visualization/predicted-states-4.csv')

In [48]:
df_to_viz = pd.concat([underfit_matrix, overfit_matrix, between_matrix, gbd_matrix])
df_to_viz.tail()
df_to_viz.to_csv('../visualization/predicted-states-allprobas-4.csv')