# **Image Recognition**: Neural Nets

Source:  [https://github.com/d-insight/code-bank.git](https://github.com/d-insight/code-bank.git)  
License: [MIT License](https://opensource.org/licenses/MIT). See open source [license](LICENSE) in the Code Bank repository. 

-------------

## Overview

In this demo we will perform a common image classification task, using the MNIST dataset. We consider a dataset of hand-written images and we are going to predict the number associated with each image. Every image has a dimension of 28 * 28 pixels and is gray-scale. The input data includes the intensity associated with each pixel row by row, starting from top-left corner (784 pixels in total). The label field shows the number associated with each image. 

The state-of-the-art model achieves an error rate of only 0.23% (Ciresan et al. CVPR 2012). We will achieve an approximate error rate of 1.2% in this demo. 

<img src="https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset.png" width="700" height="500" align="center"/>


Image source: https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2019/02/Plot-of-a-Subset-of-Images-from-the-MNIST-Dataset.png

For an interactive hand-written image classification demo, visit: https://mnist-demo.herokuapp.com/ 

Dataset source: http://yann.lecun.com/exdb/mnist/

-------------

## **Part 0**: Setup

In [None]:
# Put all import statements at the top of your notebook

# Standard imports
import pandas as pd
import numpy as np
import itertools

# Data science packages
from sklearn.model_selection import learning_curve, validation_curve, StratifiedShuffleSplit, train_test_split, StratifiedKFold
from sklearn.metrics         import confusion_matrix
from sklearn.svm             import SVC
from sklearn.ensemble        import RandomForestClassifier
from sklearn.neighbors       import KNeighborsClassifier
from sklearn.dummy           import DummyClassifier
from sklearn.linear_model    import LogisticRegression
from xgboost                 import XGBClassifier

# Neural networks
from tensorflow.keras.models                 import Sequential
from tensorflow.keras.layers                 import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.python.keras.utils.vis_utils import model_to_dot
from tensorflow.keras.models                 import Sequential
from tensorflow.keras.layers                 import Dense
from tensorflow.python.keras.utils.np_utils  import to_categorical

# Visualization packages
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import SVG

import warnings
warnings.simplefilter('ignore')

%matplotlib inline


In [None]:
# Set constants 
SCORE    = 'accuracy'
N_CORES  = -1  # use all cores available
SEED     =  0
N_SPLITS =  3

In [None]:
def plot_validation_curve(train_scores, cv_scores, x_data, scale='lin', title='', y_label='', x_label=''):
    """
    Plot validation and learning curves 
    
    Parameter: 
        train_scores : first element of what validation_curve() object from sklearn returns
        cv_scores : second element of what validation_curve() object from sklearn returns
        x_data (list) : tuning parameter range to plot on x axis 
        scale (str) : 'lin' or 'log' for linear or logarithmic scale
        title (str) : plot title 
        y_label (str) : y label 
        x_label (str) : x label 
    
    Returns: 
        None
        
    """  
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    cv_scores_mean = np.mean(cv_scores, axis=1)
    cv_scores_std = np.std(cv_scores, axis=1)
    
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.ylim(0.0, 1.1)
    lw = 2
    
    plt.fill_between(x_data, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.2, color="r", lw=lw)
    plt.fill_between(x_data, cv_scores_mean - cv_scores_std, cv_scores_mean + cv_scores_std, alpha=0.2, color="g", lw=lw)
    
    if (scale == 'lin'):
        plt.plot(x_data, train_scores_mean, 'o-', color="r", label="Training score")
        plt.plot(x_data, cv_scores_mean, 'o-', color="g",label="Cross-validation score")
    elif (scale == 'log'):
        plt.semilogx(x_data, train_scores_mean, 'o-', color="r", label="Training score")
        plt.semilogx(x_data, cv_scores_mean, 'o-', color="g",label="Cross-validation score")
    plt.grid()
    plt.legend(loc="best")
    plt.show()
    
def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion Matrix', cmap=plt.cm.Reds):
    """
    Print and plot the confusion matrix
    
    Parameter: 
        cm : confusion_matrix() object from sklearn
        normalize (bool) : indicator that normalizes confusion matrix entries
        title (str) : plot title 
        cmap : matplotlib color scheme 
        
    Returns: 
        None
        
    """  
    
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    thresh = cm.max() / 2.
    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, round (cm[i, j],2), horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('Actual label')
    plt.xlabel('Predicted label')
    
def plot_learning_curve(train_scores, cv_scores, x_data, scale='lin', title='', y_label='', x_label=''):
    """
    Plot learning curve for different data sizes 
    
    Parameter: 
        train_scores : second element of what learning_curve() object from sklearn returns (training scores)
        cv_scores : third element of what learning_curve() object from sklearn returns (CV scores)
        x_data (list) : first element of what learning_curve() object from sklearn returns (train sizes)
        scale (str) : 'lin' or 'log' for linear or logarithmic scale
        title (str) : plot title 
        y_label (str) : y label 
        x_label (str) : x label 
    
    Returns: 
        None
        
    """  
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    cv_scores_mean = np.mean(cv_scores, axis=1)
    cv_scores_std = np.std(cv_scores, axis=1)
    
    plt.title(title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.ylim(0.0, 1.1)
    lw = 2
    
    plt.fill_between(x_data, train_scores_mean - train_scores_std,train_scores_mean + train_scores_std, alpha=0.2, color="r", lw=lw)
    plt.fill_between(x_data, cv_scores_mean - cv_scores_std, cv_scores_mean + cv_scores_std, alpha=0.2, color="g", lw=lw)
    
    if (scale == 'lin'):
        plt.plot(x_data, train_scores_mean, 'o-', color="r", label="Training score")
        plt.plot(x_data, cv_scores_mean, 'o-', color="g",label="Cross-validation score")
    elif (scale == 'log'):
        plt.semilogx(x_data, train_scores_mean, 'o-', color="r", label="Training score")
        plt.semilogx(x_data, cv_scores_mean, 'o-', color="g",label="Cross-validation score")
    plt.grid()
    plt.legend(loc="best")
    plt.show()

## **Part 1**: Data Preprocessing and EDA

In [None]:
# Load data into a dataframe

data = pd.read_csv('image_data.csv')
data.head()

In [None]:
# Dimensions of data

data.shape

In [None]:
# investigate basic statistics of data

data.describe()

In [None]:
# Drop missing value, if any

data.dropna(inplace=True)
data.shape

In [None]:
# Separate features and target

target = data['label'].values.ravel()
features = data.iloc[:, 1:].values

In [None]:
# Normalize features to be between 0 and 1

features = features / 255.0

In [None]:
# Check distribution of labels

sns.countplot(target)

In [None]:
# Visualize values of feature and its visual representation for an arbitrary data point

# Define a subplot with two rows and one column
fig, ax = plt.subplots(2, 1, figsize=(12,6))
dp_id = 100

# Plot features representation in 1D
ax[0].plot(features[dp_id])
ax[0].set_title('Unravelled image in format 784x1')
print()

# Plot features representation in 2D
ax[1].imshow(features[dp_id].reshape(28,28), cmap='gray')
ax[1].set_title('Corresponding image in original format 28x28')

# Show the plot
plt.show()

## **Part 2**: Define Cross-Validation Schema and Dummy Classifier Baseline

In [None]:
# Divide data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=SEED, stratify=target)

In [None]:
# Define a baseline using dummy classifier
# It enables you to define a basic classifier which apply simple strategies such as stratified
# Stratified strategy predicts probability of belonging to positive class as percentage of positive cases

dummy_clf = DummyClassifier(strategy='stratified', random_state = SEED)
dummy_clf.fit(X_train, y_train)
baseline_score = dummy_clf.score(X_test, y_test)
print('Baseline accuracy: {}'.format(round(baseline_score, 4)))

## **Part 3**: Prediction using Logistic Regression with Ridge regularization

Note: For parameter `C` in `LogisticRegression`, smaller values specify stronger regularization.

We follow the following steps to come up with a prediction model:

1. We tune our model parameter(s) using cross-validation on training set.
2. We fit the model with tuned parameter on training set.
3. We evaluate performance of our model using the test set.

In [None]:
# Define model
model = LogisticRegression(penalty='l2', n_jobs = N_CORES, multi_class='multinomial', solver='saga', random_state=SEED)

# Define cross-validation schema for tuning
cv_schema = StratifiedKFold(n_splits = N_SPLITS, shuffle=True, random_state = SEED)

In [None]:
# Tune model against a single hyper parameter C using validation_curve function

tuning_param = 'C'
tuning_param_range = np.logspace(-5, 5, 5)

train_scores_val, cv_scores_val = validation_curve(
    model, X_train, y_train, param_name = tuning_param, param_range = tuning_param_range,
    cv = cv_schema, scoring = SCORE, n_jobs = N_CORES)


In [None]:
# Plot validation curve
plot_validation_curve(train_scores_val, cv_scores_val, tuning_param_range, scale='log',
                      title='Validation curve: logistic regression', y_label='accuracy', x_label='C')

In [None]:
# Obtain the best value of the hyper parameter

best_param_val = tuning_param_range[np.argmax(np.mean(cv_scores_val, axis=1))]
print('Best C: {}'.format(best_param_val))

In [None]:
%%time
# Train model with best hyper parameter and assess its performance on the test data
lr_clf = LogisticRegression(C = best_param_val, n_jobs = N_CORES, multi_class='multinomial', solver='saga', random_state=SEED)
lr_clf.fit(X_train,y_train)
lr_score = lr_clf.score(X_test, y_test)
print('LR with Ridge accuracy: {}\n'.format(round(lr_score, 4)))

## **Part 4**: Prediction using KNN Classifier

In [None]:
# Set parameters, model and cv schema
model     = KNeighborsClassifier(n_jobs = N_CORES)
cv_schema = StratifiedKFold(n_splits = N_SPLITS, random_state = SEED)

In [None]:
%%time
# Tune model against a single hyper parameter

tuning_param = 'n_neighbors'
tuning_param_range = []
for i in np.linspace(2.0, 10.0, 2):
    tuning_param_range.append(int(i))

train_scores_val, cv_scores_val = validation_curve(
    model, X_train, y_train, param_name = tuning_param, param_range = tuning_param_range,
    cv = cv_schema, scoring = SCORE, n_jobs = N_CORES)

In [None]:
# Plot validation curve
plot_validation_curve(train_scores_val, cv_scores_val, tuning_param_range, scale='lin', 
                      title='Validation curve: KNN', y_label='accuracy', x_label='n_neighbors')

In [None]:
# Obtain the best value of the hyper parameter

best_param_val = tuning_param_range[np.argmax(np.mean(cv_scores_val, axis=1))]
print('Best n_neighbors: {}'.format(best_param_val))

In [None]:
# Train model with best hyper parameter and assess its performance on test data

knn_clf = KNeighborsClassifier(n_neighbors = best_param_val, n_jobs = N_CORES)
knn_clf.fit(X_train,y_train)
knn_score = knn_clf.score(X_test, y_test)
print('KNN accuracy: {}'.format(round(knn_score, 4)))

## **Part 5**: Prediction using Random Forest

In [None]:
# Set parameters, model and cv schema
model     = RandomForestClassifier(n_jobs = N_CORES)
cv_schema = StratifiedKFold(n_splits = N_SPLITS, shuffle=True, random_state = SEED)

In [None]:
%%time
# Tune model against a single hyper parameter

tuning_param = 'n_estimators'
tuning_param_range = []
for i in np.linspace(10.0, 150.0, 5):
    tuning_param_range.append(int(i))

train_scores_val, cv_scores_val = validation_curve(
    model, X_train, y_train, param_name = tuning_param, param_range = tuning_param_range,
    cv = cv_schema, scoring = SCORE, n_jobs = N_CORES)

In [None]:
# Plot validation curve

plot_validation_curve(train_scores_val, cv_scores_val, tuning_param_range, scale='lin', 
                      title='Validation curve: random forest', y_label='accuracy', x_label='n_estimators')

In [None]:
# Obtain the best value of the hyper parameter

best_param_val = tuning_param_range[np.argmax(np.mean(cv_scores_val, axis=1))]
print('Best n_estimators: {}'.format(best_param_val))

In [None]:
# Train model with best hyper parameter and assess its performance on test data

rf_clf = RandomForestClassifier(n_estimators = best_param_val, n_jobs = N_CORES)
rf_clf.fit(X_train,y_train)
rf_score = rf_clf.score(X_test,y_test)
print('RF accuracy: {}'.format(round(rf_score, 4)))

## **Part 6**: Prediction using Gradient Boosted Trees

In [None]:
# XGboost library has a more performant implementation of gradient boosted trees
model     = XGBClassifier(n_jobs = N_CORES, random_state = SEED)
cv_schema = StratifiedKFold(n_splits = N_SPLITS, random_state = SEED)

In [None]:
%%time
# Tune model against a single hyper parameter
tuning_param = 'n_estimators'
tuning_param_range = []
for i in np.linspace(30.0, 80.0, 10):
    tuning_param_range.append(int(i))

train_scores_val, cv_scores_val = validation_curve(
    model, X_train, y_train, param_name = tuning_param, param_range = tuning_param_range,
    cv = cv_schema, scoring = SCORE, n_jobs = N_CORES)

In [None]:
# Plot validation curve
plot_validation_curve(train_scores_val, cv_scores_val, tuning_param_range, scale='lin', 
                      title='Validation curve: gradient boosted trees', y_label='accuracy', x_label='n_estimators')

In [None]:
# Obtain the best value of the hyper parameter
best_param_val = tuning_param_range[np.argmax(np.mean(cv_scores_val, axis=1))]
print('Best n_estimators: {}'.format(best_param_val))

In [None]:
# Train model with best hyper parameter and assess its perfrmance on test data

gb_clf = XGBClassifier(n_estimators = best_param_val, n_jobs = N_CORES, random_state=SEED)
gb_clf.fit(X_train, y_train)
gb_score = gb_clf.score(X_test,y_test)
print('Gradient boosted accuracy: {}'.format(round(gb_score, 4)))

## **Part 7**: Prediction using SVC

In [None]:
# Set parameters, model and cv schema
model     = SVC(random_state = SEED)
cv_schema = StratifiedKFold(n_splits = N_SPLITS, shuffle=True, random_state = SEED)

In [None]:
%%time
# Tune model against a single hyper parameter

tuning_param = 'C'
tuning_param_range = np.logspace(-5, 5, 5)

train_scores_val, cv_scores_val = validation_curve(
    model, X_train, y_train, param_name = tuning_param, param_range = tuning_param_range,
    cv = cv_schema, scoring = SCORE, n_jobs = N_CORES)

In [None]:
# Plot validation curve

plot_validation_curve(train_scores_val, cv_scores_val, tuning_param_range, scale='log', 
                      title='Validation curve: support vector classifier', y_label='accuracy', x_label='C')

In [None]:
# Obtain the best value of the hyper parameter

best_param_val = tuning_param_range[np.argmax(np.mean(cv_scores_val, axis=1))]
print('Best C: {}'.format(best_param_val))

In [None]:
# Train model with increasing amount of training data

svc_clf = SVC(C = best_param_val)
svc_clf.fit(X_train, y_train)
svc_score = svc_clf.score(X_test,y_test)
print('SVC accuracy: {}'.format(round(svc_score, 4)))

## **Part 8**: Prediction using Feed Forward Neural Network

In [None]:
# Encode target labels to one hot vectors (ex : 3 -> [0,0,0,1,0,0,0,0,0,0])

y_train_ohe = to_categorical(y_train, num_classes = 10)
y_test_ohe = to_categorical(y_test, num_classes = 10)
y_train_ohe[0]

In [None]:
# Set parameters, model and cv schema

epochs      = 10  # Number of iterations over full training set
batch_size  = 200 # Number of observations to fit in every batch 
num_pixels  = 28 * 28
num_classes = 10

In [None]:
# Create and compile a simple feed forward neural network

def ffnn_model():
    """
    Set up a feed-forward neural network with two dense layers
    
    Parameter: 
        None
    
    Returns: 
        model : Keras Sequential() model 
        
    """  
    
    # Create a neural network with two dense layers
    model = Sequential()
    model.add(Dense(num_pixels, input_dim=num_pixels, kernel_initializer='normal', activation='relu'))
    model.add(Dense(num_classes, kernel_initializer='normal', activation='softmax'))
    
    # Compile model
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    
    return model

model = ffnn_model()

In [None]:
# Visualize keras model
SVG(model_to_dot(model, show_shapes=True, dpi=65).create(prog='dot', format='svg'))

In [None]:
# Fit the model to data

model.fit(X_train, y_train_ohe, validation_data = (X_test, y_test_ohe), 
          epochs = epochs, batch_size = batch_size)

In [None]:
# Get model summary

model.summary()

In [None]:
# Evaluate trained model on test set

ff_loss, ff_score = model.evaluate(X_test, y_test_ohe)
print()
print('Loss (categorical cross-entropy):'.ljust(35) + str(round(ff_loss, 4)))
print('Accuracy:'.ljust(35) + str(round(ff_score, 4)))

In [None]:
# Get predicted values
predicted_classes = model.predict_classes(X_test)

# Get index of correctly and incorrectly classified observations
target_val_orig = np.argmax(y_test_ohe, 1)

# Get index list of all correctly predicted values
correct_indices = np.nonzero(np.equal(predicted_classes, target_val_orig))[0]

# Get index list of all incorrectly predicted values
incorrect_indices = np.nonzero(np.not_equal(predicted_classes, target_val_orig))[0]

# Print number of correctly and incorrectly clasified observations
print ('Correctly predicted:'.ljust(30) + str(np.size(correct_indices)))
print ('Incorrectly predicted:'.ljust(30) + str(np.size(incorrect_indices)))

In [None]:
# See a sample of incorrectly classified samples
    
plt.figure(figsize=[20,8])
for i, incorrect in enumerate(incorrect_indices[:6]):
    plt.subplot(1,6,i+1)
    plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[incorrect], target_val_orig[incorrect]))

## **Part 9**: Prediction using Convolutional Neural Network

In [None]:
# Reshape image in 3 dimensions (height = 28px, width = 28px , canal = 1)

X_train_reshaped = X_train.reshape(-1,28,28,1)
X_test_reshaped = X_test.reshape(-1,28,28,1)

In [None]:
# Access a single pixel value in reshaped arrays

X_train_reshaped[2800][27][27][0]

In [None]:
# Encode target labels to one hot vectors

y_train_ohe = to_categorical(y_train, num_classes = 10)
y_test_ohe  = to_categorical(y_test, num_classes = 10)

In [None]:
# Set parameters, model and cv schema

epochs     = 12  # number of iterations over full training set
batch_size = 128 # number of observations to fit in every batch 

In [None]:
# Set the CNN architechture (architecture taken from:
# https://machinelearningmastery.com/handwritten-digit-recognition-using-convolutional-neural-networks-python-keras/

def cnn_model():
    """
    Set up a convolutional neural network (architecture from: https://keras.io/examples/mnist_cnn/)
    
    Parameter: 
        None
    
    Returns: 
        model : Keras Sequential() model 
        
    """ 
    
    model = Sequential()
    
    model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(28,28,1)))
    model.add(Conv2D(64, (3, 3), activation='relu'))
    model.add(MaxPool2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(10, activation='softmax'))
    
    # compile the model
    model.compile(optimizer = 'adam', loss = "categorical_crossentropy", metrics=["accuracy"])
    
    return (model)

model = cnn_model()

In [None]:
SVG(model_to_dot(model, show_shapes=True, dpi=65).create(prog='dot', format='svg'))

In [None]:
# Fit the model to data

model.fit(X_train_reshaped, y_train_ohe, validation_data=(X_test_reshaped, y_test_ohe), epochs=epochs, batch_size=batch_size)

In [None]:
# Get model summary

model.summary()

In [None]:
# Evaluate trained model on test set

cnn_loss, cnn_score = model.evaluate(X_test_reshaped, y_test_ohe)
print()
print('Loss (categorical cross-entropy):'.ljust(35) + str(round(cnn_loss, 4)))
print('Accuracy:'.ljust(35) + str(round(cnn_score, 4)))

In [None]:
# Get predicted values
predicted_classes = model.predict_classes(X_test_reshaped)

# Get index of correctly and incorrectly classified observations
target_val_orig = np.argmax(y_test_ohe, 1)

# Get index list of all correctly predicted values
correct_indices = np.nonzero(np.equal(predicted_classes, target_val_orig))[0]

# Get index list of all incorrectly predicted values
incorrect_indices = np.nonzero(np.not_equal(predicted_classes, target_val_orig))[0]

# Print number of correctly and incorrectly clasified observations
print ('Correctly predicted:'.ljust(30) + str(np.size(correct_indices)))
print ('Incorrectly predicted:'.ljust(30) + str(np.size(incorrect_indices)))

In [None]:
# See a sample of incorrectly classified samples
    
plt.figure(figsize=[20,8])
for i, incorrect in enumerate(incorrect_indices[:6]):
    plt.subplot(1,6,i+1)
    plt.imshow(X_test[incorrect].reshape(28,28), cmap='gray', interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[incorrect], target_val_orig[incorrect]))

In [None]:
# Look at confusion matrix
y_pred_ohe = model.predict(X_test_reshaped)
y_pred_classes = np.argmax(y_pred_ohe, axis = 1) 
y_true = np.argmax(y_test_ohe, axis = 1) 
confusion_mtx = confusion_matrix(y_true, y_pred_classes) 
plot_confusion_matrix(confusion_mtx, classes = range(10)) 

## **SUMMARY OF ACCURACY SCORES**

In [None]:
width    = 35
models   = ['Baseline',     'LR + Ridge', 'KNN',     'Random Forest', 'Boosted Trees', 'SVC',     'NN',     'CNN']
results  = [baseline_score, lr_score,     knn_score, rf_score,        gb_score,        svc_score, ff_score, cnn_score]

print('', '=' * width, '\n', 'Summary of Accuracy Scores'.center(width), '\n', '=' * width)  
for i in range(len(models)):
    print(models[i].center(width-8), '{0:.4f}'.format(round(results[i], 4)))

## **Part 10**: Investment into More Data

In [None]:
# Learning_curve function in sklearn.model_selection is used to assess performance of 
# a model with diferent training sizes, hence justifying possible investments
# into gathering more data

# Here we train random forest classifier with increasing amount of training data
rf_clf      = RandomForestClassifier(n_estimators = 50, n_jobs = N_CORES)
cv_schema   = StratifiedShuffleSplit(n_splits = N_SPLITS, test_size = 0.33, shuffle=True, random_state = SEED)
train_sizes = np.linspace(.1, 1.0, 5)

train_sizes, train_scores_learn, cv_scores_learn = learning_curve(
        rf_clf, features, target, cv = cv_schema, n_jobs = N_CORES, train_sizes = train_sizes, scoring = SCORE)

In [None]:
# Plot learning curve

plot_learning_curve(train_scores_learn, cv_scores_learn, train_sizes, scale='lin', 
                      title='Learning curve: random forest', y_label='accuracy', x_label='train set size')

## **Part 11**: Discussion

* How can one justify the investment on gathering additional data ? 
* How much an accuracy equal to 99.99 % is more significant than one equal to 99.90 % ? 
* What about 99.90 % compared to 99.00 % ?