# Learning Curves

TODO
We have a predictor (assume a classifier for this example).
After we have fit our classifier, how do we know we are done fitting?
How do we know that we cannot put more time or data into our classifier to get better results.
There are a lot of regorous mathematical answers to these questions,
but a simpler and visual way to help answer this question is by using learning curves.

TODO
A [learning curve](https://en.wikipedia.org/wiki/Learning_curve) is a graph that shows a model's performance as either the amount of training data or time (represented by the number of iterations of some algorithm) increase.
You can actually put whatever parameter you want on the x-axis, as long as it represents some sort of ordered progression.
Below is an image of a typical learning curve for a hypothetical classifier.

TODO
IMAGE

As we can see in the image above, a typical learning curve will start with a steep increase in performance.
The classifier from this image will start by randomly guessing a class label and then just a few pieces of data can start to point it in the right direction.
Eventually, we will see the learning rate of the model slow down.
This means that the model has seen enough data to do well and any more data is becoming less useful.
Finally, we will hit the plateau where more data either will not help at all or only slightly help.

In [None]:
# TODO: TEST
import matplotlib.pyplot
import pandas
import sklearn.datasets
import sklearn.inspection
import sklearn.model_selection
import sklearn.svm
import sklearn.tree

# TEST

# n_samples = 100 -- Make 200 data points.
# n_features = 2 -- Generate two feature columns (perfect for plotting decision boundaries).
# n_redundant = 0 -- No redundant features (features with the same information as other features).
# n_informative = 2 -- Make our two features useful (and not just random).
# random_state = 4 -- The seed for the random number generator.
#                     The exact number doesn't matter, the same seed will generate the same data.
# n_clusters_per_class = 1 -- Make the data simple.
features, labels = sklearn.datasets.make_classification(
    n_samples = 1000,
    n_classes = 2,
    n_features = 20,
    n_informative = 5,
    n_redundant = 15,
    n_clusters_per_class = 5,
    # n_features = 10,
    # n_redundant = 0, n_informative = 2,
    random_state = 4,
    # n_clusters_per_class = 1
)

# Turn the features into a frame, the labels can stay as a list.
# features = pandas.DataFrame(features, columns = ['A', 'B'])
features = pandas.DataFrame(features)

''' TEST
# Split the data into train and test data.
# Note that we are not being rigorous and making sure the splits have the same label breakdown.

train_features = all_features[:100]
train_labels = all_labels[:100]

test_features = all_features[100:]
test_labels = all_labels[100:]

print(train_features[0:10])
print('---')
print(train_labels[0:10])
'''

print(features[0:10])
print('---')
print(labels[0:10])

In [None]:
# TEST

# Make a linear regression classifier.
# We will discuss the `penalty` argument and why we set it to `None` later.

'''
classifier = sklearn.linear_model.LogisticRegression(random_state = 4,
                                                     # penalty = None,
                                                     max_iter = 1000000,
                                                     solver = 'liblinear',
                                                     C = 100000,
                                                     )
'''

classifier = sklearn.tree.DecisionTreeClassifier(random_state = 4,
                                                 max_depth = 10,
                                                 min_samples_split = 1,
                                                 )

# figure, axis = matplotlib.pyplot.subplots(1, 1, figsize = (6, 6))
# axis.set_title("Learning Curve for Decision Tree")

# This function will split the data into train/test for us.
sklearn.model_selection.LearningCurveDisplay.from_estimator(
    classifier, features, labels,
    # cv = sklearn.model_selection.ShuffleSplit(n_splits = 50, test_size = 0.5, random_state = 0),
    # train_sizes = [0.01, 0.02, 0.03],
    score_type = "both",
    score_name = "Accuracy",
    random_state = 4,
    
    # ax = axis,
)

Decision trees are notorious for overfitting.
They can easily get too complex.

Instead of having number of training examples on the x-axis,
we are going to have the max tree depth.
This is a standin for how complex the model is allowed to be.
A deeper tree is generally more complex than shallower one.

In [None]:
# evaluate decision tree performance on train and test sets with different tree depths
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from matplotlib import pyplot
# create dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=4, n_clusters_per_class = 5)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# define lists to collect scores
train_scores, test_scores = list(), list()
# define the tree depths to evaluate
values = [i for i in range(1, 41)]
# evaluate a decision tree for each depth
for i in values:
	# configure the model
	model = DecisionTreeClassifier(max_depth=i)
	# fit model on the training dataset
	model.fit(X_train, y_train)
	# evaluate on the train dataset
	train_yhat = model.predict(X_train)
	train_acc = accuracy_score(y_train, train_yhat)
	train_scores.append(train_acc)
	# evaluate on the test dataset
	test_yhat = model.predict(X_test)
	test_acc = accuracy_score(y_test, test_yhat)
	test_scores.append(test_acc)
	# summarize progress
	print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))
# plot of train and test scores vs tree depth
pyplot.plot(values, train_scores, '-o', label='Train')
pyplot.plot(values, test_scores, '-o', label='Test')
pyplot.legend()
pyplot.show()

Same type of patter we saw before.
Sharp increase, transition, and plateau.
Except we also see the plateau getting slightly worse over time.

In [None]:
import matplotlib.pyplot
import pandas
import sklearn.datasets
import sklearn.inspection
import sklearn.svm

FIGURE_SIZE = 5
FIGURE_RESOLUTION = 500

def visualize_decision_boundary(classifier,
                                train_features, train_labels,
                                test_features, test_labels,
                                title = None):
    """
    Visualize the decision boundary of a trained binary classifier
    using the FIRST TWO columns of the passed in features.

    The train data will be plotted in a lighter shade than the background with a white outline.
    The test data will be plotted in a darker shade than the background with a black outline.
    """

    figure, axis = matplotlib.pyplot.subplots(1, 1, figsize = (FIGURE_SIZE, FIGURE_SIZE))
                                    
    matplotlib.pyplot.suptitle(title)

    # Score the classifier.
    train_accuracy = classifier.score(train_features, train_labels)
    test_accuracy = classifier.score(test_features, test_labels)
    matplotlib.pyplot.title("Train Accuracy: %3.2f, Test Accuracy: %3.2f" % (train_accuracy,
                                                                             test_accuracy))

    all_features = pandas.concat([train_features, test_features])
    
    # TEST
    print(all_features.columns)
    print(train_features.columns)
                                    
    # Draw the decision boundary.
    sklearn.inspection.DecisionBoundaryDisplay.from_estimator(
        classifier, all_features[[all_features.columns[0], all_features.columns[1]]],
        response_method = "predict", ax = axis,
        xlabel = all_features.columns[0], ylabel = all_features.columns[1],
        grid_resolution = FIGURE_RESOLUTION,
        cmap = 'RdBu', alpha = 0.50,
    )

    # Display the train data points.
    axis.scatter(
        train_features[train_features.columns[0]], train_features[train_features.columns[1]],
        c = train_labels,
        cmap = 'RdBu', alpha = 0.25, edgecolor = 'w',
    )

    # Display the test data points.
    axis.scatter(
        test_features[test_features.columns[0]], test_features[test_features.columns[1]],
        c = test_labels,
        cmap = 'RdBu', alpha = 0.75, edgecolor = 'k',
    )

In [None]:
# TEST

columns = list(range(0, 20))
visualize_decision_boundary(model, pandas.DataFrame(X_train, columns = columns), y_train, pandas.DataFrame(X_test, columns = columns), y_test)

https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html