# Learning Curves

Let's say that we are creating some classifier.
After we have fit our classifier, how do we know we are done fitting?
How do we know that we cannot put more time or data into our classifier to get better results?
There are a lot of rigorous mathematical answers to these questions,
but a simpler and visual way to help answer these questions is by using learning curves.

A [learning curve](https://en.wikipedia.org/wiki/Learning_curve) is a graph that shows a model's performance plotted against some parameter about how the model was created/trained.
You can actually put whatever parameter you want on the x-axis, as long as it represents some sort of ordered progression like
amount of training data, number of iterations, runtime, or model complexity.
Together, these two numbers tell a story about how the model is progressing along that parameter
(e.g., how the model is progressing as more training data is added).
Below is an image of a typical learning curve for a hypothetical classifier that plots accuracy vs fit/training time.

<center><img src="basic-learning-curve.png" width="500px"/></center>
<center style='font-size: small'>Image from <a href='https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html'>scikit-learn</a>.</center>

As we can see in the image above, a typical learning curve will start with a steep increase in performance.
This is because models will generally start from a randomly initialized state and will therefore just guess randomly.
Just a few pieces of labeled data will help the classifier do much better than random.
Eventually, we will see the improvement rate of the model slow down.
This indicates that the model has seen enough data to do well and more data is becoming less useful.
Finally, we will hit the plateau where more data either will not help at all or only slightly help.

[Looking at the learning curve a model produces](https://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html)
can tell us a lot about how it was trained and what we can expect from it going forward.
For this example, we will look at some learning curves,
discuss the types of patterns we want to see from a learning curve,
and see what to do when our learning curves don't look right.

Let's start by creating some test data
(see the [decision boundary example](https://github.com/ucsc-cse-40/cse40-examples/blob/main/decision-boundaries/decision-boundaries.ipynb) for a discussion on creating toy data).

In [None]:
import matplotlib.pyplot
import numpy
import pandas
import sklearn.datasets
import sklearn.linear_model
import sklearn.model_selection
import sklearn.tree

# Make 1000 sample data points.
all_features, all_labels = sklearn.datasets.make_classification(
    n_samples = 1000,
    random_state = 4,
)

all_features = pandas.DataFrame(all_features)

train_features = all_features[:100]
train_labels = all_labels[:100]

test_features = all_features[100:]
test_labels = all_labels[100:]

print(all_features[0:10])
print('---')
print(all_labels[0:10])

Now that we have some data, let's just create a simple classifier and look at it's learning curve.
To plot learning curves,
we will be using [sklearn.model_selection.LearningCurveDisplay.from_estimator()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LearningCurveDisplay.html#sklearn.model_selection.LearningCurveDisplay.from_estimator).

In [None]:
# Make a standard logistic regression classifier.
# We make sure to set the seed so the results are always consistent.
classifier = sklearn.linear_model.LogisticRegression(random_state = 4)

# Use scikit-learn to show us our learning curve.
# This function will split the data into train/test for us,
# so we will pass in the full data.
# Setting `train_sizes` allows us to get more points on our graph than is standard.
sklearn.model_selection.LearningCurveDisplay.from_estimator(
    classifier, all_features, all_labels,
    train_sizes = numpy.linspace(0.05, 1.0, 20),
    score_name = "Accuracy",
    random_state = 4,
)

Here we can see a pretty standard learning curve for our logistic regression classifier.
There is a sharp increase in the beginning, a transition to an area of lower improvement, and a plateau.
There are some small blips up and down in the plateau, but overall the accuracy stays about the same.
Note that the shaded areas in this graph represent the standard deviation of the accuracy,
so you can think of it as the confidence of the graph.

This graph is interesting, but when debugging and analyzing classifiers we want to add more to our graph.
We are currently displaying the score the classifier gets on the test data,
but it is also useful to see the score that the classifier gets on the training data.

In [None]:
classifier = sklearn.linear_model.LogisticRegression(random_state = 4)

sklearn.model_selection.LearningCurveDisplay.from_estimator(
    classifier, all_features, all_labels,
    train_sizes = numpy.linspace(0.05, 1.0, 20),
    score_type = "both",
    score_name = "Accuracy",
    random_state = 4,
)

Now we can see the score on the test data (in orange) and the score on the training data (in blue) as more labeled data is provided to the classifier.
The first question you may be asking is: "Why does the training score go down as we add more data, is that bad?".

Great question!
This behavior is actually not only expected, but desired.
Remember that when a model is fitting/learning it only has access to training data, and not test data.
So getting a perfect score on training data could mean that the classifier is perfect and will work perfectly on all data (including the test data),
or it could mean that the classifier learned about all the quirks in the training data that may not apply to the test data.
We call the latter phenomenon [overfitting](https://en.wikipedia.org/wiki/Overfitting).

As an example of overfitting, think about when you are studying for a test and the professor provides a practice test.
A good student will use the practice test to learn about the types of questions and scope of material that will be on the actual test
(assuming the practice test is representative of the actual test).
This student will (hopefully) perform well when trying the practice test again (like when a classifier is fitting/learning)
as well as when taking the actual test (like when a classifier is scoring on the test data).
But an overfitting student may learn just the answers to all the questions on the practice test.
The overfitting student may get a perfect score when trying the practice test again,
but they will surely fail the actual test.

To see overfitting in a classification setting, see the data in the blow image.
We have two classes, red and blue.
We can see that they are roughly split into two groups.
There are a few points in either class that are clearly outliers, but the general pattern of the data is straightforward.

<center><img src="overfitting-data.png" width='400px' style="background-color: white"/></center>

Below, we see a perfectly acceptable decision boundary that generally partitions the data into the two different classes.
Of course it will misclassify the outliers, but is respects the general pattern we see in the data.
We should expect the classifier that produced this decision boundary to perform well on the test data.

<center><img src="overfitting-good.png" width='400px' style="background-color: white"/></center>

Below we see an overfit decision boundary.
This decision boundary will get a perfect score on the provided training data,
but clearly does not respect the general pattern we see in the data.
We should expect the classifier that produced this decision boundary to perform poorly on the test data.

<center><img src="overfitting-bad.png" width='400px' style="background-color: white"/></center>

To see overfitting in-practice,
we will use a different classifier that is notorious for overfitting, [Decision Trees](https://en.wikipedia.org/wiki/Decision_tree).
At a high-level, decision trees work by creating a tree (the data structure) where each non-leaf node of the tree splits the data according to an attribute and leaf nodes each belong to a class label.
If you want more information on decision trees,
scikit-learn's [decision tree documentation](https://scikit-learn.org/stable/modules/tree.html) is a good resource.

In [None]:
# Create a basic decision tree classifier.
classifier = sklearn.tree.DecisionTreeClassifier(random_state = 4)

sklearn.model_selection.LearningCurveDisplay.from_estimator(
    classifier, all_features, all_labels,
    train_sizes = numpy.linspace(0.05, 1.0, 20),
    score_type = "both",
    score_name = "Accuracy",
    random_state = 4,
)

Ouch...
The learning curves show that the decision tree classifier always has 100% accuracy on the training data,
but ends up around 87.5% accuracy on the test data (where the logistic regression classifier was getting 90% accuracy).

Clearly our decision tree classifier is overfitting,
but how do we fix it?
To start, scikit-learn provides some [tips for dealing with overfitting decision trees](https://scikit-learn.org/stable/modules/tree.html#tips-on-practical-use).
One piece of advice they give is to make the decision tree simpler using the `max_depth` argument (a lower value will force the creation of simpler decision trees).
Remember the example with the red and blue dots, more complex models (more squiggly lines) are more likely to overfit,
whereas simple models (straighter lines) tend to generalize better and not overfit.

But what value of `max_depth` should we use?
When will the decision tree become simple enough to not overfit?
To answer these questions, we can also use learning curves!
But this time instead of plotting number of data points vs score, we will plot the complexity of the model (`max_depth`) vs score.

In [None]:
# Build learning curves with the model complexity (max_depth) on the x-axis.

# Go from 1 to 10.
max_depths = list(range(1, 11))

# Collect the scores for plotting.
train_scores = []
test_scores = []

for max_depth in max_depths:
    # Make a decision tree classifier with the given complexity (max_depth).
    classifier = sklearn.tree.DecisionTreeClassifier(random_state = 4, max_depth = max_depth)

    # Fit and score the classifier.
    classifier.fit(train_features, train_labels)
    train_scores.append(classifier.score(train_features, train_labels))
    test_scores.append(classifier.score(test_features, test_labels))
    
matplotlib.pyplot.plot(max_depths, train_scores, '-o', label = 'Training metric')
matplotlib.pyplot.plot(max_depths, test_scores, '-o', label = 'Testing metric')
matplotlib.pyplot.legend()
matplotlib.pyplot.show()

According to these learning curves, the test accuracy starts to decrease when `max_depth` is 3.
We also see the training accuracy sharply increase at the same time.
So, we can safely assume that this is when our model is starting to overfit.

Let's try the decision tree again, but this time with a `max_depth` of 2.

In [None]:
classifier = sklearn.tree.DecisionTreeClassifier(random_state = 4, max_depth = 2)

sklearn.model_selection.LearningCurveDisplay.from_estimator(
    classifier, all_features, all_labels,
    train_sizes = numpy.linspace(0.05, 1.0, 20),
    score_type = "both",
    score_name = "Accuracy",
    random_state = 4,
)

Fantastic!
Now we see the learning curve behavior that we are looking for.
We see a pattern that indicates that our decision tree classifier is now longer overfitting.
The classifier ends up scoring about 90% accuracy (actually, 89.50% accuracy) on the test data,
which is better than when it was more complex and overfitting.