Skip to content

Commit

Permalink
overfitting faq
Browse files Browse the repository at this point in the history
  • Loading branch information
rasbt committed Dec 19, 2015
1 parent b2e5fff commit 80c6e98
Show file tree
Hide file tree
Showing 4 changed files with 27 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Expand Up @@ -125,6 +125,7 @@ Bonus Notebooks (not in the book)

##### Model evaluation

- [What is overfitting?](./overfitting.md)
- [Is it always better to have the largest possible number of folds when performing cross validation?](./faq/number-of-kfolds.md)
- [When training an SVM classifier, is it better to have a large or small number of support vectors?](./faq/num-support-vectors.md)
- [How do I evaluate a model?](./faq/evaluate-a-model.md)
Expand Down
2 changes: 2 additions & 0 deletions faq/README.md
Expand Up @@ -63,12 +63,14 @@ Sebastian

##### Model evaluation

- [What is overfitting?](./overfitting.md)
- [Is it always better to have the largest possible number of folds when performing cross validation?](./number-of-kfolds.md)
- [When training an SVM classifier, is it better to have a large or small number of support vectors?](./num-support-vectors.md)
- [How do I evaluate a model?](./evaluate-a-model.md)
- [What is the best validation metric for multi-class classification?](./multiclass-metric.md)
- [What factors should I consider when choosing a predictive model technique?](./choosing-technique.md)


##### Logistic Regression

- [What is the probabilistic interpretation of regularized logistic regression?](./probablistic-logistic-regression.md)
Expand Down
24 changes: 24 additions & 0 deletions faq/overfitting.md
@@ -0,0 +1,24 @@
# What is overfitting?

Let’s assume we have a hypothesis or model m that we fit on our training data. In machine learning, the training performance — for example, accuracy — is what we measure and optimize during training time. Let’s call this training accuracy ACC<sub>train</sub>(*m*).

Now, what we really care about in machine learning is to build a model that generalizes well to unseen data, that is, we want to build a model that has a high accuracy on the whole distribution of data; let’s call this ACC<sub>population</sub>(*m*). (Typically, we use cross-validation techniques and a separate, independent test set to estimate the generalization performance.)

Now, overfitting occurs if there’s an alternative model *m'* from the algorithm's hypothesis space where the training accuracy is better and the generalization performance is worse compared to model *m* -- we say that m overfits the training data.

### Learning Curves

As a rule of thumb, a model is more likely to overfit if it is too complex given a fixed number of training samples. The figure below shows the training and validation accuracies of a SVM model on a certain dataset. Here, I plotted the accuracy as a function of the inverse regularization parameter C -- the larger the value of C the larger the penalty term against complexity.

![](./overfitting/learning_curve_1.png)

We observe a larger difference between training and test accuracy for increasing values of C (more complex models). Based on the plot, we can say that the model at < 10^-1 underfit the training data whereas the models > 10^-1 overfit the training data.

### Remedies

Remedies against overfitting include

1. Choose a simpler model by adding bias and/or reducing the number of parameters
2. Adding regularization penalties
3. Reducing the dimensionality of the feature space
2. Collecting more training data
Binary file added faq/overfitting/learning_curve_1.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 80c6e98

Please sign in to comment.