diff --git a/README.md b/README.md index fde0c339..1f2e4012 100644 --- a/README.md +++ b/README.md @@ -125,6 +125,7 @@ Bonus Notebooks (not in the book) ##### Model evaluation +- [What is overfitting?](./overfitting.md) - [Is it always better to have the largest possible number of folds when performing cross validation?](./faq/number-of-kfolds.md) - [When training an SVM classifier, is it better to have a large or small number of support vectors?](./faq/num-support-vectors.md) - [How do I evaluate a model?](./faq/evaluate-a-model.md) diff --git a/faq/README.md b/faq/README.md index 2c116c4f..0f2d723f 100644 --- a/faq/README.md +++ b/faq/README.md @@ -63,12 +63,14 @@ Sebastian ##### Model evaluation +- [What is overfitting?](./overfitting.md) - [Is it always better to have the largest possible number of folds when performing cross validation?](./number-of-kfolds.md) - [When training an SVM classifier, is it better to have a large or small number of support vectors?](./num-support-vectors.md) - [How do I evaluate a model?](./evaluate-a-model.md) - [What is the best validation metric for multi-class classification?](./multiclass-metric.md) - [What factors should I consider when choosing a predictive model technique?](./choosing-technique.md) + ##### Logistic Regression - [What is the probabilistic interpretation of regularized logistic regression?](./probablistic-logistic-regression.md) diff --git a/faq/overfitting.md b/faq/overfitting.md new file mode 100644 index 00000000..19aef43e --- /dev/null +++ b/faq/overfitting.md @@ -0,0 +1,24 @@ +# What is overfitting? + +Let’s assume we have a hypothesis or model m that we fit on our training data. In machine learning, the training performance — for example, accuracy — is what we measure and optimize during training time. Let’s call this training accuracy ACCtrain(*m*). + +Now, what we really care about in machine learning is to build a model that generalizes well to unseen data, that is, we want to build a model that has a high accuracy on the whole distribution of data; let’s call this ACCpopulation(*m*). (Typically, we use cross-validation techniques and a separate, independent test set to estimate the generalization performance.) + +Now, overfitting occurs if there’s an alternative model *m'* from the algorithm's hypothesis space where the training accuracy is better and the generalization performance is worse compared to model *m* -- we say that m overfits the training data. + +### Learning Curves + +As a rule of thumb, a model is more likely to overfit if it is too complex given a fixed number of training samples. The figure below shows the training and validation accuracies of a SVM model on a certain dataset. Here, I plotted the accuracy as a function of the inverse regularization parameter C -- the larger the value of C the larger the penalty term against complexity. + +![](./overfitting/learning_curve_1.png) + +We observe a larger difference between training and test accuracy for increasing values of C (more complex models). Based on the plot, we can say that the model at < 10^-1 underfit the training data whereas the models > 10^-1 overfit the training data. + +### Remedies + +Remedies against overfitting include + +1. Choose a simpler model by adding bias and/or reducing the number of parameters + 2. Adding regularization penalties + 3. Reducing the dimensionality of the feature space +2. Collecting more training data \ No newline at end of file diff --git a/faq/overfitting/learning_curve_1.png b/faq/overfitting/learning_curve_1.png new file mode 100644 index 00000000..15444be5 Binary files /dev/null and b/faq/overfitting/learning_curve_1.png differ