# Learning Curves

### Learning Curves help diagnose Bias and Variance

![model complexity](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

A learning curve shows the validation and training score of an estimator for varying numbers of training samples. It is a tool to find out how much we benefit from adding more training data and whether the estimator suffers more from a variance error or a bias error.   

If the training score is much greater than the validation score for the maximum number of training samples, adding more training samples will most likely increase generalization. 

Variance is the amount that the estimate of the target function will change if different training data is used. 

    * Examples of low-variance/high-bias algorithms: Linear Regression, Linear Discriminant Analysis and Logistic Regression.

    * Examples of high-variance/low-bias algorithms: Decision Trees (especially if not pruned), k-Nearest Neighbors and Support Vector Machines.


If both the validation score and the training score converge to a value that is too low with increasing size of the training set, we will not benefit much from more training data.  We will probably have to use an estimator or a parametrization of the current estimator that can learn more complex concepts (i.e. has a lower bias).       

**High Bias** => Increase model complexity: add features      
**High Variance** => Reduce model complexity    


### Reduce Variance Without Increasing Bias       
    * Averaging reduces variance:       
    
$Var(\bar{X}) = \frac{Var(X)}{N}$     
    
    * bagging   
    * boosting   
    * feature elimination   
        * regularization   
        * low-variance filter   
        * PCA   
        * RFE   
    * ensembling   
    


In [None]:
import numpy as np

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, scoring=None, obj_line=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and training learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : int, cross-validation generator or an iterable, optional
        Determines the cross-validation splitting strategy.
        Possible inputs for cv are:
          - None, to use the default 3-fold cross-validation,
          - integer, to specify the number of folds.
          - An object to be used as a cross-validation generator.
          - An iterable yielding train/test splits.

        For integer/None inputs, if ``y`` is binary or multiclass,
        :class:`StratifiedKFold` used. If the estimator is not a classifier
        or if ``y`` is neither binary nor multiclass, :class:`KFold` is used.

        Refer :ref:`User Guide <cross_validation>` for the various
        cross-validators that can be used here.

    scoring : string, callable or None, optional, default: None
              A string (see model evaluation documentation)
              or a scorer callable object / function with signature scorer(estimator, X, y)
              For Python 3.5 the documentation is here:
              http://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
              For example, Log Loss is specified as 'neg_log_loss'

    obj_line : numeric or None (default: None)
               draw a horizontal line


    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).


    Citation
    --------
        http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html

    Usage
    -----
        plot_learning_curve(estimator = best_estimator,
                            title     = best_estimator_title,
                            X         = X_train,
                            y         = y_train,
                            ylim      = (-1.1, 0.1), # neg_log_loss is negative
                            cv        = StatifiedCV, # CV generator
                            scoring   = scoring,     # eg., 'neg_log_loss'
                            obj_line  = obj_line,    # horizontal line
                            n_jobs    = n_jobs)      # how many CPUs

         plt.show()
    """
    from sklearn.model_selection import learning_curve
    import numpy as np
    from matplotlib import pyplot as plt

    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, scoring=scoring, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    if obj_line:
        plt.axhline(y=obj_line, color='blue')

    plt.legend(loc="best")
    return plt


In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.naive_bayes import GaussianNB
from sklearn.svm         import SVC

from sklearn.datasets        import load_digits
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import StratifiedKFold

digits = load_digits()
X, y = digits.data, digits.target

estimator = GaussianNB()
plot_learning_curve(estimator = estimator,
                    title     = "Learning Curves (Naive Bayes)",
                    X         = X,
                    y         = y,
                    ylim      = (0.5, 1.1),
                    cv        = StratifiedKFold(),
                    scoring   = 'accuracy',     
                    obj_line  = 0.85,    
                    n_jobs    = -1)  
plt.show()


estimator = SVC(gamma=0.001)
plot_learning_curve(estimator = estimator,
                    title     = "Learning Curves (SVM, RBF kernel, $\gamma=0.001$)",
                    X         = X,
                    y         = y,
                    ylim      = (0.8, 1.1),
                    cv        =  ShuffleSplit(n_splits=10, test_size=0.2, random_state=0),
                    scoring   = 'accuracy',     
                    obj_line  = 0.99,    
                    n_jobs    = -1)
plt.show()

![andrew ng](http://www.ultravioletanalytics.com/wp-content/uploads/2014/12/bias_variance_chart1.jpg)   

### c/o Andrew Ng