# Machine Learning with scikit-learn

## Avoiding Overfitting

The reason for train/test splits of data is always, at heart, a desire to avoid overfitting.  It is straightforward in supervised learning problems to fit a model against all the available data.  Since we, by definition, do not yet have the data we do not have, we want a proxy for "the novel data we might see in the future."

Obviously, the best proxy we can come up with is simply a portion of the original data that did not participate in the fitting of the model.  We rely on an assumption that our sample data is similar to observations we will obtain in the future.  However, there is really nothing better we might choose as such a proxy.

Using `train_test_split()` to divide the data between a training and testing set if a very reasonable approach.  By default, this utility function shuffles the data before splitting it; in general this will minimize effects related to order of collection or collation of the dataset.  However, especially on moderate sized datasets of hundreds or thousands of samples (but not really of tens of thousands, or millions), the particular accident of a randomized split can still lead to artifacts.

In [None]:
import pandas as pd
import numpy as np

### Understanding splitting

`train_test_split()` performs just one split of a data array, while all the other splitting classes in `sklearn.model_selection` produce an iterator over multiple distinct splits.  

We will use the Iris dataset to illustrate these difference.  This dataset contains 150 observations of 3 different species of Iris, each sample containing 4 features.  It is a widely used example, and responds well to many classifiers.

In [None]:
from sklearn.model_selection import KFold, train_test_split
from sklearn import datasets

In order to show some different behavior of splitting techniques, we will modify the Iris data to drop some of it.  In particular, we truncate the last 25 observations.  The reason we do this is because the samples in the dataset are grouped by their class, first all the Iris setosa, then all the Iris virginica, and finally all the Iris versicolor samples.  The truncation will create an imbalance among the classes of observations.  Most datasets you will encountere will have varying numbers of samples in different classes.

In [None]:
iris = datasets.load_iris()
iris.data = iris.data[:-25]
iris.target = iris.target[:-25]
print(iris.target_names)

The basic utility function—as we have seen in prior lessons—simply divides the data into two arrays.  We keep the same number of columns as in the original, but put some of the rows in the first array and the rest in the second.

In [None]:
[arr.shape for arr in train_test_split(iris.data)]

If more than one array is passed to the function, it will split each one in turn; the split will be consistent in choosing the corresponding rows from each array (which must, therefore, all have a consistent number of rows).

99% of the time, you use this behavior to simultaneously split features and target arrays.  In principle, the API of `train_test_split()` does nothing to enforce that usage.  In fact, in optionally taking more than two arguments, it is not constrained to this specific use.

In [None]:
import numpy as np
for arr in train_test_split(iris.data, iris.target, np.ones((125,3))):
    print(arr.shape)

### Multiple splitting

Classes for splitting create iterators over multiple splits using the same general algorithm for splitting.  

#### KFold

One of the simplest such techniques is `KFold`.  This simply divides the data into multiple "folds."  By default, `KFold` does not shuffle the data first; therefore, if the dataset is meaningfully ordered in some manner already, the folds may have importantly different characteristics.  

The potential differences among the folds can be good or bad, depending on your purpose.  Either way, be aware of it.  If you hope your model will generalize to sample collections with a characteristic not in the training, there is an advantage to not shuffling.  However, it equivalently means that a particular loop through fitting may not have the opportuntity to fit to data with that characteristic.

<img src="img/KFold.png" width="66%"/>

Below we loop through a three-way split of the (truncated) Iris data.  The lengths of the the training versus testing data are slightly different between iterations simply because 125 is not divisible by 3.

In [None]:
for n, (train, test) in enumerate(KFold(n_splits=3).split(iris.data)):
    print("Iteration: %d; Train shape: %s; Test shape: %s" % (
                       n, train.shape, test.shape))

One thing that might be surprising at first is that the shape of training arrays are not, e.g. `(83, 4)`.  What we iterate over is a collections of index positions into the underlying NumPy arrays.  So, for example, in the first iteration, the test data is the first 1/3rd of the rows in the data.

In [None]:
train, test = next(KFold(n_splits=3).split(iris.data))
print(test)
pd.DataFrame(iris.data[test], columns=iris.feature_names).tail()

### StratifiedKFold

This cross-validation object is a variation of `KFold` that returns stratified folds. The folds are made by preserving the percentage of samples for each class.  Because this split is sensitive to the classes of the target, it must take a `y` argument to the `.split()` method.

In [None]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3).split(iris.data, iris.target)
for n, (train, test) in enumerate(skf):
    print("Iteration: %d; Train shape: %s; Test shape: %s" % (
                       n, train.shape, test.shape))

Notice that the index positions generated for the first split are not successive from the head.  Rather there are 17 each from the first and second 50 samples, then 9 more from the last 25 samples.  Other folds are similar, with rounding producing slightly different counts.

In [None]:
skf = StratifiedKFold(n_splits=3).split(iris.data, iris.target)
train, test = next(skf)
print(test)

### LeaveOneOut

This splitting technique utilizes the maximum possible size for each training set which still creating a nominal testing set.  This can be useful to train models as completely as possible while still allowing validation.  Of course, this iterates over a number of splits equal to the number of samples, so is the most computationally spendy split possible also.

In [None]:
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut().split(iris.data)

In [None]:
all_folds = []
for n, (train, test) in enumerate(loo):
    all_folds.append((n, train, test))

n, train, test = all_folds[0]
print("Iteration: %d; Train shape: %s; Test shape: %s; Test index: %s" % (
    n, train.shape, test.shape, test))
print("...")
n, train, test = all_folds[-1]
print("Iteration: %d; Train shape: %s; Test shape: %s; Test index: %s" % (
    n, train.shape, test.shape, test))

### GroupKFold

A `KFold` variant with non-overlapping groups.  The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds).

The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

In [None]:
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=3).split(iris.data, groups=iris.target)
for n, (train, test) in enumerate(gkf):
    print("Iteration: %d; Train shape: %s; Test shape: %s" % (
                       n, train.shape, test.shape))

In [None]:
# Verify that the final test group really is a homogeneous class
print("Index postitions:", test)
print("Species:", [iris.target_names[n] for n in iris.target[test]])

## Cross validation

The splitters discussed in this lesson are only a few of those in scikit-learn.  A variety of others build on the general idea contained in those discussed.  Consult the [documentation](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) for details on each.

The point of all these splitters is almost universally to be used in conjunction with cross validation.  The function `cross_val_score()` performs repeated training and scoring relative to muliple train/test splits. 

In [None]:
# As we mentioned, the Iris dataset is quite easy to fit
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
clf = SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
scores                                              

If an integer argument is given for `cv`, as above, it performs a Stratified KFold with that number of folds.  But we can also give an iterable like one of the scorers discussed.  

The "score" given for each iteration is that produced by the `.score()` method of the estimator being used.  You can manually specify a different `scorer=my_scorer` parameter to `cross_val_score` if you want to use a different metric.

This is an artificially bad splitting strategy.  We train exclusively on two species on each iteration, then try to predict the excluded species.

In [None]:
cross_val_score(clf, iris.data, iris.target, 
                cv=GroupKFold(n_splits=3), groups=iris.target)

A less bad split for this particular dataset and classifier would be a basic KFold.  The data has an unequal number of samples from each class (by construction) and is orderd by class.  So this gets enough overlap to do well on some splits, but does poorly on others.

In generally, Stratified KFold is pretty robust, and hence is the default strategy used by `cross_val_score`.

In [None]:
cross_val_score(clf, iris.data, iris.target, cv=KFold(n_splits=3))

In [None]:
loo_cv = cross_val_score(clf, iris.data, iris.target, cv=LeaveOneOut())
print("Mean leave-one-out cross validation:", np.mean(loo_cv))
print("All scores:\n", loo_cv)