# Machine Learning

Machine learning is a much broader subject than we'll have time to cover in any great detail here. If you're interested in the topic and want to learn more, there are a lot of freely available resources online.

## Online Resources

- [Wikipedia](https://en.wikipedia.org/wiki/Machine_learning) and its [Machine Learning Portal](https://en.wikipedia.org/wiki/Portal:Machine_learning) gives a pretty good overview of all the various methods that are used.
- The [scikit-learn project website](http://scikit-learn.org) has several [tutorials](http://scikit-learn.org/stable/tutorial/index.html) that make for a good practical introduction. The [user guide](http://scikit-learn.org/stable/user_guide.html) also has good information on each of the models available.
- The [coursera machine learning course](https://www.coursera.org/learn/machine-learning) from Andrew Ng/Stanford University is very highly regarded. (The material can be viewed for free if you sign up and enroll). [Octave](https://www.gnu.org/software/octave/) (like a free version of Matlab) is used rather than Python.
- There is a curated list of free resources (books, courses, libraries etc) at https://github.com/josephmisiti/awesome-machine-learning.

## Books

- [Hands-On Machine Learning with Scikit-Learn and TensorFlow](http://shop.oreilly.com/product/0636920052289.do) gives a good practical overview of what's involved in applying machine learning to various problems. The sample problems. Jupyter notebooks with worked examples for each chapter can be freely downloaded from [https://github.com/ageron/handson-ml](https://github.com/ageron/handson-ml).

## What we'll cover here

In this section, I really just want you to give you an overview of what kind of things machine learning can do, and where you might find out more about solving particular types of problems.

We'll also go through a worked example with one of the Python libraries, where we compare several models for a task.

## What is machine learning?

Machine learning is an approach where the computer learns how to solve a problem without being explicitly programmed to do so. It has some things in common with the ideas we covered in the fitting section: linear regression is a common machine learning algorithm for example, but usually it means an algorithm which learns from and makes predictions about data. In practice this can involve working with a very number of dimensions and models that have a very large number of internal parameters.

In machine learning we often aim to train a model such that we can use it to make predictions about some unseen data, rather than try to really understand the model that happens to work. This can make it can seem somewhat unscientific at first, since it's common to try to ensure your data best represents the general case before fitting your model by e.g. discarding outliers as they can negatively impact the fit produced by many models.

## Types of problems where machine learning can be used

- **Classification**: based on some data associated with an item, put it into one (or possibly more) of some set of predefined categories.
    - Is an email spam or legitimate?
    - Handwriting recognition - use an image of a written letter to find which letter of the alphabet it corresponds to.
    - [Is a picture of a hotdog, or not of a hotdog?](https://play.google.com/store/apps/details?id=com.seefoodtechnologies.nothotdog)
    - Is a material a metal or an insulator - based on it's structure and composition.
- **Regression**: estimate the relationship between variables such that given some data associated with an item we can estimate the value of some property of that item. We now want some continuous output rather than discrete as we had in the case of classification.
    - What is the likely future price of stock - based on whatever you think might be relevant.
    - What is the bandgap of a material - based on it's structure and composition.
- **Clustering**: divide some set of items into groups. This is like classification, but we don't know what the categories are before we start.
    - In social networks - to recognize communities within groups of users.
    - Image analysis - to divide up an image into different regions for border detection and object recognition.
    - To find groups with structural similarity given a large number of chemical compounds.
- **Dimensionality Reduction**: reducing the number of variables needed to describe a set items given data about them, by finding the most important variables or combining the effect of several variables into 1. This is often used a preliminary step before using one of the methods above. If several variables are strongly correlated, you don't need to use all of them. Or if some variable seems completely uncorrelated with the result it may be best to disregard it.
    - Say you had some dataset of materials, which included the volume and lattice lengths: you might find looking at the volume alone was sufficient as it correlates strongly with the lattice vector lengths, or perhaps you might use the volume per atom.
    - This is important in general, as any subsequently used model will perform significantly better if it can focus on only the most relevant features.

## Types of machine learning task

Machine learning tasks are usually divided into two categories:
1. **Supervised learning** - you give the computer some example data and what the desired outcome would be for that data. Usually this is called the "training set". It uses this to build some rules about about how the input data should map to the output. There are a few special cases also:
    - Semi-supervised learning: the training set is incomplete; perhaps only a small portion is properly labeled (getting properly labeled data can be difficult/expensive). For example a training set of images of handwritten digits might only have a fraction of them labeled with what the actual digit should be.
    - Active learning: the training set is limited, but the computer can decide which objects it wants to ask for labels for. These may then be presented to the user (such as catchpas asking website users to identify road signs and vehicles).
    - Reinforcement learning: training data is given as a response to the computers actions when it is in use. It finds out if it got it right or wrong: maybe it wins or loses a game, or someone tells it that an email should have been marked as spam.

2. **Unsupervised learning** - you don't tell the computer anything about the input data and let it discover some structure and/or relationships itself.
    - This is used in fraud detection - to identify transactions or patterns that are unusual.
    - It can be used to find groups of documents that discuss similar topics.

## Machine learning methods

There are many methods used in machine learning tasks. Here are some of them:

- [**Linear Regression**](https://en.wikipedia.org/wiki/Linear_regression): You find how some dependent variable $y$ depends on the various components of some independent variables $\mathbf x$, assuming some linear relationship between them. This can be used both for regression and classification.
- [**Logistic Regression**](https://en.wikipedia.org/wiki/Logistic_regression): This is similar to linear regression, but uses the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) instead; this has form of the Fermi-Dirac distribtion. This can be used to better define a boundary between categories than a linear regression, as it varies most rapidly in the region of the boundary.
- [**k-Nearest Neighbours (k-NN)**](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm): You determine the value or classification of an item by looking the output for some number of its nearest neighbours in a training set.
- [**Support Vector machine**](https://en.wikipedia.org/wiki/Support_vector_machine): This finds hyperplanes or more advanced boundaries that separate the training data into categories, so that any new item can be assigned to one of these categories.
- [**Naive Bayes classifier**](https://en.wikipedia.org/wiki/Naive_Bayes_classifier): This classifies objects based on combining probabilities it generates associated with each piece of information you know about something to find which category it most likely falls into.
- [**Artifical Neural Network**](https://en.wikipedia.org/wiki/Artificial_neural_network): These are computing systems inspired by biological neural networks. They are made up of collected units which transform their inputs in various ways, and transmit signals to each other. There are typically some input and output "neurons", and several "hidden layers" which connect between them and each other. These typically require much more training than the other approaches mentioned here.

## Deep Learning

This is a very advanced set of machine learning techniques generally based on using large deep (many layered) neural networks trained with large amounts of data. What makes this approach so popular for difficult problems, is that while other learning algorithms will reach a plateau, where training them with more data won't make much difference to how well that algorithm performs, deep learning will keep improving as you give it more data.

Deep learning is usually composed of neural nets with many hidden connected layers, where the output from one is passed as the input to the next. Each layer further abstracts the data and extracts important features. This approach has become more feasible in recent years due to the suitability of GPUs for training these algorithms, along with the ready availability of very large datasets. 

Applications include:
- Speech recognition
- Image recognition - and recognizing people in images
- Self driving vehicles
- Playing (and winning) Go
- Finding new ways to get you to buy things

## Machine Learning applied to Materials

In parallel with the creation of a number of materials databases in recent years, there have also been a number of initiatives to apply machine learning to the prediction of materials properties. For example [aflow](http://www.aflowlib.org) and [nomad](http://metainfo.nomad-coe.eu).

## Machine Learning and Deep Learning in Python

There are a large number of Python libraries for machine learning, with many tuned for particular methods or type of problem. A good listing is available at https://github.com/josephmisiti/awesome-machine-learning#python

The more commonly used general purpose machine learning and deep learning libraries are:
- [scikit-learn](http://scikit-learn.org) - [an open source](https://github.com/scikit-learn/scikit-learn) library largely written in Python and Cython, that has been designed to easily work with SciPy and NumPy. It does not offer GPU support, so would struggle to scale to the size of problems analysed with deep learning techniques. It is generally regarded as one of the easiest libraries to pick up and start using.
- [TensorFlow](https://www.tensorflow.org/) - [an open source](https://github.com/tensorflow/tensorflow) library developed by Google Brain, and used for both research and production at Google. This can be used with several other languages such as C++ and Java as well as Python. This can be used for deep learning. It now also includes a scikit type interface "scikit flow" allowing easier testing of various models.
- [MXNet](http://mxnet.io/) - [an open source](https://github.com/apache/incubator-mxnet) deep learning framework. This is used by Amazon internally. It can also be used with several other languages, such as C++, R and Matlab.

All these, and many others you might find, can be installed for Python using `pip`. If you want to have these use a GPU, you should check the project's page, as you may need to install it in a different way.

## A worked example with scikit-learn

As an example, let's look at how we might train an algorithm to identify handwritten digits.

Fortunately, scikit-learn comes with several [toy databases](http://scikit-learn.org/stable/datasets/index.html#toy-datasets) that can be used to practice different machine learning techniques. One of these is the [digits dataset](http://scikit-learn.org/stable/auto_examples/datasets/plot_digits_last_image.html#the-digit-dataset) which is from [here](http://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits) originally.

The [datasets](http://scikit-learn.org/stable/datasets/index.html) page gives information on how to load in datasets from many other sources.

Similarly to the matrix diagonalization case, getting the data into the _right_ format is generally a much harder step in the process than applying some pre-built learning algorithms.

Finding which digit a given image corresponds to will be a classification problem, and as we have some set of labeled data we'll use to train it, we'll be using a supervised learning approach.

### Installing sci-kit learn

The first thing you'll need to do is make sure you have sci-kit learn installed. You can do this in several ways, with some examples given at its [documentation page](http://scikit-learn.org/stable/install.html). If you're using your own ubuntu install or virtual machine, `sudo pip3 install --upgrade scikit-learn` in a terminal should work. Make sure `python3 -c "import sklearn, numpy, scipy, matplotlib"` in a terminal doesn't give any errors before proceeding.

### Import the sci-kit learn modules

We'll be using several sklearn modules, which we need to import first:

In [None]:
import numpy as np
import sklearn.datasets, sklearn.metrics
import sklearn.svm, sklearn.model_selection
import sklearn.linear_model, sklearn.neighbors
import sklearn.neural_network
import matplotlib.pyplot as plt

### Load the data

Now we can load the dataset. This is straight forward in this case as we're using a built in dataset. Often you'd need to connect to an online source to do this, or use some separately downloaded set.

In [None]:
# Load the digits dataset. This is built in to sci-kit learn.
digits = sklearn.datasets.load_digits()

In [None]:
# Let's see what we have here
help(digits)

In [None]:
# So it seems we get a dictionary. We can list the various dictionary keys with
digits.keys()

In [None]:
# We can output the description associated with this dataset as follows:
print(digits.DESCR)

In [None]:
# Let's look at the last element of the various fields
print("data[-1]=", digits.data[-1])
print("target[-1] =", digits.target[-1])
print("images[-1]=", digits.images[-1])

# This is a list of possible results.
print("target_names=", digits.target_names)

So It looks like `data[i] = images[i].flatten()`, i.e. these have the same information, but `data` is a length 64 1D array, whereas `images` is an 8x8 2D array.

In [None]:
# Let's see what these look like:
# Plot the first digit image.
print(digits.target[0])
plt.imshow(digits.images[0], cmap=plt.cm.gray_r)
plt.show()
# Plot the last digit image.
print(digits.target[-1])
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r)
plt.show()

So we can see each image is represented as an 8x8 array of pixels, with each pixel having a number, that seems to go between 0 and 16, representing its intensity. And even when we plot them, some of them are not completely obvious.

In [None]:
# Let's look at a few more of these.
for i, image in enumerate(digits.images[:10]):
    plt.subplot(1, 10, i + 1)
    plt.axis('off')
    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
    plt.title(digits.target[i])
plt.show()

n_digits = len(digits.data)
print("%i images in total" % n_digits)

We took a a bit of a shortcut in using a built-in data set. If you want to use some other data, the important thing is to have some vector of values associated with each entry (corresponding to 8x8 pixels here), along with a corresponding label.

### Splitting up the data to a training and a testing set

Before we apply a machine learning algorithm to our data, we should split it into a training and a test set. A commonly used split is 80% training and 20% testing, but there's no hard rule here.

We could just select the the first 80% of the data:
```
data_train = digits.data[:int(0.8*n_digits)]
target_train = digits.target[:int(0.8*n_digits)]
data_test = digits.data[int(0.8*n_digits):]
target_test = digits.target[int(0.8*n_digits):]
```

Or we could write a function do to select our training set at random from our data. One important point if we do, is that it's better to be able to reproduce the same split if you want to test a new model. A good way to do that is to set the random seed manually.

This approach can be done with the [sklearn.model_selection.train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function. 

There are also more advanced methods available within sklearn such as [sklearn.model_selection.StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html) to split up your data while ensuring you retain whatever statistical distribution you have between categories in your input data.

Let's create a training and test set now with `train_test_split`:

In [None]:
# Randomly split our data into a training and test set.
# We apply the same split to our targets. And fix the
# random seed so we can reproduce the same split.
data_train, data_test, target_train, target_test = (
    sklearn.model_selection.train_test_split(digits.data,
                                             digits.target,
                                             test_size=0.2,
                                             random_state=42)
)

In [None]:
print("Number of training data:", len(data_train), len(target_train))
print("Number of test data:", len(data_test), len(target_test))

Now our training data is a set of vectors with a desired output for each vector, and we want to find some function that will map each vector to the desired output.

### Picking a model

Now we're ready to choose a model to use. There's no good rule for choosing a method, so it generally comes down to trying various approaches and seeing how they do. Of course we also need to be mindful that we could negatively impact how our model generalizes by picking a model that gives the best result when we test against our test set. We'll come back to this point later.

Let's try a [support vector machine](http://scikit-learn.org/stable/modules/svm.html) classifier first. This approach tries to find hyperplanes or more advanced boundaries between each classification based on the training data. A number of classifiers are available in [sklearn.svm](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.svm).

Let's begin with the [sklearn.svm.LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) classifier. This has many options to tune the model, but let's see how we do with the defaults to begin with.

In [None]:
# Create our machine learning classifier.
# This is a linear support vector classifier, accepting most defaults.
# I have only set the random_state parameter here so that this model
# will reproduce the same results for our (fixed) training set
# each time it's run.
clf_linearsvc = sklearn.svm.LinearSVC(random_state=0)

### Fitting the training data

Once we have initialized the model, we now need to fit it to our training data. The sci-kit learn models all come with a `fit()` method that you can use for this purpose. Let's do this now.

In [None]:
clf_linearsvc.fit(data_train, target_train)

And that's it, you've now trained a machine learning model!

### Testing the model

Now we need to find out how we did by using the model to predict classifications for our test set, and comparing them to the target values.

Let's try a couple of values.

In [None]:
# Make predictions on our first five test elements
clf_linearsvc.predict(data_test[0:5])

In [None]:
target_test[0:5]

This looks like we've done pretty well: the predictions for the first five elements match what they are labeled as.

Let's see how the predictions compare to the images for a number of the test elements:

In [None]:
# Here we use zip to convert two 20 element arrays into a 2x20 element list.
data_and_predictions = list(zip(data_test[0:20], clf_linearsvc.predict(data_test[0:20])))
for i, (data, prediction) in enumerate(data_and_predictions):
    plt.subplot(1, 20, i+1)
    plt.axis('off')
    plt.imshow(data.reshape(8, 8), cmap=plt.cm.gray_r)
    plt.title(prediction)
plt.show()

This all looks pretty good. We can see what looks like a 9 was classified as a 1 however.

We can use the [score()](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC.score) method to get the average accuracy of our model over our test set.

In [None]:
clf_linearsvc.score(data_test, target_test)

### Analysing the Model

Although we have a general score, it would be good to have some more detailed statistics about how well the model does. This tends to be more useful in cases where the model doesn't perform well, as you can use it to try to see where issues are arising. Sci-kit learn has the [sklean.metrics](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module which comes with several ways to generate metrics to quantify how well your model is doing. Many methods of evaluating a model are given in the [sci-kit learn documentation](http://scikit-learn.org/stable/modules/model_evaluation.html).

Let's use
- [sklearn.metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html): this generates a text report for the performance of a classifier
- [sklearn.metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html): this returns a confusion matrix for a classifier, which tells you which classification as assigned to each item vs which it was supposed to be. A good classification model will have mostly zeros or small values on non-diagonal entries: these are the classifications that are incorrect.

In [None]:
# First we'll get an array of all the predicted values for the test set
predicted = clf_linearsvc.predict(data_test)

# Then we pass the expected and predicited results to the function
# to generate the report.
report = sklearn.metrics.classification_report(target_test, predicted)
print("Classification report for classifier %s:\n%s\n" % (clf_linearsvc, report))

This is pretty good. The first two columns list the [precision and recall](http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html). 

For a given target:
- The **precision** gives the number correctly assigned to that target divided by the total number (correct and incorrect) assigned to that target.
- The **recall** is  the number correctly assigned to that target divided by the number that should have been assigned to that target.
    - For example, if our classifier was labeling every digit image as 0, its precision would be very low, but its recall would be 1.0.
    - How these numbers differ from 1.0 can show you which categories are causing issues for your classifier.
- The **f1-score** is the harmonic mean of the precision and recall $2\frac{P\times R}{P+R}$.
- Ideally these three number will be close to one for each target if your classifier is performing well.
- The **support** is the number of each label in the expected targets. The same set of values would be returned by `numpy.bincount(target_test)` in our case.

Now let's try the [confusion matrix](http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html):

In [None]:
cm = sklearn.metrics.confusion_matrix(target_test, predicted)
print("Confusion matrix:\n%s" % cm)

Here, as output, each row corresponds to each of our expected categories, and each column gives the category that the classifier assigned the digit to. So a classifier that works perfectly will only have non-zero elements along the diagonal. This let's us see which if some pair of classifications are causing issues for our classifier. In this case, the worst we have is three items that should have been 9s that were instead classified as 8s.

We could also use this to calculate the precision and recall:
- the precision is the element on the diagonal divided by the row total.
- the recall is the element on the diagonal divided by the column total.

This can also be plot as follows:

In [None]:
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

### Saving your trained model

Often you'll want to be able to save your trained model, and reload it later. Our dataset is small here, and we're using fairly simple models, so training is quite fast. But often training can take a number of hours, so you'll need to be able to save the result. This is covered in the [model persistence](http://scikit-learn.org/stable/modules/model_persistence.html) section of the sci-kit learn documentation.

#### Pickle
We can save the model to file with the built-in Python `pickle` library, as e.g.
```
import pickle

pickle.dump(clf_linearsvc, open("clf_digits.p", "wb"))
```
And then when we want to load it we could do
```
clf_linearsvc = pickle.load(open("clf_digits.p", "rb"))
```
#### sci-kit learn

Sci-kit learn also comes with it's own method to save to file, that can be more efficient for some models which have large amounts of internal data associated with them.

```
import sklearn.externals
sklearn.externals.joblib.dump(clf_linearsvc, 'clf_digits.pkl')
```
And then when we want to load the file later
```
clf_linearsvc = sklearn.externals.joblib.load('clf_digits.pkl')
```

### Other Models

Let's try some different models for the same training and test set and see how they do.

We can try a **logistic regression** using [sklearn.linear_model.LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). This fits a logistic function to the data, and can be used to find a boundary between different classifications. We can use this in the same way before:

In [None]:
# Again we initialize our model, and accept all defaults since
# we don't yet know any better.
clf_logistic = sklearn.linear_model.LogisticRegression()

# And now we train our model with our training data
clf_logistic.fit(data_train, target_train)

And now we can analyse how this has done in the same way as before.

In [None]:
print("Logistic regression model score:\n ",
      clf_logistic.score(data_test, target_test))

predicted = clf_logistic.predict(data_test)
report = sklearn.metrics.classification_report(target_test, predicted)
print("Classification report for classifier %s:\n%s\n" % (clf_logistic, report))

cm = sklearn.metrics.confusion_matrix(target_test, predicted)
print("Confusion matrix:\n%s" % cm)

Again, we've done very well using the model defaults. We can see from the confusion matrix that we have four 9s this time that are mistakenly classified as 8s, but some of the other categories are a little more accurate and we do similarly well overall.

Let's try a **k-NN model** now [sklearn.neighbors.KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html). When we make a prediction using this model, it finds some number of the nearest vectors, and returns a classification based on the classification of these neighbours.

Again we initialize the model (using defaults, since this has done all right so far), and fit it to the training data.

In [None]:
# Again we initialize our model, and accept all defaults since
# we don't yet know any better.
clf_kNN = sklearn.neighbors.KNeighborsClassifier()

# And now we train our model with our training data
clf_kNN.fit(data_train, target_train)

And again we can analyse how well we do using the test data. We'll generate the score, classification report, and confusion matrix as previously.

In [None]:
print("k-NN classifier model score:\n ",
      clf_kNN.score(data_test, target_test))

predicted = clf_kNN.predict(data_test)
report = sklearn.metrics.classification_report(target_test, predicted)
print("Classification report for classifier %s:\n%s\n" % (clf_kNN, report))

cm = sklearn.metrics.confusion_matrix(target_test, predicted)
print("Confusion matrix:\n%s" % cm)

Again, we've done very well using the model defaults. We can see from the confusion matrix that only a handful of digits were incorrectly classified.

Finally let's look at a **neural network**. We'll use what's known as a [multi-layer perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron) classifier, from [sklearn.neural_network.MLPClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html). This has a set of a particular implementation of a artificial neuron known as a [perceptron](https://en.wikipedia.org/wiki/Perceptron). It accepts several inputs, which may come from other perceptrons, and if the weighted inputs exceed some threshold it outputs 1, but otherwise outputs 0. The training process adjusts the weights connecting the perceptrons based on the training data.

In [None]:
# Initialize our model. Again fixing random state for
# reprodicible behaviour in this notebook.
clf_mlp = sklearn.neural_network.MLPClassifier(random_state=0)

# Train our model with our training data
clf_mlp.fit(data_train, target_train)

print("MLP classifier model score:\n ",
      clf_mlp.score(data_test, target_test))

predicted = clf_mlp.predict(data_test)
report = sklearn.metrics.classification_report(target_test, predicted)
print("Classification report for classifier %s:\n%s\n" % (clf_mlp, report))

cm = sklearn.metrics.confusion_matrix(target_test, predicted)
print("Confusion matrix:\n%s" % cm)

Again this was pretty good overall. You may have noticed this took a little longer to fit the training data than the other models. This is one of the drawbacks of artificial neural networks.

While we may be tempted to assume the k-NN model is the best in general, this could just be the case for our particular split of training and test data, as the difference between them was quite small. By picking a model, or parameters used in a given model, based on how well it does on the test set, we are fitting to the test set, and we can't assume that a model picked this way will perform best in general.

### Cross Validation

A better way to compare models would be to use what's known as **cross validation**. In this approach, we don't use the test to to compare models. Instead the training set is split into a smaller test set and a _validation_ set. The various models are trained against the smaller training set and evaluated using the validation set. We can do this split a number of different ways and get a score for each model from each different split.

In sci-kit learn we can do this automatically using [sklearn.model_selection.cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html). This can be used to randomly split the training set into a number of distinct subsets known as folds. It will then assign each fold as the validation set in turn, and train using the rest of the folds. It will then output an array of scores from using each fold as the test set. There is more information on this approach at the [sci-kit learn documentation](http://scikit-learn.org/stable/modules/cross_validation.html).

Let's do this for the three models we've tried so far.

In [None]:
# We can chose to use 10 folds using cv=10 in each case here.

linearsvc_scores = sklearn.model_selection.cross_val_score(
    clf_linearsvc, data_train, target_train, cv=10)
print(linearsvc_scores)

logistic_scores = sklearn.model_selection.cross_val_score(
    clf_logistic, data_train, target_train, cv=10)
print(logistic_scores)

kNN_scores = sklearn.model_selection.cross_val_score(
    clf_kNN, data_train, target_train, cv=10)
print(kNN_scores)

mlp_scores = sklearn.model_selection.cross_val_score(
    clf_mlp, data_train, target_train, cv=10)
print(mlp_scores)

In [None]:
# Let's print the average score for each case.
print("linearSVC average score:", np.average(linearsvc_scores))
print("logistic average score:", np.average(logistic_scores))
print("kNN average score:", np.average(kNN_scores))
print("mlp average score:", np.average(mlp_scores))

From this we have much more robust evaluation of the performance of the various models in this case. And the default kNN does actually seem to be the best.

### Hyperparameters

We've used the default parameters for each of the models so far and gotten pretty good results. However this won't always be the case. Most models have one or more internal parameters, known as **hyperparameters** which can be used to tune various aspects of their behaviour. For example, in the k-NN model, we can choose how many nearest neighbours we look at, and we could also choose to weigh the result associated with the neighbours in different ways, such as weighing them by the inverse of the distance to the point.

Let's see how this works by looking at another support vector machine classifier: [sklearn.svm.SVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). This is defaults to a more complex boundary between classifications. It has an internal parameter `gamma` which tunes how sensitive the generated boundary is to the variation in the points - a higher value will fit around the training points more closely but may overfit the training set.

Lets try with the defaults to begin with:

In [None]:
# Initialize our model. Again we'll set the random_state
# for better reproducibility in this notebook.
clf_svc = sklearn.svm.SVC(random_state=0)

# And now let's do cross validation as before to see how this scores on average.
svc_scores = sklearn.model_selection.cross_val_score(
    clf_svc, data_train, target_train, cv=10)
print("Scores: ", svc_scores)
print("Average: ", np.average(svc_scores))

This is pretty bad compared to our earlier models. It's correct less than half the time.

If we want to persist with this model, we need to try to find better values of the **hyperparameters**. We could manually vary them until we improve our cross validation scores. This would work, but could be time consuming if we have several parameters we want to vary. It would be better to have some way to do this search automatically.

There are a few ways we can approach this using tools provided by sci-kit learn, and there is also some advice on this in [their documentation](http://scikit-learn.org/stable/modules/grid_search.html).

There is the [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function. We can pass this a dictionary of parameters with lists of values and it will automatically test all combinations. When trying to find good hyperparameters, it's even more important to use cross validation. This sci-kit learn function will do this atuomatically. Let's try this first.

We have no idea what a good value of gamma is so we'll try powers of 10 from -4 to 4. Again we have a `cv` parameter that generates a set of folds for cross validation.

In [None]:
# Define a dict of the parameters to vary and the values they should take.
# We'll just vary a single parameter here.
param_grid = {'gamma': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000]}

# Initialize the grid search with our svc model.
# Again we'll use a 10-fold cross validator.
clf_svc_gs = sklearn.model_selection.GridSearchCV(clf_svc, param_grid, cv=10)
# Here we passed clf_svc as the model, we could have passed the
# sklearn.svm.SVC model directly, and set 'random_state': [0] in the
# param_grid above to also set this parameter.

# And fit it to our training data. This step may take a minute or so.
clf_svc_gs.fit(data_train, target_train)

In [None]:
# We can find out what the best parameters it found was as follows:
clf_svc_gs.best_params_

By default, once it finds the best value from the cross validation search, it will use this to fit the full training set; it's best to have the model fit with all the training data rather than  So the `clf_svc_gs` object can be used as any of our fit models before. Let's see how it scores now:

In [None]:
print("SVC classifier model score:\n ",
      clf_svc_gs.score(data_test, target_test))

predicted = clf_svc_gs.predict(data_test)
report = sklearn.metrics.classification_report(target_test, predicted)
print("Classification report for classifier %s:\n%s\n" % (clf_svc_gs, report))

cm = sklearn.metrics.confusion_matrix(target_test, predicted)
print("Confusion matrix:\n%s" % cm)

This is about as good as the k-NN model, so you can see the value in trying to tune the model via the hyperparameters, rather than disregarding it if it doesn't provide a good result with the defaults.

If you have many hyperparameters you don't you wish to try to optimize, another methods you can use to do so rather than defining a particular grid of values is to try many random values. For example, [sklearn.model_selection.RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) can try a predefined number of random combinations of random values.

### Concluding Remarks

If you plan to use machine learning in one of your projects, I suggest you read the documentation associated with whatever library you use. There may be common pitfalls associated with some of these models that can adversely affect performance but can be easily avoided. While several of the models we used achieved very high accuracy, none of them were perfect; you need to keep in mind that machine learning won't always give you the right answer, just the answer it thinks is best based on what it has learned so far. Normally this isn't a big deal, but if your self driving car fails to stop at 1% of red lights you should spend more time working on your model before trusting it to work on public roads.

### Exercise

Load one of the other [toy datasets](http://scikit-learn.org/stable/datasets/index.html#toy-datasets) available from sci-kit learn and try to train a model from it. The [iris dataset](http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) is another classic classification problem, with 3 categories, and may be a good place to start.
