# Machine Learning 1: Background and techniques

## IMPORTANT:
To get the material in this week's session to work you will need to do the following if you are using a cluster machine (and you may need to do something similar if you are using your own computer, depending on your version of anaconda):
1. Open a new cmd.exe window
2. Execute the command: `conda install opencv`
3. Wait......
4. Restart your kernel.

These instructions add a computer vision module to the python install. You should now be ready to work through the notebooks.

## Introduction
In the lecture I introduced some of the key ideas of machine learning. In today's practical class we explore these ideas further. I want to show you how easy it is to use python to explore the power of machine learning with minimal need for new code. Later today we will use python to implement a method from a recent materials science publication that uses machine learning to automatically classify microstructure.

In [None]:
# Import some modules that we will require
import numpy as np  
import scipy as sp
import sklearn # This is a module of machine learning tools

import matplotlib.pyplot as plt
% matplotlib inline

### Resources
As always, the best place to find help with coding is the web, via some judicious use of a search engine. There is also, of course, the python documentation. A good book is "Introduction to Machine Learning with Python", by Andreas C. Mueller and Sarah Guido. Many of the examples used in this notebook are adapted from this book.

![Image of expected classification of moon data](Images/Book.jpg "Title")

## Unsupervised learning
We'll begin by considering some examples of *unsupervised learning*. What is unsupervised learning? From Wikipedia: *"Unsupervised machine learning is the machine learning task of inferring a function to describe hidden structure from "unlabeled" data (a classification or categorization is not included in the observations)"*

So, as an example, imagine that we had a dataset containing measurements of two material properties, let's call them *Property A* and *Property B*. We might suspect that these data correspond to measurements on samples of several different classes of materials, but we don't know which sample comes from which class or what a typical range of values for each property would be for each class. *Unsupervised learning* addresses this sort of problem.

Another example might be one of *feature extraction*. Say we have a bunch of images of material microstructures. We might be interested in searching through these images to determine what features regularly occur in the microstructures. We don't need to know anything about where the microstructures come from. Instead, we are just interested in learning the patterns in the data that we have.

The first part of this notebook looks at using unsupervised learning in the form of cluster analysis.

### Cluster analysis: k-means clustering by hand
We begin by implementing a simple method by hand.

#### The problem
Let's assume that we have a bunch of data for a pair of properties for a number of samples. We won't worry for now about what these propeties are, but they could, in principle, be any kind of data, either continuous variables or category data. The methods we will use are also not restricted to two-dimensions: we could deal with many more properties at the same time.

As an example, we could imagine that we have made a number of micr-hardness indents across the microstructure of some material and recorded the hardness and pop-in load at a series of points. We are now interested in discovering if these measurements fall into any kind of pattern. Perhaps certain ranges of hardness and pop-in load tend to occur together. Perhaps these groups of points correspond to different phases in the material, but the first thing we need to do is look to see if there is any such pattern in the data.

Let's start by taking a look at the data:

In [None]:
nclusters = 3
data = np.loadtxt('Data/clusterdata.txt', delimiter=',')
plt.scatter(data[:,0],data[:,1],marker='o', color='r')
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

There is clearly some structure to this data: our samples seem to fall into three distinct clusters. What we want to do now is have the computer automatically group the data into clusters. Broadly speaking this means finding groups of points such that the variation in the two properties *within* the group is minimised but the differences *between* the groups are maximised. To illustrate this process, we will now implement a method called *k-means clustering*. First of all, we will do this "by hand" so that we can understand what is going on. Later on we will use a built-in method of the scikit-learn module to demonstrate the power of this module.

#### The method
First of all we need to decide how many clusters we will ask the computer to define. In this case, the data clearly suggest that three clusters would be appropriate. Next we pick three data points at random from the dataset to serve as initial guesses for our cluster centres. Let's do that now and plot these cluster centres (as black triangles):

In [None]:
centres = data[np.random.randint(0,np.size(data,0),nclusters),:]
plt.scatter(data[:,0],data[:,1],marker='o', color='r')
plt.scatter(centres[:,0],centres[:,1],marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

Now, there's a good chance that this random selection has not resulted in a particularly good choice for the cluster centres (Note: if, by chance, you did get a good choice of cluster centres, then I suggest you rerun the above cell until you get a *poor* choice. The next few cells will make more sense this way!). Couldn't we do better by choosing the cluster centres by hand? In this case, yes we could, but remember that we are aiming for an *unsupervised* learning method, in which we intervene as little as possible in the clustering process. Also, if the data set had many more dimensions (or more noise) it might not be easy to see initially where the cluster centres ought to lie.

The next stage of the process is to classify all the points into three clusters based on which of the initial cluster centres is the closest. To do this we will define a python function which takes the raw data and a list of cluster centres as input and populates an array of flags, one per data point, to indicate which cluster the point belongs to:

In [None]:
def classify(data,centres):
    cluster_id = np.zeros(np.size(data,0), dtype=int) # Create an empty array to hold the cluster identities
    nclusters = np.size(centres,0)
    for i in range(np.size(data,0)):
        mind = 9999.0
        for s in range(nclusters):
            d = np.linalg.norm(data[i,:]-centres[s,:])
            if d < mind:
                mind = d
                cluster_id[i] = s
    return cluster_id

Let's now use this function and take a look at the result:

In [None]:
cluster_id = classify(data,centres) # Run the clustering function

color = ['r','g','b']
for s in range(nclusters):
    plt.scatter(data[(cluster_id == s),0],data[(cluster_id == s),1],marker='o', color=color[s])
plt.scatter(centres[:,0],centres[:,1],marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

Now, because our intial choice of cluster centres was not ideal the points will not be perfectly classified. We now move to the next stage of the algorithm, which first involves finding the centres of our newly defined (imperfect) clusters. This is simply the average value of the coordinates (the properties) of the points in each cluster. Let's implement a python function to do this:

In [None]:
def getcentres(data,cluster_id,centres):
    nclusters = np.size(centres,0)
    count = np.zeros(nclusters, dtype=int)
    centres[:,:] = 0
    for i in range(np.size(data,0)):
        count[cluster_id[i]] = count[cluster_id[i]] + 1
        centres[cluster_id[i]] = centres[cluster_id[i]] + data[i,:]
    centres[:,0] = centres[:,0]/count[:]
    centres[:,1] = centres[:,1]/count[:]
    return

Now we will use this function to get the new centres of the clusters and plot them on top of the data:

In [None]:
getcentres(data,cluster_id,centres)
for s in range(nclusters):
    plt.scatter(data[(cluster_id == s),0],data[(cluster_id == s),1],marker='o', color=color[s])
plt.scatter(centres[:,0],centres[:,1],marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

These probably look like much better values for the cluster centres. Let's now repeat the proces of classifiying the points according to their distance from the new cluster centres and then finding the new cluster centres based on this classification. Because we put our code into functions, this is particularly easy to do:

In [None]:
cluster_id = classify(data,centres)
getcentres(data,cluster_id,centres)
for s in range(nclusters):
    plt.scatter(data[(cluster_id == s),0],data[(cluster_id == s),1],marker='o', color=color[s])
plt.scatter(centres[:,0],centres[:,1],marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

You should now, after only two interations of the clustering algorithm, have a reasonable solution. Just in case the clustering is still not perfect, we can now interate through the cluster assignment and centre determination process a few more times and check the result:

In [None]:
ntries = 10
for i in range(ntries):
    cluster_id = classify(data,centres)
    getcentres(data,cluster_id,centres)
for s in range(nclusters):
    plt.scatter(data[(cluster_id == s),0],data[(cluster_id == s),1],marker='o', color=color[s])
plt.scatter(centres[:,0],centres[:,1],marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

This should now look pretty good, whatever the initial choice of cluster centres.  <span style="color: red">You should note</span> that it is possible that the initial choice of cluster centres is such that even after 10 iterations the algorithm performs very poorly. If this is the case for you, simply go back up to the cell where the initial cluster centres are randomly chosen and restart the clustering process.

#### <span style="color: red"> Task 1:</span> Try out some different initial choices for the cluster centres
Experiment by repeatedly running the cell that picks the inital centres at random until you get a really bad choice (e.g. all the centres within one cluster) and then rerun the remaining cells to see how quickly the result converges to the obviously correct answer.

### Using scikit-learn: k-means clustering the easy way

k-means clustering uses a simple and intuitive algorithm (which is why we chose it as an example), but some machine learning algorithms are much more complex and would require a lot of coding. This is just where python and its huge community of users and contributors comes into its own. Scikit-learn is a module that implements many of the most useful machine learning algorithms. As a first example of how to use it, we will repeat the process of k-means clustering of our example data. All we do is load the KMeans object from the module, create a KMeans object and then implement the fitting method on the object (we'll reload the data just to keep things clean):

In [None]:
from sklearn.cluster import KMeans
nclusters = 3
data = np.loadtxt('Data/clusterdata.txt', delimiter=',')
kmeans = KMeans(n_clusters = nclusters)
kmeans.fit(data)

After running the `.fit()` method the `kmeans` object now contains the results of the clustering process in the member variable `.labels_` and `.cluster_centers_` (note the US spelling of "centers"). We can access these variables and view the results of the clustering.

In [None]:
color = ['r','g','b']
for s in range(nclusters):
    plt.scatter(data[(kmeans.labels_ == s),0],data[(kmeans.labels_ == s),1],marker='o', color=color[s])
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')
plt.axes().set_aspect('equal')

The results are equivalent to the ones we obtained above "by hand" and I think you will agree that this process is pretty painless!

### A trickier clustering example - using alternative algorithms

Above we used a carefully chosen data set that was amenable to clustering by the k-means method. Not all data will be so well behaved. To demonstate this, let's take a look at some more data:

In [None]:
data = np.loadtxt('Data/moonclusterdata.txt', delimiter=',')
plt.scatter(data[:,0],data[:,1],marker='o', color='r')
plt.xlabel('Property A')
plt.ylabel('Property B')

This is an example of a classic data set used to test machine learning algorithms. The non-linearity in the dividing line between the two clusters makes this rather tricky to work with.

#### <span style="color: red"> Task 2:</span> Try out the k-means method (using scikit learn or our own `classify` and `getcentres` functions) to classify the data into two clusters.
Plot the results of the clustering to see how well the algorithm has performed.

You should find that the k-means clustering struggles to deal with the curved clusters, as below:

![Image of expected classification of moon data](Images/MoonKmeans.jpg "Title")

#### An alternative clustering algorithm: DBSCAN
Fortunately scikit-learn includes many different algorithms to try out and they are all easily accessible. To show how easy it is to use scikit-learn, I have approached this problem myself from a position of complete ignorance about how to proceed and using only information obtained through an internet search engine to guide me. The below cell contains the code that I came up with. It uses the unsupervised learning algorithm DBSCAN to correctly cluster this trickier data set. I timed myself and managed to get this working from scratch in less than ten minutes. I think this demonstrates the power of python. The process that I went through was as follows:

1. Google the phrase: scikit learn clustering unsupervised.
2. Click on the link to the python documentation page on unsupervised learning (this was the first link in my case).
3. Click on the link "2.3. Clustering".
4. Look at the overview figure for something that looks like it would work on my data: DBSCAN looks like a good bet.
5. Have a quick read of the documentation and the example code: it looks like I will need to tune two parameters: epsilon and min_samples. I will also need a way to extract the number of clusters for plotting purposes.
6. Try out some code.
7. Tune the values of epsilon and min_samples to get the clustering correct.

It's a simple as that. Try out the resulting code below:

In [None]:
from sklearn.cluster import DBSCAN
data = np.loadtxt('Data/moonclusterdata.txt', delimiter=',')
db = DBSCAN(eps=0.4, min_samples=2).fit(data)
nclusters = len(set(db.labels_))
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, nclusters)]
for s in range(nclusters):
    plt.scatter(data[(db.labels_ == s),0],data[(db.labels_ == s),1],marker='o', color=colors[s])
plt.xlabel('Property A')
plt.ylabel('Property B')

#### <span style="color: red"> Task 3:</span>  Try our some alternative values of epsilon and min_samples and observe the results.

You will find that the DBSCAN algorithm needs careful tuning, but is at least capable of coping with the curved boundary between clusters.

## Supervised  learning

In contrast to the unsupervised learning above, *"Supervised learning is the machine learning task of inferring a function from labeled training data"* (Wikipedia). So in contrast to unsupervised learning we now have a case in which we know a little more about our dataset. Rather than just looking for patterns within the data we are looking specifically for patterns which correlate with some known property of the data. Let's explore some examples.

### Learning preassigned classes: a support vector machine 
As an example of supervised learning we will take another look at our moon-shaped data, but this time we will assume that the data represent samples drawn from two different types (perhaps two types of materials) and that we know to which type each point belongs. The below code loads in the data, along with a list of the type for each point, and plots the result:

In [None]:
trainingdata = np.loadtxt('Data/moonclusterdata.txt', delimiter=',')
trainingclasses = np.loadtxt('Data/moonclusterclasses.txt', delimiter=',')

nclusters = 2
color = ['r','g','b']
for s in range(nclusters):
    plt.scatter(trainingdata[(trainingclasses == s),0],data[(trainingclasses == s),1],marker='o', color=color[s])
plt.xlabel('Property A')
plt.ylabel('Property B')

We are going to use this data, which consists of pairs of input properties along with known classes, to train a kind of classifier known as a *support vector machine* (SVM). This data is referred to as *training data*. 

To train a SVM on our data all we need to do is run the following code:

In [None]:
from sklearn.svm import SVC
svm = SVC(kernel='rbf', C=100, gamma=1.0).fit(trainingdata,trainingclasses)

That's all there is to it! (although note that I did need to tune the values of `C` and `gamma`: I'll come back to that in a minute). You might have noticed that in the above example I am creating the `svm` object and calling the fitting method in a single line: this is a very compact syntax. 

Now that now we have trained our SVM, we can use it to *predict* the classes of some further data. In this case, I have prepared another set of data for which we know the class of each point. This is a set of *test data*. The data was not used in training the SVM, so we can see how well the SVM does predicting the classes of these new samples and visualise the result:

In [None]:
testdata = np.loadtxt('Data/moonclustertestdata.txt', delimiter=',')
predictedclasses = svm.predict(testdata)

for s in range(nclusters):
    plt.scatter(testdata[(predictedclasses == s),0],testdata[(predictedclasses == s),1],marker='o', color=color[s])
plt.xlabel('Property A')
plt.ylabel('Property B')

This looks like a reasonable job, but let's check by comparing the predictions with the actual classes of the test data (which I saved at the time of generating the data):

In [None]:
testclasses = np.loadtxt('Data/moonclustertestclasses.txt', delimiter=',')
print('Percentage correctly classified: ' + str(np.sum(testclasses==predictedclasses)/np.size(testclasses)*100) + '%')
print(testclasses==predictedclasses)

We can see that the 4th point has been classified incorrectly. Let's use a plot to see which point this is:

In [None]:
for s in range(nclusters):
    plt.scatter(testdata[(predictedclasses == s),0],testdata[(predictedclasses == s),1],marker='o', color=color[s])
plt.scatter(testdata[(predictedclasses != testclasses),0],testdata[(predictedclasses != testclasses),1], marker='^', color='k', s=200)
plt.xlabel('Property A')
plt.ylabel('Property B')

The incorrectly classified point is indicated by the triangular marker. To work out what has gone wrong we can visualise the *decision boundary* that is used to determine the classes of different points in our two-dimensional property space. The code below does this. There is no need to worry too much about the details (creating the contour is a bit fiddly!).

In [None]:
X1, X2 = np.meshgrid(np.linspace(-1.5, 2.5, 100), np.linspace(-1.0, 1.5, 100))
X_grid = np.c_[X1.ravel(), X2.ravel()]
decision_values = svm.decision_function(X_grid)
fill_levels = [decision_values.min()] + [0] + [
            decision_values.max()]

for s in range(nclusters):
    plt.scatter(testdata[(testclasses == s),0],testdata[(testclasses == s),1],marker='o', color=color[s])
plt.xlabel('Property A')
plt.ylabel('Property B')
ax = plt.gca()
ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=[0])

In the above plot I have coloured the points according to the correct classification. You will see that the problematic point falls on the wrong side of the decision boundary. This is always a possibility and highlights the importance of using a test set of data to measure the performance of a trained classifier. Another reason for using a test set is the issue of *over-fitting* which we will return to shortly when we look at neural networks.

#### Tuning the SVM
I mentioned above that I needed to tune the parameters of the SVM in order to achieve good results. The SVM takes two parameters: `C` and `gamma`, which control the shape and smoothness of the decision boundary.

#### <span style="color: red"> Task 4:</span>   Explore the effect on the decision boundary of changing the parameters of the SVM. I have included the necessary code below.

In [None]:
# CHANGE THE VALUES IN THE TWO LINES BELOW:
C = 100
gamma = 1.0

svm = SVC(kernel='rbf', C=C, gamma=gamma).fit(trainingdata,trainingclasses)

X1, X2 = np.meshgrid(np.linspace(-1.5, 2.5, 100), np.linspace(-1.0, 1.5, 100))
X_grid = np.c_[X1.ravel(), X2.ravel()]
decision_values = svm.decision_function(X_grid)
fill_levels = [decision_values.min()] + [0] + [
            decision_values.max()]

for s in range(nclusters):
    plt.scatter(trainingdata[(trainingclasses == s),0],trainingdata[(trainingclasses == s),1],marker='o', color=color[s])
ax = plt.gca()
ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=[0])

#### Over-fitting 

You should see that as `gamma` is made larger the shape of the decision boundary follows the data more and more closely. This is not necessarily a good thing. Can you think why? Below I train the SVM on the training data using a very large value of gamma, but plot the decision boundary along with the test data and the predicted classes. This is a case of "over-fitting" of the training data to the point where new cases are frequently mis-classified:


In [None]:
C = 100
gamma = 1000.0

svm = SVC(kernel='rbf', C=C, gamma=gamma).fit(trainingdata,trainingclasses)

X1, X2 = np.meshgrid(np.linspace(-1.5, 2.5, 100), np.linspace(-1.0, 1.5, 100))
X_grid = np.c_[X1.ravel(), X2.ravel()]
decision_values = svm.decision_function(X_grid)
fill_levels = [decision_values.min()] + [0] + [
            decision_values.max()]

predictedclasses = svm.predict(testdata)

for s in range(nclusters):
    plt.scatter(testdata[(predictedclasses == s),0],testdata[(predictedclasses == s),1],marker='o', color=color[s])
plt.xlabel('Property A')
plt.ylabel('Property B')
ax = plt.gca()
ax.contour(X1, X2, decision_values.reshape(X1.shape), levels=[0])

## An application of machine learning: classifying microstructure

We will now take what we have learned and apply it to a real research problem. The next part of the material for this session is in the second Machine Learning notebook