# Machine Learning

## The Scikit Learn module

The [scikit learn](https://scikit-learn.org/stable/) module is an open source module dedicated to implementing machine learning methods in python. It is the tool we will use for the next few weeks to explore these methods and apply them to some data.

There are particular sections of the module dedicated to supervised and unsupervised machine learning tasks (recall these are where we have training examples and where we do not respectively). Here we are going to focus on supervised classification methods, that is, given training data classified into two classes (denoted 1 and 0), we want to use that data to construct a way to classify new data points based on that information.

Firstly we are going to need to import some modules and functions to get started (we'll particularly take sub-modules connected to nearest neighbours and decision trees):

In [None]:
import pandas as pd
import sklearn as skl
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.neighbors as nei
import sklearn.tree as tree

## First KNN example

The first example we will work with is a synthetic dataset, made as it just has two features, which means we can plot the data points within the feature space more easily to get a handle on what's going on.

Our dataset comes split in two parts `synth.tr.csv` and `synth.te.csv`, the first one containing our training data and the second our test data. Both therefore have labels, but recall we don't use the labels in the test set to build our model, just to compare with the predictions of our classifier to see how well we do.

First let's load in the training dataset and have a look at it. Here is a scatter plot.

In [None]:
class_data = pd.read_csv('synth.tr.csv')
print(class_data.head())

sns.scatterplot(data=class_data, x='xs', y='ys', hue='yc')
plt.title('Scatter plot - coloured by class')
plt.show()

Next we want to make a K nearest neighbour classifier based on this training data. To do that we start a classifier model, specifying K and the distance metric we will use (here we specify K=3, and normal Euclidean distance).

We can predict a value of the class label for a made up data point - here we try the point (0,1) which looks like it should definitely be a 1 since it is in the middle of lots of orange points in the plot above.

In [None]:
knn_class = nei.KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_class.fit(class_data[['xs','ys']],class_data['yc'])

print(knn_class.predict([[0,1]]))

Don't worry about the warning we get from the last line - this is just saying we passed in data without heading names within a dataframe - and the training data was in a dataframe - so just warning us of the mismatch.

Note that we have two methods associated with our classifier (denoted by the . notation to get to them). One is the `fit()` method that we pass in our data to. The pattern is the data points - here simply two columns of our dataframe, and then the associated label, here contained in another column of our dataframe. The second method `predict()` is used to apply our classifier to a new data point - we can give it a data point and ask what class the classifier gives as an output.

So far so good then. We have a classifier, and we can test it out on a data point and it gives a classification. The next step is to look at our test data and see if we can begin to qunatify how well our classifier has done.

First we can load in and have a look at the test data:

In [None]:
test_data = pd.read_csv('synth.te.csv')

sns.scatterplot(data=test_data, x="xs", y="ys", color='green')
plt.title("Scatter plot of test data")
plt.show()

These are all of the data points in the test dataset. We then have to classify them by using our classifier above. It will find the nearest three points in our training dataset to each point here, and let them vote on the class for the point. With 3 neighbours selected there cannot be a draw.

We can use the `predict()` method again here, this time passing in all of the data points within our test dataset. The next line creates a new column of the dataframe which we call 'class'.

In [None]:
test_data['class'] = knn_class.predict(test_data[['xs','ys']])

We can look at the results of this classification. Here is the test dataset coloured according to the predicted class of each point:

In [None]:
sns.scatterplot(data=test_data, x="xs", y="ys",hue='class')
plt.title("Scatter plot - test data coloured by class")
plt.show()

We could also overlay on the plot all of the training data. Here we do that using triangular markers for the training data.

You can look to see the closest 3 triangles to each circle - they are the ones which voted on that classification.

In [None]:
sns.scatterplot(data=class_data, x="xs", y="ys", hue="yc",marker='^',legend=None)
sns.scatterplot(data=test_data, x="xs", y="ys",hue='class')
plt.title("Scatter plot - test data coloured by class")
plt.show()

## Assessing the classifier

Looking at our test dataset we have two columns now which contain the real class of our data points 'yc', and the predicted class 'class'.

In [None]:
print(test_data.head())

Note that we can't see much from just the top of the dataframe. But if we look through the dataframe we can find locations where the values in these two columns don't match:

In [None]:
print(test_data[test_data['yc']!=test_data['class']])

Note we have made both types of errors here, places where we have predicted a 1 but the real answer was a 0 - a False Positive. And also places where we precited a 0 but the real answer was 1 a False Negative.

We can summarise all of the True/False Positive/Negatives in a confusion matrix. We can also look at the balanced accuracy and F scores (look back at the slides to see the definitions of each of these). 

In [None]:
print(skl.metrics.confusion_matrix(y_true=test_data['yc'],y_pred=test_data['class']))
print(skl.metrics.balanced_accuracy_score(y_true=test_data['yc'],y_pred=test_data['class']))
print(skl.metrics.f1_score(y_true=test_data['yc'],y_pred=test_data['class']))

We can see we do a pretty good job, the majority of points are well classified. 

Let's compare to another model to see if we can do better - let's try KNN for the case K=5. Here we fit that model, predict the classes on the test dataset and compute our scores again just as before:

In [None]:
knn_class = nei.KNeighborsClassifier(n_neighbors=5, metric='euclidean')
knn_class.fit(class_data[['xs','ys']],class_data['yc'])
test_data['class'] = knn_class.predict(test_data[['xs','ys']])

print(skl.metrics.confusion_matrix(y_true=test_data['yc'],y_pred=test_data['class']))
print(skl.metrics.balanced_accuracy_score(y_true=test_data['yc'],y_pred=test_data['class']))
print(skl.metrics.f1_score(y_true=test_data['yc'],y_pred=test_data['class']))

So by all of these metrics the K=5 classifier appears to be a bit better than the K=3. 

## Second example

The second data example is derived from the paper:
Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C. and Johannes, R. S. (1988) Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications in Medical Care (Washington, 1988), ed. R. A. Greenes, pp. 261–265. Los Alamitos, CA: IEEE Computer Society Press.

This data contains measurements of various health indicators for use as a predictor for whether the study participants would go on to develop diabetes. The final column gives the true outcome for what happened to each participant.

We can load in the data. Note there are more columns in this data - so the feature space is now 8-dimensional, which means we cannot simply visualise it all now. We will just look at a cross-section of the feature space and show the sixth column against the second column to get a small sense of how the feature space looks.

In [None]:
diabetes_data = pd.read_csv('pima-indians-diabetes.csv')
print(diabetes_data.head())
print(diabetes_data.keys())
sns.scatterplot(data=diabetes_data, x='glu', y='bmi', hue='class')
plt.title("Scatter plot - coloured by class")
plt.show()

We can see some separation of the different classes, but of course we are missing the other directions in the feature space in our plot, so we can't tell whether those which merge together in the middle in our plot might separate out more if we look at other planes of the data.

In this data set all of the labelled data is clumped in together in one file. So we need to split it ourselves into a training set and a test set. We can use a function in scikit learn to do this for us. 

Note the form of the following line. The function outputs data points for the train and test portions, and separately the classes for the train and test data points - usually called X and y respectively. In the input of the function we need to give the columns of data which make up our feature space, and the column with our labels. We also need to specify what fraction of the data we want to be used as a test set (usually we want to retain most of the data to do training with). Here we specify a third of the data to be our test set.

In [None]:
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(diabetes_data[['npreg', 'glu', 'bp', 'skin', 'ins', 'bmi', 'ped', 'age']],diabetes_data['class'], test_size=0.33)

Next we can use the `fit()` method to fit our KNN classifier for K=3. Then use the `predict()` method on our test data to obtain predicted classes for each data point in the test set.

In [None]:
knn_class = nei.KNeighborsClassifier(n_neighbors=3, metric='euclidean')
knn_class.fit(X_train,y_train)
y_predict=knn_class.predict(X_test)

Now we have both the real classes (contained in `y_test`) and the predicted classes (contained in `y_predict`) we can use these to obtain our various metrics to determine how well our classifier performed:

In [None]:
print('Confusion matrix: ',skl.metrics.confusion_matrix(y_test,y_predict))
print('Precision: ',skl.metrics.precision_score(y_test,y_predict))
print('Recall: ',skl.metrics.recall_score(y_test,y_predict))
print('Balanced accuracy: ',skl.metrics.balanced_accuracy_score(y_test,y_predict))
print('F1 score: ',skl.metrics.f1_score(y_true=y_test,y_pred=y_predict))

We can also look at the difference between the true and predicted classes in our slice of the feature space. Here is what they look like - you can see at least a few points which are different across the two plots:

In [None]:
sns.scatterplot(data=X_test, x='glu', y='bmi', hue=y_test)
plt.title("Scatter plot - coloured by class - true class")
plt.show()
sns.scatterplot(data=X_test, x='glu', y='bmi', hue=y_predict)
plt.title("Scatter plot - coloured by class - predicted class")
plt.show()

## Decision Trees

We can also fit a decision tree classifier to the same data. The nice thing about the scikit learn module is that the `fit()` and `predict()` methods work in a reasonably standard way so fitting a new classifier model is pretty straightforward and follows a similar pattern.

To do it we make a new decision tree classifier, use the `fit()` method with our training data, and use the `predict()` method with our test data - all in pretty similar fashion to the way we did this for the KNN case above:

In [None]:
dt_class = tree.DecisionTreeClassifier()
dt_class.fit(X_train,y_train)
y_predict = dt_class.predict(X_test)

Now we have the true and predicted classes again, we can go ahead and find all of the metrics which detail how well the classifier has worked:

In [None]:
print('Confusion matrix: ',skl.metrics.confusion_matrix(y_test,y_predict))
print('Precision: ',skl.metrics.precision_score(y_test,y_predict))
print('Recall: ',skl.metrics.recall_score(y_test,y_predict))
print('Balanced accuracy: ',skl.metrics.balanced_accuracy_score(y_test,y_predict))
print('F1 score: ',skl.metrics.f1_score(y_true=y_test,y_pred=y_predict))

Comparing to the KNN classifier above it looks like this does generally worse.

Once again we can plot our cross section of the data showing the real and predited classes - again we can see some differences as we would expect:

In [None]:
sns.scatterplot(data=X_test, x='glu', y='bmi', hue=y_test)
plt.title("Scatter plot - coloured by class - true class")
plt.show()
sns.scatterplot(data=X_test, x='glu', y='bmi', hue=y_predict)
plt.title("Scatter plot - coloured by class - predicted class")
plt.show()

## Exercises

Here is a new dataset that contains 4 columns in the relevant dataframe. The first 3 columns represent the features, the fourth is the real class of each data point.

In [None]:
operation_data = pd.read_csv('haberman.csv')

print(operation_data.head())
print(operation_data.keys())

1) Split the data up into a training and test dataset, use 50% of the data in each set.

2) Fit a decision tree classfier to the data and output the balanced accuracy and F1 score.

3) Fit the KNN classifier with K=3 to the data and output the balanced accuracy and F1 score.

4) Which method did best? What's your justification for that conclusion?