# A Simple Binary Classification Example

*Supervised* machine learning techniques involve training a model to operate on a set of *features* and predict a *label* using a dataset that includes some already-known label values. The training process *fits* the features to the known labels to define a general function that can be applied to new features for which the labels are unknown, and predict them. You can think of this function like this, in which ***y*** represents the label we want to predict and ***x*** represents the features the model uses to predict it.

$$y = f(x)$$

*Classification* is a form of supervised machine learning in which you train a model to use the features (the ***x*** in our function) to predict a label (***y***) that represents one of a number of possible classes. The simplest form of classification is *binary* classification, in which the label is 0 or 1, representing one of two classes; for example, "True" or "False"; "Internal" or "External"; "Profitable" or "Non-Profitable"; and so on. We'll focus on binary classification in this example, but many of the same principles apply to *multiclass* classification in which there are more than two possible classes.

## Preparing Data for Model Training
Let's start by looking at some data. Run the following cell to load a CSV file into a **pandas** dataframe:

In [None]:
import pandas as pd

# load the training dataset
diabetes = pd.read_csv('data/diabetes.csv')
diabetes

This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients to tested positive. This is the label that we will train our mode to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label. Let's separate those - we'll call the features ***X*** and the label ***Y***:

In [None]:
# Separate features and labels
X, Y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

Our dataset includes known values for the label, so we can use this to train a classifier so that it finds a statistical relationship between the features and the label value; but how will we know if our model is any good? How do we know it will predict correctly when we use it with new data that it wasn't trained with? Well, we can take advantage of the fact we have a large dataset with known label values, use only some of it to train the model, and hold back some to test the trained model - enabling us to compare the predicted labels with the already known labels in the test set.

In Python, the **scikit-learn** package contains a large number of functions we can use to build a machine learning model - including a **train_test_split** function that ensures we get a statistically random split of training and test data. We'll use that to split the data into 65% for training and hold back 35% for testing.

In [None]:
from sklearn.model_selection import train_test_split

# Split data 65%-35% into training set and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.35, random_state=0)

print ('Training Set: %d, Test Set: %d \n' % (X_train.size, X_test.size))

print("Sample of features and labels:")
# Take a look at the first 10 training features and corresponding labels
for n in range(0,9):
    print(X_train[n], Y_train[n])

In the output from the previous cell, you can see the number of training and test cases, and a sample of the first ten features (in \[square brackets\] with the corresponding known label.

## Training the Model
OK, now we're ready to train our model by fitting the training features (**X_train**) to the training labels (**Y_train**). There are various algorithms we can use to train the model. In this example, we'll use *Logistic Regression*, which is a well-established algorithm for classification. In addition to the training features and labels, we'll need to set a *regularization* parameter. This is used to counteract any bias in the sample, and help the model generalize well by avoiding *overfitting* the model to the training data.

In [None]:
# Train the model
from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01

# train a logistic regression model on the training set
clf1 = LogisticRegression(C=1/reg).fit(X_train, Y_train)
print (clf1)

## Testing and Evaluating the Model
So now we've trained the model using the training data, we can use the test data we held back to evaluate how well it predicts. Again, **scikit-learn** can help us do this. Let's start by using the model to predict labels for our test set, and compare the predicted labels to the known labels:

In [None]:
predictions = clf1.predict(X_test)
print('Predicted labels: ', predictions)
print('Actual labels: ' ,Y_test)

The arrays of labels are too long to be displayed in the notebook output, so we can only compare a few values. Even if we printed out all of the predicted and actual labels, there are too many of them to make this a sensible way to evaluate the model. Fortunately, **scikit-learn** has a few more tricks up its sleeve, and it provides some metrics that we can use to evaluate the model.

The most obvious thing you might want to do is to check the *accuracy* of the predictions - in simple terms, what proportion of the labels did the model predict correctly?

In [None]:
from sklearn import metrics
from sklearn.metrics import accuracy_score

print('Accuracy: ', accuracy_score(Y_test, predictions))

The accuracy is returned as a decimal value - a value of 1.0 would mean that the model got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless!

Accuracy seems like a sensible metric to evaluate (and to a certain extent it is), but you need to be careful about drawing too many conclusions from the accuracy of a classifier. Remember that it's simply a measure of how many cases were predicted correctly. Suppose only 3% of the population is diabetic. You could create a classifier that always just predicts 0, and it would be 97% accurate - but not terribly helpful in identifying patients with diabetes!

Fortunately, there are some other metrcs that reveal a little more about how our model is performing.

In [None]:
from sklearn. metrics import classification_report

print(classification_report(Y_test, predictions))

The classification report includes the following metrics:
* *Precision*: The proportion of *positive* (1) predictions that were in fact positive.
* *Recall*: The proportion of actual positive cases that the classifier correctly identified.
* *F1-Score*: An average metric that takes both precision and recall into account.
* *Support*: A weighted average of prevelance for the two classes.

The precision and recall metrics are derived from four core metrics:
* *True Positives*: The predicted label and the actual label are both 1.
* *False Positives*: The predicted label is 1, but the actual label is 0.
* *False Negatives*: The predicted label is 0, but the actual label is 1.
* *True Negatives*: The predicted label and the actual label are both 0.

These metrics are generally shown together as a *confusion matrix*, which takes the following form:

<table>
    <tr>
        <td>TN</td><td>FP</td>
    </tr>
    <tr>
        <td>FN</td><td>TP</td>
    </tr>
</table>

In Python, you can use the **sklearn.metrics.confusion_matrix** function to find these values for a trained classifier:

In [None]:
from sklearn.metrics import confusion_matrix

# Print the confusion matrix
tn, fp, fn, tp = confusion_matrix(Y_test, predictions).ravel()
print (tn,fp,'\n',fn,tp,)

Up until now, we've considered the predictions from the model as being either a 1 or a 0. Actually, things are a little more complex than that. Statistical machine learning algorithms, like logisitic regression, are based on *probability*; so what actually gets predicted by a binary classifier is the probability that the label is true (**P(y)**) and the probability that the label is false (1 - **P(Y)**). A threshold value of 0.5 is used to decide whether the predicted label is a 1 (*P(Y) > 0.5*) or a 0 (*P(Y) <= 0.5*). You can use the **predict_proba** method to see the probability pairs for each case:

In [None]:
Y_scores = clf1.predict_proba(X_test)
print(Y_scores)

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilties are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:

In [None]:
# Evaluate the model
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# calculate ROC curve
fpr, tpr, thresholds = roc_curve(Y_test, Y_scores[:,1])

# plot ROC curve
fig = plt.figure(figsize=(6, 4), dpi=75)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()



The ROC chart shows the curve of the true and false positive rates for different threshold values between 0 and 1. A perfect classifier would have a curve that goes straight up the left side and straight across the top. The diagonal line across the chart represents the probability of predicting correctly with a 50/50 random prediction; so you obviously want the curve to be higher than that (or your model is no better than simply guessing!).

The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall performance of the model. The closer to 1 this value is, the better the model.

In [None]:
from sklearn.metrics import roc_auc_score

auc = roc_auc_score(Y_test,Y_scores[:,1])
print('AUC: ' + str(auc))

In this case, the ROC curve and its AUC indicate that the model performs better than a random guess.

## Persisting and Using a Model
Now that we have a reasonably useful trained model, we can save it for use later:

In [None]:
# Save the trained model
import sys
import os
import pickle

print ("Exporting the model to model.pkl")
f = open('data/model.pkl', 'wb')
pickle.dump(clf1, f)
f.close()

Then when we have some new observations for which the lable is unknown, we can load the model and use it to predict values for the unknown label:

In [None]:
# load model from pickle file
print("Importing the model from model.pkl")
f2 = open('data/model.pkl', 'rb')
clf2 = pickle.load(f2)

# predict on a new sample
X_new = [[2,180,74,24,21,23.9091702,1.488172308,22]]
print ('New sample: {}'.format(X_new))

pred = clf2.predict(X_new)
print('Predicted class is {}'.format(pred))

## Learing More

Check out the scikit-learn documentation at http://scikit-learn.org/stable/documentation.html.