# Solutions to Practical 1.2 - Iris Classification

Here we are continuing with using basic classifiers, except this time we are working with a famous external dataset: the Iris Dataset, developed by Fisher. 

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier

Here we import the Iris dataset. We see that the features $X$ are a 4-column matrix with all continuous data as the sepal and petal length and width. The y labels are a number [0, 1, 2] representing the associated species with the measurements. You can look up the finer details, in addition to a full view of the entire dataset here: (https://en.wikipedia.org/wiki/Iris_flower_data_set).

In [2]:
iris = datasets.load_iris()
X = iris.data
y = iris.target
print("First 5 rows of features: {}".format(X[0:5]))
print("First 5 rows of labels: {}".format(y[0:5]))

First 5 rows of features: [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]]
First 5 rows of labels: [0 0 0 0 0]


And the size of the dataset:

In [3]:
print("Feature size: {}".format(X.shape))
print("Label size: {}".format(y.shape))

Feature size: (150, 4)
Label size: (150,)


## Example 1

In this example we're going to select a few random indices out of 150, remove them from the dataset and use them as our test samples. The rest of the data, we will give to the classifier to train on the set. 

In [4]:
nIndices = 4
test_idx = np.random.randint(0,150,nIndices)
# ensure the test indices are unique
test_idx = np.unique(test_idx)

train_X = np.delete(iris.data, test_idx, axis=0)
train_y = np.delete(iris.target, test_idx, axis=0)

Now we will formulate our test features.

In [5]:
test_X = iris.data[test_idx]
test_y = iris.target[test_idx]

Now we have our training and testing sets, lets create and fit our classifier using the training data.

In [6]:
clf = tree.DecisionTreeClassifier()
clf.fit(train_X, train_y)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

Now lets test our classifier on the training data.

In [7]:
print(test_y)
print(clf.predict(test_X))

[0 0 1 2]
[0 0 1 2]


For all the examples, we see that the classifier appears to be reliably predicting the correct species, given the dataset.

## Example 2

In the above example, we generate some random integers to act as the indices which select which rows to use as test samples, leaving the majority as training data for the classifier. This practice is so common in Machine Learning there is an infinity-valued function called 'train_test_split' which takes the $X$ and $y$ and returns X_train, X_test, y_train and y_test with optimised splitting. Implement the iris dataset using the train_test_split function.

The package you will need is from sklearn.model_selection.

Look up the function from the Ski-Kit Learn documentation and use it to generate your training and testing data. Try with 50/50 split, 75/25 and 90/10 (training/testing, respectively) and see which one has the highest accuracy.

You can test which one has the highest accuracy using a quantitative scoring function called 'accuracy_score' which is already given. This can be used by giving it the y_test data and the information returned from your classifiers' predict() method. It will return a percentage, where 100% is perfect prediction with all test samples. 

In [8]:
classifier = tree.DecisionTreeClassifier()

# 50/50
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)
classifier.fit(X_train, y_train)
accuracy = accuracy_score(y_test, classifier.predict(X_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))

#75/25
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
classifier.fit(X_train, y_train)
accuracy = accuracy_score(y_test, classifier.predict(X_test))
print("Decision Tree Accuracy with 75/25: {}".format(accuracy))

#90/10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)
classifier.fit(X_train, y_train)
accuracy = accuracy_score(y_test, classifier.predict(X_test))
print("Decision Tree Accuracy with 90/10: {}".format(accuracy))

Decision Tree Accuracy with 50/50: 0.92
Decision Tree Accuracy with 75/25: 0.8947368421052632
Decision Tree Accuracy with 90/10: 0.8666666666666667


As we can see the accuracy generally goes up the more training samples we have. Run it a few times to see the values change slightly. 

## Example 3

Using the same dataset, use a different classifier than the Decision Tree. We recommend using
the K Nearest Neighbours classifier.

The import package you want for this is 'sklearn.neighbors'.

Again, check out the documentation for this classifier on the website and implement it; the code is remarkably similar to the previous example. 

Once you have it working, compare the results to the Decision Tree classifier; is there much of a difference? Which is better? What are the advantages and disadvantages of each of the algorithms?

In [9]:
classifier2 = KNeighborsClassifier()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5)
classifier2.fit(X_train, y_train)
accuracy = accuracy_score(y_test, classifier2.predict(X_test))
print("K-Nearest Neighbours Accuracy with 50/50: {}".format(accuracy))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
classifier2.fit(X_train, y_train)
accuracy = accuracy_score(y_test, classifier2.predict(X_test))
print("K-Nearest Neighbours Accuracy with 75/25: {}".format(accuracy))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1)
classifier2.fit(X_train, y_train)
accuracy = accuracy_score(y_test, classifier2.predict(X_test))
print("K-Nearest Neighbours Accuracy with 90/10: {}".format(accuracy))

K-Nearest Neighbours Accuracy with 50/50: 0.96
K-Nearest Neighbours Accuracy with 75/25: 1.0
K-Nearest Neighbours Accuracy with 90/10: 1.0


Some slightly strange output which changes every time you run it. Sometimes the lower splits are more accurate than 90/10. Figure out why!

## Example 4 - Optional

Using what you have learnt from this dataset, apply it to another dataset.

Go to http://scikit-learn.org/stable/datasets/ and select the load_digits() dataset as it is also a classification problem. Select the diabetes dataset if you really want a challenge!

Like the iris dataset, these sets can also be retrieved through the sklearn.datasets class by one simple function call.

In [10]:
digits = datasets.load_digits()
X = digits.data
y = digits.target

print(X)
print(y)

[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]
[0 1 2 ..., 8 9 8]


We learn from the website that the data is represented as 64 columns, where each column is a number between [0, 16] that describes a pixel in an 8x8 grid to draw one of the 9 numerical characters. In this dataset we have 1797 total characters. This is the beginning of image classification which we will touch on in a later practical. 

In [11]:
from matplotlib import pyplot as plt
plt.imshow(digits.images[-1], cmap=plt.cm.gray_r, interpolation='nearest')

<matplotlib.image.AxesImage at 0x10c7bc9b0>

In [12]:
X.shape

(1797, 64)

In [16]:
print(X.max(), X.min())

16.0 0.0


In [14]:
for i in range(1, 10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = i/10)
    classifier = tree.DecisionTreeClassifier()
    classifier.fit(X_train, y_train)
    predictions = accuracy_score(y_test, classifier.predict(X_test))
    print("Test size {}, Prediction {}".format((i/10),predictions))

Test size 0.1, Prediction 0.8333333333333334
Test size 0.2, Prediction 0.8555555555555555
Test size 0.3, Prediction 0.8407407407407408
Test size 0.4, Prediction 0.8386648122392212
Test size 0.5, Prediction 0.8175750834260289
Test size 0.6, Prediction 0.7831325301204819
Test size 0.7, Prediction 0.7813990461049285
Test size 0.8, Prediction 0.8087621696801113
Test size 0.9, Prediction 0.7200247218788628


In [15]:
for i in range(1, 10):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = i/10)
    classifier = KNeighborsClassifier()
    classifier.fit(X_train, y_train)
    predictions = accuracy_score(y_test, classifier.predict(X_test))
    print("Test size {}, Prediction {}".format((i/10),predictions))

Test size 0.1, Prediction 0.9944444444444445
Test size 0.2, Prediction 0.9916666666666667
Test size 0.3, Prediction 0.9814814814814815
Test size 0.4, Prediction 0.9888734353268428
Test size 0.5, Prediction 0.9810901001112347
Test size 0.6, Prediction 0.969416126042632
Test size 0.7, Prediction 0.9689984101748808
Test size 0.8, Prediction 0.958970792767733
Test size 0.9, Prediction 0.9276885043263288


As you can see the K Neighbours algorithm is much better in this example. We could also use more advanced classifiers such as Random Forests. We won't touch on these in this seminar. 