# Objectives


* To explore an existing dataset
> This week, we'll use the Iris dataset. You can read more about it here: https://scikit-learn.org/stable/datasets/toy_dataset.html.

* To apply k-nearest neighbour (kNN) and Bagging algorithms from Week 2 lecture to classification of Iris plants based on petal and sepal sizes.

# Section 1 - Load the Iris dataset

In [None]:
from sklearn import datasets

iris_data, iris_labels = datasets.load_iris(return_X_y=True, as_frame=False)

print("The dimensions of the Iris feature matrix", iris_data.shape)

# Section 2 - Explore the Iris dataset

* Read about the Iris dataset here: https://scikit-learn.org/stable/datasets/toy_dataset.html
* What type of labels does it have (real continuous or categorical)? What kind of machine learning task is this type of label suited to, i.e. classification or regression?
* What is the feature dimensionality of the dataset, i.e. the number of features?
* How many data instances are there? What is the distribution of instances across classes?



---


* Select one of the features. What association does the selected feature have with the iris classes, with respect to differentiating between them (Hint - use a search engine to read about Iris Setosa, Iris Versicolour, and Iris Virginica plant)?
* What factors do you think limited the number of data instances per class?
* How do you think the data was collected? What implication would this have for real world deployment of a model for automatic detection of iris classes based on this dataset?
* How do you think it was labelled? What kind of challenge might this pose for collection of more training data (and labels) for automatic detection of iris classes?

# Section 3 - Split into training and test sets

In [None]:
import numpy

# Randomly split the data into 50:50 training:test sets
random_seed = 1
rng =  numpy.random.default_rng(random_seed)
rand_inds = numpy.arange(iris_labels.shape[0],)
rng.shuffle(rand_inds)
split_point = int(0.5*iris_labels.shape[0])

training_data = iris_data[rand_inds[0:split_point], :]
training_labels = iris_labels[rand_inds[0:split_point]]
test_data = iris_data[rand_inds[split_point:iris_labels.shape[0]], :]
test_labels = iris_labels[rand_inds[split_point:iris_labels.shape[0]]]

print("Size of the training data:", training_data.shape)
print("Size of the ttest data:", test_data.shape)

# Section 4 - Train and evaluate a kNN model

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model_kNN = KNeighborsClassifier(n_neighbors=5)
model_kNN.fit(training_data, training_labels)
test_predictions_kNN = model_kNN.predict(test_data)

print("\n What proportion of the kNN test predictions were correct? %.2f " % accuracy_score(test_labels, test_predictions_kNN))

# Section 5 - Visually explore the data and predictions

* Use data visualization to explore how separable the three iris classes are.


# Section 6 - Explore the effect of the kNN hyperparameters

* Try different values of k, i.e. the number of nearest neighbours, e.g. k = 1, 2, 5, 10, 20. What effect of k do you notice?
* Try a different distance metric. See https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html.

# Section 7 - Train and evaluate a Bagging model

In [None]:
from sklearn.ensemble import BaggingClassifier
import math

# Set the max number of features to be used to split each node for each tree
max_feats = int(math.sqrt(training_data.shape[1]))

model_B = BaggingClassifier(n_estimators=10, max_features=max_feats, random_state=random_seed)
model_B.fit(training_data, training_labels)
test_predictions_B = model_B.predict(test_data)


print("\n What proportion of the Bagging test predictions were correct? %.2f " % accuracy_score(test_labels, test_predictions_B))

# Section 8 - Explore the effect of the Bagging hyperparameters

* Try different numbers of base classifiers, i.e. trees, e.g. n = 1, 10, 100, 1000. What effect of the number of trees do you notice?

# Section 9 - Explore split into training and test sets

* How was the Iris dataset split into training and test sets? See Section 3.
* Randomly split the dataset into training and test sets such that the ratio of instances is 80:20.
* What is the effect on performance of the Bagging algorithm?