#### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 Semester 1

## Week 5 - Practical Workshop

Today, we will be using `scikit-learn` again to learn more about Baselines and Decision Trees.
### Exercise 1. 
Load the `Iris` dataset as follows (Note that there are some small differences between this dataset and the one we were looking at last week, most notably, the errors that we needed to fix are not present):

In [48]:
import numpy as np
from sklearn import datasets
from collections import Counter
import matplotlib.pyplot as plt
iris = datasets.load_iris()

- **(a)** Identify the contents of the complex data type iris , for example iris.DESCR contains a long description of the dataset, which you can `print()`.

- **(b)** The common terminology in scikit-learn is that the array defining the attribute values is called X and the array defining the “ground truth” labels is called y ; create these variables for the Iris data.

- **(c)** Confirm that X is a 2-dimensional array, with a row for each instance and a column for each attribute. (Hint: read about the `shape` property in numpy .)

### Exercise 2.
Let’s build a 0-R classifier (“majority class classifier”). In scikit-learn , this is a `DummyClassifier`. 

**Note** `scikit-learn` uses this terminology to help remind you not to use these sorts of classifiers when trying to solve real problems; However they are easy **baseline classifiers** and are quite useful.

In [52]:
from sklearn.dummy import DummyClassifier
zero_r = DummyClassifier(strategy='most_frequent')
zero_r.fit(X, y)

DummyClassifier(constant=None, random_state=None, strategy='most_frequent')

- **(a)** Confirm that this is a typical 0-R classifier by checking its predictions on the training data: `zero_r.predict(X)` — which class has it chosen?

In [None]:
ybar = ...
print(ybar)
label_counter = Counter(y)
label_counter.most_common()

Everything is a number, as far as scikit-learn is concerned - so it's a little difficult to know which class label this is. On the other hand, each of the classes is equally likely, so the method appears to have chosen the "0th" class arbitrarily.

- **(b)** The default evaluation metric associated with a `DummyClassifier` is `accuracy`, which you can observe using `score()` , for example: `zero_r.score(X, y)`. This strategy — building a model, and then evaluating on the data that we used to build the model — gives us something called “training accuracy”, and is generally frowned upon in the *Machine Learning* community. Why do you suppose this is? (We’ll examine some better techniques later.)

In [None]:
print(zero_r.score(...))

- **(c)** Contrast the `0-R classifier` with the “weighted random classifier”, which makes random predictions according to the distribution of classes in the training data; (`strategy='stratified'`) — check its predictions, and evaluate its training accuracy. Does it have a higher accuracy, on average, than `0-R`, or a `lower accuracy`? (You should run `score()` at least 10 times.)

In [None]:
stratified_clf = DummyClassifier(strategy='stratified')
stratified_clf.fit(...)
accuracies = []
num_runs = 10
for i in range(num_runs):
    acc = stratified_clf.score(...)
    accuracies.append(acc)
print(accuracies)
print('Average accuracy over {} runs is: {}.'.format(num_runs, np.mean(accuracies)))

### Exercise 3.
Let’s consider a couple of other classifiers: a `Decision Tree`, and `1-R` (which is really just a limited
`DecisionTreeClassifier` in `scikit-learn` ).

**NOTE:** `scikit-learn` implementation of `1-R` is slightly different to the lecture version, because it doesn’t count errors — rather it uses the Gini coefficient or the Information Gain to determine the best attribute.



In [None]:
from sklearn.tree import DecisionTreeClassifier
one_r = DecisionTreeClassifier(max_depth=1)
one_r.fit(X, y)
dt = DecisionTreeClassifier(max_depth=None)
dt.fit(X, y)

- **(a)** Find the training accuracy of the two classifiers.

In [None]:
one_r_acc = ...
dt_acc = ...
print("1-R accuracy: {}; DT accuracy: {}".format(one_r_acc, dt_acc))

- **(b)** The "`feature_importances_`" attribute is adequate for completely describing the 1-R classifier. Which attribute is being used to classify the data?

In [None]:
importances = one_r.feature_importances_
max_index = ...
best_feature_name = ...
print(best_feature_name)

- **(c)** *(Harder)* Check the predicted labels for each instance to discern the values for this attribute that each class maps to.

In [None]:
ybar = one_r.predict(X)
best_feature = X[:, max_index]
plt.scatter(...)
plt.xlabel(best_feature_name)
plt.ylabel('predicted class')
plt.show()
#print(ybar)

- **(d)** The default splitting criterion for these Decision Trees is the **Gini coefficient**. Read up on the difference between this and the **Information Gain** — do you expect the behaviour of this model to change by using the alternative splitting criterion? Try it, and confirm your expectations.

In [None]:
one_r = DecisionTreeClassifier(max_depth=1, criterion="entropy")
one_r.fit(X, y)
dt = DecisionTreeClassifier(max_depth=None, criterion="entropy")
dt.fit(X, y)

one_r_acc = ...
dt_acc = ...
print("Information Gain/entropy: 1-R accuracy: {} DT accuracy: {}".format(one_r_acc, dt_acc))

importances = one_r.feature_importances_
max_index = np.argmax(importances)
best_feature_name = iris.feature_names[max_index]
print("1-R attribute: ",best_feature_name)

ybar = one_r.predict(X)
best_feature = X[:, max_index]
plt.scatter(best_feature, ybar, c=ybar)
plt.xlabel(best_feature_name)
plt.ylabel('predicted class')
plt.show()

### Exercise 4
A better mechanism for evaluating a classifier is based on randomly partitioning the data into a
training set and test set (the “holdout” method). There is an in-built utility for this in scikit-learn ,
but it can be in one of two places:

In [None]:
from sklearn.model_selection import train_test_split # Newer versions
#from sklearn.cross_validation import train_test_split # Older versions
X_train, X_test, y_train, y_test = train_test_split(X, y)
print('X_train: {} X_test: {} y_train: {} y_test: {}'.format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))

- **(a)** Train the three classifiers (`0-R, 1-R, Decision Tree`) on the training data, rather than the full data set. score() is too specific to be used in most situations; another way to find the training accuracy is by comparing the predictions to the ground truth labels as follows:

```python
>>> from scikit-learn.metrics import accuracy_score
>>> accuracy_score(zero_R.predict(X_train),y_train))
```

- Calculate the accuracy of the classifiers on the held-out training data. How does it compare to the training accuracies you calculated before? Why is this?

In [None]:
from sklearn.metrics import accuracy_score

zero_r.fit(X_train, y_train)
one_r.fit(X_train, y_train)
dt.fit(X_train, y_train)

zr_acc = ...
or_acc = ...
dt_acc = ...
print('Train accuracies: 0-R: {} 1-R: {} DT: {}'.format(zr_acc, or_acc, dt_acc))

- **(b)** Instead of calculating the accuracy with respect to the training set, train your classifiers on the training data (using `fit()`) and then evaluate them (by calculating accuracy) according to their predictions on the test data. How different are the training accuracies and test accuracies? Hypothesise what could be causing these differences.

In [None]:
zr_acc = ...
or_acc = ...
dt_acc = ...
print('Test accuracies: 0-R: {} 1-R: {} DT: {}'.format(zr_acc, or_acc, dt_acc))

- **(c)** By default, `train_test_split` uses 75% of the data as training, and 25% as test. This can be changed by passing an argument, for example, `test_size=0.5` means that we use 50% as training and 50% as test. Try some different values (perhaps multiple times) to see if you can observe the trade-off inherent in the model using this evaluation strategy.

- **Note** The default behaviour of `train_test_split` is that the remainder of the data is used as training; this too can be altered, if you wish.

In [None]:
for test_size in [0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99]:
    print('Running experiments with test set size: {}'.format(test_size))
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)
    print('X_train: {} X_test: {} y_train: {} y_test: {}'.format(X_train.shape, X_test.shape, y_train.shape, y_test.shape))
    
    zero_r.fit(X_train, y_train)
    one_r.fit(X_train, y_train)
    dt.fit(X_train, y_train)

    zr_acc = ...
    or_acc = ...
    dt_acc = ...
    print('Train accuracies: 0-R: {} 1-R: {} DT: {}'.format(zr_acc, or_acc, dt_acc))

    zr_acc = ...
    or_acc = ...
    dt_acc = ...
    print('Test accuracies: 0-R: {} 1-R: {} DT: {}'.format(zr_acc, or_acc, dt_acc))
    print()

### Exercise 5.
*(Stratified)* `M–fold` cross-validation is so popular, `scikit-learn` has a utility that applies it directly.
For example, 10–fold cross-validation of the `0-R` classifier proceeds as follows:

```python
>>> from sklearn.model_validation import cross_val_score # Newer versions
>>> from sklearn.cross_validation import cross_val_score # Older versions
>>> cross_val_score(zero_R, X, y, cv=10)
```
**Note:** There are also simpler methods like `StratifiedKFold()` to generate the partitions, which you can then use to train and test the model yourself, if you wish.

- **(a)** This method returns an array of the calculated evaluation metric (by default, accuracy) across the folds. Write a wrapper function which averages these values, so as to come up with a single score for the classifier.

In [None]:
from sklearn.model_selection import cross_val_score
print(cross_val_score(zero_r, X, y, cv=10))

def avg_score(clf, X, y, cv=10):
    scores = ...
    return np.mean(scores)   

- **(b)** How does the estimate of the accuracy of the various classifiers using cross-validation compare to the training accuracies and holdout accuracies you calculated above?

In [None]:
for clf in [zero_r, one_r, dt]:
    avg = avg_score(...)
    print(clf)
    print('Average CV accuracy', avg)
    print()