#### The University of Melbourne, School of Computing and Information Systems
# COMP30027 Machine Learning, 2021 

## Week 4 - workshop

This week, we will be using scikit-learn to classify some data, and to evaluate some classifiers.

In [11]:
import numpy as np
from sklearn import datasets
from collections import Counter
import matplotlib.pyplot as plt

### Exercise 1.
Please load Car Evaluation dataset from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data).

The common terminology in scikit-learn is that the array defining the attribute values is called X and the array defining the gold–standard (“ground truth”) labels is called y ; create these variables for the car data.

- **(a)** Load the data into a suitable format for scikit-learn:


In [12]:
X = []
y = []
with open('car.data', mode='r') as fin:
    for line in fin:
        atts = line.strip().split(",")
        X.append(...) #all atts, excluding the class
        y.append(...)

- **(b)** How many instances are there in this collection? How many attributes, and of what type(s)? What is the class we’re trying to predict, and how many values does it take?

In [None]:
from collections import Counter
print('There are', ..., 'instances')
print('There are', ..., "attributes, for example:", ...)
print('There are', .., "class labels:", ...)   
#use Counter to count the number of labels
label_counter = Counter(y)
print("Label frequencies: %s" %str(label_counter.most_common()))

### Exercise 2
Unfortunately, scikit-learn isn’t set up to deal with our attributes in this format.

- **(a)** Write some functions that transform our **categorical** attributes into **numerical** attributes, by (perhaps arbitrarily) assigning each categorical value to an integer, for example:

```python
def convert_class(raw):
    if raw=="unacc": return 0
    elif raw=="acc": return 1
    elif raw=="good": return 2
    elif raw=="vgood": return 3
```


In [None]:
# We could check this from the "car.names" file linked above
# Here's one (somewhat inefficient) way of reading this from the data itself
feature_1_values = set([X[i][0] for i in range(len(X))])
feature_2_values = set([X[i][1] for i in range(len(X))])
feature_3_values = set([X[i][2] for i in range(len(X))])
feature_4_values = set([X[i][3] for i in range(len(X))])
feature_5_values = set([X[i][4] for i in range(len(X))])
feature_6_values = set([X[i][5] for i in range(len(X))])
print("feature 1: %s" %str(feature_1_values))
print("feature 2: %s" %str(feature_2_values))
print("feature 3: %s" %str(feature_3_values))
print("feature 4: %s" %str(feature_4_values))
print("feature 5: %s" %str(feature_5_values))
print("feature 6: %s" %str(feature_6_values))

In [None]:
import numpy as np

def convert_feature_1and2and6(raw):
    if raw == "low": return 0
    elif raw == "med": return 1
    elif raw == "high": return 2
    elif raw == "vhigh": return 3
    # In general, we might want to catch unexpected values, too
def convert_feature_3(raw):
    if raw == "2": return 0
    elif raw == "3": return 1
    elif raw == "4": return 2
    elif raw == "5more": return 3
def convert_feature_4(raw):
    if raw == "2": return 0
    elif raw == "4": return 1
    elif raw == "more": return 2
def convert_feature_5(raw):
    if raw == "small": return 0
    elif raw == "med": return 1
    elif raw == "big": return 2
def convert_class(raw):
    if raw == "unacc": return 0
    elif raw == "acc": return 1
    elif raw == "good": return 2
    elif raw == "vgood": return 3

X_ordinal = []
for x in X:
    f1, f2, f3, f4, f5, f6 = ...
    f1 = ...
    f2 = ...
    f3 = ...
    f4 = ...
    f5 = ...
    f6 = ...
    x = [f1, f2, f3, f4, f5, f6]
    X_ordinal.append(x)
    
#convert to int array to make sure everything is converted.
X_ordinal = np.array(X_ordinal, dtype='int')


#convert ys
y_numeric = []
for this_y in y:
    this_y = convert_class(this_y)
    y_numeric.append(this_y)

y_num = np.array(y_numeric, dtype='int')


print('X shape: {}, y shape: {}'.format(X_ordinal.shape, y_num.shape))

- **(b)** Load the dataset again, this time as integers. Observe that we can actually build a model using this data.

In [None]:
clf.fit(...)

- **(c)** Split the data into training and test sets

In [None]:
from sklearn.model_selection import train_test_split # Newer versions
#from sklearn.cross_validation import train_test_split # Older versions
X_train, X_test, y_train, y_test = train_test_split(...)
print('X_train: {} X_test: {}'.format(X_train.shape, X_test.shape))v

### Exercise 3.
Read up on different implementations of the Naive Bayes classifier in `sklearn.naive_bayes`. Which one do you think is most suitable for the dataset we have?

- **(a)** Compare the accuracies of all three different kinds of Naive Bayes classifier. Does this accord with your expectations?

In [None]:
import sklearn.naive_bayes as nb
##print(dir(nb))
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb_accs = []
mnb_accs = []
bnb_accs = []
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    gnb.fit(X_train, y_train)
    acc = gnb.score(...)
    print("GNB score %f " %acc)
    gnb_accs.append(acc)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(...)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(...)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg GNB score: {}'.format(np.mean(gnb_accs)))
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

    

- **(b)** By default, this implementation of Naive Bayes uses Laplace smoothing. Turn this off, and see what happens — what is the significance of the reported accuracy?

In [None]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB

mnb_accs = []
bnb_accs = []

# Gaussian NB doesn't use smoothing; all of the probabilities for the Gaussian are already non-zero
# You can try this for yourself, but scikit-learn will flatly refuse to do it

#mnb = MultinomialNB(alpha=0)

#bnb = BernoulliNB(alpha=0)

mnb = MultinomialNB(alpha=...)
bnb = BernoulliNB(alpha=...)

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))

- **(c)** What happens if you increase the smoothing parameter instead? Calculate the accuracy for a range of values from 5 to 500. For the very large values, examine the predicted classes for the test instances — what is happening?

In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

mnb_accs = []
bnb_accs = []
# Let's not mess around, and go straight to a large value:
mnb = MultinomialNB(alpha=...)
bnb = BernoulliNB(alpha=...)

for i in range(1):
    X_train, X_test, y_train, y_test = train_test_split(X_ordinal, y_num, test_size=0.33, random_state=i)
    
    mnb.fit(X_train, y_train)
    acc = mnb.score(X_test, y_test)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(X_train, y_train)
    acc = bnb.score(X_test, y_test)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    

### Exercise 4.
The transformation of the data in Q2 implicitly creates ordinal attributes. At first glance, such a strategy does seem reasonable in light of the given values (such as *small, med, big*).
A different strategy would be to `binarise` the attributes: to replace a categorical attribute having `m` values with `m binary attributes`. One way of doing this in scikit-learn is using the **OneHotEncoder** :

```python
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(X)
X_trans = ohe.transform(X).toarray()
```

Note that this transformation should be done before we split the data into training and test sets. (Why?)

- **(a)** Check the shape of `X_trans` — how many attributes do we have now? Does this correspond to your expectations?

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit(...)
X_trans = ohe.transform(...).toarray()

print(X_trans.shape)
print('X[0]:', X[0])
print('X_trans[0]:', X_trans[0])

- **(b)** Split the dataset comprised of `one–hot attributes` into **train** and **test** sets. Compare the accuracies of the three Naive Bayes models using ordinal attributes with the three models using `one–hot attributes`: are you surprised? What can we infer?



In [None]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

gnb_accs = []
mnb_accs = []
bnb_accs = []
gnb = GaussianNB()
mnb = MultinomialNB()
bnb = BernoulliNB()

for i in range(3):
    X_train, X_test, y_train, y_test = train_test_split(X_trans, y_num, test_size=0.33, random_state=i)
    gnb.fit(...)
    acc = gnb.score(...)
    print("GNB score %f " %acc)
    gnb_accs.append(acc)
    
    mnb.fit(...)
    acc = mnb.score(...)
    print("MNB score %f " %acc)
    mnb_accs.append(acc)
    
    bnb.fit(...)
    acc = bnb.score(...)
    print("BNB score %f " %acc)
    bnb_accs.append(acc)
    
print('Avg GNB score: {}'.format(np.mean(gnb_accs)))
print('Avg MNB score: {}'.format(np.mean(mnb_accs)))
print('Avg BNB score: {}'.format(np.mean(bnb_accs)))