# Assignment 2 - Part C: Trying alternative classifiers

This is a skeleton for trying alternative classifiers on the basketball dataset.

In [1]:
import csv

We can define, as done in Practicum 6, a data loading in a way to obtain the attributes set and class labels for each the training and the test sets.

In [2]:
ATTRS = ["LOCATION", "W", "FINAL_MARGIN", "SHOT_NUMBER", "PERIOD", "GAME_CLOCK", "SHOT_CLOCK", "DRIBBLES", "TOUCH_TIME",
         "SHOT_DIST", "PTS_TYPE", "CLOSE_DEF_DIST", "SHOT_RESULT"]
ATTRS_WO_CLASS = 12

def load_data(filename):
    train_x = []
    train_y = []
    test_x = []
    test_y = []
    with open(filename, 'rt') as csvfile:
        csvreader = csv.reader(csvfile, delimiter=',')
        i = 0
        for row in csvreader:
            if len(row) == ATTRS_WO_CLASS + 1:
                i += 1
                instance = [row[i] for i in range(ATTRS_WO_CLASS)]  # first ATTRS_WO_CLASS values are attributes
                label = row[ATTRS_WO_CLASS]  # (ATTRS_WO_CLASS + 1)th value is the class label
                if i % 3 == 0:  # test instance
                    test_x.append(instance)
                    test_y.append(label)
                else:  # train instance
                    train_x.append(instance)
                    train_y.append(label)
                    
    return train_x, train_y, test_x, test_y

And then we can use it to load the data.

In [3]:
train_x, train_y, test_x, test_y = load_data("data/basketball.train.csv")

predictions evaluator:

In [4]:
def evaluate(predictions, true_labels):
    correct = 0
    incorrect = 0
    for i in range(len(predictions)):
        if predictions[i] == true_labels[i]:
            correct += 1
        else:
            incorrect += 1

    print("\tAccuracy:   ", correct / len(predictions))
    print("\tError rate: ", incorrect / len(predictions))

Scikit-learn needs that all the attribute values to be numeric. This is, we need to binarize all the non-numeric attribute values, to obtain vectors: records having only numbers. The `DictVectorizer` class provided by scikit-learn allows to do this easily.

In [5]:
from sklearn.feature_extraction import DictVectorizer

Mind that each `train_x` and `test_x` are a list of lists.

We just need to obtain from each a list of dictionaries (as done in previous practica where each record was a dictionary).

In [6]:
dicts_train_x = []
for x in train_x:
    d = {}
    for i, attr in enumerate(ATTRS):
        if i < len(ATTRS) - 1: # we removed class from train_x elems
            val = x[i]
            # save as floats the values for the already-numeric attributes from dataset, keep the rest as the strings they are
            if i not in [0, 1, 4, 10]:  # indices for "LOCATION", "W", "PERIOD", "PTS_TYPE" attributes
                val = float(val)
            d[attr] = val
    dicts_train_x.append(d)

Finally, the `fit_transform` method of the vectorizer binarizes the non-numeric attributes in the list of dictionaries, and returns the vector we need.

In [7]:
vectorizer_train = DictVectorizer()
vec_train_x = vectorizer_train.fit_transform(dicts_train_x).toarray()

We do similarly for vectorizing `test_x`.

In [8]:
dicts_test_x = []
# TODO DONE
for x in test_x:
    d = {}
    for i, attr in enumerate(ATTRS):
        if i < len(ATTRS) - 1: # we removed class from test_x elems
            val = x[i]
            # save as floats the values for the already-numeric attributes from dataset, keep the rest as the strings they are
            if i not in [0, 1, 4, 10]:  # indices for "LOCATION", "W", "PERIOD", "PTS_TYPE" attributes
                val = float(val)
            d[attr] = val
    dicts_test_x.append(d)

vectorizer_test = DictVectorizer()
vec_test_x = vectorizer_train.fit_transform(dicts_test_x).toarray()

Having `evaluate` defined somewhere, we are ready to learn and apply the model, similarly to Task 3 of Practicum 6. But here, we use the vectors recently obtained for the input sets. E.g., for Naive Bayes classifier:

In [9]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

#clf = GaussianNB()
#clf.fit(vec_train_x, train_y)
#predictions = clf.predict(vec_test_x)
#evaluate(predictions, test_y)

In [11]:
classifiers = {"Decision Tree": DecisionTreeClassifier(),
               "Nearest Neighbors": KNeighborsClassifier(n_neighbors=3),
               "Naive Bayes (Gaussian)": GaussianNB(),
               "Random Forests": RandomForestClassifier(n_estimators=10, max_features=2)  # number of trees in the forest, and maximum number of features in each tree
               }
for name, clf in classifiers.items():
    print(name)
    clf.fit(vec_train_x, train_y)
    predictions = clf.predict(vec_test_x)
    evaluate(predictions, test_y)

Decision Tree
	Accuracy:    0.5350537113318795
	Error rate:  0.4649462886681205
Nearest Neighbors
	Accuracy:    0.5359825539132542
	Error rate:  0.46401744608674583
Naive Bayes (Gaussian)
	Accuracy:    0.5679266618205314
	Error rate:  0.4320733381794685
Random Forests
	Accuracy:    0.5589613116872627
	Error rate:  0.44103868831273724
