# Module 8 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

In [1]:
from copy import deepcopy
import random
import statistics
import scipy
import math
import pprint
from typing import List, Dict, Tuple, Callable
import collections

## Decision Trees

For this assignment you will be implementing and evaluating a Decision Tree using the ID3 Algorithm (**no** pruning or normalized information gain). Use the provided pseudocode. The data is located at (copy link):

http://archive.ics.uci.edu/ml/datasets/Mushroom

**Just in case** the UCI repository is down, which happens from time to time, I have included the data and name files on Blackboard.

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>

One of the things we did not talk about in the lectures was how to deal with missing values. There are two aspects of the problem here. What do we do with missing values in the training data? What do we do with missing values when doing classifcation?

For the first problem, C4.5 handled missing values in an interesting way. Suppose we have identifed some attribute *B* with values {b1, b2, b3} as the best current attribute. Furthermore, assume there are 5 observations with B=?, that is, we don't know the attribute value. In C4.5, those 5 observations would be added to *all* of the subsets created by B=b1, B=b2, B=b3 with decreased weights. Note that the observations with missing values are not part of the information gain calculation.

This doesn't quite help us if we have missing values when we use the model. What happens if we have missing values during classification? One approach is to prepare for this advance. When you train the tree, you need to add an implicit attribute value "?" at every split. For example, if the attribute was "size" then the domain would be ["small", "medium", "large", "?"]. The "?" value gets all the data (because ? is now a wildcard). However, there is an issue with this approach. "?" becomes the worst possible attribut value because it has no classification value. What to do? There are several options:

1. Never recurse on "?" if you do not also recurse on at least one *real* attribute value.
2. Limit the depth of the tree.

There are good reasons, in general, to limit the depth of a decision tree because they tend to overfit.
Otherwise, the algorithm *will* exhaust all the attributes trying to fulfill one of the base cases.

You must implement the following functions:

`train` takes training_data and returns the Decision Tree as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, depth_limit=None):
   # returns the Decision Tree.
```

The `depth_limit` value defaults to None. (What technique would we use to determine the best parameter value for `depth_limit` hint: Module 3!)

`classify` takes a tree produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data).

```
def classify(tree, observations, labeled=True):
    # returns a list of classifications
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application).

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

```
def pretty_print_tree(tree):
    # pretty prints the tree
```

This should be a text representation of a decision tree trained on the entire data set (no train/test).

To summarize...

Apply the Decision Tree algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. When you are done, apply the Decision Tree algorithm to the entire data set and print out the resulting tree.

**Note** Because this assignment has a natural recursive implementation, you should consider using `deepcopy` at the appropriate places.

-----

In [2]:
# COPIED FROM MODULE 3
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [str(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("agaricus-lepiota.data")
print(data[0])
print(len(data))

['p', 'x', 'y', 'y', 'f', 'f', 'f', 'c', 'b', 'g', 'e', 'b', 'k', 'k', 'n', 'p', 'p', 'w', 'o', 'l', 'h', 'y', 'p']
8124


In [4]:
# COPIED FROM MODULE 3
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [5]:
folds = create_folds(data, 10)
print(len(folds))

10


In [6]:
# COPIED FROM MODULE 3
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

In [7]:
training_example, test = create_train_test(folds, 0)
print(len(training_example))
print(len(test))

7311
813


## ***Above is directly from Module 3 and 8**

## ***Below is new**

<a id="evaluate"></a>
## evaluate

This calculates the error rate of the classification.

Variables
* **data_set** List[List]: list of data_set list
* **classification_data** List[List]: list of classification data list

**returns** float: the error rate of the classification data from the true data

In [8]:
def evaluate(data_set, classification_data):
    y = [x[0] for x in data_set]
    yh = [x[0] for x in classification_data]
    return (sum([1 for i in range(len(y)) if not y[i] == yh[i]]))/len(y)

In [9]:
evaluate_test1a = ['e','e','e']
evaluate_test1b = ['e','e','e']
evaluate_test2a = ['e','e','e']
evaluate_test2b = ['e','e','p']

assert evaluate(evaluate_test1a, evaluate_test1b) == 0
assert evaluate(evaluate_test2a, evaluate_test2b) == 1/3

<a id="is_homogeneous"></a>
## is_homogeneous

This returns a bool for if every label for each record in the data_set is the same.

Variables
* **data_set** List[List]: list of data_set list

**returns** bool: true/false for if the data set is all assigned the same label (assumed to be in index 0)

In [10]:
def is_homogeneous(data_set):
    labels = [x[0] for x in data_set]
    unique_labels = list(set(labels))
    return len(unique_labels) < 2

In [11]:
ih_test1 = [[1], [1], [1]]
ih_test2 = [[1], [2], [1]]

assert is_homogeneous(ih_test1) == True
assert is_homogeneous(ih_test2) == False

<a id="get_majority_label"></a>
## get_majority_label

This returns the value of the most common label in the dataset.

Variables
* **data_set** List[List]: list of data_set list

**returns** object: will return the most common element, when multiple share highest frequency it will pick the first found as a tie breaker.

In [12]:
def get_majority_label(data_set):
    labels = [x[0] for x in data_set]
    counter = collections.Counter(labels)
    return counter.most_common(1)[0][0]

In [13]:
gm_test1 = [[1], [1], [1]]
gm_test2 = [[1], [2], [2]]
gm_test3 = [[3], [2], [1]]
gm_test4 = [[1], [2], [3]]

assert get_majority_label(gm_test1) == 1
assert get_majority_label(gm_test2) == 2
assert get_majority_label(gm_test3) == 3
assert get_majority_label(gm_test4) == 1

<a id="pick_best_attribute"></a>
## pick_best_attribute

This goes through the entire data set using the entropy equation to find what element would lower the entropy, then selects that element.

Variables
* **data_set** List[List]: list of data_set list
* **attributes** List: list of indexes that represent the attributes left that can be considered

**returns** int: will return the attribute (dataset index) that will cause the entropy to lower the most.

In [14]:
def pick_best_attribute(data_set, attributes):
    lowest_entropy = 9999
    entropy_attribute= -1
    for attribute in attributes:
        domain = list(set([x[attribute] for x in data_set]))
        domain_len = len(domain)
        entropy = 0
        for value in domain:
            subset_len = len([x for x in data_set if x[attribute] == value])
            entropy -= (subset_len/domain_len)*math.log((subset_len/domain_len), 10)
        if entropy < lowest_entropy:
            lowest_entropy = entropy
            entropy_attribute = attribute
    return entropy_attribute

<a id="get_domain_for_attribute"></a>
## get_domain_for_attribute

Gets the distinct elements that are included in the domain for a given attribute

Variables
* **data_set** List[List]: list of data_set list
* **attribute** int: index/attribute to find values of

**returns** List: list of distinct values for the given attribute

In [15]:
def get_domain_for_attribute(data_set, attribute):
    return list(set([x[attribute] for x in data_set]))

<a id="id3"></a>
## id3

Recursively create the decision learning tree for a given data_set and attributes predefined for the data_set.

Variables
* **data_set** List[List]: list of data_set list
* **attributes** List: list of indexes that represent the attributes left that can be considered
* **default** str: default string value of the label to use in case there is no decision to make

**returns** object: this is a collection which represents the entire decision learning tree.

In [16]:
def id3(data_set, attributes, default): 
    if len(data_set) == 0:
        return default
    if is_homogeneous(data_set):
        return data_set[0][0]
    if attributes is None or len(attributes) == 0:
        return get_majority_label(data_set)
    best_attribute = pick_best_attribute(data_set, attributes)
    default_label = get_majority_label(data_set)
    attribute_domain = get_domain_for_attribute(data_set, best_attribute)
    # are there non-? values in the attribute domain?
    recurse_on_questions = len([x for x in attribute_domain if not x == '?']) > 0
    children = {}
    for value in attribute_domain:
        subset = [x for x in data_set if x[best_attribute] == value]
        modified_attributes = deepcopy(attributes)
        modified_attributes.remove(best_attribute)
        if value == '?' and not recurse_on_questions:
            children[value] = default_label
            continue
        child = id3(subset, modified_attributes, default_label)
        children[value] = child
    return {best_attribute: children}

In [17]:
def classify(tree, observations, labeled=True):
    current_value = next(iter(tree))
    current_tree = tree[current_value]
    found = False
    while not found:
        next_value = observations[current_value]
        if not next_value in current_tree:
            return "p" # the default: we don't want a false positive for being edible or you could die...
        current_tree = current_tree[next_value]
        if isinstance(current_tree, str):
            return current_tree
        current_value = next(iter(current_tree))
    return "p"

In [18]:
def train(training_data, depth_limit=None):
    return id3(training_data, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], 'p')

In [19]:
def cross_validate(evaluation_folds, classify_func, depth_limit=None):
    errors = []
    for i in range(len(folds)):
        train_set, test = create_train_test(evaluation_folds, i)
        tree = train(train_set)
        test_results = []
        for test_row in test:
            result = classify(tree, test_row)
            data_copy = deepcopy(test_row)
            data_copy[0] = result
            test_results.append(data_copy)
        error = evaluate(test, test_results)
        print('\r\nFold ' + str(i) + ', Error Rate: ' + str(error*100) + '%')
        errors.append(error)
    avg_error = sum(errors)/len(errors)
    print('\r\nAverage error across folds: ' + str(round(avg_error*100, 6)) + '%')
    print('Errors Standard Deviation across folds: ' + str(round(statistics.stdev(errors, avg_error), 6)))

In [20]:
# typically the result does not show very well cause the length causes a word wrap, 
#   it's all there though and if you copy/paste to something like notepad++ it looks good
def pretty_print_tree(tree):
    pprint.pprint(tree)

In [21]:
cross_validate(folds, classify)
print("\r\n")
result = id3(data, [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22], 'p')
pretty_print_tree(result)


Fold 0, Error Rate: 51.53751537515375%

Fold 1, Error Rate: 52.521525215252154%

Fold 2, Error Rate: 49.815498154981555%

Fold 3, Error Rate: 54.48954489544895%

Fold 4, Error Rate: 52.0935960591133%

Fold 5, Error Rate: 54.80295566502463%

Fold 6, Error Rate: 52.463054187192114%

Fold 7, Error Rate: 50.61576354679803%

Fold 8, Error Rate: 50.24630541871922%

Fold 9, Error Rate: 49.38423645320197%

Average error across folds: 51.796999%
Errors Standard Deviation across folds: 0.018546


{16: {'p': {6: {'a': {4: {'f': {7: {'c': {8: {'b': {10: {'e': {2: {'s': 'e',
                                                                   'y': 'p'}}}}}}}}}},
                'f': {17: {'w': {7: {'c': {8: {'b': {4: {'f': {10: {'e': {12: {'k': 'p',
                                                                               'y': 'e'}}}},
                                                         't': {12: {'f': 'p',
                                                                    's': {19: {'e':

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.