# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

In [1]:
from copy import deepcopy
import random
import statistics
import scipy
import math
import pprint
from typing import List, Dict, Tuple, Callable
import collections

In [2]:
# COPIED FROM MODULE 3
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [str(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("agaricus-lepiota-1.data")
print(data[0])
print(len(data))

['e', 'x', 'y', 'n', 't', 'n', 'f', 'c', 'b', 'p', 't', 'b', 's', 's', 'p', 'p', 'p', 'w', 'o', 'p', 'n', 'y', 'd']
8124


In [4]:
# COPIED FROM MODULE 3
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [5]:
folds = create_folds(data, 10)
print(len(folds))

10


In [6]:
# COPIED FROM MODULE 3
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

In [7]:
training_example, test = create_train_test(folds, 0)
print(len(training_example))
print(len(test))

7311
813


In [8]:
# COPIED FROM MODULE 8
# checking percentage of records that don't match
def evaluate(data_set, classification_data):
    y = [x[0] for x in data_set]
    yh = [x[0] for x in classification_data]
    return (sum([1 for i in range(len(y)) if not y[i] == yh[i]]))/len(y)

## ***Above is directly from Modules 3 and 8**

## ***Below is new**

<a id="calculate_probabilities"></a>
## calculate_probabilities

Recursively create the decision learning tree for a given data_set and attributes predefined for the data_set.

Variables
* **data_set** List[List]: list of data_set list
* **smoothing** bool: this defaults to True and is used to decide whether we should smooth all features as we calculate probabilities

**returns** dicitonary of probabilities

In [9]:
def calculate_probabilities(data_set, smoothing = True):
    probabilities = {}
    feature_indexes = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22]
    smoother = 0
    if smoothing:
        smoother = 1
    label_index = 0
    labels = [x[label_index] for x in data_set]
    unique_labels = list(set(labels))
    for label in unique_labels:
        count = len([x for x in labels if x == label])
        probabilities[label] = count / len(labels)
    for feature_index in feature_indexes:
        feature_values = [x[feature_index] for x in data_set]
        unique_feature_values = list(set(feature_values))
        for feature_value in unique_feature_values:
            for label in unique_labels:
                working_data = [x for x in data_set if x[0] == label]
                count = len([x for x in working_data if x[feature_index] == feature_value])
                key = label + str(feature_index) + feature_value
                probabilities[key] = (count + smoother) / (len(working_data) + smoother)
    return probabilities

In [10]:
# i printed out all probabilities and this looks very correct

# commenting it out for now though cause it takes up a lot of room
#calculate_probabilities(data)

<a id="get_probability"></a>
## get_probability

Calculates the probability of the data row happening assuming all variables are independent

Variables
* **row** List: data row representing a mushroom
* **label** string: this is the class label we would like to look for
* **probabilities** dictionary: this is the dictionary of probabilities

**returns** probability as a range of 0.0 through 1.0

In [11]:
def get_probability(row, label, probabilities):
    probability = probabilities[label]
    for x in range(1, len(row)):
        key = label + str(x) + row[x]
        probability *= probabilities[key]
    return probability

In [12]:
test_probabilities = {"e": .5,  "p": .5, "e1a": .5, "e1b": .25, "e1c": .25, "p1b": .5, "e2a": .75, "p2a": .1}

assert get_probability(["e,", "b", "a"], "e", test_probabilities) == 0.09375
assert get_probability(["p", "b", "a"], "p", test_probabilities) == 0.025

<a id="normalize_results"></a>
## normalize_results

Calculates the probability of the data row happening assuming all variables are independent

Variables
* **results** dictionary of dictionaries: dictionary for the class labels and their respective probabilities for finding them in the data set that generated these results

**returns** dictionary of probabilities directly linked to the data set

In [13]:
def normalize_results(results):
    p_of_e = results["e"]
    p_of_p = results["p"]
    total = p_of_e + p_of_p
    normalized = {}
    normalized["e"] = round(p_of_e / total, 4)
    normalized["p"] = round(p_of_p / total, 4)
    return normalized

In [14]:
test_results_for_normalize1 = {"e": .2, "p": .6}
test_results_for_normalize2 = {"e": .1, "p": 0.05}

assert normalize_results(test_results_for_normalize1) == {'e': 0.25, 'p': 0.75}
assert normalize_results(test_results_for_normalize2) == {'e': 0.6667, 'p': 0.3333}

<a id="find_best"></a>
## find_best

Finds the class label in the given dictionary that has the highest probability within it.

Variables
* **results** dictionary of dictionaries: dictionary for the class labels and their respective probabilities for finding them in the data set that generated these results

**returns** string (class label)

In [15]:
def find_best(results):
    max_label = ""
    max_probability = 0
    if results["e"] > max_probability:
        max_probability = results["e"]
        max_label = "e"
    if results["p"] > max_probability:
        max_probability = results["p"]
        max_label = "p"
    return max_label

In [16]:
test_results_for_find_best1 = {'e': 0.25, 'p': 0.75}
test_results_for_find_best2 = {'e': 0.9, 'p': 0.1}

assert find_best(test_results_for_find_best1) == "p"
assert find_best(test_results_for_find_best2) == "e"

<a id="nbc"></a>
## nbc

Finds the best class label for the provided row of data.

Variables
* **probabilities** dictionary: this is the dictionary of probabilities
* **row** List: data row representing a mushroom

**returns** string class label and the dictionary of all results

In [17]:
def nbc(probabilities, row):
    results = {}
    for label in ['e', 'p']:
        results[label] = get_probability(row, label, probabilities)
    results = normalize_results(results)
    best = find_best(results)
    return (best, results)

In [18]:
def train(training_data, smoothing=True):
    return calculate_probabilities(training_data, smoothing)

In [19]:
def classify(probabilities, instances):
    results = []
    for row in instances:
        results.append(nbc(probabilities, row))
    return results

In [20]:
def cross_validate(evaluation_folds, smoothing=True):
    errors = []
    for i in range(len(folds)):
        train_set, test = create_train_test(evaluation_folds, i)
        probabilities = train(train_set, smoothing)
        test_results = []
        for test_row in test:
            result = classify(probabilities, [test_row])
            test_results.append(result[0])
        error = evaluate(test, test_results)
        e_rounded = round(error*100,6)
        print('\r\nFold ' + str(i) + ', Error Rate: ' + str(e_rounded) + '%')
        errors.append(error)
    avg_error = sum(errors)/len(errors)
    print('\r\nAverage error across folds: ' + str(round(avg_error*100, 6)) + '%')
    print('Errors Standard Deviation across folds: ' + str(round(statistics.stdev(errors, avg_error), 6)))

In [21]:
cross_validate(folds, classify)


Fold 0, Error Rate: 5.289053%

Fold 1, Error Rate: 3.690037%

Fold 2, Error Rate: 5.166052%

Fold 3, Error Rate: 3.690037%

Fold 4, Error Rate: 4.064039%

Fold 5, Error Rate: 5.541872%

Fold 6, Error Rate: 4.679803%

Fold 7, Error Rate: 4.433498%

Fold 8, Error Rate: 3.940887%

Fold 9, Error Rate: 5.418719%

Average error across folds: 4.5914%
Errors Standard Deviation across folds: 0.007275


## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.