# Module 9 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can use Dicts, NamedTuples, Data Classes, etc. as your abstract data type (ADT).
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the "classifier" (however you decided to represent the probability tables).
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 5x2 cross validation (from Module 2!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 2's materials, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 5x2 cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief hypothesis/explanation for the similarities or differences in the results. You may also compare the results to the Decision Tree and why you think they're different (if they are).

### Provided Functions

You do not need to document these.

You can use this function to read the data file.

In [1]:
def parse_data(file_name: str) -> list[list]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = line.rstrip().split(",")
        data.append(datum)
    random.shuffle(data)
    return data

You can use this function to create 10 folds for 5x2 cross validation.

In [2]:
def create_folds(xs: list, n: int) -> list[list[list]]:
    k, m = divmod(len(xs), n)
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

Put your code after this line:

-----

### train
`train` takes in training data and returns a Naive Bayes classifier, represented by a dictionary. Optionally, +1 smoothing can be toggled to help account for zero probabilities in the model. **Used by**: [cross_validate](#cross_validate)

- **training_data** (list[list]): The training data where the last column is the class label
- **smoothing** (bool): Whether to apply +1 smoothing

**Returns** (dict): A dictionary representing the Naive Bayes Classifier

In [9]:
from collections import defaultdict

def train(training_data, smoothing=False):
    class_counts = defaultdict(int)
    feature_counts = defaultdict(lambda: defaultdict(lambda: defaultdict(int)))
    total_instances = len(training_data)

    for instance in training_data:
        class_label = instance[-1]
        class_counts[class_label] += 1
        for feature_index, feature_value in enumerate(instance[:-1]):
            feature_counts[feature_index][feature_value][class_label] += 1

    classifier = {
        "class_probs": {},
        "feature_probs": defaultdict(lambda: defaultdict(dict)),
        "classes": set(class_counts.keys())
    }

    for class_label, count in class_counts.items():
        classifier["class_probs"][class_label] = count / total_instances

    for feature_index, feature_values in feature_counts.items():
        for feature_value, class_dict in feature_values.items():
            for class_label, count in class_dict.items():
                if smoothing:
                    numerator = count + 1
                    denominator = class_counts[class_label] + len(feature_values)
                else:
                    numerator = count
                    denominator = class_counts[class_label]
                classifier["feature_probs"][feature_index][feature_value][class_label] = numerator / denominator

    return classifier

In [24]:
training_data = [
        [1, 0, 'A'],
        [1, 1, 'A'],
        [0, 1, 'B'],
        [0, 0, 'B']
    ]
classifier = train(training_data, smoothing=False)  # Test classifier without smoothing

assert classifier["class_probs"]['A'] == 2 / 4, "Class probability for 'A' is incorrect"
assert classifier["class_probs"]['B'] == 2 / 4, "Class probability for 'B' is incorrect"

assert classifier["feature_probs"][0][1]['A'] == 2 / 2, "Feature probability for feature 0, value 1, class 'A' is incorrect"
assert classifier["feature_probs"][0][0]['B'] == 2 / 2, "Feature probability for feature 0, value 0, class 'B' is incorrect"
assert classifier["feature_probs"][1][0]['A'] == 1 / 2, "Feature probability for feature 1, value 0, class 'A' is incorrect"
assert classifier["feature_probs"][1][1]['B'] == 1 / 2, "Feature probability for feature 1, value 1, class 'B' is incorrect"


In [25]:
classifier = train(training_data, smoothing=True) # Test classifier with smoothing

assert classifier["class_probs"]['A'] == 2 / 4, "Class probability for 'A' is incorrect with smoothing"
assert classifier["class_probs"]['B'] == 2 / 4, "Class probability for 'B' is incorrect with smoothing"

assert classifier["feature_probs"][0][1]['A'] == (2 + 1) / (2 + 2), "Feature probability for feature 0, value 1, class 'A' is incorrect with smoothing"
assert classifier["feature_probs"][0][0]['B'] == (2 + 1) / (2 + 2), "Feature probability for feature 0, value 0, class 'B' is incorrect with smoothing"
assert classifier["feature_probs"][1][0]['A'] == (1 + 1) / (2 + 2), "Feature probability for feature 1, value 0, class 'A' is incorrect with smoothing"
assert classifier["feature_probs"][1][1]['B'] == (1 + 1) / (2 + 2), "Feature probability for feature 1, value 1, class 'B' is incorrect with smoothing"

### classify
`classify` classifies the test data by appllying the trained model  **Used by**: [cross_validate](#cross_validate)

- **classifier** (dict): The trained Naive Bayes Classifier model
- **test_data** (list[list]): The test data where the last column is the class label

**Returns** (list): A list of predicted class labels

In [10]:
import math

def classify(classifier, test_data):
    predictions = []

    for instance in test_data:
        posteriors = {}

        for class_label in classifier["classes"]:
            posterior = math.log(classifier["class_probs"].get(class_label, 1e-10))  # Use a small value if missing

            for feature_index, feature_value in enumerate(instance[:-1]):
                feature_probs = classifier["feature_probs"][feature_index]
                prob = feature_probs.get(feature_value, {}).get(class_label, 1e-10)
                posterior += math.log(prob)

            posteriors[class_label] = posterior

        predicted_class = max(posteriors, key=posteriors.get)
        predictions.append(predicted_class)

    return predictions

In [29]:
training_data = [
    [1, 0, 'A'],
    [1, 1, 'A'],
    [0, 1, 'B'],
    [0, 0, 'B']
]

classifier = train(training_data, smoothing=False)
test_data = [
    [1, 0],
    [1, 1],
    [0, 1], 
    [0, 0],
    [1, 1],
]

expected_predictions = ['A', 'A', 'B', 'B', 'A']
predictions = classify(classifier, test_data)
assert predictions == expected_predictions, f"Test failed: {predictions} != {expected_predictions}"

### evaluate
`evaluate` evaluates the classifier by calculating the error rate  **Used by**: [cross_validate](#cross_validate)

- **test_data** (list[list]): The test data where the last column is the class label
- **predictions** (list): The predicted class labels

**Returns** (float): Error rate

In [11]:
def evaluate(test_data, predictions):
    actual_labels = [instance[-1] for instance in test_data]
    errors = sum(1 for actual, predicted in zip(actual_labels, predictions) if actual != predicted)
    return errors / len(test_data)

In [31]:
test_data = [
    [1, 0, 'A'],
    [0, 1, 'B'],
    [1, 1, 'A']
]
predictions = ['A', 'B', 'A']
error_rate = evaluate(test_data, predictions)
assert error_rate == 0.0, f"Expected error rate 0.0, got {error_rate}"

predictions = ['B', 'A', 'B']
error_rate = evaluate(test_data, predictions)
assert error_rate == 1.0, f"Expected error rate 1.0, got {error_rate}"

predictions = ['A', 'A', 'A']
error_rate = evaluate(test_data, predictions)
assert error_rate == 1 / 3, f"Expected error rate {1 / 3}, got {error_rate}"

### cross_validate
`cross_validate` evaluates the performance of the model using 5x2 cross-validation. Prints the error rate for each fold and split.

- **data** (list): The dataset to use for cross-validation where each row is a list with the first element as the label
- **smoothing** (bool): Boolean flag to toggle +1 smoothing

In [32]:
import random

def cross_validate(data, smoothing=False):
    clean_data = [row for row in data if '?' not in row]

    total_error = 0

    print("Fold\tSplit\tError Rate")
    for fold in range(5):
        random.shuffle(clean_data)
        folds = create_folds(clean_data, 2)

        for split in range(2):
            train_set = folds[1 - split]
            test_set = folds[split]

            tree = train(train_set, smoothing)
            preds = classify(tree, test_set)
            error = evaluate(test_set, preds)
            total_error += error
            error_rate = error * 100
            print(f"{fold+1}\t{split+1}\t{error_rate:.4f}%")

    average_error = total_error / 10
    average_error_rate = average_error * 100
    print(f"Average Error Rate: {average_error_rate:.4f}%")

In [34]:
# Load the dataset
data_file = "agaricus-lepiota.data"
data = parse_data(data_file)

print("5x2 Cross-Validation with Smoothing=True")
cross_validate(data, True)

print("\n5x2 Cross-Validation with Smoothing=False")
cross_validate(data, False)

5x2 Cross-Validation with Smoothing=True
Fold	Split	Error Rate
1	1	28.3841%
1	2	28.2424%
2	1	28.7739%
2	2	27.5337%
3	1	28.4196%
3	2	28.0652%
4	1	29.1283%
4	2	27.8172%
5	1	28.5259%
5	2	27.9943%
Average Error Rate: 28.2884%

5x2 Cross-Validation with Smoothing=False
Fold	Split	Error Rate
1	1	28.1715%
1	2	28.4904%
2	1	29.2346%
2	2	27.2147%
3	1	27.3565%
3	2	29.2346%
4	1	27.8880%
4	2	28.8448%
5	1	27.4628%
5	2	29.1991%
Average Error Rate: 28.3097%


The Naive Bayes classifier achieved an average error rate of 28%. This suggets that the strong independene assumption made by NB does not hold well for this dataset. Applying +1 smoothing led to very marginal improvements, indicating the model was not significantly affected by zero probabilities. We can assume that this dataset is limited more by independence than sparsity. 

In comparison to the Decision Tree model, which had a near zero error rate, NB was unable to model the more complex feature relationships. While NB is simpler and faster, it can underperform in cases with stronger feature independence, whereas a decision tree can adapt to fit such patterns.

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.