## Naive Bayes Classifier

For this assignment you will be implementing and evaluating a Naive Bayes Classifier with the same data from last week:

http://archive.ics.uci.edu/ml/datasets/Mushroom

(You should have downloaded it).

<div style="background: lemonchiffon; margin:20px; padding: 20px;">
    <strong>Important</strong>
    <p>
        No Pandas. The only acceptable libraries in this class are those contained in the `environment.yml`. No OOP, either. You can used Dicts, NamedTuples, etc. as your abstract data type (ADT) for the the tree and nodes.
    </p>
</div>


You'll first need to calculate all of the necessary probabilities using a `train` function. A flag will control whether or not you use "+1 Smoothing" or not. You'll then need to have a `classify` function that takes your probabilities, a List of instances (possibly a list of 1) and returns a List of Tuples. Each Tuple has the best class in the first position and a dict with a key for every possible class label and the associated *normalized* probability. For example, if we have given the `classify` function a list of 2 observations, we would get the following back:

```
[("e", {"e": 0.98, "p": 0.02}), ("p", {"e": 0.34, "p": 0.66})]
```

when calculating the error rate of your classifier, you should pick the class label with the highest probability; you can write a simple function that takes the Dict and returns that class label.

As a reminder, the Naive Bayes Classifier generates the *unnormalized* probabilities from the numerator of Bayes Rule:

$$P(C|A) \propto P(A|C)P(C)$$

where C is the class and A are the attributes (data). Since the normalizer of Bayes Rule is the *sum* of all possible numerators and you have to calculate them all, the normalizer is just the sum of the probabilities.

You will have the same basic functions as the last module's assignment and some of them can be reused or at least repurposed.

`train` takes training_data and returns a Naive Bayes Classifier (NBC) as a data structure. There are many options including namedtuples and just plain old nested dictionaries. **No OOP**.

```
def train(training_data, smoothing=True):
   # returns the Decision Tree.
```

The `smoothing` value defaults to True. You should handle both cases.

`classify` takes a NBC produced from the function above and applies it to labeled data (like the test set) or unlabeled data (like some new data). (This is not the same `classify` as the pseudocode which classifies only one instance at a time; it can call it though).

```
def classify(nbc, observations, labeled=True):
    # returns a list of tuples, the argmax and the raw data as per the pseudocode.
```

`evaluate` takes a data set with labels (like the training set or test set) and the classification result and calculates the classification error rate:

$$error\_rate=\frac{errors}{n}$$

Do not use anything else as evaluation metric or the submission will be deemed incomplete, ie, an "F". (Hint: accuracy rate is not the error rate!).

`cross_validate` takes the data and uses 10 fold cross validation (from Module 3!) to `train`, `classify`, and `evaluate`. **Remember to shuffle your data before you create your folds**. I leave the exact signature of `cross_validate` to you but you should write it so that you can use it with *any* `classify` function of the same form (using higher order functions and partial application). If you did so last time, you can reuse it for this assignment.

Following Module 3's discussion, `cross_validate` should print out the fold number and the evaluation metric (error rate) for each fold and then the average value (and the variance). What you are looking for here is a consistent evaluation metric cross the folds. You should print the error rates in terms of percents (ie, multiply the error rate by 100 and add "%" to the end).

To summarize...

Apply the Naive Bayes Classifier algorithm to the Mushroom data set using 10 fold cross validation and the error rate as the evaluation metric. You will do this *twice*. Once with smoothing=True and once with smoothing=False. You should follow up with a brief explanation for the similarities or differences in the results.

In [1]:
import numpy as np
import random
from statistics import mean

## load
`load` takes in a file name and then returns a numpy array of the file **Used by**: None

* **file_name** str: the file name

**returns** list[list] : returns a numpy array

In [2]:
def load(file_name: str) -> list[list]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [value for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return np.asarray(data)

In [3]:
print(load('agaricus-lepiota.data'))

[['e' 'x' 's' ... 'n' 'c' 'l']
 ['p' 'f' 'y' ... 'h' 'y' 'p']
 ['e' 'f' 's' ... 'u' 'v' 'd']
 ...
 ['p' 'k' 'y' ... 'w' 'v' 'p']
 ['e' 'x' 'y' ... 'n' 'y' 'd']
 ['p' 'f' 's' ... 'h' 'v' 'u']]


## get_domain
`get_domain` examines the data at a column index to return all the unique values within that column **Used by**: [cross_validation](#cross_validation), [train](#train)
* **data** list[list]: the data
* **col** int: the index of the column

**returns** list: returns a list containing the unique valeus in that col

In [4]:
def get_domain(data:list[list], col:int)->list:
    return np.unique(data[:, col])

In [5]:
test1 = [['s', 's', 'a'], 
        ['s', 's', 's'], 
        ['s', 'a', 'b']]
test = np.asarray(test1)

assert len(get_domain(test, 0)) == 1
assert len(get_domain(test, 1)) == 2
assert len(get_domain(test, 2)) == 3

## get_count
`get_count` takes in the data, column index, value string character, and if necessary the y column index and y value chracter string to get the count of said character, if a y column and y character is inputted then it'll return the logical count of instances where both characters are present **Used by**: [get_probability](#get_probability)

* **data** list[list]: the data 
* **col** int: the index of the column
* **value** str: the character of the string in question that is being counted
* **y_col** int: the index of the class - y column 
* **y_value** str: the character of the class from the y_column that is in question 

**returns** int: returns the count 

In [6]:
def get_count(data: list[list], col:int, value:str, y_col = None, y_value = None)->int:
    value_count = (data[:, col] == value)
    if y_col is None or y_value is None:
        return value_count.sum()
    y_count = (data[:, y_col] == y_value)
    val_y_count = np.logical_and(value_count, y_count)
    return val_y_count.sum()

In [7]:
test1 = [['s', 's', 'a'], 
        ['s', 's', 's'], 
        ['s', 'a', 'b']]
test = np.asarray(test1)

assert get_count(test, 0, 's') == 3
assert get_count(test, 1, 's') == 2
assert get_count(test, 1, 's', y_col = 0, y_value = 's') == 2

## get_col_count
`get_col_count` gets the count of the length of the columns **Used by:** [get_probability](#get_probability)

* **data** list[list]: the data in question
* **col** int: the index of the column

**returns** int: returns the count of how many instances are in that column

In [8]:
def get_col_count(data:list[list], col:int)-> int:
    return len(data[:, col])

In [9]:
test1 = [['s', 's', 'a'], 
        ['s', 's', 's'], 
        ['s', 'a', 'b']]
test = np.asarray(test1)

assert get_col_count(test, 0) == 3

test2 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 's']])

assert get_col_count(test2, 0) == 2

test3 = np.asarray([['s', 's', 's']])

assert get_col_count(test3, 0) == 1

## get_probability 
`get_probability` gets the probability of a string character based on how many times it appears within the column, if y_values and y_col are not none then it'll return the conditional probability given the instance is also y_value. 
This automatically defaults to smoothig, if smoothing is False then it will not perform +1 smoothing **Used by:** [train](#train)
* **data** list[list]: the data in question
* **col** int: the index of the colum
* **value** str: the character of the string in question that is being counted
* **y_col** int: the index of the class - y column 
* **y_value** str: the character of the class from the y_column that is in question 
* **smoothing** bool: if True then performs +1 Smoothing if False then it doesn't do plus 1 smoothing

**returns** float: returns the probability

In [10]:
def get_probability(data: list[list], col: int, value:str, y_col = None, y_value = None, smoothing = True)-> float:
    if y_col is None or y_value is None:
        value_count = get_count(data, col, value)
        col_count = get_col_count(data, col)
        value_prob = value_count/col_count
        return value_prob
    else:
        # Doing the probability of X given Y
        value_count = get_count(data, col, value, y_col, y_value)
        y_count = get_count(data, y_col, y_value)
        if smoothing != True:
            value_prob = value_count / y_count
            return value_prob
        value_prob = (value_count + 1) / (y_count +1)
    return value_prob

In [11]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['s', 'a', 'a'], 
                    ['s', 'a', 'a']])
assert get_probability(test1, 1, 'a') == 0.5

assert get_probability(test1, 0, 's') == 1

assert get_probability(test1, 1, 'a', y_col = 0, y_value = 'a') == 1

assert get_probability(test1, 1, 'a', y_col = 0, y_value = 's', smoothing = False) == 0.5

## train

`train` gets the naive bayes classifer as a list of dictionaries

* **data** list[list]: the data in question
* **smoothing** bool: if True then performs +1 smoothing is False then it does not

**returns** list[dict]: returns all the possible probabilities

In [12]:
def train(data:list[list], smoothing=True) -> list[dict]:
    nbc = []
    domains = [get_domain(data, i) for i in range(len(data[0]))]
    y_values = domains[0]
    nbc.append({value: get_probability(data, 0, value) for value in domains[0]})
    for i in range(1, len(data[0])):
        nbc.append({value: {y_value : get_probability(data, i, value, 0, y_value, smoothing=smoothing) 
                                      for y_value in y_values} for value in domains[i]})
    return nbc

In [13]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['s', 'a', 'a'], 
                    ['s', 'a', 'a']])
assert train(test1) == [{'s': 1.0}, {'a': {'s': 0.6}, 's': {'s': 0.6}}, {'a': {'s': 1.0}}]

test2 = np.asarray([['s', 's', 'a'], 
                    ['e', 's', 'a'], 
                    ['s', 'a', 'a'], 
                    ['e', 'a', 'a']])
assert train(test2) == [{'e': 0.5, 's': 0.5},
                                        {'a': {'e': 0.6666666666666666, 's': 0.6666666666666666},
                                         's': {'e': 0.6666666666666666, 's': 0.6666666666666666}},
                                        {'a': {'e': 1.0, 's': 1.0}}]

assert train(test2, smoothing = False) == [{'e': 0.5, 's': 0.5},
                                                           {'a': {'e': 0.5, 's': 0.5}, 's': {'e': 0.5, 's': 0.5}},
                                                           {'a': {'e': 1.0, 's': 1.0}}]

## get_prob_of
`get_prob_of` calculates the probability of an example given a specific class label **Used by:** [normalize](#normalize),  [best](#best), [classify](#classify)

$label = P(C) \cdot \prod_{i,j} P(A_i = V_j | C)$

 $C$ = class label
 $A$ = attribute
 $V$ = attribute value
 
* **nbc** list[dict]: naive bayes data stucture
* **instance** list: instance of attributes
* **label** str: label in question 

**returns** float: the probability

In [14]:
def get_prob_of(nbc: list[dict], instance:list, label:str)-> float:
    return nbc[0][label]*np.product(np.array([nbc[i+1][instance[i]][label] 
                                                            for i in range(len(instance))]))
    

In [15]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['s', 'a', 'a'], 
                    ['s', 'a', 'a']])

assert get_prob_of(train(test1), test1[1][1:], label="s") == 0.6
assert get_prob_of(train(test1, smoothing=False), test1[1][1:], label="s") == 0.5

test2 = np.asarray([['s', 's', 'a'], 
                    ['e', 's', 'a'], 
                    ['s', 'a', 'a'], 
                    ['e', 'a', 'a']])
assert get_prob_of(train(test2, smoothing=False), test1[1][1:], label="e") == 0.25

## normalize 

The normalize function will normalize the results so that the probabilities add up to 1 **Used by:**[best](#best), [classify](#classify)

* **nbc** dict: modified dictionary structure of the nbc
* **labels** list[str]: list of all the possible labels  

**returns** dict: normalized probailities given the class

In [16]:
def normalize(nbc: dict, labels:list[str])->dict:
    total = np.sum(np.array([nbc[label] for label in labels]))
    result = {label : nbc[label]/total for label in labels}
    return result

In [17]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['e', 'a', 'a'], 
                    ['s', 'a', 'a']])
test_labels = ['e', 's']
probs = train(test1)

results1 = {label : get_prob_of(probs, test1[2][1:], label)  for label in test_labels}
results2 = {label : get_prob_of(probs, test1[1][1:], label)  for label in test_labels}
results3 = {label : get_prob_of(probs, test1[3][1:], label)  for label in test_labels}

assert normalize(results1, test_labels) == {'e': 0.4, 's': 0.6}
assert normalize(results2, test_labels) == {'e': 0.18181818181818182, 's': 0.8181818181818182}
assert normalize(results3, test_labels) == {'e': 0.4, 's': 0.6}

## best
`best` takes the naive bayes data structure and the sorts it by the highest probability and returns the label that has the highest probabilty **Used by:** [classify](#classify)

* **nbc** dict: modified dictionary structure of the nbc
* **labels** list[str]: list of all the possible labels  

**returns** str: returns the best choice within the dictionary

In [18]:
def best(nbc:dict, labels:list[str])->str:
    return sorted([(nbc[label], label) for label in labels], reverse = True)[0][1]

In [19]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['e', 'a', 'a'], 
                    ['s', 'a', 'a']])
test_labels = ['e', 's']
probs = train(test1)

results1 = {label : get_prob_of(probs, test1[2][1:], label)  for label in test_labels}
results2 = {label : get_prob_of(probs, test1[1][1:], label)  for label in test_labels}
results3 = {label : get_prob_of(probs, test1[3][1:], label)  for label in test_labels}

assert best(normalize(results1, test_labels), test_labels) == 's'
assert best(normalize(results2, test_labels), test_labels) == 's'
assert best(normalize(results3, test_labels), test_labels) == 's'

## classify
`classify` is used to classify an instance from the testing data 
* **nbc** list[dict]: naive bayes data stucture
* **instance** list[str]: instance of attributes
* **labels** list[str]: list of possible labels

**returns**  tuple : best choice and the dictionary of the probabilities

In [20]:
def classify(nbc, instance, labels)-> tuple:
    res = {label: get_prob_of(nbc, instance, label) for label in labels}
    res = normalize(res, labels)
    best_choice = best(res, labels)
    return (best_choice, res)

In [21]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['e', 'a', 'a'], 
                    ['s', 'a', 'a']])
test_labels = ['e', 's']
test_nbc = train(test1)


assert classify(test_nbc, test1[2][1:], test_labels) == ('s', {'e': 0.4, 's': 0.6})

assert classify(test_nbc, test1[1][1:], test_labels) == ('s', {'e': 0.18181818181818182, 's': 0.8181818181818182})

assert classify(test_nbc, test1[3][1:], test_labels) == ('s', {'e': 0.4, 's': 0.6})

# Model Evaluation

## get_folds
`get_fold` is used to split the data into k folds

* **data** list[list]: the data in question
* **k** int: the number of folds

**returns**  list[list]: nested list with k number of folds

In [22]:
def get_folds(data:list[list], k:int)->list[list]:
    return np.array_split(data, k)

In [23]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['e', 'a', 'a'], 
                    ['s', 'a', 'a']])

assert len(get_folds(test1, 3)) == 3
assert len(get_folds(test1, 4)) == 4
assert len(get_folds(test1, 1)) == 1

## separate 
`separate` is used to separate the x and y values
* **data** list[list]: the data in question

**returns** list, list: list of x and list of y values

In [24]:
def separate(data):
    x = data[:,1:]
    y = data[:, 0]
    return x, y

In [25]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a'], 
                    ['e', 'a', 'a'], 
                    ['s', 'a', 'a']])
separate(test1)
x = [['s', 'a'],
     ['s', 'a'],
     ['a', 'a'],
     ['a', 'a']]
y = ['s', 's', 'e', 's']
assert separate(test1) == x or y

test2 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'a']])
x2 = [['s', 'a'],
      ['s', 'a']]
y2 = ['s', 's']
assert separate(test2) == x2 or y2 

test3 = np.asarray([['s', 's', 'a']])
x3 = [['s', 'a']]
y3 = ['s']
assert separate(test2) == x3 or y3

## cross_validation
`cross_validation` is used to execute all of the above and classify the testing data in each fold for k number of folds

* **data** list[list]: the data in question
* **k** int: the number of folds
* **smoothing** bool: +1 smoothing if True, if False then no smoothing

**returns** list(tuples): list of tuples with y actual and the y predicted

In [26]:
def cross_validation(data:list[list], k: int, smoothing=True)-> list:
    outputs = []
    np.random.shuffle(data)
    folds = get_folds(data, k)
    labels = get_domain(data, 0)
    for i in range(k):
        test_data = folds[i][:,:]
        new_folds = np.row_stack(np.delete(folds, i, 0))
        training_data = new_folds[:,:]
        nbc = train(training_data, smoothing=smoothing)
        test_x, y = separate(test_data)
        y_pred = np.array([classify(nbc, test_x[i], labels) for i in range(len(test_x))])
    
        outputs.append((y, y_pred))
    return outputs

In [27]:
test1 = np.asarray([['s', 's', 'a'], 
                    ['s', 's', 'b'], 
                    ['e', 'a', 'a'], 
                    ['s', 'a', 'a'], 
                    ['e', 's', 'b'], 
                    ['e', 'a', 'a'],
                    ['e', 's', 'a'], 
                    ['s', 'a', 'b'], 
                    ['e', 's', 'a'], 
                    ['e', 'a', 'a']])

output = cross_validation(test1, 2)

## evaluate
`evaluate` is used to evalute the error rate by counting the number of times the y_pred does not equal the y_actual and dividing it by the total of predictions, and then it also evalutates the variance in error rate for all the folds

$$error\_rate=\frac{errors}{n}$$

In [28]:
def evaluate(output)->float:
    fold_errors = []
    total_error_count, total_size = 0, 0
    for y, y_pred in output:
        error_count = 0
        total_size += len(y)
        for a, b in zip(y, y_pred):
            if a != b[0]:
                error_count +=1
                total_error_count +=1
            else:continue
        fold_errors.append(error_count/len(y))
    total_error = total_error_count/total_size
    variance = sum([(fold - mean(fold_errors))**2 for fold in fold_errors]) / (len(fold_errors)-1)
    return fold_errors, total_error, variance

In [29]:
a,b,c = evaluate(output)
print("fold_error:", a)
print("total_error:", b)
print("variance:", c)

fold_error: [0.6, 0.6]
total_error: 0.6
variance: 0.0


## pprint
`pprint` is a helper function to pretty print the outputs of running the cross_validation testing 

In [30]:
def pprint(output):
    fold_error, total_error, variance = evaluate(output)
    for i in range(len(fold_error)):
        print(f"Fold: {i+1}")
        print(f"  error rate_(fold = {i+1}) = {fold_error[i]*100} %\n")
    print(f"Total error rate = {total_error*100} %")
    print(f"Variance = {variance}")

In [31]:
pprint(output)

Fold: 1
  error rate_(fold = 1) = 60.0 %

Fold: 2
  error rate_(fold = 2) = 60.0 %

Total error rate = 60.0 %
Variance = 0.0


# 10 Fold Cross Validation on Mushroom Data with +1 Smoothing

In [32]:
mushroom_data = load('agaricus-lepiota.data')

mushroom_output_smoothing = cross_validation(mushroom_data, 10)

pprint(mushroom_output_smoothing)

  arr = asarray(arr)


Fold: 1
  error rate_(fold = 1) = 4.305043050430505 %

Fold: 2
  error rate_(fold = 2) = 4.182041820418204 %

Fold: 3
  error rate_(fold = 3) = 3.6900369003690034 %

Fold: 4
  error rate_(fold = 4) = 4.428044280442804 %

Fold: 5
  error rate_(fold = 5) = 3.8177339901477834 %

Fold: 6
  error rate_(fold = 6) = 3.9408866995073892 %

Fold: 7
  error rate_(fold = 7) = 4.926108374384237 %

Fold: 8
  error rate_(fold = 8) = 4.310344827586207 %

Fold: 9
  error rate_(fold = 9) = 6.157635467980295 %

Fold: 10
  error rate_(fold = 10) = 5.41871921182266 %

Total error rate = 4.517479074347612 %
Variance = 5.9829302837869466e-05


# 10 Fold Cross Validation on Mushroom Data with no smoothing

In [33]:
mushroom_output_no_smoothing = cross_validation(mushroom_data, 10, smoothing=False)

pprint(mushroom_output_no_smoothing)

Fold: 1
  error rate_(fold = 1) = 0.4920049200492005 %

Fold: 2
  error rate_(fold = 2) = 0.24600246002460024 %

Fold: 3
  error rate_(fold = 3) = 0.24600246002460024 %

Fold: 4
  error rate_(fold = 4) = 0.6150061500615006 %

Fold: 5
  error rate_(fold = 5) = 0.0 %

Fold: 6
  error rate_(fold = 6) = 0.7389162561576355 %

Fold: 7
  error rate_(fold = 7) = 0.12315270935960591 %

Fold: 8
  error rate_(fold = 8) = 0.49261083743842365 %

Fold: 9
  error rate_(fold = 9) = 0.49261083743842365 %

Fold: 10
  error rate_(fold = 10) = 0.0 %

Total error rate = 0.34465780403741997 %
Variance = 6.668084280250639e-06


### Final Thoughts:

After running the 10 fold cross validations with +1 smoothing and without it, it is noticeable that the error rate without the +1 smoothing is less than the 10-fold cross validation with the +1 smoothing. Laplace smoothing is a technique that adds 1 to the count of all n-observations in the training data before normalizing the probabilities. This is done to eliminate the instance of zero-probability. While the error-rate is higher with Laplace smoothing, the error rates across all the the folds look much more uniformed than in the later. Without Laplace smoothing, the error rate has a lot if variance among each fold. Laplace smoothing ensures that the posterior probability is never zero, and helps make successful predictions when a query point contains a new observation which has not been seen in the training data, therfore by accounting for unseen observations the error rate is bound to rise. 