# Introduction to Machine Learning - Practicum 5 - Naive Bayes Classifier

**Topics covered**: Label Encoding, Naive Bayes Classifier Algorithm

**Deliverables**:
- Complete the tasks as detailed in this document.
- You are not allowed to use any Machine Learning APIs for this practicum (NumPy and Pandas are allowed).

**Objectives**:  
Naive Bayes is a very simple classification algorithm that makes some strong assumptions about the independence of each input variable. Nevertheless, it has been shown to be effective in a large number of problem domains. In this chapter you will discover the Naive Bayes algorithm for categorical data. After reading this chapter you will know.
- How to work with categorical data for Naive Bayes.
- How to prepare the class and conditional probabilities for a Naive Bayes model.
- How to use a learned Naive Bayes model to make predictions.

---

## 1. Dataset
Run the following cell to load the dataset and import the relevant libraries.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')
print(df.head())
print(df.shape) # own

  Weather      Car   Class
0   sunny  working  go-out
1   rainy   broken  go-out
2   sunny  working  go-out
3   sunny  working  go-out
4   sunny  working  go-out
(10, 3)


The dataset describes two categorical input variables and a class variable that has two outputs:

| Weather | Car     | Class     |
|---------|---------|-----------|
| sunny   | working | go-out    |
| rainy   | broken  | go-out    |
| sunny   | working | go-out    |
| sunny   | working | go-out    |
| sunny   | working | go-out    |
| rainy   | broken  | stay-home |
| rainy   | broken  | stay-home |
| sunny   | working | stay-home |
| sunny   | broken  | stay-home |
| rainy   | broken  | stay-home |

For the dataset above to be useful, we need to convert the categorical input variables to nominal data.

Since each input has only two values and the output class variable has two values, we should convert each variable to binary based on the following representations:
- Weather: sunny = 1, rainy = 0
- Car: working = 1, broken = 0
- Class: go-out = 1, stay-home = 0

This is also known as label encoding.

Write a function called `label_encoder` that takes in the dataset `df` as input and returns the encoded dataset `edf` as output.

In [2]:
def label_encoder(df):

    # converting to binary data 
    df_weather = pd.get_dummies(df["Weather"]) 
    df_weather = df_weather.drop(['rainy'], axis=1)
    df_weather = df_weather.rename(columns={"sunny": "Weather"})
#     print(df_weather)
    
    df_car = pd.get_dummies(df["Car"]) 
    df_car = df_car.drop(['broken'], axis=1)
    df_car = df_car.rename(columns={"working": "Car"})
#     print(df_car)
    
    df_class = pd.get_dummies(df["Class"]) 
    df_class = df_class.drop(['stay-home'], axis=1)
    df_class = df_class.rename(columns={"go-out": "Class"})
#     print(df_class)
    
#     # display result 
    df_encoded = pd.concat((df_weather, df_car, df_class), axis=1) 
    return df_encoded

Run the following cell. Your dataset should be restated as follows:

| Weather 	| Car 	| Class |
|:---------:|:-----:|:-----:|
| 1 		| 1 	| 1     |
| 0 		| 0 	| 1     |
| 1 		| 1 	| 1     |
| 1 		| 1 	| 1     |
| 1 		| 1 	| 1     |
| 0 		| 0 	| 0     |
| 0 		| 0 	| 0     |
| 1 		| 1 	| 0     |
| 1 		| 0 	| 0     |
| 0 		| 0 	| 0     |

In [3]:
edf = label_encoder(df)
print(edf.head())

   Weather  Car  Class
0        1    1      1
1        0    0      1
2        1    1      1
3        1    1      1
4        1    1      1


---

## 2. Learn a Naive Bayes Model
There are two types of quantities that need to be calculated from the dataset for the naive Bayes model:
- Class Probabilities.
- Conditional Probabilities.
Let's start with the class probabilities.

### 2.1 Calculate the Class Probabilities
The dataset is a two class problem. We can calculate the class probabilities for classes 0 and 1 as follows:

$$
P(class = 1) = \frac{count(class = 1)}{count(class = 0) + count(class = 1)} \\
$$

$$
P(class = 0) = \frac{count(class = 0)}{count(class = 0) + count(class = 1)} \\
$$

Write a function `calc_class_prob` that takes in the encoded dataset `edf` as input and returns the probability of each class `class_prob` as a dictionary.

In [4]:
def calc_class_prob(edf):    
    class_1_value_counts = edf['Class'].value_counts()[0]  # total counts of value '1'
    class_0_value_counts = edf['Class'].value_counts()[1]  # total counts of value '0'  
    total_counts = edf['Class'].count()              # total counts in the column 'Class'
    class_1_prob = class_1_value_counts / total_counts      # calc the prob of class_1
    class_0_prob = class_0_value_counts / total_counts      # calc the prob of class_0   
    class_value = edf['Class'].value_counts().index.tolist()  # extract the values into a list
#     print(class_value[0], class_value[1])
    class_prob = {class_value[0]:class_1_prob, class_value[1]:class_0_prob}
    return class_prob

Run the cell below to store the class probabilities in the variable `cls_prob`.

In [5]:
cls_prob = calc_class_prob(edf)
print(cls_prob)

{1: 0.5, 0: 0.5}


### 2.2 Calculate the Conditional Probabilities
The conditional probabilities are the probability of each input value given each class value. The conditional probabilities for the dataset can be calculated as follows:

#### Weather Input Variable

$$
\begin{align}
&P(weather = sunny|class = \text{go-out}) = \frac{count(weather = sunny \wedge class = \text{go-out})}{count(class = \text{go-out})}\\
&P(weather = rainy|class = \text{go-out}) = \frac{count(weather = rainy \wedge class = \text{go-out})}{count(class = \text{go-out})}\\
&P(weather = sunny|class = \text{stay-home}) = \frac{count(weather = sunny \wedge class = \text{stay-home})}{count(class = \text{stay-home})}\\
&P(weather = rainy|class = \text{stay-home}) = \frac{count(weather = rainy \wedge class = \text{stay-home})}{count(class = \text{stay-home})}\\
\end{align}
$$

#### Car Input Variable

$$
\begin{align}
&P(car = working|class = \text{go-out}) = \frac{count(car = working \wedge class = \text{go-out})}{count(class = \text{go-out})}\\
&P(car = broken|class = \text{go-out}) = \frac{count(car = broken \wedge class = \text{go-out})}{count(class = \text{go-out})}\\
&P(car = working|class = \text{stay-home}) = \frac{count(car = working \wedge class = \text{stay-home})}{count(class = \text{stay-home})}\\
&P(car = broken|class = \text{stay-home}) = \frac{count(car = broken \wedge class = \text{stay-home})}{count(class = \text{stay-home})}
\end{align}
$$

Remember that the $\wedge$ symbol is just a shorthand for conjunction (AND).

Write a conditional probability function `condProb` that takes in the encoded_dataframe `edf`, input variable name as a string `i_name`, an input variable value `i_val`, and a class value `c` as parameters and returns the conditional probability `cond_prob`.

In [6]:
def calc_cond_prob(edf, i_name, i_val, c):
    cls_prob = calc_class_prob(edf) # call the calc_prob func to get the dictionary
    class_prob = cls_prob[c]  # get the probability of class given c
#     print(f'class_prob: {class_prob}')
    
    if c == 1:  # class value 1
        class_value_counts = edf['Class'].value_counts()[1-c]                    # get the class val counts given c = 1
        counts_of_ival = edf[i_name][:class_value_counts].value_counts()[i_val]  # get the ival counts from upper half given c = 1
    
    elif c == 0: # class value 0
        class_value_counts = edf['Class'].value_counts()[1-c]                    # get the class val counts given c = 0
        counts_of_ival = edf[i_name][class_value_counts:].value_counts()[i_val]  # get the ival counts from lower half given c = 0
    #     print(f'counts_of_ival:{counts_of_ival}')

    input_prob = counts_of_ival / class_value_counts
#     print(f'input_prob: {input_prob}')
    
    return (input_prob * class_prob) / class_prob

Then, run the following cell, you should get:

#### Weather Input Variable

$$
\begin{align}
&P(weather = sunny|class = \text{go-out}) = 0.8\\
&P(weather = rainy|class = \text{go-out}) = 0.2\\
&P(weather = sunny|class = \text{stay-home}) = 0.4\\
&P(weather = rainy|class = \text{stay-home}) = 0.6
\end{align} 
$$

#### Car Input Variable

$$
\begin{align}
&P(car = working|class = \text{go-out}) = 0.8\\
&P(car = broken|class = \text{go-out}) = 0.2\\
&P(car = working|class = \text{stay-home}) = 0.2\\
&P(car = broken|class = \text{stay-home}) = 0.8\\
\end{align}
$$

In [7]:
print("Weather Input Variable")
print(calc_cond_prob(edf,'Weather',1,1))
print(calc_cond_prob(edf,'Weather',0,1))
print(calc_cond_prob(edf,'Weather',1,0))
print(calc_cond_prob(edf,'Weather',0,0))

print("\nCar Input Variable")
print(calc_cond_prob(edf,'Car',1,1))
print(calc_cond_prob(edf,'Car',0,1))
print(calc_cond_prob(edf,'Car',1,0))
print(calc_cond_prob(edf,'Car',0,0))

Weather Input Variable
0.8
0.2
0.4
0.6

Car Input Variable
0.8
0.2
0.2
0.8


We now have everything we need to make predictions using the Naive Bayes model.

## 3. Make Predictions with Naive Bayes
We can make predictions using Bayes Theorem, defined and explained in the previous chapter.

$$
P(h|d) = \frac{P(d|h) \times P(h)}{P(d)}
$$

In fact, we don't need a probability to predict the most likely class for a new data instance. We only need the numerator and the class that gives the largest response, which will be the predicted output.

$$
MAP(h) = max(P(d|h) \times P(h))
$$

Let's take the first record from our dataset and use our learned model to predict which class we think it belongs.

First instance: 
$$
\begin{align}
&weather = sunny,\ &&car = working \\
&or \\
&weather = 1,\ &&car = 1
\end{align}
$$

We plug the probabilities for our model in for both classes and calculate the response. Starting with the response for the output $\text{go-out}$. We multiply the conditional probabilities together and multiply it by the probability of any instance belonging to the class.

$$
\begin{align}
\text{go-out}\ =\ &P(weather = sunny|class = \text{go-out})\ \times \\
&P(car = working|class = \text{go-out})\ \times \\
&P(class = \text{go-out})
\end{align}
$$

or

$$
\begin{align}
&\text{go-out} = 0.8 \times 0.8 \times 0.5 \\
&\text{go-out} = 0.32
\end{align}
$$


We can perform the same calculation for the stay-home case:

$$
\begin{align}
\text{stay-home} = &P(weather = sunny|class = \text{stay-home}) \times \\
&P(car = working|class = \text{stay-home}) \times \\
&P(class = \text{stay-home})
\end{align}
$$

or

$$
\begin{align}
&\text{stay-home} = 0.4 \times 0.2 \times 0.5 \\
&\text{stay-home} = 0.04
\end{align}
$$

We can see that 0.32 is greater than 0.04, therefore we predict go-out for this instance, which is correct. We can repeat this operation for the entire dataset, as follows:

|Weather|Car|Class|go-out?|stay-home?|Prediction|
|-------|---|-----|-------|----------|----------|
|sunny|working|go-out|0.32|0.04|go-out          |
|rainy|broken|go-out|0.02|0.24|stay-home        |
|sunny|working|go-out|0.32|0.04|go-out          |
|sunny|working|go-out|0.32|0.04|go-out          |
|sunny|working|go-out|0.32|0.04|go-out          |
|rainy|broken|stay-home|0.02|0.24|stay-home     |
|rainy|broken|stay-home|0.02|0.24|stay-home     |
|sunny|working|stay-home|0.32|0.04|go-out       |
|sunny|broken|stay-home|0.08|0.16|stay-home     |
|rainy|broken|stay-home|0.02|0.24|stay-home     |

Tying up everything, create a function `calc_predictions` that takes in the encoded dataset `edf` and returns the updated dataset containing both the responses of each class and the predicted class `pdf`, similar to that as shown in the table above. Your dataset should still be made up of numerical values.

For your reference:
- Weather: sunny = 1, rainy = 0
- Car: working = 1, broken = 0
- Class: go-out = 1, stay-home = 0

In [8]:
def calc_cond_prob_opposite(edf, i_name, i_val, c):
    """
    # create a calc_cond_prob_opposite function to output the opposite set of values 
    # as compares to calc_cond_prob function
    # calc_cond_prob_opposite function will be used in calc_predicton function below.
    """
    cls_prob = calc_class_prob(edf) # call the calc_prob func to get the dictionary
    class_prob = cls_prob[c]  # get the probability of class given c
#     print(f'class_prob: {class_prob}')
    
    if c == 1:  # class value 1
        class_value_counts = edf['Class'].value_counts()[1-c]                    # get the class val counts given c = 1
        counts_of_ival = edf[i_name][class_value_counts:].value_counts()[i_val]  # get the ival counts from lower half given c = 1
    
    elif c == 0: # class value 0
        class_value_counts = edf['Class'].value_counts()[1-c]                    # get the class val counts given c = 0
        counts_of_ival = edf[i_name][:class_value_counts].value_counts()[i_val]  # get the ival counts from upper hald given c = 0
    #     print(f'counts_of_ival:{counts_of_ival}')

    input_prob = counts_of_ival / class_value_counts
#     print(f'input_prob: {input_prob}')
    
    return (input_prob * class_prob) / class_prob

In [9]:
def calc_prediction(edf):
    temp = []
    temp_opposite = []
    
    ## --------------- call calc_cond_prob and calc_cond_prob_opposite function -----------------
    for index, row in edf.iterrows(): 
    #     print(row['Weather'], row['Car'], 1)  
        original = calc_cond_prob(edf,'Weather',row['Weather'],row['Class']) * calc_cond_prob(edf,'Car',row['Car'],row['Class']) * 0.5
        temp.append(round(original, 2))
        opposite= calc_cond_prob_opposite(edf,'Weather',row['Weather'],row['Class']) * calc_cond_prob_opposite(edf,'Car',row['Car'],row['Class']) * 0.5
        temp_opposite.append(round(opposite, 2))
    go_out_upper = temp[:5]
    go_out_lower = temp_opposite[5:]
    stay_home_lower = temp[5:]
    stay_home_upper = temp_opposite[:5]
    go_out = go_out_upper + go_out_lower
    stay_home = stay_home_upper + stay_home_lower
    
    ## ------------------- update 3 new columns into the updated edf dataframe -----------------
    updated_edf = edf.copy(deep=True)
    updated_edf['go-out?'] = pd.Series(go_out)
    updated_edf['stay-home?'] = pd.Series(stay_home)
    updated_edf['Prediction'] = np.where(updated_edf["go-out?"] > updated_edf["stay-home?"], 1, 0)
    return updated_edf

Run the cell below to update the dataset using the `prediction` function you have created.

In [10]:
pdf = calc_prediction(edf)
print(pdf.head())

   Weather  Car  Class  go-out?  stay-home?  Prediction
0        1    1      1     0.32        0.04           1
1        0    0      1     0.02        0.24           0
2        1    1      1     0.32        0.04           1
3        1    1      1     0.32        0.04           1
4        1    1      1     0.32        0.04           1


Write a function `calc_accuracy`, that takes in the updated dataset `pdf` and compare the predicted class values with the actual class values and returns the accuracy as output `accuracy`.

In [11]:
def calc_accuracy(pdf):
    accuracy_list = np.where(pdf['Class'] == pdf['Prediction'], 1, 0)
    unique, counts = np.unique(accuracy_list, return_counts=True)
    accuracy_tuple = list(zip(unique, counts))
    return (accuracy_tuple[1][1] / len(accuracy_list)) * 100 

If we tally up the predictions compared to the actual class values, we should get an accuracy of 80%, which is excellent given that there are conflicting examples in the dataset. Run the cell below.

In [12]:
print(calc_accuracy(pdf))

80.0


## 4. Summary

In this chapter you discovered exactly how to implement Naive Bayes from scratch. You learned:
- How to work with categorical data with Naive Bayes.
- How to calculate class probabilities from training data.
- How to calculate conditional probabilities from training data.
- How to use a learned Naive Bayes model to make predictions on new data.

You now know how to implement Naive Bayes from scratch for categorical data.