
# **HW7: Naive Bayes Classifier**

### **Steven Yoo**

**Attention: This is an individual assignment.**


For this week's homework we are going explore one new classification technique:

  - Naive Bayes

We are reusing the version of the Melbourne housing data set from HW5, to predict the housing type as one of three possible categories:

  - 'h' house
  - 'u' duplex
  - 't' townhouse

In addition to building our own Naive Bayes classifier, we are going to compare the performace of our classifier to the [Gaussian Naive Bayes Classifier](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) available in the scikit-learn library.


In [2]:
# These are the libraries you will use for this assignment
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import calendar
from sklearn.naive_bayes import GaussianNB # The only thing in scikit-learn you can use this assignment

# Starting off loading a training set and setting a variable for the target column, "Type"
df_melb = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/81b236aecee57f6cf65e60afd865d2bb/raw/56ddb53aa90c26ab1bdbfd0b8d8229c8d08ce45a/melb_data_train.csv')
target_col = 'Type'

## **Q1 - Fix a column of data to be numeric**
If we inspect our dataframe, `df_melb` using the `dtypes` method, we see that the column "Date" is an object.  However, we think this column might contain useful information so we want to convert it to [seconds since epoch](https://en.wikipedia.org/wiki/Unix_time). Use only the exiting imported libraries to create a new column "unixtime". Be careful, the date strings in the file might have some non-uniform formating that you have to fix first.  Print out the min and max epoch time to check your work.  Drop the original "Date" column.

In [3]:
# Normalize date accepts the date string as shown in the df_melb 'Date' column,
# and returns a data in a standarized format

def normalize_date(d):
  (day,month,year) = d.split('/')
  if len(year) == 2:
      year = "20" + year
  return( day + "/" + month + "/" + year)

In [4]:
df_melb['Date'] = df_melb['Date'].apply( normalize_date )
df_melb['unixtime'] = df_melb['Date'].apply(lambda x: calendar.timegm(time.strptime(x,"%d/%m/%Y")))
df_melb = df_melb.drop(columns="Date")

print("The min unixtime is {:d} and the max unixtime is {:d}".format(df_melb['unixtime'].min(),df_melb['unixtime'].max()))

The min unixtime is 1454544000 and the max unixtime is 1506124800


In [5]:
# make sure unixtime column was added and date column was dropped correctly
df_melb.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,2,h,399000,8.7,3032,1,1.0,904,53.0,1985.0,1462579200
1,3,h,1241000,13.9,3165,1,1.0,643,,,1472342400
2,2,u,550000,3.0,3067,1,1.0,1521,,,1499472000
3,3,u,691000,8.4,3072,1,1.0,170,,,1498262400
4,2,u,657500,4.6,3122,1,1.0,728,73.0,1965.0,1479513600


## **Q2 Calculating the prior probabilities**
Calculate the prior probabilities for each possible "Type" in `df_melb` and populate a dictionary, `dict_priors`, where the key is the possible "Type" values and the value is the prior probabilities. Show the dictionary. Do not hardcode the possible values of "Type".  Don't forget about [value counts](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html).

In [6]:
dict_priors = (df_melb['Type'].value_counts() / len(df_melb)).to_dict()

# print the priors
dict_priors

{'h': 0.452, 'u': 0.418, 't': 0.13}

## **Q3 Create a model for the distribution of all of the numeric attributes**
For each class, and for each attribute calculate the sample mean and sample standard deviation.  You should store the model in a nested dictionary, `dict_nb_model`, such that `dict_nb_model['h']['Rooms']` is a tuple containing the mean and standard deviation for the target Type 'h' and the attribute 'Rooms'.  Show the model using the `display` function. You should ignore entries that are `NaN` in the mean and [standard deviation](https://pandas.pydata.org/docs/reference/api/pandas.Series.std.html) calculation.

In [7]:
df_melb.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,2,h,399000,8.7,3032,1,1.0,904,53.0,1985.0,1462579200
1,3,h,1241000,13.9,3165,1,1.0,643,,,1472342400
2,2,u,550000,3.0,3067,1,1.0,1521,,,1499472000
3,3,u,691000,8.4,3072,1,1.0,170,,,1498262400
4,2,u,657500,4.6,3122,1,1.0,728,73.0,1965.0,1479513600


In [8]:
dict_nb_model = dict()
for target in dict_priors.keys():
    this_df = df_melb[ df_melb[target_col] == target ]

    dict_nb_model[target] = dict()
    for col in df_melb.columns:
        if col != target_col:
            dict_nb_model[target][col] = (this_df[col].mean(skipna=True), this_df[col].std(skipna=True))

In [9]:
display(dict_nb_model)

{'h': {'Rooms': (3.269911504424779, 0.725826420112775),
  'Price': (1189022.3451327435, 586296.5794417894),
  'Distance': (12.086725663716816, 7.397501132737295),
  'Postcode': (3103.8982300884954, 98.35750345419703),
  'Bathroom': (1.5619469026548674, 0.6720871086493074),
  'Car': (1.7777777777777777, 0.932759177140425),
  'Landsize': (932.9646017699115, 3830.7934157687173),
  'BuildingArea': (156.2433962264151, 54.62662837301433),
  'YearBuilt': (1954.900826446281, 32.4618763471547),
  'unixtime': (1485717578.761062, 13838562.05060146)},
 'u': {'Rooms': (2.0430622009569377, 0.5908453859944255),
  'Price': (634207.1770334928, 217947.32866736987),
  'Distance': (8.760287081339714, 5.609778714430756),
  'Postcode': (3120.4545454545455, 87.18475679946476),
  'Bathroom': (1.1818181818181819, 0.4222815154866222),
  'Car': (1.1483253588516746, 0.47231993860297056),
  'Landsize': (436.23444976076553, 1394.3403794653257),
  'BuildingArea': (83.85585585585585, 45.95943801516662),
  'YearBuilt'

## **Q4 Write a function that calculates the probability of a Gaussian**

Given the mean ($\mu$), standard deviation ($\sigma$), and a observed point, `x`, return the probability.  
Use the formula $p(x) = \frac{1}{\sigma \sqrt{2 \pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$ ([wiki](https://en.wikipedia.org/wiki/Normal_distribution)).  You should use [numpy's exp](https://numpy.org/doc/stable/reference/generated/numpy.exp.html) function in your solution.

In [10]:
def get_p(mu, sigma, x):
    gaussian_prob = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp((-0.5 * ((x - mu) / sigma) ** 2))

    return gaussian_prob

In [11]:
# Test it
p = get_p(0, 2, 0.5)
p

0.19333405840142462


## **Q5 Write the Naive Bayes classifier function**
The Naive Bayes classifier function, `nb_class`, should take as a parameter the prior probability dictionary. `dict_priors`, the dictionary containing all of the gaussian distribution information for each attribue, `dict_nb_model`, and a single observation row (a series generated from iterrows) of the test dataframe. It should return a single target classification. For this problem, all of our attributes are numeric and modeled as Gaussians, so we don't worry about categorical data. Make sure to skip attributes that do not have a value in the observation.  Do not hardcode the possible classification types.



In [12]:
def nb_class( dict_priors, dict_nb_model, observation):
    dict_score = dict()
    for target in dict_priors.keys():

        # Initialize the dictionary with the prior probability
        dict_score[target] = dict_priors[target]

        for attribute in dict_nb_model[target]:
            if not np.isnan(observation[attribute]):
                cond_prob = get_p( dict_nb_model[target][attribute][0], dict_nb_model[target][attribute][1], observation[attribute])
                dict_score[target] *= cond_prob

    max_class = max(dict_score, key=dict_score.get)
    return max_class

## **Q6 Calculate the accuracy using Naive Bayes classifier function on the test set**
Load the test set from file, convert date to unix time and drop the date column, classify each row using your `nb_class`, and then show the accuracy on the test set.

In [13]:
df_test = pd.read_csv('https://gist.githubusercontent.com/yanyanzheng96/c3d53303cebbd986b166591d19254bac/raw/94eb3b2d500d5f7bbc0441a8419cd855349d5d8e/melb_data_test.csv')
df_test['Date'] = df_test['Date'].apply( normalize_date )
df_test['unixtime'] = df_test['Date'].apply(lambda x: calendar.timegm(time.strptime(x,"%d/%m/%Y")))
df_test = df_test.drop(columns="Date")

In [14]:
predictions = []

for (indx,row) in df_test.iterrows():
    this_obs = row.drop(index='Type')
    this_pred = nb_class(dict_priors, dict_nb_model,this_obs)
    predictions.append(this_pred)

In [15]:
acc = (predictions == df_test['Type']).sum()/len(predictions)

In [16]:
print('Accuracy is {:.2f}%'.format(acc*100))

Accuracy is 57.00%


### **Q7 Use `scikit-learn` to do the same thing**

Now we understand the inner workings of the Naive Bayes algorithm, let's compare our results to [scikit-learn's Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) implementation. Use the [GaussianNB](https://scikit-learn.org/stable/modules/naive_bayes.html#gaussian-naive-bayes) to train using the `df_melb`dataframe and test using the `df_test` dataframe. Remember to split `df_melb` into a `df_X` with the numerical attributes, and a `s_y` with the target column. On the `df_melb` frame you will have to fill the empty attributes via imputation since the scikit-learn library can not handle missing values.  Use the same method you used in the last homework (filling the training data with the mean of the non-nan values).

In [17]:
df_test.head()

Unnamed: 0,Rooms,Type,Price,Distance,Postcode,Bathroom,Car,Landsize,BuildingArea,YearBuilt,unixtime
0,3,h,1116000,17.9,3192,1,2.0,610,,,1498867200
1,3,h,2030000,11.2,3186,2,2.0,366,,,1472342400
2,3,h,1480000,10.7,3187,2,2.0,697,143.0,1925.0,1478476800
3,3,u,1203500,12.3,3166,2,2.0,311,127.0,2000.0,1495843200
4,3,h,540000,14.7,3030,2,2.0,353,135.0,2011.0,1504396800


In [18]:
# Imputation training
dict_imputation = dict()
for col in df_melb.columns:
    if col != target_col:
        dict_imputation[col] = df_melb[col].mean(skipna=True)
        df_melb[col].fillna(value=dict_imputation[col],inplace=True)

# Imputation - apply on the test data
for col in df_test.columns:
    if col != target_col:
        df_test[col].fillna(value=dict_imputation[col],inplace=True)

s_test = df_test[target_col]
df_test = df_test.drop(columns=[target_col])

df_X = df_melb.drop( columns = [target_col])
s_y = df_melb[target_col]

In [19]:
gnb = GaussianNB()
y_pred = gnb.fit(df_X, s_y).predict(df_test)

In [20]:
acc = (y_pred == s_test).sum()/len(y_pred)
print('Accuracy is {:.2f}%'.format(acc*100))

Accuracy is 37.00%


## **Q8 Do you think imputation hurt or helped the classifier?**

Imputation hurt the classifier. The accuracy of the model dropped by 20% after imputation.