# Naïve Bayes Classification

In this lab, we'll explore some of our familiar datasets in the context of the naïve Bayes classifier. The Naïve Bayes classifier is popular in machine learning due to its interpretability, as well as its good performance on relatively small datasets.

Naïve Bayes maximizes the posterior probability of the target variable, $y$, being of a particular class $C$, given a set of observations, $X$.

<p>
<center>
$\hat{y} = \mathrm{argmax}_y \left(P(y=C | X = {x_1, x_2 \ldots, x_n})\right)$
</center>
</p>

Applying Bayes' Rule, it is possible to show that maximizing this quantity is equivalent to:

<p>
<center>
$\hat{y} = \mathrm{argmax}_y \left(P(y=C) \cdot \Pi_{i=1}^n p(x_i|y))\right)$
</center>
</p>

Performing this computation requires assuming a particular probability distribution over the features.  Common assumptions about statistical distributions include:
* **Gaussian (Normal) Distribution:** Appropriate for continuous valued features
* **Bernoulli Distribution:** Appropriate for categorical features
* **Multinomial Distribution:** Appropriate for counts or range features.


In this lab, we will explore some basic probabilities using the Chicago Crime Dataset, implement a version of a Naïve Bayes classifier by hand, and then apply some of the libraries in `sklearn` to perform some more sophisticated classifications.

In [1]:
# Import Pandas and Numpy
import numpy as np
import pandas as pd

# Import Plotting Libraries
import matplotlib.pyplot as plt
%matplotlib inline

## Import the Chicago Crime Data

Let's revisit the crime data that we've been exploring in some of our previous labs, and clean it up as before.

### Load the Dataset

In [2]:
df = pd.read_csv("../../data/chicago-crimes-2019.csv.gz", compression='gzip')
df.head(1)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11922110,JC547456,12/15/2019 03:40:00 AM,039XX W NORTH AVE,1020,ARSON,BY FIRE,VEHICLE NON-COMMERCIAL,False,False,...,26.0,23.0,9,1149951.0,1910348.0,2019,04/27/2020 03:48:23 PM,41.909907,-87.724578,"(41.909907002, -87.724577987)"


### Clean the Features

Let's drop the records that have NaN values, making sure there aren't too many of them.

In [3]:
print("Found {} NaN community area records.".format(df['Community Area'].isna().sum()))
df.dropna(inplace=True)

Found 4 NaN community area records.


### Transform the Features

Let's turn arest into a 0/1 binary variable

Let's also re-create the hour of the day column, since that's a feature we like to use.

In [4]:
df['Hour'] = pd.to_datetime(df['Date']).dt.hour
df['Community Area'] = df['Community Area'].astype(int)
df['Hour'] = df['Hour'].astype(int)
df['Arrest'] = df['Arrest'].astype(int)

Let's suppose that, as in the Support Vector Machine lab, we want to build a binary classifier that predicts arrests based on hour of the day and community area.

In this case, the target variable $y$ is the `Arrest` column, and our features are the `Community Area` and `Hour` Columns. Let's create a dataframe to work just with those columns within this section of the notebook.

In [5]:
df_backup = df.copy()
df = df.loc[:,['Hour', 'Community Area', 'Arrest']]

## Bayes Classification by Hand

To understand the intuition behind the Naïve Bayes Classifier, let's first compute all of our posterior probabilities by hand.

### Prior Probabilities for Each Class

We first need to compute our prior probabilities for the target variable `Arrest`. Below, we've computed the complement directly just as a sanity check on our data; we could have used $1-p$ for the complement, but this helps us see that everything is accounted for.

In [6]:
arrests = df[df['Arrest']==1]['Arrest'].count()
no_arrests = df[df['Arrest']==0]['Arrest'].count()
total = df['Arrest'].count()

# Probability of Arrest
p_y = [arrests / total,
        no_arrests / total]

print("P(y=0) = {:10.4f}\nP(y=1) = {:10.4f}".format(p_y[0],p_y[1]))

P(y=0) =     0.2139
P(y=1) =     0.7861


### Feature Likelihood for Each Class

Now we compute the feature likelihood for each class. A typical way to do this is with parameter estimation, by assuming a distribution of the features (e.g., gaussian, multinomial). Here we'll start with something much simpler: We'll assume the likelihood $P(x1,x2|y)$ is simply the values in the dataset itself. 

In other words, we'll just say that the likelihood of the probability for a given community area and hour, given arrest or no arrest, is simply the number of observations of a (community area, hour) tuple in the event of arrest (or no arrest), divided by the total number of arrest (or no arrest) events.

There are $24 \times 77 \times 2$ such values to compute: One for each neighborhood area (77) and hour of the day (24), for each class of outcomes (arrest, no arrest):

In [7]:
# We could have used ranges for these, but best to derive directly from the data and not make any assumptions.
ca = np.sort(df['Community Area'].unique())
hr = np.sort(df['Hour'].unique())

arrest = [no_arrests,arrests]
likelihood = [[ [0 for col in range(2)] for col in range(24)] for row in range(78)]

`likelihood[c][h][a]` contains $P(\mathrm{community\; area, hour} | \mathrm{arrest})$

In [8]:
# Take a subset of the dataframe for y=1 since we'll need this a lot
for c in ca:
    for h in hr:
        for a in (0,1):
            likelihood[c][h][a] = df[(df['Community Area']==c) & \
                                     (df['Hour']==h) & \
                                     (df['Arrest']==a)].count()[0] / arrest[a]

Sanity check that conditional probabilities sum to 1.

In [9]:
s = 0
for c in ca:
    for h in hr:
        s = s + likelihood[c][h][0]
s

0.9999999999999976

### Predictions

Given an incident in a given community area and hour, can we predict whether an arrest was made? This is probably not a very good classifier, given the limited number of features we're using, but we'll demonstrate for the sake of example.

Essentially, given a community area and hour, we need to find the value of $y$ (i.e., arrest) that is most likely, given the observed $X$ (hour, community area).

This can be performed by picking the value of `likelihood[c][h][a]` $\cdot$ `p_y[i]` (for i=0 or 1) (the prior probability of arrest).

#### Hyde Park, 10 a.m.

Hyde Park is area 41.  If the incident is at 1000h, will an arrest take place?

In [10]:
likelihood[41][10][0] * p_y[0] < likelihood[41][10][1] * p_y[1]

False

In [11]:
s = 0
h = 10
n = 41
for a in (0,1):
    s = s + likelihood[n][h][a]

likelihood[n][h][1]/s * p_y[1]

0.12880841153745723

#### Austin, 10 a.m.

Austin is area 25. If the incident is at 1000h, will an arrest take place?

In [12]:
likelihood[25][10][0] * p_y[0] < likelihood[25][10][1] * p_y[1]

True

## Naïve Bayes with Scikit-Learn

Now we will perform the same computation with Python's `sklearn` library. We'll use the `ComplementNB` class, which is an adaptation of the multinomial Naïve Bayes classifier that deals better with imbalanced datasets. The technique is described in more detail in this [paper](https://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf).

In [13]:
# Import the Naïve Bayes Classifiers. (We'll only use Multinomial for now.)
from sklearn.naive_bayes import ComplementNB
nb = ComplementNB() 

features = df.loc[:,['Community Area', 'Hour']].values
target = df['Arrest'].values

#### Hyde Park, 10 a.m.

In [14]:
nb.fit(features,target)
nb.predict([[41,10]])[0]

0

#### Austin, 10 a.m.

In [15]:
nb.predict([[25,10]])[0]

1