Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = "Pasit Tiwawongrut"
ID = "st122442"

---

# Lab 06: Generative Classifiers: Naive Bayes

As discussed in class, a naive Bayes classifier works as follows.

We are given a feature space $\mathcal{X}$ that could be discrete, continuous, or a mix of discrete and continuous features.

We are also given a discrete set $\mathcal{Y} = { y_1, \ldots,
y_K }$ of exhaustive, mutually exclusive classes thought to be the provenance of a dataset elements $\mathbf{x} \in \mathcal{X}$.

What does it mean to say that the features come from the classes? Specifically, we mean that the observation $\mathbf{x}^{(i)}$ is a random vector statistically dependent on a random variable $y^{(i)}$.

This means that $\mathbf{x}^{(i)} \sim p(\mathbf{x} \mid y^{(i)})$, where $y^{(i)} \in \mathcal{Y}$ and $y^{(i)} \sim p(y)$. $p(y)$, the *prior*, is assumed to be a multinomial distribution over the possible classes $\mathcal{Y}$, but the class conditional distribution $p(\mathbf{x} \mid y)$ can be an arbitrarily complicated joint distribution over the feature space that is different for each $y \in \mathcal{Y}$.

The random process just described, in which a $y$ is first sampled from a multinomial distribution over $\mathcal{Y}$ then an $\mathbf{x}$ is sampled from an arbitrary joint distribution over $\mathcal{X}$ that is conditioned on $y$, is a *generative model* for the provenance of our dataset. It may not be a fully accurate model for how nature gave us our dataset, but we nevertheless assume that it is.

With all those preliminaries, now, given a new sample $\mathbf{x}$ assumed to have been generated by the same generative process, we estimate, for each $y \in \mathcal{Y}$, the *posterior* $p(y \mid \mathbf{x})$ using the following strategy:
$$\begin{eqnarray}
p(y \mid \mathbf{x} ; \theta) & = & \frac{p(\mathbf{x} \mid y ; \theta) p(y ; \theta)}{p(\mathbf{x} ; \theta)} \\
& \propto & p(\mathbf{x} \mid y ; \theta) p(y ; \theta) \\
& = & p(y ; \theta) \prod_j p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \\
& \approx & p(y ; \theta) \prod_j p(x_j \mid y ; \theta).
\end{eqnarray}$$

The critical assumption here (besides the story of the generative random process assumed to be the origin of our dataset) is the *naive Bayes assumption* that the approximation

$$ p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \approx p(x_j \mid y ; \theta)$$

is close enough to reality to be useful. Note that if the features are truly *conditionally independent of each other given the class*, then the naive Bayes classifier is an exact probabilistic classifier.

So now we know that the parameters of a naive Bayes classifier will always include the parameters $\phi_1, \ldots, \phi_k$ of the multinomial distribution over $\mathcal{Y}$ plus the individual conditional feature distributions $p(x_j \mid y)$. If $x_j$ is discrete, we can represent this conditional distribution using a simple table of probabilities, and if $x_j$ is continuous, we represent the conditional distribution using the parameters of some continuous distribution such as a univariate Gaussian, univariate exponential, etc.

In today's lab, we will use naive Bayes to perform diabetes diagnosis and text classification.

## Example 1: Diabetes classification

In this example we predict wheter a patient with specific diagnostic measurements has diabetes or not. The target classes $\mathcal{Y} = { y_1, y_2 }$ correspond respectively to "no diabetes" and "diabetes." As the features are continuous, we will model their conditional probabilities $p(x_j \mid y ; \theta)$ as univariate Gaussians with means $\mu_{j,y}$ and standard deviations $\sigma_{j,y}$.

The data are originally from the U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and are available from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [1]:
import csv
import math
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Data manipulation

First we have some functions to read the dataset, split it into train and test, and partition it according to target class ($y$).

In [2]:
# Load data from CSV file
def loadCsv(filename):
    data_raw = pd.read_csv(filename)
    headers = data_raw.columns
    dataset = data_raw.values
    return dataset, headers

# Split dataset into test and train with given ratio
def splitDataset(test_size,*arrays,**kwargs):
    return train_test_split(*arrays,test_size=test_size,**kwargs)

# Separate training data according to target class
# Return key value pairs array in which keys are possible target variable values
# and values are the data records.

def data_split_byClass(dataset):
    Xy = {}
    for i in range(len(dataset)):
        datapair = dataset[i]
        # datapair[-1] (the last column) is the target class for this record.
        # Check if we already have this value as a key in the return array
        if (datapair[-1] not in Xy):
            # Add class as key
            Xy[datapair[-1]] = []
        # Append this record to array of records for this class key
        Xy[datapair[-1]].append(datapair)
    return Xy

### Model training

Next we have some functions used for training the model. Parameters include the conditional means and standard deviations for each feature as well as the parameters of the multinomial distribution (more specifically the Bernoulli distribution since this is a binary classification problem) over $\mathcal{Y}$.

In [3]:
# Calculate Gaussian parameters mu and sigma for each attribute over a dataset

def get_gaussian_parameters(X, y):
    parameters = {}
    unique_y = np.unique(y)
    for uy in unique_y:
        mean = np.mean(X[y==uy], axis=0)
        std = np.std(X[y==uy], axis=0)
        py = y[y==uy].size / y.size
        parameters[uy] = { 'prior': py, 'mean': mean, 'std': std }
    return parameters, unique_y

def calculateProbability(x, mu, sigma):
    sigma = np.diag(sigma**2)
    x = x.reshape(-1,1)
    mu = mu.reshape(-1,1)
    exponent = np.exp(-1/2*(x-mu).T@np.linalg.inv(sigma)@(x-mu))
    return ((1/(np.sqrt(((2*np.pi)**x.size)*np.linalg.det(sigma))))*exponent)[0,0]

### Model testing

Next are some functions for testing the model on a test set and computing its accuracy. Note that `predict_one()` allows us to calculate $p(y \mid \mathbf{x} ; \theta)$ with or without the prior, i.e., as either

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta),$$

which corresponds to the assumption that the priors $p(y)$ are equal, i.e., $p(y) = \frac{1}{K}$ for all $y$, or

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta) p(y ; \theta),$$

which correctly includes the prior.

In [4]:
# Calculate class conditional probabilities for given input data vector

def predict_one(x, parameters, unique_y, prior=True):
    probabilities = []
    for key in parameters.keys():
        probabilities.append(calculateProbability(x, parameters[key]['mean'], parameters[key]['std']) * (parameters[key]['prior']**(float(prior))))
    probabilities = np.array(probabilities)
    return unique_y[np.argmax(probabilities)]

def getPredictions(X, parameters, unique_y,prior=True):
    predictions = []
    for i in range(X.shape[0]):
        predictions.append(predict_one(X[i],parameters,unique_y,prior))
    return np.array(predictions)

# Get accuracy for test set

def getAccuracy(y, y_pred):
    correct = len(y[y==y_pred])
    return correct/y.size

### Experiment

Here we load the diabetes dataset, split it into training and test data, train a Gaussian NB model, and test the model on the test set.

In [5]:
# Load dataset

filename = 'diabetes.csv'
dataset, headers = loadCsv(filename)
#print(headers)
#print(np.array(dataset)[0:5,:])

# Split into training and test

X_train,X_test,y_train,y_test = splitDataset(0.4,dataset[:,:-1],dataset[:,-1])
print("Total =",len(dataset),"Train =", len(X_train),"Test =",len(X_test))

# Train model

parameters, unique_y = get_gaussian_parameters(X_train,y_train)
prediction = getPredictions(X_test,parameters,unique_y)
print("Accuracy with Prior =",getAccuracy(y_test,prediction))

# Test model

prediction = getPredictions(X_test,parameters,unique_y,prior = False)
print("Accuracy without Prior =",getAccuracy(y_test,prediction))

Total = 768 Train = 460 Test = 308
Accuracy with Prior = 0.7305194805194806
Accuracy without Prior = 0.7272727272727273


###  Exercise In lab / take home work (20 points)

Find out the proportion of the records in your dataset are positive vs. negative.  Can we conclude that $p(y=1) = p(y=0)$? If not, we should use the version of the model in which we use the priors $p(y=1)$ and $p(y=0)$. Explain
whether/how it improves the result.


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Explain whether you can conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model and explain how it improves the result.**


## Example 2: Text classification

This example has been adapted from a post by Jaya Aiyappan, available at
[Analytics Vidhya](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data).

We will generate a small dataset of sentences that are classified as either "statements" or "questions."

We will assume that occurance and placement of words within a sentence are independent of each other, so the sentence "this is my book" will have the same features as the sentence "is this my book." We will treat words without case sensitivity.

In [None]:
# Generate text data for two classes, "statement" and "question"

text_train = [['This is my novel book', 'statement'],
              ['this book has more than one author', 'statement'],
              ['is this my book', 'question'],
              ['They are novels', 'statement'],
              ['have you read this book', 'question'],
              ['who is the novels author', 'question'],
              ['what are the characters', 'question'],
              ['This is how I bought the book', 'statement'],
              ['I like fictional characters', 'statement'],
              ['what is your favorite book', 'question']]

text_test = [['this is the book', 'statement'], 
             ['who are the novels characters', 'question'], 
             ['is this the author', 'question'],
            ['I like apples']]

# Load training and test data into pandas data frames

training_data = pd.DataFrame(text_train, columns= ['sentence', 'class'])
print(training_data)
print('\n------------------------------------------\n')
testing_data = pd.DataFrame(text_test, columns= ['sentence', 'class'])
print(testing_data)


In [None]:
# Partition training data by class

stmt_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'statement']
question_docs = [train['sentence'] for index,train in training_data.iterrows() if train['class'] == 'question']
all_docs = [train['sentence'] for index,train in training_data.iterrows()]

# Get word frequencies for each sentence and class

def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [None]:
# Get word frequencies for statement documents

word_list_s = get_words(stmt_docs)
word_freq_table_s = get_doc_word_frequency(word_list_s, stmt_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
print(tdm_s)

In [None]:
# Get word frequencies over all statement documents

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
print(freq_s)

In [None]:
# Get word frequencies for question documents

word_list_q = get_words(question_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, question_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
print(tdm_q)

In [None]:
# Get word frequencies over all question documents

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))
print(freq_q)
print(freq_list_s)
print(freq_list_q)

In [None]:
# Get word probabilities for statement class
a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a))
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   
    
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

In [None]:
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
        if className == 'statement':
            idx = np.where(word_list_s == word)
            prob = prob * prob_s[np.array(idx)[0,0]]
        else:
            idx = np.where(word_list_q == word)
            prob = prob * prob_q[np.array(idx)[0,0]]   
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

### In-lab exercise: Laplace smoothing

Run the code below and figure out why it fails.

When a word does not appear with a specific class in the training data, its class-conditional probability is 0, and we are unable to
get a reasonable probability for that class.

Research Laplace smoothing, and modify the code above to implement Laplace smoothing (setting the frequency of all words with frequency 0 to a frequency of 1).
Run the modified code on the test set.

In [None]:
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])
print('Getting prediction for %s"' % test_docs[0])
predict(test_docs[0])


### Exercise 1.1 (10 points)

Explain Why it failed and explain how to solve the problem.

Explanation here! (Double click to explain)

### Exercise 1.2 (20 points)

Modify the code to make it work using Laplace smoothing. Include the functions `prior()`, `classCondProb()`, and `predict()`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Test function: Do not remove
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])

for sentence in test_docs:
    print('Getting prediction for %s"' % sentence)
    print(predict(sentence))
    
print("success!")
# End Test function

**Expected result**:\
Getting prediction for this is the book"\
question\
Getting prediction for who are the novels characters"\
question\
Getting prediction for is this the author"\
question\
Getting prediction for I like apples"\
statement\
success!

### Take home exercise

Find a more substantial text classification dataset, clean up the documents, and build your NB classifier. Write a brief report on your in-lab and take home exercises and results here.