Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [4]:
NAME = "Sunny Kumar Tuladhar"
ID = "st122336"

---

# Lab 06: Generative Classifiers: Naive Bayes

As discussed in class, a naive Bayes classifier works as follows.

We are given a feature space $\mathcal{X}$ that could be discrete, continuous, or a mix of discrete and continuous features.

We are also given a discrete set $\mathcal{Y} = { y_1, \ldots,
y_K }$ of exhaustive, mutually exclusive classes thought to be the provenance of a dataset elements $\mathbf{x} \in \mathcal{X}$.

What does it mean to say that the features come from the classes? Specifically, we mean that the observation $\mathbf{x}^{(i)}$ is a random vector statistically dependent on a random variable $y^{(i)}$.

This means that $\mathbf{x}^{(i)} \sim p(\mathbf{x} \mid y^{(i)})$, where $y^{(i)} \in \mathcal{Y}$ and $y^{(i)} \sim p(y)$. $p(y)$, the *prior*, is assumed to be a multinomial distribution over the possible classes $\mathcal{Y}$, but the class conditional distribution $p(\mathbf{x} \mid y)$ can be an arbitrarily complicated joint distribution over the feature space that is different for each $y \in \mathcal{Y}$.

The random process just described, in which a $y$ is first sampled from a multinomial distribution over $\mathcal{Y}$ then an $\mathbf{x}$ is sampled from an arbitrary joint distribution over $\mathcal{X}$ that is conditioned on $y$, is a *generative model* for the provenance of our dataset. It may not be a fully accurate model for how nature gave us our dataset, but we nevertheless assume that it is.

With all those preliminaries, now, given a new sample $\mathbf{x}$ assumed to have been generated by the same generative process, we estimate, for each $y \in \mathcal{Y}$, the *posterior* $p(y \mid \mathbf{x})$ using the following strategy:
$$\begin{eqnarray}
p(y \mid \mathbf{x} ; \theta) & = & \frac{p(\mathbf{x} \mid y ; \theta) p(y ; \theta)}{p(\mathbf{x} ; \theta)} \\
& \propto & p(\mathbf{x} \mid y ; \theta) p(y ; \theta) \\
& = & p(y ; \theta) \prod_j p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \\
& \approx & p(y ; \theta) \prod_j p(x_j \mid y ; \theta).
\end{eqnarray}$$

The critical assumption here (besides the story of the generative random process assumed to be the origin of our dataset) is the *naive Bayes assumption* that the approximation

$$ p(x_j \mid y, x_1, \ldots, x_{j-1} ; \theta) \approx p(x_j \mid y ; \theta)$$

is close enough to reality to be useful. Note that if the features are truly *conditionally independent of each other given the class*, then the naive Bayes classifier is an exact probabilistic classifier.

So now we know that the parameters of a naive Bayes classifier will always include the parameters $\phi_1, \ldots, \phi_k$ of the multinomial distribution over $\mathcal{Y}$ plus the individual conditional feature distributions $p(x_j \mid y)$. If $x_j$ is discrete, we can represent this conditional distribution using a simple table of probabilities, and if $x_j$ is continuous, we represent the conditional distribution using the parameters of some continuous distribution such as a univariate Gaussian, univariate exponential, etc.

In today's lab, we will use naive Bayes to perform diabetes diagnosis and text classification.

## Example 1: Diabetes classification

In this example we predict wheter a patient with specific diagnostic measurements has diabetes or not. The target classes $\mathcal{Y} = { y_1, y_2 }$ correspond respectively to "no diabetes" and "diabetes." As the features are continuous, we will model their conditional probabilities $p(x_j \mid y ; \theta)$ as univariate Gaussians with means $\mu_{j,y}$ and standard deviations $\sigma_{j,y}$.

The data are originally from the U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) and are available from [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [5]:
import csv
import math
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Data manipulation

First we have some functions to read the dataset, split it into train and test, and partition it according to target class ($y$).

In [6]:
# Load data from CSV file
def loadCsv(filename):
    data_raw = pd.read_csv(filename)
    headers = data_raw.columns
    dataset = data_raw.values
    return dataset, headers

# Split dataset into test and train with given ratio
def splitDataset(test_size,*arrays,**kwargs):
    return train_test_split(*arrays,test_size=test_size,**kwargs)

# Separate training data according to target class
# Return key value pairs array in which keys are possible target variable values
# and values are the data records.

def data_split_byClass(dataset):
    Xy = {}
    for i in range(len(dataset)):
        datapair = dataset[i]
        # datapair[-1] (the last column) is the target class for this record.
        # Check if we already have this value as a key in the return array
        if (datapair[-1] not in Xy):
            # Add class as key
            Xy[datapair[-1]] = []
        # Append this record to array of records for this class key
        Xy[datapair[-1]].append(datapair)
    return Xy

### Model training

Next we have some functions used for training the model. Parameters include the conditional means and standard deviations for each feature as well as the parameters of the multinomial distribution (more specifically the Bernoulli distribution since this is a binary classification problem) over $\mathcal{Y}$.

In [7]:
# Calculate Gaussian parameters mu and sigma for each attribute over a dataset

def get_gaussian_parameters(X, y):
    parameters = {}
    unique_y = np.unique(y)
    for uy in unique_y:
        mean = np.mean(X[y==uy], axis=0)
        std = np.std(X[y==uy], axis=0)
        py = y[y==uy].size / y.size
        parameters[uy] = { 'prior': py, 'mean': mean, 'std': std }
    return parameters, unique_y

def calculateProbability(x, mu, sigma):
    sigma = np.diag(sigma**2)
    x = x.reshape(-1,1)
    mu = mu.reshape(-1,1)
    exponent = np.exp(-1/2*(x-mu).T@np.linalg.inv(sigma)@(x-mu))
    return ((1/(np.sqrt(((2*np.pi)**x.size)*np.linalg.det(sigma))))*exponent)[0,0]

### Model testing

Next are some functions for testing the model on a test set and computing its accuracy. Note that `predict_one()` allows us to calculate $p(y \mid \mathbf{x} ; \theta)$ with or without the prior, i.e., as either

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta),$$

which corresponds to the assumption that the priors $p(y)$ are equal, i.e., $p(y) = \frac{1}{K}$ for all $y$, or

$$ p(y \mid \mathbf{x} ; \theta) \propto p(\mathbf{x} \mid y ; \theta) p(y ; \theta),$$

which correctly includes the prior.

In [8]:
# Calculate class conditional probabilities for given input data vector

def predict_one(x, parameters, unique_y, prior=True):
    probabilities = []
    for key in parameters.keys():
        probabilities.append(calculateProbability
                             (x, parameters[key]['mean'],
                              parameters[key]['std']) * (parameters[key]['prior']**(float(prior))))
    probabilities = np.array(probabilities)
    return unique_y[np.argmax(probabilities)]

def getPredictions(X, parameters, unique_y,prior=True):
    predictions = []
    for i in range(X.shape[0]):
        predictions.append(predict_one(X[i],parameters,unique_y,prior))
    return np.array(predictions)

# Get accuracy for test set

def getAccuracy(y, y_pred):
    correct = len(y[y==y_pred])
    return correct/y.size

### Experiment

Here we load the diabetes dataset, split it into training and test data, train a Gaussian NB model, and test the model on the test set.

In [9]:
# Load dataset

filename = 'diabetes.csv'
dataset, headers = loadCsv(filename)
#print(headers)
#print(np.array(dataset)[0:5,:])

# Split into training and test

X_train,X_test,y_train,y_test = splitDataset(0.4,dataset[:,:-1],dataset[:,-1])
print("Total =",len(dataset),"Train =", len(X_train),"Test =",len(X_test))

# Train model

parameters, unique_y = get_gaussian_parameters(X_train,y_train)
prediction = getPredictions(X_test,parameters,unique_y)
print("Accuracy with Prior =",getAccuracy(y_test,prediction))

# Test model

prediction = getPredictions(X_test,parameters,unique_y,prior = False)
print("Accuracy without Prior =",getAccuracy(y_test,prediction))

Total = 768 Train = 460 Test = 308
Accuracy with Prior = 0.737012987012987
Accuracy without Prior = 0.7305194805194806


###  Exercise In lab / take home work (20 points)

Find out the proportion of the records in your dataset are positive vs. negative.  Can we conclude that $p(y=1) = p(y=0)$? If not, we should use the version of the model in which we use the priors $p(y=1)$ and $p(y=0)$. Explain
whether/how it improves the result.


In [10]:
ratio0 = len(X_train[y_train==0])/len(y_train)
ratio1 = len(X_train[y_train==1])/len(y_train)
print('p(y = 0) is',np.round(ratio0,2))
print('p(y = 1) is',np.round(ratio1,2))

p(y = 0) is 0.65
p(y = 1) is 0.35


In [11]:
# YOUR CODE HERE

# Train model

parameters, unique_y = get_gaussian_parameters(X_train,y_train)

prediction = getPredictions(X_test,parameters,unique_y,prior = True)
Acc_Pr = getAccuracy(y_test,prediction)
print("Accuracy with Prior =",np.round(Acc_Pr,4) * 100 )

# Test model
prediction = getPredictions(X_test,parameters,unique_y,prior = False)
Acc_noPr = getAccuracy(y_test,prediction)
print("Accuracy without Prior =",np.round(Acc_noPr,4) * 100)

print('difference',np.round((Acc_Pr - Acc_noPr) * 100,4) )

#raise NotImplementedError()

Accuracy with Prior = 73.7
Accuracy without Prior = 73.05
difference 0.6494


We can see the accuracy increased slightly when using with prior




**Explain whether you can conclude that $p(y=1) = p(y=0)$? If not, add
the priors $p(y=1)$ and $p(y=0)$ to your NB model and explain how it improves the result.**


We cannot conclude $p(y=1) = p(y=0)$ because their values are very different. p(y = 0) is 0.63 and p(y = 1) is 0.37. Which means there is a lot more of class 0 than class 1 in the dataset. 
Adding the priors improves the result as it also takes the probability distribution of the classes over entire dataset into consideration

## Example 2: Text classification

This example has been adapted from a post by Jaya Aiyappan, available at
[Analytics Vidhya](https://medium.com/analytics-vidhya/naive-bayes-classifier-for-text-classification-556fabaf252b#:~:text=The%20Naive%20Bayes%20classifier%20is,time%20and%20less%20training%20data).

We will generate a small dataset of sentences that are classified as either "statements" or "questions."

We will assume that occurance and placement of words within a sentence are independent of each other, so the sentence "this is my book" will have the same features as the sentence "is this my book." We will treat words without case sensitivity.

In [12]:
# Generate text data for two classes, "statement" and "question"

text_train = [['This is my novel book', 'statement'],
              ['this book has more than one author', 'statement'],
              ['is this my book', 'question'],
              ['They are novels', 'statement'],
              ['have you read this book', 'question'],
              ['who is the novels author', 'question'],
              ['what are the characters', 'question'],
              ['This is how I bought the book', 'statement'],
              ['I like fictional characters', 'statement'],
              ['what is your favorite book', 'question']]

text_test = [['this is the book', 'statement'], 
             ['who are the novels characters', 'question'], 
             ['is this the author', 'question'],
            ['I like apples']]

# Load training and test data into pandas data frames

training_data = pd.DataFrame(text_train, columns= ['sentence', 'class'])
print(training_data)
print('\n------------------------------------------\n')
testing_data = pd.DataFrame(text_test, columns= ['sentence', 'class'])
print(testing_data)


                             sentence      class
0               This is my novel book  statement
1  this book has more than one author  statement
2                     is this my book   question
3                     They are novels  statement
4             have you read this book   question
5            who is the novels author   question
6             what are the characters   question
7       This is how I bought the book  statement
8         I like fictional characters  statement
9          what is your favorite book   question

------------------------------------------

                        sentence      class
0               this is the book  statement
1  who are the novels characters   question
2             is this the author   question
3                  I like apples       None


In [13]:
training_data

Unnamed: 0,sentence,class
0,This is my novel book,statement
1,this book has more than one author,statement
2,is this my book,question
3,They are novels,statement
4,have you read this book,question
5,who is the novels author,question
6,what are the characters,question
7,This is how I bought the book,statement
8,I like fictional characters,statement
9,what is your favorite book,question


In [14]:
stmt_docs = [train['sentence'] for index,train in training_data.iterrows() 
             if train['class'] == 'statement']

In [15]:
# Partition training data by class

stmt_docs = [train['sentence'] for index,train in training_data.iterrows() 
             if train['class'] == 'statement']
question_docs = [train['sentence'] for index,train in training_data.iterrows()
                 if train['class'] == 'question']
all_docs = [train['sentence'] for index,train in training_data.iterrows()]

# Get word frequencies for each sentence and class

def get_words(text):
    # Initialize word list
    words = [];
    # Loop through each sentence in input array
    for text_row in text:       
        # Check the number of words. Assume each word is separated by a blank space
        # so that the number of words is the number of blank spaces + 1
        number_of_spaces = text_row.count(' ')
        # loop through the sentence and get words between blank spaces.
        for i in range(number_of_spaces):
            # Check for for last word
            words.append([text_row[:text_row.index(' ')].lower()])
            text_row = text_row[text_row.index(' ')+1:]  
            i = i + 1        
        words.append([text_row])
    return np.unique(words)

# Get frequency of each word in each document

def get_doc_word_frequency(words, text):  
    word_freq_table = np.zeros((len(text),len(words)), dtype=int)
    i = 0
    for text_row in text:
        # Insert extra space between each pair of words to prevent
        # partial match of words
        text_row_temp = ''
        for idx, val in enumerate(text_row):
            if val == ' ':
                 text_row_temp = text_row_temp + '  '
            else:
                  text_row_temp = text_row_temp + val.lower()
        text_row = ' ' + text_row_temp + ' '
        j = 0
        for word in words: 
            word = ' ' + word + ' '
            freq = text_row.count(word)
            word_freq_table[i,j] = freq
            j = j + 1
        i = i + 1
    
    return word_freq_table

In [16]:
# Get word frequencies for statement documents
word_list_s = get_words(stmt_docs)
print(word_list_s)
word_freq_table_s = get_doc_word_frequency(word_list_s, stmt_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
tdm_s

['are' 'author' 'book' 'bought' 'characters' 'fictional' 'has' 'how' 'i'
 'is' 'like' 'more' 'my' 'novel' 'novels' 'one' 'than' 'the' 'they' 'this']


Unnamed: 0,are,author,book,bought,characters,fictional,has,how,i,is,like,more,my,novel,novels,one,than,the,they,this
0,0,0,1,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,1
1,0,1,1,0,0,0,1,0,0,0,0,1,0,0,0,1,1,0,0,1
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0
3,0,0,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0,1,0,1
4,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0


In [17]:
# Get word frequencies over all statement documents

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
print(freq_s)

{'are': 1, 'author': 1, 'book': 3, 'bought': 1, 'characters': 1, 'fictional': 1, 'has': 1, 'how': 1, 'i': 2, 'is': 2, 'like': 1, 'more': 1, 'my': 1, 'novel': 1, 'novels': 1, 'one': 1, 'than': 1, 'the': 1, 'they': 1, 'this': 3}


In [18]:
# Get word frequencies for question documents

word_list_q = get_words(question_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, question_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
tdm_q

Unnamed: 0,are,author,book,characters,favorite,have,is,my,novels,read,the,this,what,who,you,your
0,0,0,1,0,0,0,1,1,0,0,0,1,0,0,0,0
1,0,0,1,0,0,1,0,0,0,1,0,1,0,0,1,0
2,0,1,0,0,0,0,1,0,1,0,1,0,0,1,0,0
3,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0
4,0,0,1,0,1,0,1,0,0,0,0,0,1,0,0,1


In [19]:
# Get word frequencies over all question documents

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))
print(freq_q)

{'are': 1, 'author': 1, 'book': 3, 'characters': 1, 'favorite': 1, 'have': 1, 'is': 3, 'my': 1, 'novels': 1, 'read': 1, 'the': 2, 'this': 2, 'what': 2, 'who': 1, 'you': 1, 'your': 1}


In [20]:
# Get word probabilities for statement class
a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a))
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   
    
    
print('Probability of words for "statement" class \n')
print(dict(zip(word_list_s, prob_s)))
print('------------------------------------------- \n')
print('Probability of words for "question" class \n')
print(dict(zip(word_list_q, prob_q)))

Probability of words for "statement" class 

{'are': 0.043478260869565216, 'author': 0.043478260869565216, 'book': 0.08695652173913043, 'bought': 0.043478260869565216, 'characters': 0.043478260869565216, 'fictional': 0.043478260869565216, 'has': 0.043478260869565216, 'how': 0.043478260869565216, 'i': 0.06521739130434782, 'is': 0.06521739130434782, 'like': 0.043478260869565216, 'more': 0.043478260869565216, 'my': 0.043478260869565216, 'novel': 0.043478260869565216, 'novels': 0.043478260869565216, 'one': 0.043478260869565216, 'than': 0.043478260869565216, 'the': 0.043478260869565216, 'they': 0.043478260869565216, 'this': 0.08695652173913043}
------------------------------------------- 

Probability of words for "question" class 

{'are': 0.05128205128205128, 'author': 0.05128205128205128, 'book': 0.10256410256410256, 'characters': 0.05128205128205128, 'favorite': 0.05128205128205128, 'have': 0.05128205128205128, 'is': 0.10256410256410256, 'my': 0.05128205128205128, 'novels': 0.0512820512

In [21]:
# Calculate prior for one class

def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
            if className == 'statement':
                idx = np.array(np.where(word_list_s == word))              
                prob = (prob * prob_s[idx[0,0]]) 
              
            else:
                idx = np.array(np.where(word_list_q == word))
                prob = (prob * prob_q[idx[0,0]]) 
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'

### In-lab exercise: Laplace smoothing

Run the code below and figure out why it fails.

When a word does not appear with a specific class in the training data, its class-conditional probability is 0, and we are unable to
get a reasonable probability for that class.

Research Laplace smoothing, and modify the code above to implement Laplace smoothing (setting the frequency of all words with frequency 0 to a frequency of 1).
Run the modified code on the test set.

In [22]:
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])
doc = test_docs[1]
print('Getting prediction for %s"' % doc)
predict(doc)


Getting prediction for who are the novels characters"


IndexError: index 0 is out of bounds for axis 1 with size 0

### Exercise 1.1 (10 points)

Explain Why it failed and explain how to solve the problem.

Explanation here! (Double click to explain)

It fails because one of the word count was zero at this line 

`idx = np.array(np.where(word_list_s == word))`

This would return nothing to the idx value. Hence the error "index 0 is out of bounds for axis 1 with size 0" and thus make the function not work.

To fix this problem we can use Laplace smoothing.
Laplace smoothing adds words to each class that appears in other class 
but not in its own with a count 1.
We also increase the count of other words by 1 so as not to change the overall probabilities much. This means we will always a non zero value returned.



### Exercise 1.2 (20 points)

Modify the code to make it work using Laplace smoothing. Include the functions `prior()`, `classCondProb()`, and `predict()`.

In [24]:
# YOUR CODE HERE

a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a))
    
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
    
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   


def prior(className):    
    denominator = len(stmt_docs) + len(question_docs)
    
    if className == 'statement':
        numerator =  len(stmt_docs)
    else:
        numerator =  len(question_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
            if className == 'statement':
                idx = np.where(word_list_s == word)
                
                if np.array(idx).shape[1] != 0: 
                    prob = (prob * prob_s[np.array(idx)[0,0]]) 
                else:
                    prob = (prob * prob_s[-1])
                    
            else: #if className is question
                idx = np.where(word_list_q == word)
                
                if np.array(idx).shape[1] != 0: 
                    prob = (prob * prob_q[np.array(idx)[0,0]]) 
                else:
                    prob = (prob * prob_q[-1])
                
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 'statement'
    else:
        return 'question'
    


#raise NotImplementedError()

In [29]:
# Test function: Do not remove
test_docs = list([test['sentence'] for index,test in testing_data.iterrows()])

for sentence in test_docs:
    print('Getting prediction for %s"' % sentence)
    print(predict(sentence))
    
print("success!")
# End Test function

Getting prediction for this is the book"
question
Getting prediction for who are the novels characters"
question
Getting prediction for is this the author"
question
Getting prediction for I like apples"
question
success!


**Expected result**:\
Getting prediction for this is the book"\
question\
Getting prediction for who are the novels characters"\
question\
Getting prediction for is this the author"\
question\
Getting prediction for I like apples"\
statement\
success!

### Take home exercise

Find a more substantial text classification dataset, clean up the documents, and build your NB classifier. Write a brief report on your in-lab and take home exercises and results here.

### TAKE HOME

The dataset is from [Kaggle](https://www.kaggle.com/shahrukhkhan/questions-vs-statementsclassificationdataset?select=val.csv). We are just going to use the validation set and only a part of it 
becuase it is too huge


In [30]:
import re
filename = 'val.csv'
dataset, headers = loadCsv(filename)
dataset = dataset[:1000,1:] #we will only use 1000 sentences

#the dataset is the classification of sentences into statements or questions
# 0 is a statement and 1 is a question as we can see below
dataset[:5]

array([['How many did snow fall in Palermo between 1940and the 2000s?',
        1],
       ['When did the Egyptian and Syrian armies launch a surprise attack against Israeli forces',
        1],
       [' Still, the political ramifications of removing Hussein would have broadened the scope of the conflict greatly, and many coalition nations refused to participate in such an action, believing it would create a power vacuum and destabilize the region.',
        0],
       ['What did Simpson help construct with Bell?', 1],
       ['How much did 1941 earn in the US', 1]], dtype=object)

In [31]:
#removing all special characters and numbers and then converting all to lower case
for i in range(dataset.shape[0]):
    dataset[i,0] = re.sub(r"[^a-z]"," ",dataset[i,0].lower())    

In [39]:
dataset[:5,-1]

array([1, 1, 0, 1, 1], dtype=object)

In [40]:
# Split into training and test

X_train,X_test,y_train,y_test = splitDataset(0.4,dataset[:,:-1],
                                             dataset[:,-1])
print("Total =",len(dataset),
      "\nTrain =", len(X_train),
      "\nTest =",len(X_test))

Total = 1000 
Train = 600 
Test = 400


In [41]:
s_docs = (X_train[y_train == 0]).reshape(-1).tolist()
q_docs = (X_train[y_train == 1]).reshape(-1).tolist()

In [42]:
word_list_s = get_words(s_docs)
word_list_s
word_freq_table_s = get_doc_word_frequency(word_list_s, s_docs)
tdm_s = pd.DataFrame(word_freq_table_s, columns=word_list_s)
tdm_s

freq_list_s = word_freq_table_s.sum(axis=0) 
freq_s = dict(zip(word_list_s,freq_list_s))
freq_s.pop('') #ignoring spaces
list(freq_s.items())[:20]

[('a', 131),
 ('aaron', 1),
 ('abandoned', 1),
 ('abbreviations', 1),
 ('abortive', 1),
 ('about', 8),
 ('absolutely', 1),
 ('accepted', 1),
 ('accident', 1),
 ('accidental', 1),
 ('accompanied', 2),
 ('according', 1),
 ('account', 1),
 ('accounting', 1),
 ('accusative', 1),
 ('accused', 1),
 ('accuser', 1),
 ('achieve', 1),
 ('acid', 1),
 ('acquisition', 1)]

In [43]:
word_list_q = get_words(q_docs)
word_freq_table_q = get_doc_word_frequency(word_list_q, q_docs)
tdm_q = pd.DataFrame(word_freq_table_q, columns=word_list_q)
tdm_q

freq_list_q = word_freq_table_q.sum(axis=0) 
freq_q = dict(zip(word_list_q,freq_list_q))

freq_q.pop('') #removing spaces
list(freq_q.items())[:20]

[('a', 148),
 ('abbess', 1),
 ('ability', 1),
 ('able', 2),
 ('abolish', 1),
 ('abound', 1),
 ('about', 9),
 ('above', 1),
 ('absorbs', 1),
 ('academic', 2),
 ('accept', 2),
 ('accepted', 1),
 ('accessories', 1),
 ('accord', 1),
 ('according', 3),
 ('accounting', 2),
 ('accuse', 1),
 ('accused', 1),
 ('achaemenid', 1),
 ('achieve', 1)]

In [48]:
# YOUR CODE HERE
a = 1
prob_s = []
for count in freq_list_s:
    #print(word, count)
    prob_s.append((count+a)/(sum(freq_list_s)+len(freq_list_s)*a))
    
prob_s.append(a/(sum(freq_list_s)+len(freq_list_s)*a))
    
# Get word probabilities for question class

prob_q = []
for count in freq_list_q:
    prob_q.append((count+a)/(sum(freq_list_q)+len(freq_list_q)*a))
    
prob_q.append(a/(sum(freq_list_q)+len(freq_list_q)*a))   


def prior(className):    
    denominator = len(s_docs) + len(q_docs)
    
    if className == 'statement':
        numerator =  len(s_docs)
    else:
        numerator =  len(q_docs)
        
    return np.divide(numerator,denominator)
    
# Calculate class conditional probability for a sentence
    
def classCondProb(sentence, className):
    words = get_words(sentence)
    prob = 1
    for word in words:
            if className == 'statement':
                idx = np.where(word_list_s == word)
                
                if np.array(idx).shape[1] != 0: 
                    prob = (prob * prob_s[np.array(idx)[0,0]]) 
                else:
                    prob = (prob * prob_s[-1])
                    
            else: #if className is question
                idx = np.where(word_list_q == word)
                
                if np.array(idx).shape[1] != 0: 
                    prob = (prob * prob_q[np.array(idx)[0,0]]) 
                else:
                    prob = (prob * prob_q[-1])
                
    
    return prob

# Predict class of a sentence

def predict(sentence):
    prob_statement = classCondProb(sentence, 'statement') * prior('statement')
    prob_question = classCondProb(sentence, 'question') * prior('question')
    if  prob_statement > prob_question:
        return 0 #statement
    else:
        return 1 #question

In [49]:
#Getting the predicted values of the test set
yhat_test = []
for i in X_test:
    yhat_test.append(predict(i))

print ('The accuracy of our model is ',getAccuracy(y_test,yhat_test))

The accuracy of our model is  0.755


We used the NB Classifier to our new dataset and were able to get an accuracy of 0.78 to classify statements and questions. Using the value of the prior seems very important when the two classes had different values of probability among the dataset. 

This was a very simple classifier where we only included text as words in the lowercase and removed all numbers and special characters. This also discards word sequences and only takes into account word frequency. That is why it only performs slightly better than the prior for questions) as shown below


In [50]:
print('Prior for questions', prior('question'))
print('Accuracy of our model', getAccuracy(y_test,yhat_test))

Prior for questions 0.6283333333333333
Accuracy of our model 0.755


So if our model predicted only questions for all values it would still have an accuracy of 0.61