## Is It An Action or A Comedy Film?

This notebook is a step by step walk-through on how to train a simple naive bayes classfier to recognize the genre of the film through its reivew. 

Before we start, below is a picture demonstration of the equation for calculating the likelihood.
1. Prior: number of files in given class, i.e. if 2 out of 5 reviews are action films, 0.4 will be its prior prob.
2. Likelihood or P(feature|class): give num of features, what's the likelihood that its an action film ((i.e. fly,fun,kick,hit)|action)
3. Evidence: number of data points (here namely reviews) we have. 

Note that in this exercise about computing the denominator for the naive Bayes classifier, we can ignore the denominator since we're comparing P(action | review) and P(comedy | review) and so can cancel out their denominators to simplify our work.

## Building and Storing Feature Vectors

Create parameters to store the **features** into an appropriate data structure of your choice. 

Here `numpy` is used to create matrices for creating **feature vectors**
In the past, I have primarily used `dictionaries` for storing data. Alternatively, `numpy` supports various magic operations on the data structure and is very powerful. Therefore, here `numpy` is used.

Please click for more information about how to use [numpy](https://cs231n.github.io/python-numpy-tutorial/).



In [18]:
import os
import numpy as np
from collections import defaultdict
import nltk
# nltk.download()
# nltk.download('punkt')

class NaiveBayes():

    def __init__(self):
        # be sure to use the right class_dict for each data set
        self.class_dict = {0: 'action', 1: 'comedy'}
        self.feature_dict = {}
        self.prior = np.zeros(2)
        self.likelihood = None
        self.loglikelihood = None
        self.logprior = np.zeros(2)
        
        # be sure to store all the needed data 
        self.dict_action = None     
        self.dict_comedy = None   
        self.dict_voc = None

Before we can do anything, we first have to load data into our workspace.
The following code scrapes all the roots, dirs, and directories for the files we need and does some basic
text tokenization and counting.

In [19]:
    def preprocessing(self,train_set):
        prior = np.zeros(2)       #self.prior
        N_doc = 0 #number of documents
        N_class = np.zeros(2) #
        doc_action = []
        doc_comedy = []
        doc_all = []
        for root, dirs, files in os.walk(train_set):
            for name in files:
                N_doc += 1 #num of documents 
                with open(os.path.join(root, name)) as f:
                    text = nltk.word_tokenize(f.read())
                if root == r"Users/shiyishen/doc/class_material/COSI_114_FoCL/homework/PA2/movie_reviews_small⁩/train/action":
                    N_class[0] += 1
                    doc_action.extend(text)
                else:
                    N_class[1] += 1
                    doc_comedy.extend(text)
                doc_all.extend(text)
        self.dict_action = nltk.FreqDist(doc_action)      #bigdoc[action]
        self.dict_comedy = nltk.FreqDist(doc_comedy)      #bigdoc[comedy]
        self.dict_voc = nltk.FreqDist(doc_all)            #vocabulary
        
        for i in range(2):
            self.prior[i] = N_class[i]/N_doc
        self.log_prior = np.log(prior)


## Training Our NB Classifier 
Remember we have defined three parameters `dict_action`, `dict_comedy`, and `doc_all` in the previous cell. As we have iterated through the data folder and loaded in their corresponding text, what's left is count the number of words that each category and the overall data contain. We'll use NLTK's `.FreqDist` to directly compute their **frequency distribution**.

Now we can then start to create our **feature vectors**. We do it first by creating an array to store all of our features, which are unique words in our training files. Or you could do your own feature selection. 


Now we are done with data preprocessing. Let's head into training our model.

If you still remember the equation. To calculate the NB distribution of a given class, we need its **prior probability** and the **likelikhood** that each feature appears in the class. 
Here we use log space to smoothe our calculation, as we might encounter some significantly small number.

We'll also be using **Laplace's smoothing** technique also called **add-one smoothing**, in which each word appears one extra time. This is to smooth out words with zero probability that might mess up with our likelihood calculation. 

MLE estimate: 
$$ P_{MLE}(w_{i} | w_{i-1}) = \frac{c(w_{i-1},w_{i})}{c(w_{i-1})}$$

Add-1(Laplace) estimate:
$$ P_{Add-1}(w_{i} | w_{i-1}) = \frac{c(w_{i-1},w_{i}) + 1}{c(w_{i-1}) + V}$$

After finishing calculating prior, we can go on calculate the likelihood for all the features. 

In [23]:
    def train(self):
        for i in range(len(self.class_dict)):
                count_wc = np.zeros(len(self.feature_dict))
                sum_count = len(self.feature_dict)   
                if i == 0:  
                    for j in range(len(self.feature_dict)):

                        if self.feature_dict[j] in dict_class_1.keys():
                            count_wc[j] = dict_class_1[self.feature_dict[j]]
                            sum_count += count_wc[j]
                        else:
                            count_wc[j] = 0
                    for j in range(len(self.feature_dict)):
                        self.likelihood[0][j] = (count_wc[j] + 1) / sum_count
                if i == 1:  
                    for j in range(len(self.feature_dict)):

                        if self.feature_dict[j] in dict_class_2.keys():
                            count_wc[j] = dict_class_2[self.feature_dict[j]]
                            sum_count += count_wc[j]
                        else:
                            count_wc[j] = 0
                    for j in range(len(self.feature_dict)):
                        self.likelihood[1][j] = (count_wc[j] + 1) / sum_count

            self.loglikelihood= np.log(self.likelihood)
            return self.loglikelihood

## Predict and Classify
It's time to predict and classify the given file with our newly trained model.

We will use `numpy`'s `.dot` method, which does **dot** product matrix operation on the **feature vector** we have just built and compares the parameters. 

In such case, the higher the likelihood the more confident that it's the given class. 

In [25]:
def test(self, dev_set):
    results = defaultdict(dict)
    for root, dirs, files in os.walk(dev_set):
        for name in files:
            if name!='.DS_Store':
                feature_vector = np.zeros(len(self.feature_dict))
                results[name] = []
                with open(os.path.join(root, name)) as f:
                    text = nltk.word_tokenize(f.read())
                    if root == '/Users/shiyishen/Downloads/movie_reviews/dev/pos':
                        results[name].append('pos')
                    else:
                        results[name].append('neg')

                    dict_text = nltk.FreqDist(text)
                    for i in range(len(self.feature_dict)):
                        if [i] in text:
                            feature_vector[i] = dict_text[self.feature_dict[i]]
                        else:
                            pass

                feature_vector.transpose()
                compare = np.dot(self.loglikelihood, feature_vector)
                compare = compare + self.logprior
                if compare[0] > compare[1]:
                    results[name].append('pos')
                elif compare[0] < compare[1]:
                    results[name].append('neg')
                else:
                    pass