# Implementing Naive Bayes

We're going to walk through an implementation of a Naive Bayes classifier, and in doing so get comfortable with some basic aspects of machine learning on collections of text.

In the first week of this course we explored basic input and output using Python, and also reviewed some string operations that we can use to "tokenize" text (divide it into words or word-like tokens) and "normalize" it (for instance, by rendering everything lowercase). We could continue using those basic Python functions to convert the texts we use into numbers.

But since normalizing and tokenizing text is a very common operation, there are Python libraries that take care of it for us. Using them will simplify our code. Standard libraries also reduces sources of distortion that can creep into a model when (say) the model trained using one tokenizing process, and then applied to data produced with a different process.

### Using CountVectorizer to turn texts into a pandas dataframe

One library we'll use a lot is [scikit-learn.](https://scikit-learn.org/stable/about.html) As a package it's abbreviated ```sklearn.```


In [1]:
# !pip install sklearn   # only uncomment and run if needed
                        # in other words, if you get an error when you attempt
                        # to import sklearn below

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import pandas as pd
import numpy as np
import glob, math
from pathlib import Path

In [3]:
sample_texts = ["It was pathetic. The worst part was the boxing scenes.",
         "No plot twists or great scenes.",
         "and satire, and great plot twists",
         "Great scenes; great film."]
text_titles = ['reviewA', 'reviewB', 'reviewC', 'reviewD']

count_vectorizer = CountVectorizer()
count_vectors = count_vectorizer.fit_transform(sample_texts)

vector_frame = pd.DataFrame(count_vectors.toarray(), index = text_titles, 
                            columns = count_vectorizer.get_feature_names())
vector_frame

Unnamed: 0,and,boxing,film,great,it,no,or,part,pathetic,plot,satire,scenes,the,twists,was,worst
reviewA,0,1,0,0,1,0,0,1,1,0,0,1,2,0,2,1
reviewB,0,0,0,1,0,1,1,0,0,1,0,1,0,1,0,0
reviewC,2,0,0,1,0,0,0,0,0,1,1,0,0,1,0,0
reviewD,0,0,1,2,0,0,0,0,0,0,0,1,0,0,0,0


We'll call this a term-doc matrix; typically words ("features") are columns and documents are rows.

A "vector" is essentially a list of numbers. Next week we'll talk about the geometrical interpretation that makes it possible to interpret a list of numbers as a line in space.

### Relative frequencies (normalized by doc length)

Suppose we wanted to have the relative frequency of each word as a percentage of its document. That representation has a lot of advantages, since it factors out document length and provides, essentially, a unigram probability model.

Before proceeding, pause for a moment and think about what we would need to do mathematically to generate relative frequencies. Then you'll understand the following code.

In [4]:
rowsums = vector_frame.sum(axis = 'columns')   # change to 'rows' and see what happens
rowsums

reviewA    10
reviewB     6
reviewC     6
reviewD     4
dtype: int64

For future reference, ```axis = 'columns'``` is often abbreviated ```axis = 1``` and ```axis = 'rows'``` is 0.

In [5]:
vector_frame.divide(rowsums, axis = 'rows')  # change to 'columns' and see what happens :(

Unnamed: 0,and,boxing,film,great,it,no,or,part,pathetic,plot,satire,scenes,the,twists,was,worst
reviewA,0.0,0.1,0.0,0.0,0.1,0.0,0.0,0.1,0.1,0.0,0.0,0.1,0.2,0.0,0.2,0.1
reviewB,0.0,0.0,0.0,0.166667,0.0,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.166667,0.0,0.166667,0.0,0.0
reviewC,0.333333,0.0,0.0,0.166667,0.0,0.0,0.0,0.0,0.0,0.166667,0.166667,0.0,0.0,0.166667,0.0,0.0
reviewD,0.0,0.0,0.25,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.25,0.0,0.0,0.0,0.0


### Load folders of files into a dataframe

We're using [a dataset of movie reviews developed by Bo Pang, Lillian Lee,and Shivakumar Vaidhyanathan.](https://www.cs.cornell.edu/people/pabo/movie-review-data/)

Our first task is to load them into a dataframe.

It's worth stepping through this to make sure you understand what's happening.

In [6]:
negative_dir = '../../data/review_polarity/txt_sentoken/neg'
positive_dir = '../../data/review_polarity/txt_sentoken/pos'
neg_paths = glob.glob(f'{negative_dir}/*.txt')
pos_paths = glob.glob(f'{positive_dir}/*.txt')

all_paths = neg_paths + pos_paths      # notice the order

all_classes = [0] * 1000 + [1] * 1000  # notice the same order
                                       # if it's not clear what's in that list, inspect
                                       # by using len() and saying all_classes[-10 : ]

We now have a list of 2000 paths to files, paired with a list of class labels that are either zero (negative sentiment) or one (positive sentiment).

In [7]:
count_vectorizer = CountVectorizer(input = 'filename',   # notice that we're now setting this up
                                  max_features = 5000)   # to automatically read a list of paths
                                                       # and also only taking the top 5000 words
    
word_counts = count_vectorizer.fit_transform(all_paths)  # that line does all the work!

titles = [Path(text).stem for text in all_paths]
count_df = pd.DataFrame(word_counts.toarray(), index = titles, 
                      columns = count_vectorizer.get_feature_names())

count_df = count_df.assign(class_label = all_classes)  # adding a column for class_label
count_df.head()

Unnamed: 0,000,10,100,11,12,13,13th,14,15,16,...,younger,your,yourself,youth,zane,zany,zero,zeta,zone,class_label
cv676_22202,0,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
cv839_22807,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
cv155_7845,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
cv465_23401,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
cv398_17047,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


We now have a term-doc matrix for the top 5000 words in 2000 movie reviews, along with a column that indicates whether each review was negative or positive.

### Generating train and test sets

Now let's divide this into train and test sets. By convention we're going to call the matrix of feature values *X* (it gets a capital because the names of matrices are by convention capital letters). The vector of class labels, we'll call *y.* We're about to learn a function that predicts *y* from *X.* To be super-fancy we can refer to our predictions as $\hat{y}$.

In [8]:
neg_counts = count_df.loc[count_df['class_label'] == 0, ]
pos_counts = count_df.loc[count_df['class_label'] == 1, ]

train_X = pd.concat([neg_counts.iloc[0:800, : ], pos_counts.iloc[0:800, : ]], axis = 'rows')
train_y = train_X['class_label']
train_X = train_X.drop('class_label', axis = 'columns')  # we don't want this as a feature. Why not?
train_X.shape

(1600, 5000)

Notice that we just took the first 800 elements of the negative and positive dataframes as our training set. (That's four-fifths of the data). We're trusting that the data is already well-randomized, so all the reviews of action movies aren't at the end, etc.

#### a quick thought experiment

I wrote a line of code that drops the class_label from train_X.

    train_X = train_X.drop('class_label', axis = 'columns')
    
What would happen if I forgot to do that, and we trained a naive Bayes model to predict $\hat{y}$ on train_X with that extra column? What would you expect our accuracy to be?

Now write some code that generates a test set. We'll need both a matrix of feature values and a vector of class labels.

In [9]:
# Lines that create a test_X and a test_y


At the end your shape should be (400, 5000).

### Applying Naive Bayes can be simple, if we just want results

If our goal is simply to get predictions, that's easy. Scikit-learn has [several forms of Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) built in. We'll use [Multinomial Naive Bayes.](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB)

In [10]:
bayes = MultinomialNB(alpha = 1)
bayes.fit(train_X, train_y)

MultinomialNB(alpha=1)

In [11]:
yhat = bayes.predict(test_X)
yhat[0:10]

NameError: name 'test_X' is not defined

In [None]:
sum(yhat == test_y) / len(test_y)

### But let's implement naive Bayes ourselves, to understand it. First, training.

Let's review the pseudo-code from Jurafsky and Martin:

![pseudocode for the train function](pseudocode_train.png)

Because the probabilities we're dealing with are very, very small, and get smaller as you multiply them, naive Bayes is conventionally implemented by adding logarithms instead of multiplying probabilities. To confirm that this is (or should be!) the same thing, check it out:

In [12]:
logsum = math.log(0.1) + math.log(0.1)

product = 0.1 * 0.1

print(product)
print(math.exp(logsum))

0.010000000000000002
0.010000000000000004


Floating-point math isn't perfect! but that's basically the same thing.

Okay, let's generate the class priors, expressed as logarithms.

The code for this is actually extremely simple. *Notice that we do not have to construct a for-loop at all, because pandas takes care of it for us!* We just count the number of instances in each class, and divide by the total number of instances.

In [13]:
def train_bayes(X, y):
    '''
    This function only performs the first part of the training:
    it generates a class prior for each class, which takes the form
    of a pandas Series
    '''
    
    priors = y.value_counts() / y.shape[0]
    
    logpriors = np.log(priors)
    
    return logpriors
        

In [14]:
# Use the function we just defined to generate a class prior, and print
# it out to see what it looks like.

# Then play around with the math here, step by step,
# to understand why that works. Test .value_counts() and shape,
# and then divide, etc

# To confirm that we're getting the right value, look at this:

math.log(800/ 1600)

# the difference between math.log() and np.log() is that
# the numpy version automatically *broadcasts* the function
# across a vector

-0.6931471805599453

The next part is almost equally simple. Again, no for-loop! Remember how we summed up the rows to normalize frequencies? Now we can sum columns to find the total number of times a word appears in a class. Pandas does that on the whole dataframe at once and saves us from writing a loop, which would tend to be slower as well as more verbose. What we need to do is, step by step.

```For each class c:```

    1) Create a dataframe by selecting rows with class == c. We can use train_y to do that. This creates what Jurafsky and Martin call ```bigdoc```. Then, for each class:

    2) Sum the *columns* of that class frame to create a vector of wordcounts for the class. Now we have ```count(w, c)```. We'll call that a classvector.

    3) Add one to all the elements of the classvector (Laplace smoothing).
    
    4) Divide the smoothed classvector by its own sum, producing a vector of smoothed likelihoods.

    5) Take the np.log() of the smoothed likelihoods.

    6) Package the two vectors in a single data frame or dictionary. Voila! You have your loglikelihoods.
   
Let's work through that step by step. First, turn train_X into two classvectors. There are ways to do even the two classes without a loop, but let's not get too fancy with it right now; it's okay to loop across the classes and store the resulting vectors in a dictionary.

So I'll give you an initial framework:


In [15]:
# WORKSPACE for TRAINING

for c in [0, 1]:        # instead of [0,1] you can say np.unique(train_y) if you prefer
    
    # lines that generate the classvector go here
    # this will take us through step (2) above
    pass

In [16]:
# Now look at the classvectors you generated to see what's actually there.

#### Steps 3, 4 and 5

These become easy when you realize that you can broadcast addition and division across a vector (a Series or list of numbers in pandas). There's no need to write a loop. For instance, see what happens when you take out the +1 in the cell below.

In [17]:
vector_frame.loc['reviewA', : ] + 1     # take out the + 1 to see the original word counts

and         1
boxing      2
film        1
great       1
it          2
no          1
or          1
part        2
pathetic    2
plot        1
satire      1
scenes      2
the         3
twists      1
was         3
worst       2
Name: reviewA, dtype: int64

Go back to the WORKSPACE for TRAINING and add lines that smooth the classvector (add one to it), normalize it by the total number of words in the corpus (+1 for each word; aka the sum of the smoothed classvector), and then turn each element into its own logarithm.

Then let's check that code. Once we know it's working, we can write it up as a function.

In [18]:
# HERE, put the complete function train_bayes().
# It should return log priors and log likelihoods for both classes.
# For now, it doesn't need to return vocabulary, because we know our test and train sets
# have the same vocabulary.

## Now write a function that uses our Naive Bayes model to make predictions about test_X

Here's a reminder of the function we need:

![pseudocode for the test function](pseudocode_test.png)

In [19]:
# The test_bayes() function goes here.