# Naive Bayes - using NLTK for move review sentiment analysis

The NLTK module is a massive tool kit, aimed at helping you with the entire Natural Language Processing (NLP) methodology. NLTK will aid you with everything from splitting sentences from paragraphs, splitting up words, recognizing the part of speech of those words, highlighting the main subjects for just the start. 
<br><br>In this notebook I will be combining the NTLK functionally with NaiveBayesClassifier that is built into the module. We aim to create a corpus (body) of positive and negative words that we can use to predict if a movie review is positive or negative. 

Import the packages we need for this analysis. The movie review data from NLTK is 1000 positive and 1000 negative movie reviews that have already been classified. This is the main point of "Training Data". 

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews

The next bit of code extracts the categories from the training set and creates a document collection. Then we need to randomly shuffle the data so it is not in order for the testing.

In [3]:

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

So we have all the reviews as a document collection.<br>
<br>
So now what we need to do is create a new list of all the words in the movie reviews into a large list. Also convert them all to lower case.

In [4]:
all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

So new we can use the NLTK function FreqDist to build an array that will show the most common words in the dataset. Then using  the NLTK feature of most_common we can see the most common words in the reviews.

In [5]:
all_words = nltk.FreqDist(all_words)

In [6]:
print(all_words.most_common(10))

[(u',', 77717), (u'the', 76529), (u'.', 65876), (u'a', 38106), (u'and', 35576), (u'of', 34123), (u'to', 31937), (u"'", 30585), (u'is', 25195), (u'in', 21822)]


Note: NTLK thinks that punctuation is word. We can also ask NLTK what is count for a certian word - 'bad' for example. We get 1395

In [7]:
print(all_words['bad'])

1395


So now what we want to is to concentrate to the top group of word so we can base the training on this subset. There is little point training for a word that is seldom used. So we will take to the top 2500 words in the list.

In [8]:
word_features = list(all_words.keys())[:2500]

Now we need a function that will cycle throught the text to great a group of these top 3,000 words in our positive and negative documents, marking their presence as either positive or negative:

In [9]:
def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

Next, we can print one feature set like:

In [11]:
print((find_features(movie_reviews.words('neg/cv000_29416.txt'))))



Then we can do this for all of our documents, saving the feature existence booleans and their respective positive or negative categories by doing. This will allow use to have a dictonary that will filter the top 2500 words using a boolean true or false.

In [15]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]

The next step is to go ahead and train an algorithm, then test it. <br> This is called supervised machine learning, because we're showing the machine data, and telling it "hey, this data is positive," or "this data is negative." Then, after that training is done, we show the machine some new data and ask the computer, based on what we taught the computer before, what the computer thinks the category of the new data is.

Now we need to separate our data into training and testing sets. We need to use the NLTK Naive Bayes classifier.<br> We need to go ahead and split up the data into a training set and a testing set. 70 / 30 split 

In [11]:
# set that we'll train our classifier with
training_set = featuresets[:1750]

# set that we'll test against.
testing_set = featuresets[1750:]

Next, we can define, and train our classifier like:

In [13]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

First we just simply are invoking the Naive Bayes classifier, then we go ahead and use .train() to train it all in one line.

Once this is done we need to test it, using the built in NTLK accruarcy function.

In [14]:
print("Classifier accuracy percent:",(nltk.classify.accuracy(classifier, testing_set))*100)

('Classifier accuracy percent:', 74.0)


So what is going on here is that we are testing the classfications created by the classifier when we show the data without the correct answers, but we allready know the answers from the training set. Here we get 74% which is pretty decent. 
So, in testing, we show the computer the data without giving it the correct answer - and then rate how it gets on...

Next, we can take it a step further to see what the most valuable words are when it comes to positive or negative reviews:

In [15]:
classifier.show_most_informative_features(30)

Most Informative Features
               insulting = True              neg : pos    =     10.8 : 1.0
                    sans = True              neg : pos    =      8.5 : 1.0
                 wasting = True              neg : pos    =      8.5 : 1.0
            refreshingly = True              pos : neg    =      8.2 : 1.0
              mediocrity = True              neg : pos    =      7.1 : 1.0
            effortlessly = True              pos : neg    =      6.9 : 1.0
               dismissed = True              pos : neg    =      6.9 : 1.0
                   wires = True              neg : pos    =      6.5 : 1.0
                  fabric = True              pos : neg    =      6.2 : 1.0
             overwhelmed = True              pos : neg    =      6.2 : 1.0
               uplifting = True              pos : neg    =      6.0 : 1.0
                  wasted = True              neg : pos    =      5.9 : 1.0
             bruckheimer = True              neg : pos    =      5.8 : 1.0

What this tells you is the ratio of occurences in negative to positive, or visa versa, for every word. So here, we can see that the term "insulting" appears 10.6 more times as often in negative reviews as it does in positive reviews. Ludicrous, 10.1.