### Naive bayes is a popular algorithm for classifying text. Although it is fairly simple, it often performs as well as much more complicated solutions.

In [1]:
from __future__ import division

In [2]:
# Here's a running history for the past week.
# For each day, it contains whether or not the person ran, and whether or not they were tired.
days = [["ran", "was tired"], ["ran", "was not tired"], ["didn't run", "was tired"], ["ran", "was tired"], ["didn't run", "was not tired"], ["ran", "was not tired"], ["ran", "was tired"]]

# Let's say we want to calculate the prob. that someone was tired given that they ran, using bayes' theorem.
# This is P(A).
prob_tired = len([d for d in days if d[1] == "was tired"])/(len(days))
# This is P(B).
prob_ran = len([d for d in days if d[0] == "ran"])/(len(days))
# This is P(B|A).
prob_ran_given_tired = len([d for d in days if d[0] == "ran" and d[1] == "was tired"])/(len([d for d in days if d[1] == "was tired"]))

# Now we can calculate P(A|B).
prob_tired_given_ran = (prob_ran_given_tired * prob_tired)/prob_ran

print("Probability of being tired given that you ran: {0}".format(prob_tired_given_ran))

Probability of being tired given that you ran: 0.6


### We are working with one long string instead of several features. The easiest way to generate features from text is to split the text up into words. Each word in a review will then be a feature that we can then work with. In order to do this, we'll split the reviews based on whitespace.
1. We'll then count up how many times each word occurs in the negative reviews;
2. How many times each word occurs in the positive reviews. 
3. This will allow us to eventually compute the probabilities of a new review belonging to each class.

In [3]:
from collections import Counter
import csv
import re

# Read in the training data.
with open("movie_data.csv", 'r') as file:
    reviews = list(csv.reader(file))

def get_text(reviews, score):
    # Join together the text in the reviews for a particular tone.
    #We lowercase to avoid "Not" and "not" being seen as different words, for example.
    return " ".join([r[0].lower() for r in reviews if r[1] == str(score)])

def count_text(text):
    # Split text into words based on whitespace.  Simple but effective.
    words = re.split("\s+", text)
    # Count up the occurence of each word.
    return Counter(words) #Dict subclass for counting hashable items.  Sometimes called a bag or multiset.  
                        #Elements are stored as dictionary keys and their counts are stored as dictionary values.

negative_text = get_text(reviews, 0)
positive_text = get_text(reviews, 1)
# Generate word counts for negative tone.
negative_counts = count_text(negative_text) #hash map key: word and value: word count.
# Generate word counts for positive tone.
positive_counts = count_text(positive_text)

print("Negative text sample: {0}".format(negative_text[:100]))
print("Positive text sample: {0}".format(positive_text[:100]))

Negative text sample: this 22 minute short, short of a precursor to the later much better "rock and rule", features two fo
Positive text sample: ladies and gentlemen: the show begins with this documentary film. it's structured in three chapters,


In [4]:
count= 0
for i in positive_counts:
    print(i)
    count =count+1
    if count == 10:
        break
print(positive_counts.get('considered.', 1))

considered,
considered.
"unsubtle"
while..<br
8mm.
8mm,
considered"
shakespearean-child
considered?
woods
9


In [5]:
count= 0
for i in negative_counts:
    print(i)
    count =count+1
    if count == 10:
        break
print(negative_counts.get('considered.', 0))

considered,
considered.
'daring',
emerges,
thrice-cursed
nunnery
canada"
8mm.
8mm,
sunflowers!
1


### Now that we have the word counts, we just have to convert them to probabilities and multiply them out to get the predicted classification. 
1. Let's say we wanted to find the probability that the review **didn't like it** expresses a negative sentiment. 
2. We would find the total number of times the word **didn't occured in the negative reviews**, and **divide it by the total number of words in the negative reviews to get the probability of x given y.** 
3. We would then **do the same** for **like** and **it**. We would multiply all three probabilities, and then multiply by the **probability of any document expressing a negative sentiment** to get our final probability that the sentence expresses negative sentiment.

In [6]:
import re
from collections import Counter

def get_y_count(score):
    # Compute the count of each classification occuring in the data.
    return len([r for r in reviews if r[1] == str(score)])

# We need these counts to use for smoothing when computing the prediction.
positive_review_count = get_y_count(1)
negative_review_count = get_y_count(0)

# These are the class probabilities (we saw them in the formula as P(y)).
prob_positive = positive_review_count / len(reviews)
prob_negative = negative_review_count / len(reviews)

In [7]:
print(prob_positive)
print(prob_negative)

0.4999900002
0.4999900002


In [8]:
reviews[0]

['review', 'sentiment']

In [9]:
def make_class_prediction(text, counts, class_prob, class_count):
    prediction = 1
    text_counts = Counter(re.split("\s+", text))
    for word in text_counts:
        # For every word in the text, we get the number of times that word occured in the reviews for a given class, add 1 to smooth the value, and divide by the total number of words in the (negative|positive) reviews (plus the class_count to also smooth the denominator).
        # Smoothing ensures that we don't multiply the prediction by 0 if the word didn't exist in the training data.
        # We also smooth the denominator counts to keep things even.
        
        #print(counts.get(word, 0)) #some words occurs often which diminishes the prediction value
        
        #text_counts: counts the occurence of each word (if a word occurs twice, its probability should be count twice)
        
        prediction *=  text_counts.get(word) * ((counts.get(word, 0) + 1) / (sum(counts.values()) + class_count))
        # Now we multiply by the probability of the class existing in the documents.
    return (prediction * class_prob)

# As you can see, we can now generate probabilities for which class a given review is part of.
# The probabilities themselves aren't very useful -- we make our classification decision based on which value is greater.
print("Review: {0}".format(reviews[1][0]))
print("Negative prediction: {0}".format(make_class_prediction(reviews[1][0], negative_counts, prob_negative, negative_review_count)))
print("Positive prediction: {0}".format(make_class_prediction(reviews[1][0], positive_counts, prob_positive, positive_review_count)))

Review: This 22 minute short, short of a precursor to the later much better "Rock and Rule", features two folk singer mice who are going nowhere. The female mouse, Jan, signs a deal with the devil to become a hit rock star. So it's up to Daniel Mouse to save her soul. Made in the late '70's this has all the trappings of said decade (crap music, crap clothing and hair style, awful folk tunes) This cartoon is featured on the Second disc of the 2-Disk Collector's Edition of "Rock and Rule", it also comes with a Making of that runs almost as long as the show itself.<br /><br />My Grade: D+
Negative prediction: 0.0
Positive prediction: 0.0


In [11]:
text = 'This 22 minute short, short of a precursor to the later much better "Rock and Rule", features two folk singer mice who are going nowhere'
print("Negative prediction: {0}".format(make_class_prediction(text, negative_counts, prob_negative, negative_review_count)))
print("Positive prediction: {0}".format(make_class_prediction(text, positive_counts, prob_positive, positive_review_count)))

Negative prediction: 2.10446027557e-95
Positive prediction: 4.00031344778e-95


### There are a lot of extensions that we could make to this algorithm to make it perform better. 
1. We could look at n-grams instead of unigrams. 
2. We could remove punctuation and other non-characters. 
3. We could remove stopwords. 
4. We could also perform stemming or lemmatization.

### Let's remove the stopwords. An easier way to use naive bayes is to use the implementation in scikit-learn. 

In [12]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

# Generate counts from text using a vectorizer.  There are other vectorizers available, and lots of options you can set.
# This performs our step of computing word counts.
vectorizer = CountVectorizer(stop_words='english')

train_features = vectorizer.fit_transform([r[0] for r in reviews[1:20000]])
test_features = vectorizer.transform([r[0] for r in reviews[20000:]])

# Fit a naive bayes model to the training data.
# This will train the model using the word counts we compute, and the existing classifications in the training set.
nb = MultinomialNB()
nb.fit(train_features, [int(r[1]) for r in reviews[1:20000]])

# Now we can use the model to predict classifications for our test features.
predictions = nb.predict(test_features)

In [13]:
len(predictions)

30001

In [14]:
len(reviews[20000:])

30001

In [15]:
print("Number of correct labeled points out of a total %d points : %d"
...       % (len(reviews[20000:]),([int(r[1]) for r in reviews[20000:]] == predictions).sum()))

Number of correct labeled points out of a total 30001 points : 25503
