We'll be cleaning, parsing, and counting the words in a text file, so we'll want to use regular expressions and the "count vectorizer" library from scikit-learn.  To deal with the tab delimited file and numerical operations on what can become large arrays and matrices, we'll also import pandas and numpy.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import re

We'll break down a series of reviews into a list of their N most frequent terms, also called a "bag of words".  We'll start with the trivial case of two records and two categories, a good movie review and a bad movie review.

This file contains three column headers, a record ID, a "category", and some unstructured text.  The category, marked 0 or 1, indicates whether a review was negative (0) or positive (1).  

Note that we're loading this into a pandas dataframe.  There's only one record in the sampleGood.tsv file but we'll benefit from pandas operations when we generalize this to include a large number of files in a training set.  

In [8]:
good_review = pd.read_csv('data/samplePositive.tsv', header=0, delimiter="\t", quoting=3)

Here's the text for a fictitious positive movie review.  

In [10]:
good_review['text'][0]

'"Wow what a good movie.  Absolutely excellent, so good.  I loved it.  Music, dancing, action, intrigue, really good.  I\'d highly recommend this excellent film."'

And here's a clearly negative movie review.  

In [16]:
bad_review = pd.read_csv('data/sampleNegative.tsv', header=0, delimiter="\t", quoting=3)

In [17]:
bad_review['text'][0]

'"What a terrible, terrible film.  Truly bad.  In fact, I might call it awful.  I might have to call it awful twice.  The music score was dreary, the acting was contrived, the plot was not believable or convincing.  Avoid. Terrible."  '

We now have two categories, "good" and "bad", and two reviews, one for each category. 

With this minimal set of two reviews, one representing each category, we can now create a "bag of words", a list of the most frequent terms that show up in all movie reviews . 

To generate this "bag of words", we'll first generate a new list that holds all the reviews in the set (in this case, one positive review, one negative review).  

In [20]:
all_reviews = []
all_reviews.append(good_review['text'][0])
all_reviews.append(bad_review['text'][0])

In [22]:
print(all_reviews)

['"Wow what a good movie.  Absolutely excellent, so good.  I loved it.  Music, dancing, action, intrigue, really good.  I\'d highly recommend this excellent film."', '"What a terrible, terrible film.  Truly bad.  In fact, I might call it awful.  I might have to call it awful twice.  The music score was dreary, the acting was contrived, the plot was not believable or convincing.  Avoid. Terrible."  ']


We are now ready to break this list into its most common terms.  

Python's scikit-learn library has a method, CountVectorizer, for this task.  It accepts a list of strings (in our case, movie reviews), and returns a list of the N most common terms.  If no number is supplied, CountVectorizer will simply return all the words that appear in the reviews, which is fine for now since our data set is very small.

In [24]:
vectorizer = CountVectorizer(analyzer = "word")
bag_of_words = vectorizer.fit(all_reviews)

And voila, We now have a "bag of words" for our movie reviews!

In [25]:
print(vectorizer.get_feature_names())

['absolutely', 'acting', 'action', 'avoid', 'awful', 'bad', 'believable', 'call', 'contrived', 'convincing', 'dancing', 'dreary', 'excellent', 'fact', 'film', 'good', 'have', 'highly', 'in', 'intrigue', 'it', 'loved', 'might', 'movie', 'music', 'not', 'or', 'plot', 'really', 'recommend', 'score', 'so', 'terrible', 'the', 'this', 'to', 'truly', 'twice', 'was', 'what', 'wow']


Note that the vectorizer provides feature names in alphabetical order.  To get the numerical position of each term, you can use the vocabulary_ property.  

In [26]:
print(vectorizer.vocabulary_)

{'wow': 40, 'what': 39, 'good': 15, 'movie': 23, 'absolutely': 0, 'excellent': 12, 'so': 31, 'loved': 21, 'it': 20, 'music': 24, 'dancing': 10, 'action': 2, 'intrigue': 19, 'really': 28, 'highly': 17, 'recommend': 29, 'this': 34, 'film': 14, 'terrible': 32, 'truly': 36, 'bad': 5, 'in': 18, 'fact': 13, 'might': 22, 'call': 7, 'awful': 4, 'have': 16, 'to': 35, 'twice': 37, 'the': 33, 'score': 30, 'was': 38, 'dreary': 11, 'acting': 1, 'contrived': 8, 'plot': 27, 'not': 25, 'believable': 6, 'or': 26, 'convincing': 9, 'avoid': 3}


Now that we have a bag of words, we can calculate the number of times these words appear in each review.  The resulting data structure is often called a "word vector", an index of the frequency at which each word appears.  

Note that we can also supply "stop words".  These are frequently occurring terms that often have no meaning and can clutter up an algorithm (note - in real world applications, you may discover that things you thought were devoid of meaning actually make a difference in context!)  

The scikit-learn library provides methods to remove stop words from a bag of words.  Here, we use the stop_words parameter to remove the common english stop words ("a, all, and, also...").  We'll also limit the number of terms to the most frequent ten words through the max_features parameter.

In [40]:
vectorizer = CountVectorizer(analyzer = "word", stop_words = 'english', max_features = 10)
bag_of_words = vectorizer.fit(all_reviews)
print(vectorizer.get_feature_names())

['awful', 'excellent', 'film', 'good', 'music', 'really', 'recommend', 'score', 'terrible', 'truly']


Word Vectors

Now that we have a list of the most common words in all reviews, we count the frequency with which these words appear in each review.  Each review can be decomposed into a word count vector, a list of how often each word in the most frequent terms appears in a particular review.

In [42]:
good_review_vector = vectorizer.transform([good_review['text'][0]])

The result is an array with a word count corresponding to each term in the bag of words.  As you can see below, in our "good" review, the word "awful" (first index) doesn't appear, whereas "excellent" (second index) shows up twice.  

In [43]:
good_review_vector.toarray()

array([[0, 2, 1, 3, 1, 1, 1, 0, 0, 0]])

Similarly, we can take a look at the word count for a bad review

In [46]:
bad_review_vector = vectorizer.transform([bad_review['text'][0]])

In this case, awful shows up twice, and excellent is missing.  Note that some of the terms, such as film or music, show up once in both the good and bad review.  

In [45]:
bad_review_vector.toarray()

array([[2, 0, 1, 0, 1, 0, 0, 1, 3, 1]])

It can be helpful to look at the three side-by-side.  

In [35]:
all_review_vector = vectorizer.transform(all_reviews)
print(vectorizer.get_feature_names())
print(all_review_vector.toarray())

['awful', 'excellent', 'film', 'good', 'music', 'really', 'recommend', 'score', 'terrible', 'truly']
[[0 2 1 3 1 1 1 0 0 0]
 [2 0 1 0 1 0 0 1 3 1]]


Machine Learning

So far, this has all been data carpentry.  We find the most common words in a set of documents, and we create a word vector to represent each individual document in the set.

Now that we have this data, we can use it to train a computer to recognize positive and negative reviews based on patterns it finds in the word vectors associated with positive and negative reviews.  That is the essence of supervised machine learning.  We have a set of records assigned to a pre-defined set of categories, and we use it to train a computer to find a way to categorize records into those categories.  

Although we haven't discussed creating a method to categorize reviews into "positive" and "negative", you may already be thinking of some strategies based on what you've seen here.  By creating a list, or "bag of words" of common terms, and decomposing each review into a word count vector, we can create a signature for each review.  We can then associate these vectors with each category and come up with some kind of rule for matching a word count with a positive or negative review.  

Before continuing, you might try designing and implementing an approach yourself.  You have a positive and a negative review, and a single word vector for each.  How could you use this to program a computer to predict whether a new review is positive or negative?

Real Data

We'll use a genuine, real world data set of positive and negative movie reviews, available at http://www.cs.cornell.edu/home/llee/papers/sentiment.home.html

Keep in mind, even this has been pre-precessed a bit for us.  Data carpentry is a huge part of data science, and getting data into a form where you can use it plays a huge role in an application of machine learning.  Finding, cleaning, munging, formatting, and preparing data isn't just a technical task.  You almost always have to make some decisions about what to keep, what to discard, and how to prepare it, and these decisions often introduce patterns and assumptions that influence the outcome when you apply a ML technique.   

Even in case of well managed and prepared data, as we have with the sentient data from Cornell, I did have to do some formatting to get the data set into a tab delimited format that can be easily imported into pandas.  The scripts to do this are located in this repository in rawdata/review_polarity/text_sentiment/createfile.py (another script to randomize the order of the reviews is available in shufflefile.py).  These files aren't especially elaborate, but it does go to show, you'll most likely have to do some processing filtering even under the very best circumstances (and most of your raw data won't be anywhere near as well presented as the data download for the sentient data here).  Also, this data set is small, only 500 positive and negative movie reviews.  Cleaning and parsing a very large data set is an entirely different challenge!

With that said, let's create a bag of words and word vector for the sentiment data set.  

Bag of Words

Although our example above was limited to two reviews and two categories, we'll the same approach at a larger scale. 

The file trainRecords.csv is a tab delimited file with has 500 reviews, split into positive and negative reviews.  The tabs correspond to our sample files above - the first column is an identifier, the second is a category (0 for negative, 1 for positive), and the third contains the text of the review.  We'll use this data, long with the vectorizer method from scikit-learn, to build a bag of words, with a maximum of 5000 terms.  

First, we'll read in all the reviews from a tab delimied file.  Each row contains an ID, a category (1 represents a positive review, 0 a negative one), and the text of the movie review. 

First, let's load the file into a pandas dataframe, as before.  

In [47]:
train = pd.read_csv('data/trainReviews.tsv', header=0, delimiter="\t", quoting=3)

Next, we create a list of reviews.  Note that we're using a regular expression to remove all text that is not alphanumeric, so we don't have to deal with punctuation or tags.  This keeps it simple, but keep in mind, those tags and other characters may actualy have meaning sometimes.  You may actaully lose context or other information when you do this.  

In [48]:
train_records = []
for i in range( 0, len(train["text"])):
    text = train["text"][i]
    text = re.sub("[^a-zA-Z0-9]"," ", text)
    train_records.append(text.lower())


We can now use the list of text to find a bag of words for all reviews, positive and negative.  We'll limit the bag of words to 200 terms so we can inspect it more easily.  For a real application, we'd probabably want set set this value to a much higher number.  

In [50]:
vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english',   \
                             max_features = 200)


train_data_features = vectorizer.fit_transform(train_records)

Now that we've created a bag of wrods, we can list the 200 most frequent terms

In [51]:
print(vectorizer.get_feature_names())

['10', 'acting', 'action', 'actor', 'actors', 'actually', 'audience', 'away', 'bad', 'based', 'begins', 'best', 'better', 'big', 'bit', 'black', 'called', 'cast', 'character', 'characters', 'city', 'come', 'comedy', 'comes', 'comic', 'completely', 'couple', 'course', 'david', 'day', 'dead', 'death', 'despite', 'dialogue', 'did', 'didn', 'different', 'director', 'does', 'doesn', 'don', 'effects', 'end', 'ending', 'especially', 'evil', 'example', 'fact', 'family', 'far', 'father', 'feel', 'film', 'films', 'final', 'friend', 'fun', 'funny', 'game', 'gets', 'getting', 'given', 'gives', 'goes', 'going', 'good', 'got', 'great', 'group', 'guy', 'half', 'hand', 'hard', 'having', 'head', 'help', 'high', 'hollywood', 'home', 'horror', 'hour', 'human', 'humor', 'idea', 'instead', 'interesting', 'isn', 'jackie', 'james', 'job', 'john', 'just', 'kind', 'know', 'left', 'let', 'life', 'like', 'line', 'little', 'll', 'long', 'look', 'looking', 'looks', 'lost', 'lot', 'love', 'main', 'make', 'makes', '

Now that we have the most common 200 terms, we can decompose each review in the set into a word vector, counting the frequency of each term in a review. 

In [55]:
negative_review = pd.read_csv('data/negativeReview.csv', header=0, delimiter="\t", quoting=3)

FileNotFoundError: File b'data/fullNegativeReview.csv' does not exist

In [54]:
negative["text"][0]

NameError: name 'negative' is not defined

We can decompose this review into a word count vector, showing the frequency of each term in the top 200 terms for the full set of reviews

In [50]:
bad_review_data_features = vectorizer.transform([full_bad_review["text"][0]])

In [51]:
bad_review_data_features.toarray()[0]

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 2, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 2, 2, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

This is a little hard to track, so let's go ahead and print the term and frequency side by side for this review

In [52]:
bad_review_wordcount = bad_review_data_features.toarray()[0]
for i in range(0, len(bad_review_wordcount)):
    print(vectorizer.get_feature_names()[i], bad_review_wordcount[i])

10 0
acting 0
action 0
actor 0
actors 1
actually 0
audience 1
away 0
bad 1
based 0
begins 0
best 1
better 0
big 0
bit 0
black 0
called 0
cast 0
character 2
characters 2
city 0
come 2
comedy 0
comes 0
comic 0
completely 0
couple 0
course 0
david 0
day 1
dead 0
death 0
despite 0
dialogue 1
did 0
didn 1
different 0
director 1
does 1
doesn 0
don 1
effects 0
end 0
ending 0
especially 0
evil 0
example 0
fact 0
family 0
far 0
father 0
feel 0
film 8
films 0
final 0
friend 0
fun 0
funny 0
game 0
gets 0
getting 0
given 0
gives 0
goes 1
going 1
good 1
got 0
great 0
group 0
guy 0
half 0
hand 0
hard 0
having 0
head 0
help 0
high 0
hollywood 0
home 0
horror 0
hour 0
human 0
humor 0
idea 0
instead 1
interesting 0
isn 0
jackie 0
james 0
job 0
john 0
just 0
kind 2
know 0
left 0
let 0
life 1
like 0
line 0
little 0
ll 0
long 0
look 0
looking 0
looks 0
lost 1
lot 0
love 0
main 0
make 0
makes 0
making 1
man 0
men 1
michael 0
mind 0
minutes 0
moments 0
money 0
mother 0
movie 4
movies 0
mr 0
music 0
new 0
ni

You may notice there aren't a lot of matches here, a lot of zeros.  This isn't uncommon.  In addition, many of the terms, like "film", are too general to be associated with a positive or negative review.  To use this in a real training scenario, we'd want to increase the number of words in the bag considerably.  When we apply this in the next example, where we apply machine learning techniques to predict whether a review is positive or negative, we'll go up to 5,000 terms.  You can do that by changing a single parameter in some of the method calls - feel free to do that now before moving on to the next section.    