# UBC Scientific Software Seminar

## November 25, 2016

Today's Agenda:

* Natural Language Processing with [nltk](http://www.nltk.org/)
  * Movie review corpus: Exploring the `movie_reviews` nltk object
  * `nltk` stopwords
  * Using regular expression module `re` and `string` module to remove punctuation
  * Feature selection: Find top 2000 most frequent words excluding stopwords and punctuation
  * Naive Bayes movie review classifier

Last time, we built a classifier to determine whether a movie review is positive or negative. Today, our goal is to do the same (but with a slightly different method) and remove the stopwords and punctuation to get more signal from the selected features.

Let's import [nltk](http://www.nltk.org/) and [sklearn](http://scikit-learn.org/stable/) and check the versions we will be using.

In [1]:
import nltk

In [2]:
nltk.__version__

'3.2.4'

In [3]:
import sklearn

In [4]:
sklearn.__version__

'0.19.1'

### Movie review corpus: Exploring the `movie_reviews` nltk object

Let's download the movie reviews corpus:

In [5]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/Fall/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [6]:
from nltk.corpus import movie_reviews

The `movie_reivew` object is a *strange* nltk object. Let's take a look.

#### movie_reviews.fileids

The `fileids` method returns a list of strings which correspond to the file names of all the reviews.

In [7]:
fileids = movie_reviews.fileids()

In [8]:
fileids[0]

'neg/cv000_29416.txt'

In [9]:
len(fileids)

2000

We see that we have 2000 movie reviews.

#### movie_reviews.categories

Each `movie_review` is labelled by category which we access by the `categories` method which takes a file id as input:

In [10]:
print(movie_reviews.categories(fileids[0]))
print(movie_reviews.categories(fileids[999]))
print(movie_reviews.categories(fileids[1000]))
print(movie_reviews.categories(fileids[1999]))

['neg']
['neg']
['pos']
['pos']


We see that the first 1000 reviews are negative and the second 1000 reviews are all positive.

#### movie_reviews.raw

Each review is a string which we access by the `raw` method which takes a file id as input:

In [11]:
review_1000 = movie_reviews.raw(fileids[1000])
review_1000[:200]

"films adapted from comic books have had plenty of success , whether they're about superheroes ( batman , superman , spawn ) , or geared toward kids ( casper ) or the arthouse crowd ( ghost world ) , b"

### `nltk` stopwords

Stopwords are common words that we would like to exlude from our analysis to focus on more meaningful words.

#### nltk.stopwords

Let's download the `stopwords` object:

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/Fall/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [13]:
from nltk.corpus import stopwords

In [22]:
stop = stopwords.words('english')

In [23]:
print(stop)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [24]:
len(stop)

179

Notice that all the words are lowercase and does not include punctuation. We'll have to transform all the words in the movie reviews to lowercase and remove punctuation before we start our analysis.

### Using regular expression module `re` and `string` module to remove punctuation

The Python standard library has the [string](https://docs.python.org/3/library/string.html) module for working with strings and contains the list of punctuation.

In [25]:
from string import punctuation

In [26]:
punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

We also have the regular expression module `re` to serach for charcters in a string. We won't get into [regular expressions](https://en.wikipedia.org/wiki/Regular_expression) here but we'll use the function `re.sub` to remove punctuation.

In [27]:
import re

For example, we can search for vowels in a string and replace them with `X`:

In [28]:
re.sub('[aeiouAEIOU]','X','Pen pineapple apple pen')

'PXn pXnXXpplX XpplX pXn'

Or we can replace all punctuation with 7s. (Notice we have to enter the regular expression to be matched as a string starting and ending with `[` and `]` respectively. 

In [29]:
re.sub('[' + punctuation + ']','7',"What?! What?! I don't know what.")

'What77 What77 I don7t know what7'

Now we are able to take a string (such as a raw moview review) and remove punctuation:

In [30]:
no_punc_review_1000 = re.sub('[' + punctuation + ']',' ',review_1000)
clean_review_1000 = [word.lower() for word in no_punc_review_1000.split() if word.lower() not in stop]

In [31]:
print(clean_review_1000[:20])

['films', 'adapted', 'comic', 'books', 'plenty', 'success', 'whether', 'superheroes', 'batman', 'superman', 'spawn', 'geared', 'toward', 'kids', 'casper', 'arthouse', 'crowd', 'ghost', 'world', 'never']


Let's make this into a function we can use later:

In [32]:
def clean_review(review):
    no_punc = re.sub('[' + punctuation + ']',' ',review)
    clean_review = [word.lower() for word in no_punc.split() if word.lower() not in stop]
    return clean_review

In [33]:
clean_review("Worst! Movie! Ever! But not as bad as 'Titanic'; I really didn't like that movie... But I saw it twice?!")

['worst',
 'movie',
 'ever',
 'bad',
 'titanic',
 'really',
 'like',
 'movie',
 'saw',
 'twice']

### Feature selection: Find top 2000 most frequent words excluding stopwords and punctuation

The `movie_reviews` object has method `words` which returns the list of all words appearing in all movie reviews. We can use this to find the 2000 most common words.

In [34]:
len(movie_reviews.words())

1583820

In [35]:
movie_words = [word.lower() for word in movie_reviews.words()
               if word.lower() not in stop and word.lower() not in punctuation]

We can use the `Counter` class from the `collections` module to count the number of occurences of each word and then pick the 2000 most common:

In [36]:
from collections import Counter

In [37]:
top_2000 = [item[0] for item in Counter(movie_words).most_common(2000)]

Notice that the `Counter` object is like a `dict` where the keys are the unique elements in the list and the values are the counts.

In [38]:
Counter(['a','a','b','c','c','c'])

Counter({'a': 2, 'b': 1, 'c': 3})

In [39]:
print(top_2000[:20])

['film', 'one', 'movie', 'like', 'even', 'good', 'time', 'story', 'would', 'much', 'character', 'also', 'get', 'two', 'well', 'characters', 'first', '--', 'see', 'way']


In [40]:
import numpy as np

We are now ready to make a function which takes a fileid and returns a vector which gives the number of occurences of each of the 2000 most common words in the movie review.

In [41]:
def word2vector(fileid):
    vec = np.zeros(2000)
    review = re.sub("[^a-zA-Z]"," ", movie_reviews.raw(fileid)).split()
    for i in range(0,2000):
        if top_2000[i] in review:
            vec[i] = 1
        else:
            vec[i] = 0
    return vec

In [42]:
movie_reviews.raw(fileids[0])[:100]

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \n'

In [43]:
word2vector(fileids[0])

array([ 1.,  1.,  1., ...,  0.,  0.,  0.])

The function takes a fileid and returns a feature vector of length 2000. Now let's apply this to the dataset to create an array where each row is the feature vector of length 2000 for that review.

In [44]:
n_files = len(fileids)
X = np.zeros((n_files,2000))
for i in range(0,n_files):
    X[i,:] = word2vector(fileids[i])

Create a vector of labels where 0 is for a negative review and 1 for a positive review.

In [45]:
y = [0 if movie_reviews.categories(fileid) == ['neg'] else 1 for fileid in fileids]

### Naive Bayes movie review classifier

With our dataset in standard `sklearn` format, we can feed into a Naive Bayes classifier.

In [46]:
from sklearn.model_selection import train_test_split

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [48]:
from sklearn.naive_bayes import BernoulliNB

In [49]:
clf = BernoulliNB()

In [50]:
clf.fit(X_train,y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [51]:
clf.score(X_test,y_test)

0.8175