# Movie Review with NLTK (Bag of Words Feature)

**Top-N words feature**
+ The top-N words feature is also a bag-of-words feature.
+ But in the top-N feature, only used the top 2000 words in the feature set.
+ Combined the positive and negative reviews into a single list, randomized the list, and then separated the train and test set.
+ This approach can result in the uneven distribution of positive and negative reviews across the train and test set.

**Bag-of-words feature**
+ Use all the useful words of each review while creating the feature set.
+ Take a fixed number of positive and negative reviews for train and test set.
+ This result in equal distribution of positive and negative reviews across train and test set.

## Import movie reviews data

In [16]:
from nltk.corpus import movie_reviews 

In [17]:
pos_reviews = []

for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append(words)
    
print(pos_reviews[0])

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]


In [18]:
neg_reviews = []

for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append(words)
    
print(neg_reviews[0])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]


In [19]:
print(pos_reviews[0][:20])

['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', ',', 'whether', 'they', "'", 're', 'about', 'superheroes', '(', 'batman', ',']


In [20]:
print(neg_reviews[0][:20])

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an']


## Feature Extraction

Use the bag-of-words feature. Here, we clean the word list (i.e. remove stop words and punctuation). Then, we create a dictionary of cleaned words.

In [21]:
from nltk.corpus import stopwords 
import string

In [22]:
stopwords_english = stopwords.words('english')

In [23]:
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [24]:
def bag_of_words(words):
    words_clean = []
 
    for word in words:
        word = word.lower()
        if word not in stopwords_english and word not in string.punctuation:
            words_clean.append(word)
    
    words_dictionary = dict([word, True] for word in words_clean)
    
    return words_dictionary

In [25]:
# EXAMPLE!!!
# using dict will remove duplicate words from the words list
# note the output: stopword 'the' is also removed
print(bag_of_words(['the', 'the', 'good', 'bad', 'the', 'good']))

{'good': True, 'bad': True}


## Create Feature Set

We use the bag-of-words feature and tag each review with its respective category as positive or negative.

In [26]:
pos_reviews_set = []

for words in pos_reviews:
    pos_reviews_set.append((bag_of_words(words), 'pos'))
    
neg_reviews_set = []

for words in neg_reviews:
    neg_reviews_set.append((bag_of_words(words), 'neg'))

In [27]:
# print first positive review item from the pos_reviews list
print(pos_reviews_set[0])

({'films': True, 'adapted': True, 'comic': True, 'books': True, 'plenty': True, 'success': True, 'whether': True, 'superheroes': True, 'batman': True, 'superman': True, 'spawn': True, 'geared': True, 'toward': True, 'kids': True, 'casper': True, 'arthouse': True, 'crowd': True, 'ghost': True, 'world': True, 'never': True, 'really': True, 'book': True, 'like': True, 'hell': True, 'starters': True, 'created': True, 'alan': True, 'moore': True, 'eddie': True, 'campbell': True, 'brought': True, 'medium': True, 'whole': True, 'new': True, 'level': True, 'mid': True, '80s': True, '12': True, 'part': True, 'series': True, 'called': True, 'watchmen': True, 'say': True, 'thoroughly': True, 'researched': True, 'subject': True, 'jack': True, 'ripper': True, 'would': True, 'saying': True, 'michael': True, 'jackson': True, 'starting': True, 'look': True, 'little': True, 'odd': True, 'graphic': True, 'novel': True, '500': True, 'pages': True, 'long': True, 'includes': True, 'nearly': True, '30': Tru

In [28]:
# print first negative review item from the neg_reviews list
print (neg_reviews_set[0])

({'plot': True, 'two': True, 'teen': True, 'couples': True, 'go': True, 'church': True, 'party': True, 'drink': True, 'drive': True, 'get': True, 'accident': True, 'one': True, 'guys': True, 'dies': True, 'girlfriend': True, 'continues': True, 'see': True, 'life': True, 'nightmares': True, 'deal': True, 'watch': True, 'movie': True, 'sorta': True, 'find': True, 'critique': True, 'mind': True, 'fuck': True, 'generation': True, 'touches': True, 'cool': True, 'idea': True, 'presents': True, 'bad': True, 'package': True, 'makes': True, 'review': True, 'even': True, 'harder': True, 'write': True, 'since': True, 'generally': True, 'applaud': True, 'films': True, 'attempt': True, 'break': True, 'mold': True, 'mess': True, 'head': True, 'lost': True, 'highway': True, 'memento': True, 'good': True, 'ways': True, 'making': True, 'types': True, 'folks': True, 'snag': True, 'correctly': True, 'seem': True, 'taken': True, 'pretty': True, 'neat': True, 'concept': True, 'executed': True, 'terribly': 

## Create Train and Test Set

There are 1000 positive reviews set and 1000 negative reviews set. Take 20% of positive reviews and 20% of negative reviews as a test set. The remaining negative and positive reviews will be taken as a training set.

+ There is difference between pos_reviews & pos_reviews_set array which are defined above.
+ pos_reviews array contains words list only
+ pos_reviews_set array contains words feature list
+ pos_reviews_set & neg_reviews_set arrays are used to create train and test set as shown below

In [29]:
print("Length of Pos Reviews: ", len(pos_reviews_set))
print("Length of Neg Reviews: ", len(neg_reviews_set))
 
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle

shuffle(pos_reviews_set)
shuffle(neg_reviews_set)

train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
 
print("Length of data testing", len(test_set)) 
print("Length of data training",len(train_set))

Length of Pos Reviews:  1000
Length of Neg Reviews:  1000
Length of data testing 400 Length of data training 1600


## Training Classifier and Calculating Accuracy

We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.

In [30]:
from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print("The accuracy of classification: ", accuracy)
 
print (classifier.show_most_informative_features(10))

The accuracy of classification:  0.7175
Most Informative Features
                   sucks = True              neg : pos    =     14.3 : 1.0
                   lousy = True              neg : pos    =     13.0 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
               stupidity = True              neg : pos    =     11.4 : 1.0
            breathtaking = True              pos : neg    =     10.6 : 1.0
               strongest = True              pos : neg    =     10.3 : 1.0
             outstanding = True              pos : neg    =     10.1 : 1.0
               atrocious = True              neg : pos    =      9.7 : 1.0
                  hatred = True              pos : neg    =      9.7 : 1.0
               illogical = True              neg : pos    =      9.0 : 1.0
None


### Test with custome review 1

In [45]:
from nltk.tokenize import word_tokenize
 
custom_review = "Might as well watch the cut scenes from a video game, the cgi was poorly implemented. The villain absolutely pointless. Superman's Resurrection wasted, why not give him a few movies to find a villain worthy and build suspense? Each of the characters in twisted to ape the Avengers in their roles, especially Flash/Spidey but with non of the depth. Truly a pointless exercise and a wasted opportunity, spare yourself the disappointment. How it is an 8+ on here discredits IMDb. Such a shame. :("
custom_review_tokens = word_tokenize(custom_review)

### Train and classify custome review 1

In [46]:
custom_review_set = bag_of_words(custom_review_tokens)
print("Classificate as: ", classifier.classify(custom_review_set))
 
# probability result
prob_result = classifier.prob_classify(custom_review_set)

print("Classified as: ", prob_result)
print("Classification category: ", prob_result.max())
print("Negative probability : ", prob_result.prob("neg"))
print("Positive probability : ", prob_result.prob("pos"))

Classificate as:  neg
Classified as:  <ProbDist with 2 samples>
Classification category:  neg
Negative probability :  0.9998317014009249
Positive probability :  0.0001682985990849785


### Test with custome review 2

In [47]:
custom_review = "I have never seen such an amazing film since I saw The Shawshank Redemption. Shawshank encompasses friendships, hardships, hopes, and dreams. And what is so great about the movie is that it moves you, it gives you hope. Even though the circumstances between the characters and the viewers are quite different, you don't feel that far removed from what the characters are going through."
custom_review_tokens = word_tokenize(custom_review)

### Train and classify custome review 2

In [48]:
custom_review_set = bag_of_words(custom_review_tokens)
print("Classificate as: ", classifier.classify(custom_review_set))
 
# probability result
prob_result = classifier.prob_classify(custom_review_set)

print("Classified as: ", prob_result)
print("Classification category: ", prob_result.max())
print("Negative probability : ", prob_result.prob("neg"))
print("Positive probability : ", prob_result.prob("pos"))

Classificate as:  pos
Classified as:  <ProbDist with 2 samples>
Classification category:  pos
Negative probability :  0.001673700190107909
Positive probability :  0.9983262998098886
