# Movie Review with NLTK (Bi-gram Features)

N-grams are common terms in text processing and analysis. N-grams are related with words of a text. There are different n-grams like unigram, bigram, trigram, etc.

+ Unigram = Item having a single word, i.e. the n-gram of size 1. For example, good.
+ Bigram = Item having two words, i.e. the n-gram of size 2. For example, very good.
+ Trigram = Item having three words, i.e. the n-gram of size 3. For example, not so good.

In the above bag-of-words model, we only used the unigram feature. In the example below, we will use both unigram and bigram feature, i.e. we will deal with both single words and double words.

## Feature Extraction

In this case, both unigrams and bigrams are used as features.

We define two functions:

+ bag_of_words: that extracts only unigram features from the movie review words
+ bag_of_ngrams: that extracts only bigram features from the movie review words

We then define another function:

+ bag_of_all_words: that combines both unigram and bigram features

In [3]:
from nltk import ngrams
from nltk.corpus import stopwords 
import string
 
stopwords_english = stopwords.words('english')

# clean words, remove stopwords and punctuation
def clean_words(words, stopwords_english):
    words_clean = []
    
    for word in words:
        word = word.lower()
    
        if word not in stopwords_english and word not in string.punctuation:
            words_clean.append(word)    
    
    return words_clean


# feature extractor function for unigram
def bag_of_words(words):    
    
    words_dictionary = dict([word, True] for word in words)    
    
    return words_dictionary


# feature extractor function for ngrams (bigram)
def bag_of_ngrams(words, n=2):
    words_ng = []
    
    for item in iter(ngrams(words, n)):
        words_ng.append(item)
    
    words_dictionary = dict([word, True] for word in words_ng)    
    
    return words_dictionary

#### Example of `bag_of_ngrams` function

In [4]:
from nltk.tokenize import word_tokenize

text = "It was a very good movie."
words = word_tokenize(text.lower())

# Test tokenization
print(words)

['it', 'was', 'a', 'very', 'good', 'movie', '.']


In [5]:
print(bag_of_ngrams(words))

{('it', 'was'): True, ('was', 'a'): True, ('a', 'very'): True, ('very', 'good'): True, ('good', 'movie'): True, ('movie', '.'): True}


#### Example of `clean_words` function

In [7]:
words_clean = clean_words(words, stopwords_english)

print(words_clean)

['good', 'movie']


### Use `important_words` for Bi-grams

In [8]:
# Add more important words for bigrams from stopwords
important_words = ['above', 'below', 'off', 'over', 'under', 'more', 'most', 'such', 'no', 
                   'nor', 'not', 'only', 'so', 'than', 'too', 'very', 'just', 'but']

stopwords_english_for_bigrams = set(stopwords_english) - set(important_words)

words_clean_for_bigrams = clean_words(words, stopwords_english_for_bigrams)
print(words_clean_for_bigrams)

['very', 'good', 'movie']


#### Example about work with unigram

In [9]:
# We will use general stopwords for unigrams 
# And special stopwords list for bigrams
unigram_features = bag_of_words(words_clean)

print(unigram_features)

{'good': True, 'movie': True}


#### Test about work with Bi-grams

In [10]:
bigram_features = bag_of_ngrams(words_clean_for_bigrams)
print(bigram_features)

{('very', 'good'): True, ('good', 'movie'): True}


#### Example about combining unigrams and Bi-grams

In [11]:
# combine both unigram and bigram features
all_features = unigram_features.copy()
all_features.update(bigram_features)

print(all_features)

{'good': True, 'movie': True, ('very', 'good'): True, ('good', 'movie'): True}


### Main Function

let's define a new function that extracts all features that extracts both unigram and bigrams features.

In [13]:
def bag_of_all_words(words, n=2):
    words_clean = clean_words(words, stopwords_english)
    words_clean_for_bigrams = clean_words(words, stopwords_english_for_bigrams)
 
    unigram_features = bag_of_words(words_clean)
    bigram_features = bag_of_ngrams(words_clean_for_bigrams)
 
    all_features = unigram_features.copy()
    all_features.update(bigram_features)
 
    return all_features
 
print(bag_of_all_words(words))

{'good': True, 'movie': True, ('very', 'good'): True, ('good', 'movie'): True}


## Working with NLTK movie reviews dataset

In [14]:
from nltk.corpus import movie_reviews 
 
pos_reviews = []

for fileid in movie_reviews.fileids('pos'):
    words = movie_reviews.words(fileid)
    pos_reviews.append(words)
    
neg_reviews = []

for fileid in movie_reviews.fileids('neg'):
    words = movie_reviews.words(fileid)
    neg_reviews.append(words)

## Create Feature Set

In [15]:
# positive reviews feature set
pos_reviews_set = []

for words in pos_reviews:
    pos_reviews_set.append((bag_of_all_words(words), 'pos'))
    
# negative reviews feature set
neg_reviews_set = []

for words in neg_reviews:
    neg_reviews_set.append((bag_of_all_words(words), 'neg'))

In [16]:
print(pos_reviews_set[0])

({'films': True, 'adapted': True, 'comic': True, 'books': True, 'plenty': True, 'success': True, 'whether': True, 'superheroes': True, 'batman': True, 'superman': True, 'spawn': True, 'geared': True, 'toward': True, 'kids': True, 'casper': True, 'arthouse': True, 'crowd': True, 'ghost': True, 'world': True, 'never': True, 'really': True, 'book': True, 'like': True, 'hell': True, 'starters': True, 'created': True, 'alan': True, 'moore': True, 'eddie': True, 'campbell': True, 'brought': True, 'medium': True, 'whole': True, 'new': True, 'level': True, 'mid': True, '80s': True, '12': True, 'part': True, 'series': True, 'called': True, 'watchmen': True, 'say': True, 'thoroughly': True, 'researched': True, 'subject': True, 'jack': True, 'ripper': True, 'would': True, 'saying': True, 'michael': True, 'jackson': True, 'starting': True, 'look': True, 'little': True, 'odd': True, 'graphic': True, 'novel': True, '500': True, 'pages': True, 'long': True, 'includes': True, 'nearly': True, '30': Tru

In [17]:
print(neg_reviews_set[0])

({'plot': True, 'two': True, 'teen': True, 'couples': True, 'go': True, 'church': True, 'party': True, 'drink': True, 'drive': True, 'get': True, 'accident': True, 'one': True, 'guys': True, 'dies': True, 'girlfriend': True, 'continues': True, 'see': True, 'life': True, 'nightmares': True, 'deal': True, 'watch': True, 'movie': True, 'sorta': True, 'find': True, 'critique': True, 'mind': True, 'fuck': True, 'generation': True, 'touches': True, 'cool': True, 'idea': True, 'presents': True, 'bad': True, 'package': True, 'makes': True, 'review': True, 'even': True, 'harder': True, 'write': True, 'since': True, 'generally': True, 'applaud': True, 'films': True, 'attempt': True, 'break': True, 'mold': True, 'mess': True, 'head': True, 'lost': True, 'highway': True, 'memento': True, 'good': True, 'ways': True, 'making': True, 'types': True, 'folks': True, 'snag': True, 'correctly': True, 'seem': True, 'taken': True, 'pretty': True, 'neat': True, 'concept': True, 'executed': True, 'terribly': 

## Create Train and Test Set

There are 1000 positive reviews set and 1000 negative reviews set. We take 20% (i.e. 200) of positive reviews and 20% (i.e. 200) of negative reviews as the test set. The remaining negative and positive reviews will be taken as the training set.

In [18]:
print (len(pos_reviews_set), len(neg_reviews_set)) # Output: (1000, 1000)
 
# radomize pos_reviews_set and neg_reviews_set
# doing so will output different accuracy result everytime we run the program
from random import shuffle 

shuffle(pos_reviews_set)
shuffle(neg_reviews_set)
 
test_set = pos_reviews_set[:200] + neg_reviews_set[:200]
train_set = pos_reviews_set[200:] + neg_reviews_set[200:]
 
print("Data testing: ", len(test_set))
print("Data training: ",len(train_set))

1000 1000
Data testing:  400
Data training:  1600


## Training Classifier and Calculating Accuracy

We train Naive Bayes Classifier using the training set and calculate the classification accuracy of the trained classifier using the test set.

In [19]:
from nltk import classify
from nltk import NaiveBayesClassifier
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print(accuracy) # Output: 0.8025
 
print (classifier.show_most_informative_features(10))

0.7825
Most Informative Features
                 idiotic = True              neg : pos    =     21.0 : 1.0
             magnificent = True              pos : neg    =     19.0 : 1.0
                   sucks = True              neg : pos    =     13.7 : 1.0
               ludicrous = True              neg : pos    =     12.6 : 1.0
                chilling = True              pos : neg    =     12.3 : 1.0
                  seagal = True              neg : pos    =     12.3 : 1.0
        ('one', 'worst') = True              neg : pos    =     12.2 : 1.0
    ('steven', 'seagal') = True              neg : pos    =     11.7 : 1.0
       ('quite', 'well') = True              pos : neg    =     11.7 : 1.0
     ('saving', 'grace') = True              neg : pos    =     11.7 : 1.0
None


**Note:**

+ The accuracy of the classifier has significantly increased when trained with combined feature set (unigram + bigram).
+ Accuracy was 73% while using only Unigram features.
+ Accuracy has increased to 80% while using combined (unigram + bigram) features

## Testing Classifier with Custom Review

We provide custom review text and check the classification output of the trained classifier. The classifier correctly predicts both negative and positive reviews provided.

### Test with custome review 1

In [20]:
from nltk.tokenize import word_tokenize

custom_review = "Might as well watch the cut scenes from a video game, the cgi was poorly implemented. The villain absolutely pointless. Superman's Resurrection wasted, why not give him a few movies to find a villain worthy and build suspense? Each of the characters in twisted to ape the Avengers in their roles, especially Flash/Spidey but with non of the depth. Truly a pointless exercise and a wasted opportunity, spare yourself the disappointment. How it is an 8+ on here discredits IMDb. Such a shame. :("
custom_review_tokens = word_tokenize(custom_review)

### Train and classify custome review 1

In [21]:
custom_review_set = bag_of_all_words(custom_review_tokens)
print("Classificate as: ", classifier.classify(custom_review_set))
 
# probability result
prob_result = classifier.prob_classify(custom_review_set)

print("Classified as: ", prob_result)
print("Classification category: ", prob_result.max())
print("Negative probability : ", prob_result.prob("neg"))
print("Positive probability : ", prob_result.prob("pos"))

Classificate as:  neg
Classified as:  <ProbDist with 2 samples>
Classification category:  neg
Negative probability :  0.9997604582337178
Positive probability :  0.00023954176628309255


### Test with custome review 2

In [22]:
custom_review = "I have never seen such an amazing film since I saw The Shawshank Redemption. Shawshank encompasses friendships, hardships, hopes, and dreams. And what is so great about the movie is that it moves you, it gives you hope. Even though the circumstances between the characters and the viewers are quite different, you don't feel that far removed from what the characters are going through."
custom_review_tokens = word_tokenize(custom_review)

### Train and classify custome review 2

In [23]:
custom_review_set = bag_of_all_words(custom_review_tokens)
print("Classificate as: ", classifier.classify(custom_review_set))
 
# probability result
prob_result = classifier.prob_classify(custom_review_set)

print("Classified as: ", prob_result)
print("Classification category: ", prob_result.max())
print("Negative probability : ", prob_result.prob("neg"))
print("Positive probability : ", prob_result.prob("pos"))

Classificate as:  pos
Classified as:  <ProbDist with 2 samples>
Classification category:  pos
Negative probability :  6.385517857581553e-09
Positive probability :  0.9999999936144807
