Name: Sathish Kumar Rajendiran
Chapter :  Sentiment Classifications in the NLTK
Date: 11/22/2020
Week: 8

In [1]:
#import libraries

# standard library
import os
import sys
from datetime import datetime
import time

# csv, xls, pandas & json
import pandas as pd
import json
import csv
import xlrd

# Language Processing
import nltk
from nltk import FreqDist
# import wordnet and shorten its name to wn
from nltk.corpus import wordnet as wn

# import sentence_polarity corpus 
from nltk.corpus import sentence_polarity
import random

# web requests
from urllib import request


##  Regular Expression to match non-alphabetic characters
import re

os.getcwd()

'/Users/sathishrajendiran/ist664-nlp/Async'

In [2]:
ls -ltr

total 1344
-rw-r--r--@ 1 sathishrajendiran  staff  662534 Mar  7  2011 subjclueslen1-HLTEMNLP05.tff
-rw-r--r--@ 1 sathishrajendiran  staff    1999 Nov 22 23:42 subjectivity.py
-rw-r--r--  1 sathishrajendiran  staff   16534 Nov 23 00:09 Week8_Lab8.6.ipynb


# Movie Review Corpus Sentences with BOW (Corpus)

## Sentiment/Opinion Classification (using the Movie Review corpus sentences)

In today’s lab, we will look at two ways to add features that are sometimes used in various sentiment or opinion classification problems.  In addition to providing a corpus of the 2000 positive and negative movie review documents, Pang and Lee had a subset of the sentences of the corpus annotated for sentiment in each sentence.  We will illustrate the process of sentiment classification on this corpus of sentences with positive or negative sentiment labels.

We start by loading the sentence_polarity corpus and creating a list of documents where each document represents a single sentence with the words and its label. 


In [54]:
sentences = sentence_polarity.sents()
print(len(sentences))

10662


In [45]:
print(type(sentences))

<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>


In [46]:
print(sentence_polarity.categories())

['neg', 'pos']


In [48]:
# sentences are already tokenized, show the first four sentences
for sent in sentences[:10]:
    print(sent)

['simplistic', ',', 'silly', 'and', 'tedious', '.']
["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.']
['exploitative', 'and', 'largely', 'devoid', 'of', 'the', 'depth', 'or', 'sophistication', 'that', 'would', 'make', 'watching', 'such', 'a', 'graphic', 'treatment', 'of', 'the', 'crimes', 'bearable', '.']
['[garbus]', 'discards', 'the', 'potential', 'for', 'pathological', 'study', ',', 'exhuming', 'instead', ',', 'the', 'skewed', 'melodrama', 'of', 'the', 'circumstantial', 'situation', '.']
['a', 'visually', 'flashy', 'but', 'narratively', 'opaque', 'and', 'emotionally', 'vapid', 'exercise', 'in', 'style', 'and', 'mystification', '.']
['the', 'story', 'is', 'also', 'as', 'unoriginal', 'as', 'they', 'come', ',', 'already', 'having', 'been', 'recycled', 'more', 'times', 'than', "i'd", 'care', 'to', 'count', '.']
['about', 'the', 'only', 'thing', 'to', 'give', 'the', 'movie', 'points', 'for', 'is', 'bravado', '--'

In [49]:
pos_sents = sentence_polarity.sents(categories='pos')
print(len(pos_sents))

5331


In [50]:
neg_sents = sentence_polarity.sents(categories='neg')
print(len(neg_sents))

5331


The movie review sentences are not labeled individually, but can be retrieved by category.  Look at the sentences by category to see how many positive and negative sentences there are.

In [51]:
documents = [(sent, cat) for cat in sentence_polarity.categories() 
    for sent in sentence_polarity.sents(categories=cat)]
print(documents[0])
print(documents[-1])

(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'neg')
(['provides', 'a', 'porthole', 'into', 'that', 'noble', ',', 'trembling', 'incoherence', 'that', 'defines', 'us', 'all', '.'], 'pos')


Since the documents are in order by label, we mix them up for later separation into training and test sets.

In [52]:
random.shuffle(documents)

We need to define the set of words that will be used for features.  This is essentially all the words in the entire document collection, except that we will limit it to the 2000 most frequent words.  Note that we lowercase the words, but do not do stemming or remove stopwords.

In [53]:
all_words_list = [word for (sent,cat) in documents for word in sent]
all_words = nltk.FreqDist(all_words_list)
word_items = all_words.most_common(2000)
word_features = [word for (word, freq) in word_items]
# look at the first 50 words in the most frequent list of words
print(word_features[:50])

['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'it', 'as', 'but', 'with', 'film', 'this', 'for', 'its', 'an', 'movie', "it's", 'be', 'on', 'you', 'not', 'by', 'about', 'one', 'more', 'like', 'has', 'are', 'at', 'from', 'than', '"', 'all', '--', 'his', 'have', 'so', 'if', 'or', 'story', 'i', 'too', 'just', 'who', 'into', 'what']


Now we can define the features for each document, using just the words, sometimes called the BOW or unigram features.  The feature label will be ‘V_keyword’ for each keyword (aka word) in the word_features set, and the value of the feature will be Boolean, according to whether the word is contained in that document.

In [12]:
def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    return features

Define the feature sets for the documents. 

In [13]:
featuresets = [(document_features(d,word_features), c) for (d,c) in documents]

In [67]:
print(len(featuresets)) # total number of words in the featuresets
print(featuresets[:10]) # top 5 features
# print(featuresets[5:]) # bottom 5 features)

10662
[({'V_.': True, 'V_the': True, 'V_,': True, 'V_a': True, 'V_and': True, 'V_of': True, 'V_to': True, 'V_is': True, 'V_in': True, 'V_that': False, 'V_it': False, 'V_as': True, 'V_but': False, 'V_with': False, 'V_film': False, 'V_this': False, 'V_for': False, 'V_its': False, 'V_an': False, 'V_movie': False, "V_it's": False, 'V_be': False, 'V_on': False, 'V_you': False, 'V_not': False, 'V_by': False, 'V_about': False, 'V_more': False, 'V_one': False, 'V_like': False, 'V_has': False, 'V_are': False, 'V_at': False, 'V_from': False, 'V_than': False, 'V_"': False, 'V_all': False, 'V_--': False, 'V_his': False, 'V_have': False, 'V_so': False, 'V_if': False, 'V_or': False, 'V_story': False, 'V_i': False, 'V_too': False, 'V_just': False, 'V_who': False, 'V_into': False, 'V_what': False, 'V_most': False, 'V_out': False, 'V_no': False, 'V_much': False, 'V_even': False, 'V_good': False, 'V_up': False, 'V_will': False, 'V_comedy': False, 'V_time': False, 'V_can': False, 'V_some': True, 'V_chara

We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, and this time we’ll do a 90/10 split of our approximately 10,000 documents.

In [70]:
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print (nltk.classify.accuracy(classifier, test_set))

0.74


The function show_most_informative_features shows the top ranked features according to the ratio of one label to the other one.  For example, if there are 20 times as many positive documents containing this word as negative ones, then the ratio will be reported as     20.00: 1.00   pos:neg.

In [15]:
classifier.show_most_informative_features(30)

Most Informative Features
            V_engrossing = True              pos : neg    =     20.9 : 1.0
               V_routine = True              neg : pos    =     16.4 : 1.0
               V_generic = True              neg : pos    =     15.1 : 1.0
            V_unexpected = True              pos : neg    =     14.9 : 1.0
             V_inventive = True              pos : neg    =     14.2 : 1.0
                  V_flat = True              neg : pos    =     13.1 : 1.0
              V_supposed = True              neg : pos    =     13.1 : 1.0
            V_refreshing = True              pos : neg    =     12.9 : 1.0
          V_refreshingly = True              pos : neg    =     12.9 : 1.0
                V_boring = True              neg : pos    =     12.8 : 1.0
                  V_warm = True              pos : neg    =     12.1 : 1.0
             V_wonderful = True              pos : neg    =     12.1 : 1.0
                  V_dull = True              neg : pos    =     11.6 : 1.0

# Sentiment Lexicon:  Subjectivity Count features

In [71]:
# this function returns a dictionary where you can look up words and get back 
#     the four items of subjectivity information described above
def readSubjectivity(path):
    flexicon = open(path, 'r')
    # initialize an empty dictionary
    sldict = { }
    for line in flexicon:
        fields = line.split()   # default is to split on whitespace
        # split each field on the '=' and keep the second part as the value
        strength = fields[0].split("=")[1]
        word = fields[2].split("=")[1]
        posTag = fields[3].split("=")[1]
        stemmed = fields[4].split("=")[1]
        polarity = fields[5].split("=")[1]
        if (stemmed == 'y'):
            isStemmed = True
        else:
            isStemmed = False
        # put a dictionary entry with the word as the keyword
        #     and a list of the other values
        sldict[word] = [strength, posTag, isStemmed, polarity]
    return sldict

Now run the function that reads the file.  It creates a Subjectivity Lexicon that is represented here as a dictionary, where each word is mapped to a list containing the strength, POStag, whether it is stemmed and the polarity

In [72]:
SLpath = "/Users/sathishrajendiran/ist664-nlp/Async/subjclueslen1-HLTEMNLP05.tff"
SL = readSubjectivity(SLpath )

Now the variable SL (for Subjectivity Lexicon) is a dictionary where you can look up words and find the strength, POS tag, whether it is stemmed and polarity.  We can try out some words.

In [73]:
SL['absolute']

['strongsubj', 'adj', False, 'neutral']

In [74]:
SL['shabby']

['strongsubj', 'adj', False, 'negative']

In [75]:
strength, posTag, isStemmed, polarity = SL['absolute']

Now we create a feature extraction function that has all the word features as before, but also has two features ‘positivecount’ and ‘negativecount’.  These features contains counts of all the positive and negative subjectivity words, where each weakly subjective word is counted once and each strongly subjective word is counted twice.  Note that this is only one of the ways in which people count up the presence of positive, negative and neutral words in a document.

In [21]:
def SL_features(document, SL):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
        # count variables for the 4 classes of subjectivity
        weakPos = 0
        strongPos = 0
        weakNeg = 0
        strongNeg = 0
        for word in document_words:
            if word in SL:
                strength, posTag, isStemmed, polarity = SL[word]
                if strength == 'weaksubj' and polarity == 'positive':
                    weakPos += 1
                if strength == 'strongsubj' and polarity == 'positive':
                    strongPos += 1
                if strength == 'weaksubj' and polarity == 'negative':
                    weakNeg += 1
                if strength == 'strongsubj' and polarity == 'negative':
                    strongNeg += 1
                features['positivecount'] = weakPos + (2 * strongPos)
                features['negativecount'] = weakNeg + (2 * strongNeg)      
    return features


Now we create feature sets as before, but using this feature extraction function.

In [22]:
SL_featuresets = [(SL_features(d, SL), c) for (d,c) in documents]

In [76]:
SL_featuresets[0][0]['positivecount']

9

In [77]:
SL_featuresets[0][0]['negativecount']

3

In [78]:
SL_featuresets[0][1]

'pos'

In [80]:
len(SL_featuresets)

10662

We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, and this time we’ll do a 90/10 split of our approximately 10,000 documents.

In [79]:
train_set, test_set = SL_featuresets[1000:], SL_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.753


In my random training, test split, these particular sentiment features did improve the classification on this dataset.  But also note that there are several different ways to represent features for a sentiment lexicon, e.g. instead of counting the sentiment words, we could get one overall score by subtracting the number of negative words from positive words, or other ways to score the sentiment words.  Also note that there are many different sentiment lexicons to try.

# Adding Negation Features

Negation of opinions is an important part of opinion classification.  Here we try a simple strategy.  We look for negation words "not", "never" and "no" and negation that appears in contractions of the form "doesn’t".

One strategy with negation words is to negate the word following the negation word, while other strategies negate all words up to the next punctuation or use syntax to find the scope of the negation.


Here is one list of negation words, including some adverbs called “approximate negators”:
no, not, never, none, rather, hardly, scarcely, rarely, seldom, neither, nor,
couldn't, wasn't, didn't, wouldn't, shouldn't, weren't, don't, doesn't, haven't, hasn't, won't, hadn't

The form of some of the words is a verb followed by n’t.  Now in the Movie Review Corpus itself, the tokenization has these words all split into 3 words, e.g. “couldn”, “’”, and “t”.  (and I have a NOT_features definition for this case).  But in this sentence_polarity corpus, the tokenization keeps these forms of negation as one word ending in “n’t”.

In [28]:
for sent in list(sentences)[:50]:
     for word in sent:
        if (word.endswith("n't")):
            print(sent)


['there', 'is', 'a', 'difference', 'between', 'movies', 'with', 'the', 'courage', 'to', 'go', 'over', 'the', 'top', 'and', 'movies', 'that', "don't", 'care', 'about', 'being', 'stupid']
['a', 'farce', 'of', 'a', 'parody', 'of', 'a', 'comedy', 'of', 'a', 'premise', ',', 'it', "isn't", 'a', 'comparison', 'to', 'reality', 'so', 'much', 'as', 'it', 'is', 'a', 'commentary', 'about', 'our', 'knowledge', 'of', 'films', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['most', 'of', 'the', 'problems', 'with', 'the', 'film', "don't", 'derive', 'from', 'the', 'screenplay', ',', 'but', 'rather', 'the', 'mediocre', 'performances', 'by', 'most', 'of', 'the', 'actors', 'involved']
['the', 'lack', 'of', 'naturalness', 'makes', 'everything', 'seem', 'self-consciously', 'poetic', 'and', 'forced', '.', '.', '.', "it's", 'a', 'pity', 'that', "[nelson's]", 'achievement', "doesn't", 'match'

In [30]:
negationwords = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone'
                 , 'rather', 'hardly', 'scarcely', 'rarely', 'seldom', 'neither', 'nor']

Start the feature set with all 2000 word features and 2000 Not word features set to false.  If a negation occurs, add the following word as a Not word feature (if it’s in the top 2000 feature words), and otherwise add it as a regular feature word.

In [39]:
def NOT_features(document, word_features, negationwords):
    features = {}
    for word in word_features:
#         features['contains({})'.format(word)] = False
#         features['contains(NOT{})'.format(word)] = False
        features['V_{}'.format(word)] = False
        features['V_NOT{}'.format(word)] = False
    # go through document words in order
    for i in range(0, len(document)):
        word = document[i]
        if ((i + 1) < len(document)) and ((word in negationwords) or (word.endswith("n't"))):
            i += 1
#             features['contains(NOT{})'.format(document[i])] = (document[i] in word_features)
#         else:
#             features['contains({})'.format(word)] = (word in word_features)
            features['V_{}'.format(document[i])] = (document[i] in word_features)
        else:
            features['V_NOT{}'.format(word)] = (word in word_features)
    return features


Create feature sets as before, using the NOT_features extraction funtion, train the classifier and test the accuracy.

In [32]:
NOT_featuresets = [(NOT_features(d, word_features, negationwords), c) for (d, c) in documents]
NOT_featuresets[0][0]['contains(NOTlike)']
NOT_featuresets[0][0]['contains(always)']


False

In [33]:
train_set, test_set = NOT_featuresets[200:], NOT_featuresets[:200]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))


0.75


In [34]:
classifier.show_most_informative_features(30)

Most Informative Features
    contains(engrossing) = True              pos : neg    =     21.7 : 1.0
      contains(mediocre) = True              neg : pos    =     17.0 : 1.0
       contains(generic) = True              neg : pos    =     16.3 : 1.0
       contains(routine) = True              neg : pos    =     16.3 : 1.0
     contains(inventive) = True              pos : neg    =     15.7 : 1.0
    contains(refreshing) = True              pos : neg    =     14.3 : 1.0
          contains(flat) = True              neg : pos    =     14.2 : 1.0
        contains(boring) = True              neg : pos    =     14.2 : 1.0
            contains(90) = True              neg : pos    =     13.7 : 1.0
  contains(refreshingly) = True              pos : neg    =     13.0 : 1.0
          contains(warm) = True              pos : neg    =     13.0 : 1.0
          contains(dull) = True              neg : pos    =     12.6 : 1.0
     contains(wonderful) = True              pos : neg    =     12.6 : 1.0

In [41]:
NOT_featuresets = [(NOT_features(d, word_features, negationwords), c) for (d, c) in documents]
print(NOT_featuresets[0][0]['V_NOTcare'])
print(NOT_featuresets[0][0]['V_always'])

False
False


In [81]:
len(NOT_featuresets)

10662

We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, and this time we’ll do a 90/10 split of our approximately 10,000 documents.

In [82]:
train_set, test_set = NOT_featuresets[1000:], NOT_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))

0.773


In [43]:
classifier.show_most_informative_features(30)

Most Informative Features
         V_NOTengrossing = True              pos : neg    =     21.7 : 1.0
           V_NOTmediocre = True              neg : pos    =     17.0 : 1.0
            V_NOTgeneric = True              neg : pos    =     16.3 : 1.0
            V_NOTroutine = True              neg : pos    =     16.3 : 1.0
          V_NOTinventive = True              pos : neg    =     15.7 : 1.0
         V_NOTrefreshing = True              pos : neg    =     14.3 : 1.0
               V_NOTflat = True              neg : pos    =     14.2 : 1.0
             V_NOTboring = True              neg : pos    =     14.2 : 1.0
                 V_NOT90 = True              neg : pos    =     13.7 : 1.0
               V_NOTwarm = True              pos : neg    =     13.0 : 1.0
       V_NOTrefreshingly = True              pos : neg    =     13.0 : 1.0
               V_NOTdull = True              neg : pos    =     12.6 : 1.0
          V_NOTwonderful = True              pos : neg    =     12.6 : 1.0

In my random split, using the negation features did improve the classification.

Other features

There are other types of possible features.  For example, sometimes people use bigrams in addition to just words/unigrams or use the counts of POS tags, which we will look at next week.  Also, there are many other forms of negation features.

For some problems, the word features can be pruned with a stop word list, but care should be taken that the list doesn’t remove any negation or useful function words.  A very small stop word list is probably better than a large one.


# using a stopword list to prune the word features

We’ll start with the NLTK stop word list, but we’ll remove some of the negation words, or parts of words, that our negation filter uses. This list is still pretty large.

In [85]:
stopwords = nltk.corpus.stopwords.words('english')
len(stopwords)

179

In [86]:
stopwords # review the stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [88]:
newstopwords = [word for word in stopwords if word not in negationwords]
len(newstopwords)

176

In [89]:
newstopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

Now take the new stop words out of the collection of all words, and then take the top 2000 to be the word features.

In [90]:
new_all_words_list = [word for word in all_words_list if word not in newstopwords]

Now continue to get new word features of length 2000 after the stopwords are removed:

In [91]:
new_all_words = nltk.FreqDist(new_all_words_list)

In [92]:
new_word_items = new_all_words.most_common(2000)

#### Build new word feature

In [93]:
new_word_features = [word for (word,count) in new_word_items]

In [94]:
print(new_word_features[:30])

['.', ',', 'film', 'movie', 'not', 'one', 'like', '"', '--', 'story', 'no', 'much', 'even', 'good', 'comedy', 'time', 'characters', 'little', 'way', 'funny', 'make', 'enough', 'never', 'makes', 'may', 'us', 'work', 'best', 'bad', 'director']


# Adding to BOW / Document features

In [95]:
def document_features_NEW(document, new_word_features):
    document_words = set(document)
    features = {}
    for word in new_word_features:
        features['V_{}'.format(word)] = (word in document_words)
    return features

In [99]:
featuresets_NEW = [(document_features_NEW(d,new_word_features), c) for (d,c) in documents]
print(len(featuresets_NEW)) # total number of words in the featuresets

10662


In [100]:
featuresets_NEW[:10]

[({'V_.': True,
   'V_,': True,
   'V_film': False,
   'V_movie': False,
   'V_not': False,
   'V_one': False,
   'V_like': True,
   'V_"': False,
   'V_--': False,
   'V_story': False,
   'V_no': False,
   'V_much': False,
   'V_even': False,
   'V_good': False,
   'V_comedy': False,
   'V_time': False,
   'V_characters': False,
   'V_little': False,
   'V_way': False,
   'V_funny': False,
   'V_make': False,
   'V_enough': False,
   'V_never': False,
   'V_makes': False,
   'V_may': False,
   'V_us': False,
   'V_work': False,
   'V_best': False,
   'V_bad': False,
   'V_director': False,
   'V_)': False,
   'V_?': False,
   'V_(': False,
   'V_love': True,
   'V_would': False,
   'V_life': False,
   'V_:': False,
   'V_movies': False,
   "V_there's": False,
   'V_new': False,
   'V_well': False,
   'V_could': False,
   'V_something': False,
   'V_really': False,
   'V_made': False,
   'V_performances': False,
   "V_that's": False,
   'V_drama': False,
   'V_many': False,
   'V_films

#### We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, and this time we’ll do a 90/10 split of our approximately 10,000 documents.

In [101]:
train_set, test_set = featuresets_NEW[1000:], featuresets_NEW[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print (nltk.classify.accuracy(classifier, test_set))

0.737


#### The function show_most_informative_features shows the top ranked features according to the ratio of one label to the other one.  For example, if there are 20 times as many positive documents containing this word as negative ones, then the ratio will be reported as     20.00: 1.00   pos:neg.

In [102]:
classifier.show_most_informative_features(30)

Most Informative Features
            V_engrossing = True              pos : neg    =     20.8 : 1.0
                  V_warm = True              pos : neg    =     19.4 : 1.0
              V_mediocre = True              neg : pos    =     16.5 : 1.0
               V_routine = True              neg : pos    =     15.2 : 1.0
               V_generic = True              neg : pos    =     15.2 : 1.0
             V_inventive = True              pos : neg    =     14.2 : 1.0
            V_refreshing = True              pos : neg    =     13.5 : 1.0
                V_boring = True              neg : pos    =     13.4 : 1.0
                    V_90 = True              neg : pos    =     13.2 : 1.0
             V_wonderful = True              pos : neg    =     12.5 : 1.0
                  V_flat = True              neg : pos    =     12.3 : 1.0
          V_refreshingly = True              pos : neg    =     12.2 : 1.0
                V_stupid = True              neg : pos    =     11.5 : 1.0

# Adding to Sentiment Lexicon:  Subjectivity Count features

In [103]:
def SL_features_NEW(document, SL):
    document_words = set(document)
    features = {}
    for word in new_word_features:
        features['contains(%s)' % word] = (word in document_words)
        # count variables for the 4 classes of subjectivity
        weakPos = 0
        strongPos = 0
        weakNeg = 0
        strongNeg = 0
        for word in document_words:
            if word in SL:
                strength, posTag, isStemmed, polarity = SL[word]
                if strength == 'weaksubj' and polarity == 'positive':
                    weakPos += 1
                if strength == 'strongsubj' and polarity == 'positive':
                    strongPos += 1
                if strength == 'weaksubj' and polarity == 'negative':
                    weakNeg += 1
                if strength == 'strongsubj' and polarity == 'negative':
                    strongNeg += 1
                features['positivecount'] = weakPos + (2 * strongPos)
                features['negativecount'] = weakNeg + (2 * strongNeg)      
    return features

In [104]:
SL_featuresets_NEW = [(SL_features_NEW(d, SL), c) for (d,c) in documents]

In [106]:
len(SL_featuresets_NEW)

10662

#### We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, and this time we’ll do a 90/10 split of our approximately 10,000 documents.

In [105]:
train_set, test_set = SL_featuresets_NEW[1000:], SL_featuresets_NEW[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.747


#### The function show_most_informative_features shows the top ranked features according to the ratio of one label to the other one.  For example, if there are 20 times as many positive documents containing this word as negative ones, then the ratio will be reported as     20.00: 1.00   pos:neg.

In [None]:
classifier.show_most_informative_features(30)

# Adding to Negation Features

In [107]:
def NOT_features_NEW(document, new_word_features, negationwords):
    features = {}
    for word in new_word_features:
        features['V_{}'.format(word)] = False
        features['V_NOT{}'.format(word)] = False
    # go through document words in order
    for i in range(0, len(document)):
        word = document[i]
        if ((i + 1) < len(document)) and ((word in negationwords) or (word.endswith("n't"))):
            i += 1
            features['V_{}'.format(document[i])] = (document[i] in new_word_features)
        else:
            features['V_NOT{}'.format(word)] = (word in new_word_features)
    return features



In [108]:
NOT_featuresets_NEW = [(NOT_features_NEW(d, new_word_features, negationwords), c) for (d, c) in documents]
print(NOT_featuresets_NEW[0][0]['V_NOTcare'])
print(NOT_featuresets_NEW[0][0]['V_always'])

False
False


#### We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, and this time we’ll do a 90/10 split of our approximately 10,000 documents.

In [109]:
train_set, test_set = NOT_featuresets_NEW[1000:], NOT_featuresets_NEW[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print (nltk.classify.accuracy(classifier, test_set))

0.786


#### The function show_most_informative_features shows the top ranked features according to the ratio of one label to the other one.  For example, if there are 20 times as many positive documents containing this word as negative ones, then the ratio will be reported as     20.00: 1.00   pos:neg.

In [110]:
classifier.show_most_informative_features(30)

Most Informative Features
         V_NOTengrossing = True              pos : neg    =     20.8 : 1.0
               V_NOTwarm = True              pos : neg    =     19.4 : 1.0
           V_NOTmediocre = True              neg : pos    =     16.5 : 1.0
            V_NOTgeneric = True              neg : pos    =     15.2 : 1.0
            V_NOTroutine = True              neg : pos    =     15.2 : 1.0
          V_NOTinventive = True              pos : neg    =     14.2 : 1.0
         V_NOTrefreshing = True              pos : neg    =     13.5 : 1.0
             V_NOTboring = True              neg : pos    =     13.4 : 1.0
                 V_NOT90 = True              neg : pos    =     13.2 : 1.0
          V_NOTwonderful = True              pos : neg    =     12.5 : 1.0
               V_NOTflat = True              neg : pos    =     12.3 : 1.0
       V_NOTrefreshingly = True              pos : neg    =     12.2 : 1.0
             V_NOTstupid = True              neg : pos    =     11.5 : 1.0