**IST664 - Lab for Week 5**

In this lab session, we will learn to how do a sentiment analysis using traditional machine learning approach. In future labs, we will exercise how to use convolutional neural network (CNN) and LSTM to perform a sentiment analysis.

**Part 1 - SentiWordNet, a sentiment lexicon in NLTK**

Let’s look at SentiWordNet, a sentiment lexicon in NLTK.  [here shows the NLTK resources http://www.nltk.org/howto/corpus.html]  The documentation for SentiWordNet is on this HowTo page:  http://www.nltk.org/howto/sentiwordnet.html

In this sentiment lexicon, each word is judged to be made up of partly positive, negative and objective meaning, and 3 scores are given as to how much of each, where the scores must sum to 1.

In [1]:
import nltk
nltk.download('sentiwordnet')
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
# each word judged to be made up of positive, negative and objective meaning

# sentiwordnet has the same synsets as wordnet, use wn functions
print(list(swn.senti_synsets('breakdown')))
print(wn.synsets('breakdown'))

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


[SentiSynset('dislocation.n.02'), SentiSynset('breakdown.n.02'), SentiSynset('breakdown.n.03'), SentiSynset('breakdown.n.04')]
[Synset('dislocation.n.02'), Synset('breakdown.n.02'), Synset('breakdown.n.03'), Synset('breakdown.n.04')]


In [2]:
# the print function gives the positive and negative scores
breakdown3 = swn.senti_synset('breakdown.n.03')
print (breakdown3)

<breakdown.n.03: PosScore=0.0 NegScore=0.25>


In [3]:
# there are also separate functions for all the scores
print(breakdown3.pos_score())
print(breakdown3.neg_score())
print(breakdown3.obj_score())

0.0
0.25
0.75


In [4]:
# some more exploration of sentiment scores of words
dogswn1 = swn.senti_synset('dog.n.01')
print(dogswn1)
print(dogswn1.obj_score())

<dog.n.01: PosScore=0.0 NegScore=0.0>
1.0


In [5]:
goodswn1 = swn.senti_synset('good.a.01')
print(goodswn1)
print(goodswn1.obj_score())
print(goodswn1.pos_score())
print(goodswn1.neg_score())
print(goodswn1.obj_score())

<good.a.01: PosScore=0.75 NegScore=0.0>
0.25
0.75
0.0
0.25


In [6]:
# 5.1. Use SentiWordNet to get the senti_synset of the sense of a word that you pick.
# Show the positive, negative and objective sentiment scores for that word, if any.

print(list(swn.senti_synsets('interest')))

interestswn1 = swn.senti_synset('interest.n.01')

print("Object",interestswn1.obj_score())
print("Positive", interestswn1.pos_score())
print("Negative", interestswn1.neg_score())


[SentiSynset('interest.n.01'), SentiSynset('sake.n.01'), SentiSynset('interest.n.03'), SentiSynset('interest.n.04'), SentiSynset('interest.n.05'), SentiSynset('interest.n.06'), SentiSynset('pastime.n.01'), SentiSynset('interest.v.01'), SentiSynset('concern.v.02'), SentiSynset('matter_to.v.01')]
Object 1.0
Positive 0.0
Negative 0.0


**Part 2 - Sentiment Classification – Words as Features**

Now let’s look at two ways to add features that are sometimes used in various sentiment or opinion classification problems.  We will illustrate the process on a corpus of sentences from the Movie Review corpus, where each sentence has been labeled with the ‘positive’ or ‘negative’ tag.

We start by loading the sentence_polarity corpus and creating a list of documents where each document represents a single sentence with the words and its label.

In [7]:
## Task 1: Get to know the sentence_polarity corpus

# movie review sentences
import nltk
nltk.download('sentence_polarity')
from nltk.corpus import sentence_polarity
import random

# get the sentence corpus and look at some sentences
sentences = sentence_polarity.sents()
print(len(sentences))
print(sentence_polarity.categories())

# sentences are already tokenized, print the first four sentences
for sent in sentences[:4]:
    print(sent)

# look at the sentences by category to see how many positive and negative
pos_sents = sentence_polarity.sents(categories='pos')
print(len(pos_sents))
neg_sents = sentence_polarity.sents(categories='neg')
print(len(neg_sents))

[nltk_data] Downloading package sentence_polarity to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping corpora/sentence_polarity.zip.


10662
['neg', 'pos']
['simplistic', ',', 'silly', 'and', 'tedious', '.']
["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.']
['exploitative', 'and', 'largely', 'devoid', 'of', 'the', 'depth', 'or', 'sophistication', 'that', 'would', 'make', 'watching', 'such', 'a', 'graphic', 'treatment', 'of', 'the', 'crimes', 'bearable', '.']
['[garbus]', 'discards', 'the', 'potential', 'for', 'pathological', 'study', ',', 'exhuming', 'instead', ',', 'the', 'skewed', 'melodrama', 'of', 'the', 'circumstantial', 'situation', '.']
5331
5331


In [8]:
# Task 2: setup the movie reviews sentences for classification
# The movie review sentences are not labeled individually, but can be retrieved by category.
# We first create the list of documents where each document(sentence) is paired with its label.
documents = [(sent, cat) for cat in sentence_polarity.categories()
	for sent in sentence_polarity.sents(categories=cat)]

# In this list, each item is a pair (sent,cat) where sent is a list of words
# from a movie review sentence and cat is its label, either ‘pos’ or ‘neg’.

# look at the first and last documents - consists of all the words in the review
# followed by the category
print(documents[0])
print(documents[-1])

(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'neg')
(['provides', 'a', 'porthole', 'into', 'that', 'noble', ',', 'trembling', 'incoherence', 'that', 'defines', 'us', 'all', '.'], 'pos')


In [9]:
# Since the documents are in order by label, we mix them up for later separation into training and test sets.
random.shuffle(documents)

# We need to define the set of words that will be used for features.
# This is essentially all the words in the entire document collection,
# except that we will limit it to the 2000 most frequent words.
# Note that we lowercase the words, but do not do stemming or remove stopwords.

all_words_list = [word for (sent,cat) in documents for word in sent]
all_words = nltk.FreqDist(all_words_list)

# get the 2000 most frequently appearing keywords in the corpus
word_items = all_words.most_common(2000)
word_features = [word for (word,count) in word_items]
print(word_features[:50])
#print(word_features)

['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'it', 'as', 'but', 'with', 'film', 'this', 'for', 'its', 'an', 'movie', "it's", 'be', 'on', 'you', 'not', 'by', 'about', 'more', 'one', 'like', 'has', 'are', 'at', 'from', 'than', '"', 'all', '--', 'his', 'have', 'so', 'if', 'or', 'story', 'i', 'too', 'just', 'who', 'into', 'what']


In [10]:
# Task 3: sentiment classification
# Now we can define the features for each document, using just the words,
# sometimes called the BOW or unigram features.  The feature label will be ‘contains(keyword)’
# for each keyword (aka word) in the word_features set, and the value of the feature will be Boolean,
# according to whether the word is contained in that document.
def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    return features

# get features sets for a document, including keyword features and category feature
featuresets = [(document_features(d, word_features), c) for (d, c) in documents]

# the feature sets are 2000 words long so you may want to look at one
featuresets[0]


({'V_.': True,
  'V_the': True,
  'V_,': True,
  'V_a': False,
  'V_and': True,
  'V_of': False,
  'V_to': False,
  'V_is': True,
  'V_in': False,
  'V_that': False,
  'V_it': True,
  'V_as': False,
  'V_but': False,
  'V_with': False,
  'V_film': True,
  'V_this': False,
  'V_for': False,
  'V_its': False,
  'V_an': False,
  'V_movie': False,
  "V_it's": False,
  'V_be': False,
  'V_on': False,
  'V_you': False,
  'V_not': False,
  'V_by': False,
  'V_about': False,
  'V_more': True,
  'V_one': False,
  'V_like': False,
  'V_has': False,
  'V_are': False,
  'V_at': False,
  'V_from': False,
  'V_than': True,
  'V_"': False,
  'V_all': False,
  'V_--': False,
  'V_his': False,
  'V_have': False,
  'V_so': False,
  'V_if': False,
  'V_or': False,
  'V_story': False,
  'V_i': False,
  'V_too': False,
  'V_just': False,
  'V_who': False,
  'V_into': False,
  'V_what': False,
  'V_most': False,
  'V_out': False,
  'V_no': False,
  'V_much': False,
  'V_even': False,
  'V_good': False,
  'V

In [11]:
# training using naive Baysian classifier, training set is approximately 90% of data
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)

# evaluate the accuracy of the classifier
nltk.classify.accuracy(classifier, test_set)

# the accuracy result may vary since we randomized the documents

0.723

In [12]:
# The function show_most_informative_features shows the top ranked features according
# to the ratio of one label to the other one.  For example, if there are 20 times as
# many positive documents containing this word as negative ones,
# then the ratio will be reported as 20.00: 1.00 pos:neg.

classifier.show_most_informative_features(30)

Most Informative Features
            V_engrossing = True              pos : neg    =     19.6 : 1.0
                  V_warm = True              pos : neg    =     19.6 : 1.0
              V_provides = True              pos : neg    =     16.9 : 1.0
               V_generic = True              neg : pos    =     15.1 : 1.0
              V_mediocre = True              neg : pos    =     15.1 : 1.0
               V_routine = True              neg : pos    =     15.1 : 1.0
              V_supposed = True              neg : pos    =     14.4 : 1.0
            V_unexpected = True              pos : neg    =     14.3 : 1.0
                V_boring = True              neg : pos    =     13.9 : 1.0
             V_inventive = True              pos : neg    =     13.6 : 1.0
                  V_loud = True              neg : pos    =     13.1 : 1.0
            V_refreshing = True              pos : neg    =     12.9 : 1.0
                  V_flat = True              neg : pos    =     12.7 : 1.0

**Part 3 - Sentiment Classification - Subjectivity Count features**

Let’s look at another way of adding features for sentiment classification. We will first read in the subjectivity words from the subjectivity lexicon file created by Janyce Wiebe and her group at the University of Pittsburgh in the MPQA project.  Although these words are often used as features themselves or in conjunction with other information, we will create two features that involve counting the positive and negative subjectivity words present in each document.

In [13]:
# Create a path variable to where you stored the subjectivity lexicon file.
# If you started your python in the same directory, you can just type the file name as follows
from google.colab import files
uploaded = files.upload()
SLpath = "subjclueslen1-HLTEMNLP05.tff"

# this function returns a dictionary where you can look up words and get back
# the four items of subjectivity information described above
def readSubjectivity(path):
    flexicon = open(path, 'r')
    # initialize an empty dictionary
    sldict = { }
    for line in flexicon:
        fields = line.split()   # default is to split on whitespace
        # split each field on the '=' and keep the second part as the value
        strength = fields[0].split("=")[1]
        word = fields[2].split("=")[1]
        posTag = fields[3].split("=")[1]
        stemmed = fields[4].split("=")[1]
        polarity = fields[5].split("=")[1]
        if (stemmed == 'y'):
            isStemmed = True
        else:
            isStemmed = False
        # put a dictionary entry with the word as the keyword
        #     and a list of the other values
        sldict[word] = [strength, posTag, isStemmed, polarity]
    return sldict
SL = readSubjectivity(SLpath)

Saving subjclueslen1-HLTEMNLP05.tff to subjclueslen1-HLTEMNLP05.tff


In [14]:
print(SL)



In [15]:
# how many words are in the dictionary
print(len(SL.keys()))

# look at words in the dictionary
print(SL['absolute'])
print(SL['shabby'])

6885
['strongsubj', 'adj', False, 'neutral']
['strongsubj', 'adj', False, 'negative']


In [16]:
# note what happens if the word is not there
print(SL['dog'])

KeyError: ignored

In [17]:
# use multiple assignment to get the 4 items
strength, posTag, isStemmed, polarity = SL['absolute']
print(polarity)

neutral


In [18]:
# Now we create a feature extraction function that has all the word features as before,
# but also has two features ‘positivecount’ and ‘negativecount’.  These features contains
# counts of all the positive and negative subjectivity words, where each weakly subjective
# word is counted once and each strongly subjective word is counted twice.
# Note that this is only one of the ways in which people count up the presence of positive, negative and neutral words in a document.

def SL_features(document, word_features, SL):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    # count variables for the 4 classes of subjectivity
    weakPos = 0
    strongPos = 0
    weakNeg = 0
    strongNeg = 0
    for word in document_words:
        if word in SL:
            strength, posTag, isStemmed, polarity = SL[word]
            if strength == 'weaksubj' and polarity == 'positive':
                weakPos += 1
            if strength == 'strongsubj' and polarity == 'positive':
                strongPos += 1
            if strength == 'weaksubj' and polarity == 'negative':
                weakNeg += 1
            if strength == 'strongsubj' and polarity == 'negative':
                strongNeg += 1
            features['positivecount'] = weakPos + (2 * strongPos)
            features['negativecount'] = weakNeg + (2 * strongNeg)
    return features

SL_featuresets = [(SL_features(d, word_features, SL), c) for (d, c) in documents]

In [19]:
# show just the two sentiment lexicon features in document 0
print(SL_featuresets[0][0]['positivecount'])
print(SL_featuresets[0][0]['negativecount'])

2
3


In [20]:
# this gives the label of document 0
SL_featuresets[0][1]

'neg'

In [21]:
# retrain the classifier using these features
train_set, test_set = SL_featuresets[1000:], SL_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.739

Note that there are several different ways to represent features for a sentiment lexicon, e.g. instead of counting the sentiment words, we could get one overall score by subtracting the number of negative words from positive words, or other ways to score the sentiment words.  Also, note that there are many different sentiment lexicons to try.

**Part 3 Sentiment Analysis - Negation**

Negation of opinions is an important part of opinion classification.  Here we try a simple strategy.  We look for negation words "not", "never" and "no" and negation that appears in contractions of the form "doesn", "'", "t".

For example, my first document has the following words:
if', 'you', 'don', "'", 't', 'like', 'this', 'film', ',', 'then', 'you', 'have', 'a', 'problem', 'with', 'the', 'genre', 'itself',

One strategy with negation words is to negate the word following the negation word, while other strategies negate all words up to the next punctuation or use syntax to find the scope of the negation.

We follow the first strategy here, and we go through the document words in order adding the word features, but if the word follows a negation words, change the feature to negated word.

Here is one list of negation words, including some adverbs called “approximate negators”:
no, not, never, none, rather, hardly, scarcely, rarely, seldom, neither, nor,
couldn't, wasn't, didn't, wouldn't, shouldn't, weren't, don't, doesn't, haven't, hasn't, won't, hadn't

The form of some of the words is a verb followed by n’t.  Now in the Movie Review Corpus itself, the tokenization has these words all split into 3 words, e.g. “couldn”, “’”, and “t”.  (and I have a NOT_features definition for this case).  But in this sentence_polarity corpus, the tokenization keeps these forms of negation as one word ending in “n’t”.

In [22]:
for sent in list(sentences)[:50]:
   for word in sent:
     if (word.endswith("n't")):
       print(sent)


['there', 'is', 'a', 'difference', 'between', 'movies', 'with', 'the', 'courage', 'to', 'go', 'over', 'the', 'top', 'and', 'movies', 'that', "don't", 'care', 'about', 'being', 'stupid']
['a', 'farce', 'of', 'a', 'parody', 'of', 'a', 'comedy', 'of', 'a', 'premise', ',', 'it', "isn't", 'a', 'comparison', 'to', 'reality', 'so', 'much', 'as', 'it', 'is', 'a', 'commentary', 'about', 'our', 'knowledge', 'of', 'films', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['most', 'of', 'the', 'problems', 'with', 'the', 'film', "don't", 'derive', 'from', 'the', 'screenplay', ',', 'but', 'rather', 'the', 'mediocre', 'performances', 'by', 'most', 'of', 'the', 'actors', 'involved']
['the', 'lack', 'of', 'naturalness', 'makes', 'everything', 'seem', 'self-consciously', 'poetic', 'and', 'forced', '.', '.', '.', "it's", 'a', 'pity', 'that', "[nelson's]", 'achievement', "doesn't", 'match'

In [23]:
# this list of negation words includes some "approximate negators" like hardly and rarely
negationwords = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone', 'rather', 'hardly', 'scarcely', 'rarely', 'seldom', 'neither', 'nor']


In [24]:
# One strategy with negation words is to negate the word following the negation word
#   other strategies negate all words up to the next punctuation
# Strategy is to go through the document words in order adding the word features,
#   but if the word follows a negation words, change the feature to negated word
# Start the feature set with all 2000 word features and 2000 Not word features set to false
def NOT_features(document, word_features, negationwords):
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = False
        features['V_NOT{}'.format(word)] = False
    # go through document words in order
    for i in range(0, len(document)):
        word = document[i]
        if ((i + 1) < len(document)) and ((word in negationwords) or (word.endswith("n't"))):
            i += 1
            features['V_NOT{}'.format(document[i])] = (document[i] in word_features)
        else:
            features['V_{}'.format(word)] = (word in word_features)
    return features
# define the feature sets
NOT_featuresets = [(NOT_features(d, word_features, negationwords), c) for (d, c) in documents]
# show the values of a couple of example features
print(NOT_featuresets[0][0]['V_NOTcare'])
print(NOT_featuresets[0][0]['V_always'])

False
False


In [25]:
train_set, test_set = NOT_featuresets[1000:], NOT_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.766

In [26]:
classifier.show_most_informative_features(30)

Most Informative Features
            V_engrossing = True              pos : neg    =     19.6 : 1.0
                  V_warm = True              pos : neg    =     19.6 : 1.0
              V_provides = True              pos : neg    =     16.9 : 1.0
               V_generic = True              neg : pos    =     15.1 : 1.0
              V_mediocre = True              neg : pos    =     15.1 : 1.0
               V_routine = True              neg : pos    =     15.1 : 1.0
              V_supposed = True              neg : pos    =     14.4 : 1.0
            V_unexpected = True              pos : neg    =     14.3 : 1.0
                V_boring = True              neg : pos    =     13.9 : 1.0
             V_inventive = True              pos : neg    =     13.6 : 1.0
                  V_loud = True              neg : pos    =     13.1 : 1.0
            V_refreshing = True              pos : neg    =     12.9 : 1.0
                  V_flat = True              neg : pos    =     12.7 : 1.0

In [27]:
nltk.download('punkt')
text = 'I had a fantastic day at the office'
texttokens = nltk.word_tokenize(text)
inputfeatureset = NOT_features(texttokens, word_features, negationwords)
print(inputfeatureset)
classifier.classify(inputfeatureset)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


{'V_.': False, 'V_NOT.': False, 'V_the': True, 'V_NOTthe': False, 'V_,': False, 'V_NOT,': False, 'V_a': True, 'V_NOTa': False, 'V_and': False, 'V_NOTand': False, 'V_of': False, 'V_NOTof': False, 'V_to': False, 'V_NOTto': False, 'V_is': False, 'V_NOTis': False, 'V_in': False, 'V_NOTin': False, 'V_that': False, 'V_NOTthat': False, 'V_it': False, 'V_NOTit': False, 'V_as': False, 'V_NOTas': False, 'V_but': False, 'V_NOTbut': False, 'V_with': False, 'V_NOTwith': False, 'V_film': False, 'V_NOTfilm': False, 'V_this': False, 'V_NOTthis': False, 'V_for': False, 'V_NOTfor': False, 'V_its': False, 'V_NOTits': False, 'V_an': False, 'V_NOTan': False, 'V_movie': False, 'V_NOTmovie': False, "V_it's": False, "V_NOTit's": False, 'V_be': False, 'V_NOTbe': False, 'V_on': False, 'V_NOTon': False, 'V_you': False, 'V_NOTyou': False, 'V_not': False, 'V_NOTnot': False, 'V_by': False, 'V_NOTby': False, 'V_about': False, 'V_NOTabout': False, 'V_more': False, 'V_NOTmore': False, 'V_one': False, 'V_NOTone': False

'pos'

**Part 4 Sentiment Analysis - Stopwords**

Let’s try using a stopword list to prune the word features.  We’ll start with the NLTK stop word list, but we’ll remove some of the negation words, or parts of words, that our negation filter uses.  This list is still pretty large.

In [28]:
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
print(len(stopwords))
print(stopwords)


179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [29]:
# remove some negation words
negationwords.extend(['ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'])

In [30]:
newstopwords = [word for word in stopwords if word not in negationwords]
print(len(newstopwords))
print(newstopwords)

157
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's

In [31]:
# remove stop words from the all words list
new_all_words_list = [word for (sent,cat) in documents for word in sent if word not in newstopwords]

In [32]:
# continue to define a new all words dictionary, get the 2000 most common as new_word_features
new_all_words = nltk.FreqDist(new_all_words_list)
new_word_items = new_all_words.most_common(2000)
new_word_features = [word for (word,count) in new_word_items]
print(new_word_features[:30])

['.', ',', 'film', 'movie', 'not', 'one', 'like', '"', '--', 'story', 'no', 'much', 'even', 'good', 'comedy', 'time', 'characters', 'little', 'way', 'funny', 'make', 'enough', 'never', 'makes', 'may', 'us', 'work', 'best', 'bad', 'director']


In [None]:
# 5.2. now re-run one of the feature set definitions with the new_word_features instead of word_features

In [34]:
SL_featuresets = [(SL_features(d, new_word_features, SL), c) for (d, c) in documents]

train_set, test_set = SL_featuresets[1000:], SL_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

0.735