### NLP Lab Session Week 8
### Constructing Feature Sets for Sentiment Classification in the NLTK
### Part 1:  Movie Review Corpus Sentences with BOW

#### Getting Started

For this lab session download the following files and put them in your class folder for copy/pasting examples.  

LabWk8.sentimentfeatures.sents.txt

Subjectivity.py

subjclueslen1-HLTEMNLP05.tff.zip

Unzip the subjclues file and remember the location.  Start your jupyter notebook session.


In [1]:
import nltk

#### Sentiment/Opinion Classification (using the Movie Review corpus sentences)

In today’s lab, we will look at two ways to add features that are sometimes used in various sentiment or opinion classification problems.  In addition to providing a corpus of the 2000 positive and negative movie review documents, Pang and Lee had a subset of the sentences of the corpus annotated for sentiment in each sentence.  We will illustrate the process of sentiment classification on this corpus of sentences with positive or negative sentiment labels.

We start by loading the sentence_polarity corpus and creating a list of documents where each document represents a single sentence with the words and its label. 



In [2]:
from nltk.corpus import sentence_polarity
import random


In [3]:
# Look at sentences from the entire list of sentences.
sentences = sentence_polarity.sents()
print(len(sentences))
print(type(sentences))
print(sentence_polarity.categories())
# sentences are already tokenized, show the first four sentences
for sent in sentences[:4]:
    print(sent)


10662
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>
['neg', 'pos']
['simplistic', ',', 'silly', 'and', 'tedious', '.']
["it's", 'so', 'laddish', 'and', 'juvenile', ',', 'only', 'teenage', 'boys', 'could', 'possibly', 'find', 'it', 'funny', '.']
['exploitative', 'and', 'largely', 'devoid', 'of', 'the', 'depth', 'or', 'sophistication', 'that', 'would', 'make', 'watching', 'such', 'a', 'graphic', 'treatment', 'of', 'the', 'crimes', 'bearable', '.']
['[garbus]', 'discards', 'the', 'potential', 'for', 'pathological', 'study', ',', 'exhuming', 'instead', ',', 'the', 'skewed', 'melodrama', 'of', 'the', 'circumstantial', 'situation', '.']


The movie review sentences are not labeled individually, but can be retrieved by category.  Look at the sentences by category to see how many positive and negative sentences there are.

In [4]:
pos_sents = sentence_polarity.sents(categories='pos')
print(len(pos_sents))
neg_sents = sentence_polarity.sents(categories='neg')
print(len(neg_sents))


5331
5331


In [5]:
#We create the list of documents where each document(sentence) is paired with its label.

documents = [(sent, cat) for cat in sentence_polarity.categories() 
	for sent in sentence_polarity.sents(categories=cat)]


In [6]:
#In this list, each item is a pair (sent,cat) where sent is a list of words from a movie review sentence and cat is its label, either ‘pos’ or ‘neg’.
print(documents[0])
print(documents[-1])


(['simplistic', ',', 'silly', 'and', 'tedious', '.'], 'neg')
(['provides', 'a', 'porthole', 'into', 'that', 'noble', ',', 'trembling', 'incoherence', 'that', 'defines', 'us', 'all', '.'], 'pos')


In [7]:
# Since the documents are in order by label, we mix them up for later separation into training and test sets.

random.shuffle(documents)


We need to define the set of words that will be used for features.  This is essentially all the words in the entire document collection, except that we will limit it to the 2000 most frequent words.  Note that we lowercase the words, but do not do stemming or remove stopwords.

In [8]:
all_words_list = [word for (sent,cat) in documents for word in sent]
all_words = nltk.FreqDist(all_words_list)
word_items = all_words.most_common(2000)
word_features = [word for (word, freq) in word_items]
# look at the first 50 words in the most frequent list of words
print(word_features[:50])


['.', 'the', ',', 'a', 'and', 'of', 'to', 'is', 'in', 'that', 'it', 'as', 'but', 'with', 'film', 'this', 'for', 'its', 'an', 'movie', "it's", 'be', 'on', 'you', 'not', 'by', 'about', 'one', 'more', 'like', 'has', 'are', 'at', 'from', 'than', '"', 'all', '--', 'his', 'have', 'so', 'if', 'or', 'story', 'i', 'too', 'just', 'who', 'into', 'what']


Now we can define the features for each document, using just the words, sometimes called the BOW or unigram features.  The feature label will be ‘V_keyword’ for each keyword (aka word) in the word_features set, and the value of the feature will be Boolean, according to whether the word is contained in that document.

In [9]:
def document_features(document, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = (word in document_words)
    return features


In [10]:
documents[0]

(['not',
  'only',
  'a',
  'reminder',
  'of',
  'how',
  'they',
  'used',
  'to',
  'make',
  'movies',
  ',',
  'but',
  'also',
  'how',
  'they',
  'sometimes',
  'still',
  'can',
  'be',
  'made',
  '.'],
 'pos')

In [11]:
# Define the feature sets for the documents. 
featuresets = [(document_features(d,word_features), c) for (d,c) in documents]


In [12]:
# We create the training and test sets, train a Naïve Bayes classifier, and look at the accuracy, 
# and this time we’ll do a 90/10 split of our approximately 10,000 documents.

train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))


0.759


In [13]:

# The function show_most_informative_features shows the top ranked features according to the ratio of one
# label to the other one.  For example, if there are 20 times as many positive documents containing this word as negative ones,
# then the ratio will be reported as     20.00: 1.00   pos:neg.

classifier.show_most_informative_features(30)


Most Informative Features
            V_engrossing = True              pos : neg    =     20.6 : 1.0
             V_wonderful = True              pos : neg    =     16.6 : 1.0
               V_generic = True              neg : pos    =     16.1 : 1.0
             V_inventive = True              pos : neg    =     15.9 : 1.0
              V_mediocre = True              neg : pos    =     15.4 : 1.0
            V_refreshing = True              pos : neg    =     13.9 : 1.0
                V_boring = True              neg : pos    =     13.1 : 1.0
               V_routine = True              neg : pos    =     12.8 : 1.0
                    V_90 = True              neg : pos    =     12.8 : 1.0
                  V_flat = True              neg : pos    =     12.4 : 1.0
                  V_warm = True              pos : neg    =     12.4 : 1.0
                  V_dull = True              neg : pos    =     12.1 : 1.0
                 V_stale = True              neg : pos    =     11.5 : 1.0

### Part 2:  Adding Features from a Sentiment Lexicon

### Continuing our session with the movie review sentences

### Sentiment Lexicon:  Subjectivity Count features


We will first read in the subjectivity words from the subjectivity lexicon file created by Janyce Wiebe and her group at the University of Pittsburgh in the MPQA project.  Although these words are often used as features themselves or in conjunction with other information, we will create two features that involve counting the positive and negative subjectivity words present in each document.

Copy and paste the definition of the readSubjectivity function from the Subjectivity.txt file.  We’ll look at the function to see how it reads the file into a dictionary.

Create a path variable to where you stored the subjectivity lexicon file.  Here is an example from my mac, making sure the path name goes on one line:


In [14]:
# Module Subjectivity reads the subjectivity lexicon file from Wiebe et al
#    at http://www.cs.pitt.edu/mpqa/ (part of the Multiple Perspective QA project)
#
# This file has the format that each line is formatted as in this example for the word "abandoned"
#     type=weaksubj len=1 word1=abandoned pos1=adj stemmed1=n priorpolarity=negative
# In our data, the pos tag is ignored, so this program just takes the last one read
#     (typically the noun over the adjective)
#
# The data structure that is created is a dictionary where
#    each word is mapped to a list of 4 things:  
#        strength, which will be either 'strongsubj' or 'weaksubj'
#        posTag, either 'adj', 'verb', 'noun', 'adverb', 'anypos'
#        isStemmed, either true or false
#        polarity, either 'positive', 'negative', or 'neutral'

import nltk

# pass the absolute path of the lexicon file to this program
# example call:
SLpath = "C:\\Users\\rkrishnan\\Documents\\01 Personal\\MS\\IST664\\Week8\\subjclueslen1-hltemnlp05\\subjclueslen1-HLTEMNLP05.tff"


# this function returns a dictionary where you can look up words and get back 
#     the four items of subjectivity information described above
def readSubjectivity(path):
    flexicon = open(path, 'r')
    # initialize an empty dictionary
    sldict = { }
    for line in flexicon:
        fields = line.split()   # default is to split on whitespace
        # split each field on the '=' and keep the second part as the value
        strength = fields[0].split("=")[1]
        word = fields[2].split("=")[1]
        posTag = fields[3].split("=")[1]
        stemmed = fields[4].split("=")[1]
        polarity = fields[5].split("=")[1]
        if (stemmed == 'y'):
            isStemmed = True
        else:
            isStemmed = False
        # put a dictionary entry with the word as the keyword
        #     and a list of the other values
        sldict[word] = [strength, posTag, isStemmed, polarity]
    return sldict



In [15]:
# Now run the function that reads the file.  It creates a Subjectivity Lexicon that is represented here as a dictionary, 
# where each word is mapped to a list containing the strength, POStag, whether it is stemmed and the polarity.  
# (See more details in the Subjectivity.py file.)
SL = readSubjectivity(SLpath)


In [16]:
# Now the variable SL (for Subjectivity Lexicon) is a dictionary where you can look up words and find the strength, POS tag,
# whether it is stemmed and polarity.  We can try out some words.
SL['absolute']
SL['shabby']
# Or we can use the Python multiple assignment to get the 4 items:
strength, posTag, isStemmed, polarity = SL['absolute']


Now we create a feature extraction function that has all the word features as before, but also has two features ‘positivecount’ and ‘negativecount’.  These features contains counts of all the positive and negative subjectivity words, where each weakly subjective word is counted once and each strongly subjective word is counted twice.  Note that this is only one of the ways in which people count up the presence of positive, negative and neutral words in a document.

In [17]:
def SL_features(document, SL, word_features):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
        # count variables for the 4 classes of subjectivity
        weakPos = 0
        strongPos = 0
        weakNeg = 0
        strongNeg = 0
        for word in document_words:
            if word in SL:
                strength, posTag, isStemmed, polarity = SL[word]
                if strength == 'weaksubj' and polarity == 'positive':
                    weakPos += 1
                if strength == 'strongsubj' and polarity == 'positive':
                    strongPos += 1
                if strength == 'weaksubj' and polarity == 'negative':
                    weakNeg += 1
                if strength == 'strongsubj' and polarity == 'negative':
                    strongNeg += 1
                features['positivecount'] = weakPos + (2 * strongPos)
                features['negativecount'] = weakNeg + (2 * strongNeg)      
    return features


In [18]:
# Now we create feature sets as before, but using this feature extraction function.

SL_featuresets = [(SL_features(d, SL,word_features), c) for (d,c) in documents]


In [19]:
# features in document 0
print(SL_featuresets[0][0]['positivecount'])

print(SL_featuresets[0][0]['negativecount'])

print(SL_featuresets[0][1])

train_set, test_set = SL_featuresets[1000:], SL_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))


0
0
pos
0.773


In my random training, test split, these particular sentiment features did improve the classification on this dataset.  But also note that there are several different ways to represent features for a sentiment lexicon, e.g. instead of counting the sentiment words, we could get one overall score by subtracting the number of negative words from positive words, or other ways to score the sentiment words.  Also note that there are many different sentiment lexicons to try.

### Part 3:  Adding Negation Features 

### Continuing our session with the movie review sentences

### Negation features

Negation of opinions is an important part of opinion classification.  Here we try a simple strategy.  We look for negation words "not", "never" and "no" and negation that appears in contractions of the form "doesn’t".

One strategy with negation words is to negate the word following the negation word, while other strategies negate all words up to the next punctuation or use syntax to find the scope of the negation.

We follow the first strategy here, and we go through the document words in order adding the word features, but if the word follows a negation words, change the feature to negated word.

Here is one list of negation words, including some adverbs called “approximate negators”:
no, not, never, none, rather, hardly, scarcely, rarely, seldom, neither, nor,
couldn't, wasn't, didn't, wouldn't, shouldn't, weren't, don't, doesn't, haven't, hasn't, won't, hadn't

The form of some of the words is a verb followed by n’t.  Now in the Movie Review Corpus itself, the tokenization has these words all split into 3 words, e.g. “couldn”, “’”, and “t”.  (and I have a NOT_features definition for this case).  But in this sentence_polarity corpus, the tokenization keeps these forms of negation as one word ending in “n’t”.


In [20]:
for sent in list(sentences)[:50]:
    for word in sent:
        if (word.endswith("n't")):
            print(sent)

negationwords = ['no', 'not', 'never', 'none', 'nowhere', 'nothing', 'noone', 'rather', 'hardly', 'scarcely', \
                 'rarely', 'seldom', 'neither', 'nor']


['there', 'is', 'a', 'difference', 'between', 'movies', 'with', 'the', 'courage', 'to', 'go', 'over', 'the', 'top', 'and', 'movies', 'that', "don't", 'care', 'about', 'being', 'stupid']
['a', 'farce', 'of', 'a', 'parody', 'of', 'a', 'comedy', 'of', 'a', 'premise', ',', 'it', "isn't", 'a', 'comparison', 'to', 'reality', 'so', 'much', 'as', 'it', 'is', 'a', 'commentary', 'about', 'our', 'knowledge', 'of', 'films', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['i', "didn't", 'laugh', '.', 'i', "didn't", 'smile', '.', 'i', 'survived', '.']
['most', 'of', 'the', 'problems', 'with', 'the', 'film', "don't", 'derive', 'from', 'the', 'screenplay', ',', 'but', 'rather', 'the', 'mediocre', 'performances', 'by', 'most', 'of', 'the', 'actors', 'involved']
['the', 'lack', 'of', 'naturalness', 'makes', 'everything', 'seem', 'self-consciously', 'poetic', 'and', 'forced', '.', '.', '.', "it's", 'a', 'pity', 'that', "[nelson's]", 'achievement', "doesn't", 'match'

Start the feature set with all 2000 word features and 2000 Not word features set to false.  If a negation occurs, add the following word as a Not word feature (if it’s in the top 2000 feature words), and otherwise add it as a regular feature word.

In [21]:
# One strategy with negation words is to negate the word following the negation word
#   other strategies negate all words up to the next punctuation
# Strategy is to go through the document words in order adding the word features,
#   but if the word follows a negation words, change the feature to negated word
# Start the feature set with all 2000 word features and 2000 Not word features set to false
def NOT_features(document, word_features, negationwords):
    features = {}
    for word in word_features:
        features['V_{}'.format(word)] = False
        features['V_NOT{}'.format(word)] = False
    # go through document words in order
    for i in range(0, len(document)):
        word = document[i]
        if ((i + 1) < len(document)) and ((word in negationwords) or (word.endswith("n't"))):
            i += 1
            features['V_NOT{}'.format(document[i])] = (document[i] in word_features)
        else:
            features['V_{}'.format(word)] = (word in word_features)
    return features


In [22]:

# Create feature sets as before, using the NOT_features extraction funtion, train the classifier and test the accuracy.
NOT_featuresets = [(NOT_features(d, word_features, negationwords), c) for (d, c) in documents]
print(NOT_featuresets[0][0]['V_NOTcare'])
print(NOT_featuresets[0][0]['V_always'])

train_set, test_set = NOT_featuresets[1000:], NOT_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(30)


False
False
0.796
Most Informative Features
            V_engrossing = True              pos : neg    =     20.6 : 1.0
             V_wonderful = True              pos : neg    =     16.6 : 1.0
               V_generic = True              neg : pos    =     16.1 : 1.0
             V_inventive = True              pos : neg    =     15.9 : 1.0
              V_mediocre = True              neg : pos    =     15.4 : 1.0
            V_refreshing = True              pos : neg    =     13.9 : 1.0
                V_boring = True              neg : pos    =     13.1 : 1.0
                    V_90 = True              neg : pos    =     12.8 : 1.0
               V_routine = True              neg : pos    =     12.8 : 1.0
                  V_flat = True              neg : pos    =     12.4 : 1.0
                  V_warm = True              pos : neg    =     12.4 : 1.0
                  V_dull = True              neg : pos    =     12.1 : 1.0
             V_NOTenough = True              neg : pos  

In my random split, using the negation features did improve the classification.


Other features

There are other types of possible features.  For example, sometimes people use bigrams in addition to just words/unigrams or use the counts of POS tags, which we will look at next week.  Also, there are many other forms of negation features.

For some problems, the word features can be pruned with a stop word list, but care should be taken that the list doesn’t remove any negation or useful function words.  A very small stop word list is probably better than a large one.


In [23]:
### Bonus python text for the Question, define a stop word list ###

stopwords = nltk.corpus.stopwords.words('english')
print(len(stopwords))
print(stopwords)

# remove some negation words 
negationwords.extend(['ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn'])

newstopwords = [word for word in stopwords if word not in negationwords]
print(len(newstopwords))
print(newstopwords)

# remove stop words from the all words list
new_all_words_list = [word for (sent,cat) in documents for word in sent if word not in newstopwords]

# continue to define a new all words dictionary, get the 2000 most common as new_word_features
new_all_words = nltk.FreqDist(new_all_words_list)
new_word_items = new_all_words.most_common(2000)

new_word_features = [word for (word,count) in new_word_items]
print(new_word_features[:30])

# now re-run one of the feature set definitions with the new_word_features instead of word_features


179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [24]:
print(len(featuresets[0][0]))
print(len(SL_featuresets[0][0]))
print(len(NOT_featuresets[0][0]))

2000
2002
4001


In [25]:
# Define the feature sets for the documents with the new word features
featuresets = [(document_features(d,new_word_features), c) for (d,c) in documents]
print(len(featuresets))
train_set, test_set = featuresets[1000:], featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

# Now we create feature sets as before, but using this feature extraction function with the new word features
SL_featuresets = [(SL_features(d, SL,new_word_features), c) for (d,c) in documents]
print(len(SL_featuresets))
train_set, test_set = SL_featuresets[1000:], SL_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

# Create feature sets as before, using the NOT_features extraction funtion, train the classifier and test the accuracy  with the new word features
NOT_featuresets = [(NOT_features(d, new_word_features, negationwords), c) for (d, c) in documents]
print(len(NOT_featuresets))
train_set, test_set = NOT_featuresets[1000:], NOT_featuresets[:1000]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

classifier.show_most_informative_features(30)


10662
0.755
10662
0.771
10662
0.796
Most Informative Features
            V_engrossing = True              pos : neg    =     20.6 : 1.0
             V_wonderful = True              pos : neg    =     16.6 : 1.0
               V_generic = True              neg : pos    =     16.1 : 1.0
             V_inventive = True              pos : neg    =     15.9 : 1.0
              V_mediocre = True              neg : pos    =     15.4 : 1.0
            V_refreshing = True              pos : neg    =     13.9 : 1.0
                V_boring = True              neg : pos    =     13.1 : 1.0
                    V_90 = True              neg : pos    =     12.8 : 1.0
               V_routine = True              neg : pos    =     12.8 : 1.0
                  V_flat = True              neg : pos    =     12.4 : 1.0
                  V_warm = True              pos : neg    =     12.4 : 1.0
                  V_dull = True              neg : pos    =     12.1 : 1.0
             V_NOTenough = True       