In [1]:
# Larger window and fontsize
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
display(HTML("<style>.rendered_html { font-size: 18px; }</style>"))

# Introduction

Sentiment Analysis refers to the use of text analysis and natural language processing to identify and extract subjective information in textual contents.

In this practice we will focus on the analysis of the sentiment of a collection of tweets, applying some of the ideas that we have explored in class


## Dataset

This corpus of tweets, developed by Stanford’s Natural Language processing research group
The training set is collected by querying Twitter API for happy emoticons like ":)" and sad emoticons like ":(" and labelling them positive or negative. The emoticons were then stripped and Re-Tweets and duplicates removed.

The data is a CSV with emoticons removed. Data file format has 6 fields:

    0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive) **Note**: For the dataset there is only negative and positive tweets
    1 - the id of the tweet (2087)
    2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    4 - the user that tweeted (robotickilldozr)
    5 - the text of the tweet (Lyx is cool)

It also contains around 500 tweets manually collected and labelled for testing purposes.

We randomly sample and use 5000 tweets from this dataset.

The following code creates a folder to store the dataset and download it

In [None]:
! mkdir stanford_dataset
! wget --directory-prefix=stanford_dataset/ https://github.com/crwong/cs224u-project/raw/master/data/sentiment/training.1600000.processed.noemoticon.csv

### Load dataset

In [2]:
import os.path
import csv
import pandas as pd
import nltk

def loadDataset(in_file):
    my_path = os.getcwd()
    path = os.path.join(my_path, in_file)
    column_names = ['sentiment','ID', 'Date', 'Query', 'user_id', 'tweet']
    tweets = pd.read_csv(path, delimiter=',', quotechar='"', header= None, names= column_names, encoding="ISO-8859-1")

    print('Readed ', len(tweets), "tweets")
    return tweets

We load the data and check the number of positive and negative tweets

In [3]:
raw_training_data = loadDataset("datasets/stanford_dataset/training.1600000.processed.noemoticon.csv")
raw_training_data.groupby('sentiment')['ID'].nunique()

Readed  1600000 tweets


sentiment
0    800000
4    800000
Name: ID, dtype: int64

Dataset contains more than a million tweets, for our practice we will only use a sample of 5000 tweets

In [4]:
# Sample 5000 tweets from the dataset
training_data = raw_training_data.sample(n=5000)
training_data.head()

Unnamed: 0,sentiment,ID,Date,Query,user_id,tweet
418520,0,2061788339,Sat Jun 06 21:37:01 PDT 2009,NO_QUERY,Halfbeanchacha,@crazyycamille my dads being a jerk and won't ...
447784,0,2068796410,Sun Jun 07 14:42:55 PDT 2009,NO_QUERY,VoniaPerna,"@ mmm that does sound good, but i'm at work H..."
326165,0,2008025357,Tue Jun 02 13:30:34 PDT 2009,NO_QUERY,audy86,Oooo god it's humid ! I hate humidity!
481308,0,2179593331,Mon Jun 15 09:19:00 PDT 2009,NO_QUERY,sexyexecutive,Low toner error at 5:19pm Bloody Nigel's alre...
271561,0,1990040643,Mon Jun 01 03:40:53 PDT 2009,NO_QUERY,Stephany13329,long week ahead... actually long summer ahead....


Let's check that the distribution of positive and negative tweets remains

In [5]:
training_data.groupby('sentiment')['ID'].nunique()

sentiment
0    2516
4    2484
Name: ID, dtype: int64

To facilitate the interpretation of the results we are going to recode the target variable

In [6]:
def recode_sentiment(series):
    if series == 4:
        return 'positive'
    else:
        return 'negative'
    
training_data['sentiment'] = training_data['sentiment'].apply(recode_sentiment)
training_data.head()

Unnamed: 0,sentiment,ID,Date,Query,user_id,tweet
418520,negative,2061788339,Sat Jun 06 21:37:01 PDT 2009,NO_QUERY,Halfbeanchacha,@crazyycamille my dads being a jerk and won't ...
447784,negative,2068796410,Sun Jun 07 14:42:55 PDT 2009,NO_QUERY,VoniaPerna,"@ mmm that does sound good, but i'm at work H..."
326165,negative,2008025357,Tue Jun 02 13:30:34 PDT 2009,NO_QUERY,audy86,Oooo god it's humid ! I hate humidity!
481308,negative,2179593331,Mon Jun 15 09:19:00 PDT 2009,NO_QUERY,sexyexecutive,Low toner error at 5:19pm Bloody Nigel's alre...
271561,negative,1990040643,Mon Jun 01 03:40:53 PDT 2009,NO_QUERY,Stephany13329,long week ahead... actually long summer ahead....


## Tweet Preprocessing

At this step, we will preprocess the text in the tweets, tokenize and stem it. We will have to take care of specific markups (e.g., hashtags) related to Twitter, as well as of other aspects related to the sentiment analysis, like, for instance, emoticons.

In the following, I give you an example of processing. I will use regular expressions to detect hashtags and change the detected hashtag by an indicator of the same.

### Hashtags

A hashtag is a word or an un-spaced phrase prefixed with the hash symbol (#). These are used to both naming subjects and phrases that are currently in trending topics. For example, #iPad, #news

    Regular Expression: #(\w+)

    Replace Expression: HASH_hashtag


In [7]:
import re
hash_regex = re.compile(r"#(\w+)")

def hash_repl(match):
    """
    Detect hashtags and create a new feature: _HASH_+text of the hashtag
    """
    return '__HASH_'+match.group(1).upper()

To use this function, we will make use of the `re.sub` function. This function takes a regular expression (`hash_regex` in our case) and a replacing function (our `hash_repl` function) to change every appearance of the regular expression to the output of the replacing function.

In [8]:
# Test the created function
re.sub( hash_regex, hash_repl, 'happy midsummer everyone! My little brother has a bd today and here are few relatives having a dinner.. not so sure is it very nice and #hashtag')

'happy midsummer everyone! My little brother has a bd today and here are few relatives having a dinner.. not so sure is it very nice and __HASH_HASHTAG'

# Exercise 1: More pre-processing

Following the previous example, create more regex and functions to detect some other twitter-related aspects (e.g., user names, URLs, emoticons, punctuations, repetitions, stemming, ...)

You may find interesting ideas in this regard in the following links:
 - Christopher Potts sentiment tokenizer: http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py
 - Brendan O’Connor twitter tokenizer: https://github.com/brendano/tweetmotif

In [9]:
# Your code here

In order to facilitate the application of these pre-processing steps, we will create a function to enclose all of them.

In [10]:
# Wrapper function that encloses all the processing procedures
def processAll(text):
    
    text = re.sub( hash_regex, hash_repl, text )
    
    # All your pre-processing steps here
    
    if not isinstance(text,list):
        text = text.split() # To avoid format errors
    return text

We create a new column in our dataframe with the processed text by applying our `processAll` function to all the text column.

In [11]:
training_data['processed_tweet'] = training_data.tweet.apply(processAll)
training_data.head()

Unnamed: 0,sentiment,ID,Date,Query,user_id,tweet,processed_tweet
418520,negative,2061788339,Sat Jun 06 21:37:01 PDT 2009,NO_QUERY,Halfbeanchacha,@crazyycamille my dads being a jerk and won't ...,"[@crazyycamille, my, dads, being, a, jerk, and..."
447784,negative,2068796410,Sun Jun 07 14:42:55 PDT 2009,NO_QUERY,VoniaPerna,"@ mmm that does sound good, but i'm at work H...","[@, mmm, that, does, sound, good,, but, i'm, a..."
326165,negative,2008025357,Tue Jun 02 13:30:34 PDT 2009,NO_QUERY,audy86,Oooo god it's humid ! I hate humidity!,"[Oooo, god, it's, humid, !, I, hate, humidity!]"
481308,negative,2179593331,Mon Jun 15 09:19:00 PDT 2009,NO_QUERY,sexyexecutive,Low toner error at 5:19pm Bloody Nigel's alre...,"[Low, toner, error, at, 5:19pm, Bloody, Nigel'..."
271561,negative,1990040643,Mon Jun 01 03:40:53 PDT 2009,NO_QUERY,Stephany13329,long week ahead... actually long summer ahead....,"[long, week, ahead..., actually, long, summer,..."


# Feature Creation

A wide variety of features can be used to build a classifier for tweets. The most widely used and basic feature set is word n-grams. However, there's a lot of domain specific information present in tweets that can also be used for classifying them.

## Unigrams

Unigrams are the simplest features that can be used for text classification. A Tweet can be represented by a multiset of words present in it. We, however, have used the presence of unigrams in a tweet as a feature set. Presence of a word is more important than how many times it is repeated

In [12]:
text = ["Example", "of", "tweet", "represented", "as", "unigrams"]

unigrams_fd = nltk.FreqDist()
unigrams_fd.update(text)
unigrams_fd

FreqDist({'Example': 1, 'of': 1, 'tweet': 1, 'represented': 1, 'as': 1, 'unigrams': 1})

## N-grams

N-gram refers to an n-long sequence of words. Probabilistic Language Models based on Unigrams, Bigrams and Trigrams can be successfully used to predict the next word given a current context of words. In the domain of sentiment analysis, the performance of N-grams is unclear.

As the order of the n-grams increases, they tend to be more and more sparse. Let's then try bi-gram and tri-grams

In [13]:
# Bigrams
words_bi  = [ ','.join(map(str,bg)) for bg in nltk.bigrams(text) ]
bi_grams_fd = nltk.FreqDist()
bi_grams_fd.update( words_bi )
bi_grams_fd

FreqDist({'Example,of': 1, 'of,tweet': 1, 'tweet,represented': 1, 'represented,as': 1, 'as,unigrams': 1})

In [14]:
# Trigrams
words_tri  = [ ','.join(map(str,tg)) for tg in nltk.trigrams(text) ]
tri_grams_fd = nltk.FreqDist()
tri_grams_fd.update( words_tri )
tri_grams_fd

FreqDist({'Example,of,tweet': 1, 'of,tweet,represented': 1, 'tweet,represented,as': 1, 'represented,as,unigrams': 1})

We compute the bigrams and trigrams models for the processed text in the whole dataset

In [15]:
# Wrapper function that encloses all the n-grams procedures

def get_word_features(words):
    bag = {}
    words_uni = [ 'has(%s)'% ug for ug in words ]
    words_bi  = [ 'has(%s)'% ','.join(map(str,bg)) for bg in nltk.bigrams(words) ]
    words_tri = [ 'has(%s)'% ','.join(map(str,tg)) for tg in nltk.trigrams(words) ]
    
    for f in words_uni+words_bi+words_tri:
        bag[f] = 1

    return bag


## Negations

The need negation detection in sentiment analysis can be illustrated by the difference in the meaning of the phrases, "This is good" vs. "This is not good" However, the negations occurring in natural language are seldom so simple. Handling the negation consists of two tasks – Detection of explicit negation cues and the scope of negation of these words.

**Scope of Negation**

Words immediately preceding and following the negation cues are the most negative and the words that come farther away do not lie in the scope of negation of such cues. We define left and right negativity of a word as the chances that meaning of that word is actually the opposite. Left negativity depends on the closest negation cue on the left and similarly for Right negativity.

In [16]:
negtn_regex = re.compile( r"""(?:
    ^(?:never|no|nothing|nowhere|noone|none|not|
        havent|hasnt|hadnt|cant|couldnt|shouldnt|
        wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint
    )$
)
|
n't
""", re.X)

def get_negation_features(words):
    INF = 0.0
    negtn = [ bool(negtn_regex.search(w)) for w in words ]

    left = [0.0] * len(words)
    prev = 0.0
    for i in range(0,len(words)):
        if( negtn[i] ):
            prev = 1.0
        left[i] = prev
        prev = max( 0.0, prev-0.1)

    right = [0.0] * len(words)
    prev = 0.0
    for i in reversed(range(0,len(words))):
        if( negtn[i] ):
            prev = 1.0
        right[i] = prev
        prev = max( 0.0, prev-0.1)

    return dict( zip(
                    ['neg_l('+w+')' for w in  words] + ['neg_r('+w+')' for w in  words],
                    left + right ) )

In [17]:
# Test
text = ["This","text", "does", "not", "have", "a", "negation"]
get_negation_features(text)

{'neg_l(This)': 0.0,
 'neg_l(text)': 0.0,
 'neg_l(does)': 0.0,
 'neg_l(not)': 1.0,
 'neg_l(have)': 0.9,
 'neg_l(a)': 0.8,
 'neg_l(negation)': 0.7000000000000001,
 'neg_r(This)': 0.7000000000000001,
 'neg_r(text)': 0.8,
 'neg_r(does)': 0.9,
 'neg_r(not)': 1.0,
 'neg_r(have)': 0.0,
 'neg_r(a)': 0.0,
 'neg_r(negation)': 0.0}

## POS Tagging

With POS Tagging we can get the category of each word. Some of these categories are more interesting in order to infer the sentiment of given tweet. For example, adjectives are expected to carry most sentiment information than adverbs. In a similar way, some particular names can carry a positive or negative implication for particular domains.


In [18]:
def get_pos_features(words):
    tags = {}
    tagged_words = [ 'has(%s)'% w+'_'+tag for w,tag in nltk.pos_tag(words) if len(words) > 0]
    
    for tw in tagged_words:
        tags[tw] = 1

    return tags

As in the previous step, let's create an function to apply all these creation steps.

In [19]:
# Wrapper function for the extraction of features
def extract_features(text):
    features = {}
    
    words = processAll(text)
    
    word_features = get_word_features(words)
    features.update( word_features )

    negation_features = get_negation_features(words)
    features.update( negation_features )
    
    pos_features = get_pos_features(words)
    features.update( pos_features )

    return features

In [20]:
training_data['processed_tweet_features'] = training_data.tweet.apply(extract_features)
training_data[['tweet','processed_tweet_features']].head()

Unnamed: 0,tweet,processed_tweet_features
418520,@crazyycamille my dads being a jerk and won't ...,"{'has(@crazyycamille)': 1, 'has(my)': 1, 'has(..."
447784,"@ mmm that does sound good, but i'm at work H...","{'has(@)': 1, 'has(mmm)': 1, 'has(that)': 1, '..."
326165,Oooo god it's humid ! I hate humidity!,"{'has(Oooo)': 1, 'has(god)': 1, 'has(it's)': 1..."
481308,Low toner error at 5:19pm Bloody Nigel's alre...,"{'has(Low)': 1, 'has(toner)': 1, 'has(error)':..."
271561,long week ahead... actually long summer ahead....,"{'has(long)': 1, 'has(week)': 1, 'has(ahead......"


# Classification

Let's now use the processed tweet features to create a classification model.

## Training-test Splitting

To evaluate our approaches, we are going to split our data into train and validation. We will use the train to create the models and the validation to validate their performance. Once we have selected the best model (according to the accuracy on the validation set) we can use this model to predict our test set.

In this way, test set will remain as unseen data for all the process: we are not going to make any decision based on the test error. Therefore, we can assume that the results on the test set will be the same that we will obtain when new unseen data appears in the future

In [21]:
training_size = 4000
train_tweets = [(tweet, sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[:training_size]]
validation_tweets  = [(tweet, sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[training_size:]]

## Preparing the data for the classifier

We have previously defined a feature extraction process, which we have wrapped into the `extract_features` function.

By making use of the `nltk.classify.apply_features` function provided by NLTK, we will process the tweets and create the features that will be used for the classifier 

In [22]:
# Apply the data processing and cleaning extraction methodologies
v_train = nltk.classify.apply_features(extract_features,train_tweets)
v_validation  = nltk.classify.apply_features(extract_features,validation_tweets)

Let's see the resultant object

In [23]:
print("For the tweet = ", training_data.tweet.values[0])
print(" ")
print("The following features has been created:")
print(" ")
print(v_train[0][0])

For the tweet =  @crazyycamille my dads being a jerk and won't buy me snowballs. That made me miss you and your snowballs. 
 
The following features has been created:
 
{'has(@crazyycamille)': 1, 'has(my)': 1, 'has(dads)': 1, 'has(being)': 1, 'has(a)': 1, 'has(jerk)': 1, 'has(and)': 1, "has(won't)": 1, 'has(buy)': 1, 'has(me)': 1, 'has(snowballs.)': 1, 'has(That)': 1, 'has(made)': 1, 'has(miss)': 1, 'has(you)': 1, 'has(your)': 1, 'has(@crazyycamille,my)': 1, 'has(my,dads)': 1, 'has(dads,being)': 1, 'has(being,a)': 1, 'has(a,jerk)': 1, 'has(jerk,and)': 1, "has(and,won't)": 1, "has(won't,buy)": 1, 'has(buy,me)': 1, 'has(me,snowballs.)': 1, 'has(snowballs.,That)': 1, 'has(That,made)': 1, 'has(made,me)': 1, 'has(me,miss)': 1, 'has(miss,you)': 1, 'has(you,and)': 1, 'has(and,your)': 1, 'has(your,snowballs.)': 1, 'has(@crazyycamille,my,dads)': 1, 'has(my,dads,being)': 1, 'has(dads,being,a)': 1, 'has(being,a,jerk)': 1, 'has(a,jerk,and)': 1, "has(jerk,and,won't)": 1, "has(and,won't,buy)": 1, "h

### Naive Bayes
We will start with a simple Naïve Bayes Classifier. For a given tweet, if we need to find the label for it, we find the probabilities of all the labels, given that feature and then select the label with maximum probability.

NLTK has its own implementation of Naive Bayes `nltk.classify.NaiveBayesClassifier`. If you prefer, you can use the Naive Bayes implementation in `sklearn`

In [24]:
nb_classifier = nltk.classify.NaiveBayesClassifier
nb_class = nb_classifier.train(v_train)

#### Evaluation

Let's evaluate the accuracy of our model in our validation data

In [25]:
print("Accuracy of the model = ", nltk.classify.accuracy(nb_class, v_validation))

Accuracy of the model =  0.669


71.4 % of accuracy seems pretty good for the task.

We can have a more detailed idea of the performance by taking a look to the confusion matrix

In [26]:
# build confusion matrix over validation set
test_truth   = [s for (t,s) in v_validation]
test_predict = [nb_class.classify(t) for (t,s) in v_validation]

print('Confusion Matrix')
print()
print(nltk.ConfusionMatrix( test_truth, test_predict ))

Confusion Matrix

         |   n   p |
         |   e   o |
         |   g   s |
         |   a   i |
         |   t   t |
         |   i   i |
         |   v   v |
         |   e   e |
---------+---------+
negative |<365>112 |
positive | 219<304>|
---------+---------+
(row = reference; col = test)



#### Most Representative Features

The NLTK classifier object allows us to see the most representative features

In [27]:
nb_class.show_most_informative_features(25)

Most Informative Features
             has(sad)_JJ = 1              negati : positi =     24.0 : 1.0
             neg_r(hate) = 0.0            negati : positi =     14.4 : 1.0
                has(sad) = 1              negati : positi =     14.4 : 1.0
             neg_l(hate) = 0.0            negati : positi =     14.0 : 1.0
              neg_l(sad) = 0.0            negati : positi =     13.1 : 1.0
              neg_r(sad) = 0.0            negati : positi =     12.5 : 1.0
             has(I,hate) = 1              negati : positi =     12.5 : 1.0
             has(wish,I) = 1              negati : positi =     11.9 : 1.0
             neg_r(away) = 0.0            negati : positi =     11.9 : 1.0
            has(wish)_JJ = 1              negati : positi =     11.9 : 1.0
               has(away) = 1              negati : positi =     11.9 : 1.0
             neg_l(away) = 0.0            negati : positi =     11.2 : 1.0
               has(lost) = 1              negati : positi =     11.2 : 1.0

### Baseline

We have applied a thorough process to create features for our tweets. However, is it justified? Have we actually created a better representation of our data? To know that, we are going to create a baseline model that uses only the text in the tweets (with no features added).

To that end we define a new extraction function that only extract the terms from the tweets

In [28]:
baseline_train_tweets = [(tweet.split(" "), sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[:training_size]]
baseline_validation_tweets  = [(tweet.split(" "), sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[training_size:]]

# Wrapper function for the extraction of features
def extract_baseline_features(words):
    
    bag = {}
    words_uni = [ 'has(%s)'% ug for ug in words ]
    
    for f in words_uni:
        bag[f] = 1

    return bag

v_baseline_train = nltk.classify.apply_features(extract_baseline_features, baseline_train_tweets)
v_baseline_validation = nltk.classify.apply_features(extract_baseline_features, baseline_validation_tweets)

We fit a new naive based classifier over this baseline representation and evaluate it

In [29]:
baseline_nb_classifier = nltk.classify.NaiveBayesClassifier
baseline_nb_class = nb_classifier.train(v_baseline_train)

In [30]:
print("Accuracy of the baseline model = ", nltk.classify.accuracy(baseline_nb_class, v_baseline_validation))

Accuracy of the baseline model =  0.649


In [31]:
# build confusion matrix over validation set
test_truth   = [s for (t,s) in v_baseline_validation]
test_predict = [nb_class.classify(t) for (t,s) in v_baseline_validation]

print('Confusion Matrix')
print()
print(nltk.ConfusionMatrix( test_truth, test_predict ))

Confusion Matrix

         |   n   p |
         |   e   o |
         |   g   s |
         |   a   i |
         |   t   t |
         |   i   i |
         |   v   v |
         |   e   e |
---------+---------+
negative |<368>109 |
positive | 243<280>|
---------+---------+
(row = reference; col = test)



As we can see, performance is significantly lower than that of the model using all the features we have created.

In [32]:
# Most Representative Features
baseline_nb_class.show_most_informative_features(25)

Most Informative Features
                has(sad) = 1              negati : positi =     14.4 : 1.0
               has(away) = 1              negati : positi =     11.9 : 1.0
               has(lost) = 1              negati : positi =     11.2 : 1.0
               has(hate) = 1              negati : positi =     11.1 : 1.0
                has(New) = 1              positi : negati =      8.7 : 1.0
          has(followers) = 1              positi : negati =      8.7 : 1.0
              has(sucks) = 1              negati : positi =      8.0 : 1.0
             has(You're) = 1              positi : negati =      8.0 : 1.0
            has(finally) = 1              positi : negati =      8.0 : 1.0
               has(able) = 1              negati : positi =      7.4 : 1.0
               has(wish) = 1              negati : positi =      7.3 : 1.0
               has(wont) = 1              negati : positi =      6.7 : 1.0
               has(open) = 1              negati : positi =      6.7 : 1.0

## Exercise 2: MaxEnt Classifier

Let's try a more sophisticated classifier to see if we can boost the classification performance. In particular we will apply a Maximum Entropy Classifier. This classifier works by finding a probability distribution that maximizes the likelihood of testable data.

To create a MaxEnt model, make use of the `nltk.classify.MaxentClassifier` function and follow the Naive Bayes example.

# SentiWordNet

In the theoretical session we presented some sentiment resources that could be used to enrich our dataset with external information.

In particular, SentiWordNet provides a sentiment annotation for the WordNet synsets. We can add this sentiment annotation as new features to our dataset. 

In the following, we define a fuction that based on the words in the tweets and their POS tagging, find the sentiment annotation for the word_POS_TAG in SentiWordNet. We then add these values as new features in our dataset and use them to train a new MaxEnt Classifier

In [8]:
# Download the Wordnet Corpus
nltk.download('wordnet')

# Download the Senti Wordnet Corpus
nltk.download('sentiwordnet')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag
 
lemmatizer = WordNetLemmatizer()
 
def penn_to_wn(tag):
    """
    Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None
 

def swn_polarity(text):
    sentiment = 0.0
    tokens_count = 0
  
    tagged_sentence = pos_tag(word_tokenize(text))
    sentiment = {}
    for word, tag in tagged_sentence:
        
        wn_tag = penn_to_wn(tag)
        if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
            sentiment["sent("+word+")"] = 0.0
            continue
        
        lemma = lemmatizer.lemmatize(word, pos=wn_tag)
        if not lemma:
            sentiment["sent("+word+")"] = 0.0
            continue

        synsets = wn.synsets(lemma, pos=wn_tag)
        if not synsets:
            sentiment["sent("+word+")"] = 0.0
            continue

        # Take the first sense, the most common
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())

        sentiment["sent("+word+")"] = swn_synset.pos_score() - swn_synset.neg_score()
        
    return sentiment

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\madcastea\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\madcastea\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\sentiwordnet.zip.


In [9]:
text = "This is a text with good and very good words and bad and stupid words"
swn_polarity(text)

{'sent(This)': 0.0,
 'sent(is)': 0.0,
 'sent(a)': 0.0,
 'sent(text)': 0.0,
 'sent(with)': 0.0,
 'sent(good)': 0.75,
 'sent(and)': 0.0,
 'sent(very)': 0.0,
 'sent(words)': 0.0,
 'sent(bad)': -0.625,
 'sent(stupid)': -0.75}

This annotation provides a sentiment score (based on the SentiWordNet sentiment score) for each term in the tweets (-1 negative, 1 positive, 0 neutral)

In [10]:
# Wrapper function for the extraction of features + sentiment features
def extract_features_with_sentiment(text):
    features = {}
    
    words = processAll(text)
    
    sentiment_features = swn_polarity(text)
    features.update(sentiment_features)
    
    word_features = get_word_features(words)
    features.update( word_features )

    negation_features = get_negation_features(words)
    features.update( negation_features )

    return features

# Exercise 3: Enhanced classifier

We are going to test if the sentiment lexicon improves our MaxEnt classifier. To that end you have to make use of the `extract_features_with_sentiment` function to create the features (by using the `nltk.classify.apply_features` function) to feed the classifier. **(take a look to the Naive Bayes example)**.