In [1]:
# Larger window and fontsize
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
display(HTML("<style>.rendered_html { font-size: 18px; }</style>"))

# Introduction

Sentiment Analysis refers to the use of text analysis and natural language processing to identify and extract subjective information in textual contents.

In this practice we will focus on the analysis of the sentiment of a collection of tweets, applying some of the ideas that we have explored in class


## Dataset

This corpus of tweets, developed by Sanford’s Natural Language processing research group.

The training set is collected by querying Twitter API for happy emoticons like ":)" and sad emoticons like ":(" and labelling them positive or negative. The emoticons were then stripped and Re-Tweets and duplicates removed.

The data is a CSV with emoticons removed. Data file format has 6 fields:

    0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive) **Note**: For the dataset there is only negative and positive tweets
    1 - the id of the tweet (2087)
    2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
    3 - the query (lyx). If there is no query, then this value is NO_QUERY.
    4 - the user that tweeted (robotickilldozr)
    5 - the text of the tweet (Lyx is cool)

It also contains around 500 tweets manually collected and labelled for testing purposes.

We randomly sample and use 5000 tweets from this dataset.

### Load dataset

In [3]:
import os.path
import csv
import pandas as pd
import nltk

def loadDataset(in_file):
    my_path = os.getcwd()
    path = os.path.join(my_path, in_file)
    column_names = ['sentiment','ID', 'Date', 'Query', 'user_id', 'tweet']
    tweets = pd.read_csv(path, delimiter=',', quotechar='"', header= None, names= column_names, encoding="ISO-8859-1")

    print('Readed ', len(tweets), "tweets")
    return tweets

We load the data and check the number of positive and negative tweets

In [4]:
raw_training_data = loadDataset("datasets/stanford_dataset/training.1600000.processed.noemoticon.csv")
raw_training_data.groupby('sentiment')['ID'].nunique()

Readed  1600000 tweets


sentiment
0    800000
4    800000
Name: ID, dtype: int64

Dataset contains more than a million tweets, for our practice we will only use a sample of 5000 tweets

In [5]:
# Sample 5000 tweets from the dataset
training_data = raw_training_data.sample(n=5000)
training_data.head()

Unnamed: 0,sentiment,ID,Date,Query,user_id,tweet
939035,4,1793592465,Thu May 14 03:16:59 PDT 2009,NO_QUERY,LilyFabia,day off
167371,0,1961652023,Fri May 29 09:50:07 PDT 2009,NO_QUERY,joshuamilane,@indysawhney seems like u r just pushing your ...
461470,0,2174457913,Sun Jun 14 22:35:37 PDT 2009,NO_QUERY,LoclBandsRAwsom,Air fresheners that are supposed to smell like...
1190203,4,1983675734,Sun May 31 13:36:54 PDT 2009,NO_QUERY,Andematros,"@lenanj http://twitpic.com/64re1 - Lena, the p..."
1476530,4,2066117107,Sun Jun 07 09:55:16 PDT 2009,NO_QUERY,Ravenpeach,@electricrocks AMEN AMEN AMEN!


Let's check that the distribution of positive and negative tweets remains

In [6]:
training_data.groupby('sentiment')['ID'].nunique()

sentiment
0    2512
4    2488
Name: ID, dtype: int64

To facilitate the interpretation of the results we are going to recode the target variable

In [7]:
def recode_sentiment(series):
    if series == 4:
        return 'positive'
    else:
        return 'negative'
    
training_data['sentiment'] = training_data['sentiment'].apply(recode_sentiment)
training_data.head()

Unnamed: 0,sentiment,ID,Date,Query,user_id,tweet
939035,positive,1793592465,Thu May 14 03:16:59 PDT 2009,NO_QUERY,LilyFabia,day off
167371,negative,1961652023,Fri May 29 09:50:07 PDT 2009,NO_QUERY,joshuamilane,@indysawhney seems like u r just pushing your ...
461470,negative,2174457913,Sun Jun 14 22:35:37 PDT 2009,NO_QUERY,LoclBandsRAwsom,Air fresheners that are supposed to smell like...
1190203,positive,1983675734,Sun May 31 13:36:54 PDT 2009,NO_QUERY,Andematros,"@lenanj http://twitpic.com/64re1 - Lena, the p..."
1476530,positive,2066117107,Sun Jun 07 09:55:16 PDT 2009,NO_QUERY,Ravenpeach,@electricrocks AMEN AMEN AMEN!


## Tweet Preprocessing

At this step, we will preprocess the text in the tweets, tokenize and stem it. We will have to take care of specific markups (e.g., hashtags) related to Twitter, as well as of aspects related to the sentiment analysis, like, for instance, emoticons.

You may find interesting ideas in this regard in the following links:
 - Christopher Potts sentiment tokenizer: http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py
 - Brendan O’Connor twitter tokenizer: https://github.com/brendano/tweetmotif

### Hashtags

A hashtag is a word or an un-spaced phrase prefixed with the hash symbol (#). These are used to both naming subjects and phrases that are currently in trending topics. For example, #iPad, #news

    Regular Expression: #(\w+)

    Replace Expression: HASH_hashtag


In [8]:
import re
hash_regex = re.compile(r"#(\w+)")
def hash_repl(match):
	return '__HASH_'+match.group(1).upper()

In [9]:
# Test
re.sub( hash_regex, hash_repl, 'happy midsummer everyone! My little brother has a bd today and here are few relatives having a dinner.. not so sure is it very nice and #hashtag')

'happy midsummer everyone! My little brother has a bd today and here are few relatives having a dinner.. not so sure is it very nice and __HASH_HASHTAG'

### User Names
Every Twitter user has a unique username. Any thing directed towards that user can be indicated be writing their username preceded by ‘@’. Thus, these are like proper nouns. For example, @Apple

    Regular Expression: @(\w+)

    Replace Expression: user_username

In [10]:
user_regex = re.compile(r"@(\w+)")
def user_repl(match):
	return '__user_'+match.group(1).upper()

In [11]:
# Test
re.sub( user_regex, user_repl, 'This is a @username')

'This is a __user_USERNAME'

### URLs
Users often share hyperlinks in their tweets. Twitter shortens them using its in-house URL shortening service, like http://t.co/FCWXoUd8 - such links also enables Twitter to alert users if the link leads out of its domain. From the point of view of text classification, a particular URL is not important. However, presence of a URL can be an important feature. Regular expression for detecting a URL is fairly complex because of different types of URLs that can be there, but because of Twitter’s shortening service, we can use a relatively simple regular expression.

    Regular Expression: (http|https|ftp)://[a-zA-Z0-9\\./]+

    Replace Expression: URL

In [12]:
url_regex = re.compile(r"(http|https|ftp)://[a-zA-Z0-9\./]+")
def url_repl(match):
	return '__URL_'

In [13]:
# Test
re.sub( url_regex, url_repl, 'This is a http://url.es')

'This is a __URL_'

### Emoticons

Use of emoticons is very prevalent throughout the web, more so on micro-blogging sites. We identify the following emoticons and replace them with a single word.

In [14]:
# Emoticons
emoticons = \
	[	('__EMOT_SMILEY',	[':-)', ':)', '(:', '(-:', ] )	,\
		('__EMOT_LAUGH',		[':-D', ':D', 'X-D', 'XD', 'xD', ] )	,\
		('__EMOT_LOVE',		['<3', ':\*', ] )	,\
		('__EMOT_WINK',		[';-)', ';)', ';-D', ';D', '(;', '(-;', ] )	,\
		('__EMOT_FROWN',		[':-(', ':(', '(:', '(-:', ] )	,\
		('__EMOT_CRY',		[':,(', ':\'(', ':"(', ':(('] )	,\
	]
    
def escape_paren(arr):
	return [text.replace(')', '[)}\]]').replace('(', '[({\[]') for text in arr]

def regex_union(arr):
	return '(' + '|'.join( arr ) + ')'

emoticons_regex = [ (repl, re.compile(regex_union(escape_paren(regx))) ) for (repl, regx) in emoticons ]

In [15]:
# Test
text = "This is a text with one emoticon :) and another :("
for (repl, regx) in emoticons_regex :
    text = re.sub(regx, ' '+repl+' ', text)
    
print(text)

This is a text with one emoticon  __EMOT_SMILEY  and another  __EMOT_FROWN 


### Punctuations

Although not all Punctuations are important from the point of view of classification but some of these, like question mark, exclamation mark can also provide information about the sentiments of the text. We replace every word boundary by a list of relevant punctuations present at that point. 

In [16]:
# Spliting by word boundaries
word_bound_regex = re.compile(r"\W+")

# Punctuations
punctuations = \
	[	#('',		['.', ] )	,\
		#('',		[',', ] )	,\
		#('',		['\'', '\"', ] )	,\
		('__PUNC_EXCL',		['!', '¡', ] )	,\
		('__PUNC_QUES',		['?', '¿', ] )	,\
		('__PUNC_ELLP',		['...', '…', ] )	,\
	]

#For punctuation replacement
def punctuations_repl(match):
	text = match.group(0)
	repl = []
	for (key, parr) in punctuations :
		for punc in parr :
			if punc in text:
				repl.append(key)
	if( len(repl)>0 ) :
		return ' '+' '.join(repl)+' '
	else :
		return ' '

In [17]:
# Test
re.sub( word_bound_regex , punctuations_repl, "This a text with an exclamation!!")

'This a text with an exclamation __PUNC_EXCL '

### Repetitions
People often use repeating characters while using colloquial language, like "I’m in a hurrryyyyy", "We won, yaaayyyyy!" As our final pre-processing step, we replace characters repeating more than twice as two characters.

    Regular Expression: (.)\1{1,}

    Replace Expression: \1\1

In [18]:
# Repeating words like hurrrryyyyyy
rpt_regex = re.compile(r"(.)\1{1,}", re.IGNORECASE);
def rpt_repl(match):
	return match.group(1)+match.group(1)

In [19]:
# Test
re.sub( rpt_regex, rpt_repl, "Reppppeated characters in wordsssssssss" )

'Reppeated characters in wordss'

### Stemming

We will now stemmize the words in the tweets by applying the Porter Stemmer seen in class. This stemmer was very widely used and became and remains the de facto standard algorithm used for English stemming. It offers excellent trade-off between speed, readability, and accuracy.

NLTK has its own implementation of the stemmer

In [20]:
stemmer = nltk.stem.PorterStemmer()

In [21]:
# Test
text = "Textual representation containing words to apply the porter stemmer"
text = [word if(word[0:2]=='__') else word.lower() for word in text.split() if len(word) >= 3]
text = [stemmer.stem(w) for w in text]                
text = " ".join(text)
text

'textual represent contain word appli the porter stemmer'

In [22]:
# Wrapper function that encloses all the processing procedures
def processAll(text):
    
    text = re.sub( hash_regex, hash_repl, text )
    text = re.sub( user_regex, user_repl, text)
    text = re.sub( url_regex, ' __URL ', text )
    
    for (repl, regx) in emoticons_regex :
        text = re.sub(regx, ' '+repl+' ', text)
    
    text = text.replace('\'','')
    
    text = re.sub( word_bound_regex , punctuations_repl, text )
    text = re.sub( rpt_regex, rpt_repl, text )
    
        
    text = [word if(word[0:2]=='__') else word.lower() for word in text.split() if len(word) >= 3]
    text = [stemmer.stem(w) for w in text]                
    
    return text

We create a new column in our dataframe with the processed text

In [23]:
training_data['processed_tweet'] = training_data.tweet.apply(processAll)
training_data.head()

Unnamed: 0,sentiment,ID,Date,Query,user_id,tweet,processed_tweet
939035,positive,1793592465,Thu May 14 03:16:59 PDT 2009,NO_QUERY,LilyFabia,day off,"[day, off]"
167371,negative,1961652023,Fri May 29 09:50:07 PDT 2009,NO_QUERY,joshuamilane,@indysawhney seems like u r just pushing your ...,"[__user_indysawhney, seem, like, just, push, y..."
461470,negative,2174457913,Sun Jun 14 22:35:37 PDT 2009,NO_QUERY,LoclBandsRAwsom,Air fresheners that are supposed to smell like...,"[air, freshen, that, are, suppos, smell, like,..."
1190203,positive,1983675734,Sun May 31 13:36:54 PDT 2009,NO_QUERY,Andematros,"@lenanj http://twitpic.com/64re1 - Lena, the p...","[__user_lenanj, __url, lena, the, pictur, you,..."
1476530,positive,2066117107,Sun Jun 07 09:55:16 PDT 2009,NO_QUERY,Ravenpeach,@electricrocks AMEN AMEN AMEN!,"[__user_electricrock, amen, amen, amen, __punc..."


# Feature Creation

A wide variety of features can be used to build a classifier for tweets. The most widely used and basic feature set is word n-grams. However, there's a lot of domain specific information present in tweets that can also be used for classifying them.

## Unigrams

Unigrams are the simplest features that can be used for text classification. A Tweet can be represented by a multiset of words present in it. We, however, have used the presence of unigrams in a tweet as a feature set. Presence of a word is more important than how many times it is repeated

In [24]:
text = ["Example", "of", "tweet", "represented", "as", "unigrams"]

unigrams_fd = nltk.FreqDist()
unigrams_fd.update(text)
unigrams_fd

FreqDist({'Example': 1, 'of': 1, 'tweet': 1, 'represented': 1, 'as': 1, 'unigrams': 1})

# N-grams

N-gram refers to an n-long sequence of words. Probabilistic Language Models based on Unigrams, Bigrams and Trigrams can be successfully used to predict the next word given a current context of words. In the domain of sentiment analysis, the performance of N-grams is unclear.

As the order of the n-grams increases, they tend to be more and more sparse. Let's then try bi-gram and tri-grams

In [25]:
# Bigrams
words_bi  = [ ','.join(map(str,bg)) for bg in nltk.bigrams(text) ]
bi_grams_fd = nltk.FreqDist()
bi_grams_fd.update( words_bi )
bi_grams_fd

FreqDist({'Example,of': 1, 'of,tweet': 1, 'tweet,represented': 1, 'represented,as': 1, 'as,unigrams': 1})

In [26]:
# Trigrams
words_tri  = [ ','.join(map(str,tg)) for tg in nltk.trigrams(text) ]
tri_grams_fd = nltk.FreqDist()
tri_grams_fd.update( words_tri )
tri_grams_fd

FreqDist({'Example,of,tweet': 1, 'of,tweet,represented': 1, 'tweet,represented,as': 1, 'represented,as,unigrams': 1})

We compute the bigrams and trigrams models for the processed text in the whole dataset

In [27]:
# Wrapper function that encloses all the n-grams procedures

def get_word_features(words):
    bag = {}
    words_uni = [ 'has(%s)'% ug for ug in words ]
    words_bi  = [ 'has(%s)'% ','.join(map(str,bg)) for bg in nltk.bigrams(words) ]
    words_tri = [ 'has(%s)'% ','.join(map(str,tg)) for tg in nltk.trigrams(words) ]
    
    for f in words_uni+words_bi+words_tri:
        bag[f] = 1

    return bag


## Negations

The need negation detection in sentiment analysis can be illustrated by the difference in the meaning of the phrases, "This is good" vs. "This is not good" However, the negations occurring in natural language are seldom so simple. Handling the negation consists of two tasks – Detection of explicit negation cues and the scope of negation of these words.

**Scope of Negation**

Words immediately preceding and following the negation cues are the most negative and the words that come farther away do not lie in the scope of negation of such cues. We define left and right negativity of a word as the chances that meaning of that word is actually the opposite. Left negativity depends on the closest negation cue on the left and similarly for Right negativity.

In [28]:
negtn_regex = re.compile( r"""(?:
    ^(?:never|no|nothing|nowhere|noone|none|not|
        havent|hasnt|hadnt|cant|couldnt|shouldnt|
        wont|wouldnt|dont|doesnt|didnt|isnt|arent|aint
    )$
)
|
n't
""", re.X)

def get_negation_features(words):
    INF = 0.0
    negtn = [ bool(negtn_regex.search(w)) for w in words ]

    left = [0.0] * len(words)
    prev = 0.0
    for i in range(0,len(words)):
        if( negtn[i] ):
            prev = 1.0
        left[i] = prev
        prev = max( 0.0, prev-0.1)

    right = [0.0] * len(words)
    prev = 0.0
    for i in reversed(range(0,len(words))):
        if( negtn[i] ):
            prev = 1.0
        right[i] = prev
        prev = max( 0.0, prev-0.1)

    return dict( zip(
                    ['neg_l('+w+')' for w in  words] + ['neg_r('+w+')' for w in  words],
                    left + right ) )

In [29]:
# Test
text = ["This","text", "does", "not", "have", "a", "negation"]
get_negation_features(text)

{'neg_l(This)': 0.0,
 'neg_l(text)': 0.0,
 'neg_l(does)': 0.0,
 'neg_l(not)': 1.0,
 'neg_l(have)': 0.9,
 'neg_l(a)': 0.8,
 'neg_l(negation)': 0.7000000000000001,
 'neg_r(This)': 0.7000000000000001,
 'neg_r(text)': 0.8,
 'neg_r(does)': 0.9,
 'neg_r(not)': 1.0,
 'neg_r(have)': 0.0,
 'neg_r(a)': 0.0,
 'neg_r(negation)': 0.0}

## POS Tagging

With POS Tagging we can get the category of each word. Some of these categories are more interesting in order to infer the sentiment of given tweet. For example, adjectives are expected to carry most sentiment information than adverbs. In a similar way, some particular names can carry a positive or negative implication for particular domains.



In [30]:
def get_pos_features(words):
    tags = {}
    tagged_words = [ 'has(%s)'% w+'_'+tag for w,tag in nltk.pos_tag(words)]
    
    for tw in tagged_words:
        tags[tw] = 1

    return tags

As in the previous step, let's create an function to apply all these creation steps.

In [31]:
# Wrapper function for the extraction of features
def extract_features(text):
    features = {}
    
    words = processAll(text)

    word_features = get_word_features(words)
    features.update( word_features )

    negation_features = get_negation_features(words)
    features.update( negation_features )
    
#     pos_features = get_pos_features(words)
#     features.update( pos_features )

    return features

In [32]:
training_data['processed_tweet_features'] = training_data.tweet.apply(extract_features)
training_data[['tweet','processed_tweet_features']].head()

Unnamed: 0,tweet,processed_tweet_features
939035,day off,"{'has(day)': 1, 'has(off)': 1, 'has(day,off)':..."
167371,@indysawhney seems like u r just pushing your ...,"{'has(__user_indysawhney)': 1, 'has(seem)': 1,..."
461470,Air fresheners that are supposed to smell like...,"{'has(air)': 1, 'has(freshen)': 1, 'has(that)'..."
1190203,"@lenanj http://twitpic.com/64re1 - Lena, the p...","{'has(__user_lenanj)': 1, 'has(__url)': 1, 'ha..."
1476530,@electricrocks AMEN AMEN AMEN!,"{'has(__user_electricrock)': 1, 'has(amen)': 1..."


# Classification

Let's now use the processed tweet features to create a classification model.

## Training-test Splitting

To evaluate our approaches, we are going to split our data into train and validation. We will use the train to create the models and the validation to validate their performance. Once we have selected the best model (according to the accuracy on the validation set) we can use this model to predict our test set.

In this way, test set will remain as unseen data for all the process: we are not going to make any decision based on the test error. Therefore, we can assume that the results on the test set will be the same that we will obtain when new unseen data appears in the future

In [33]:
training_size = 4000
train_tweets = [(tweet, sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[:training_size]]
validation_tweets  = [(tweet, sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[training_size:]]

## Preparing the data for the classifier

We have previously defined a feature extraction process, which we have wrapped into the `extract_features` function. By making use of the `nltk.classify.apply_features` function provided by NLTK, we will process the tweets and create the features that will be used for the classifier 

In [34]:
# Apply the data processing and cleaning extraction methodologies
v_train = nltk.classify.apply_features(extract_features,train_tweets)
v_validation  = nltk.classify.apply_features(extract_features,validation_tweets)

Let's see the resultant object

In [35]:
print("For the tweet = ", training_data.tweet.values[0])
print(" ")
print("The following features has been created:")
print(" ")
print(v_train[0][0])

For the tweet =  day off 
 
The following features has been created:
 
{'has(day)': 1, 'has(off)': 1, 'has(day,off)': 1, 'neg_l(day)': 0.0, 'neg_l(off)': 0.0, 'neg_r(day)': 0.0, 'neg_r(off)': 0.0}


### Naive Bayes
We will start with a simple Naïve Bayes Classifier. For a given tweet, if we need to find the label for it, we find the probabilities of all the labels, given that feature and then select the label with maximum probability.

NLTK has its own implementation of Naive Bayes `nltk.classify.NaiveBayesClassifier`. If you prefer, you can use the Naive Bayes implementation in `sklearn`

In [36]:
nb_classifier = nltk.classify.NaiveBayesClassifier
nb_class = nb_classifier.train(v_train)

#### Evaluation

Let's evaluate the accuracy of our model in our validation data

In [37]:
print("Accuracy of the model = ", nltk.classify.accuracy(nb_class, v_validation))

Accuracy of the model =  0.711


73 % of accuracy seems pretty good for the task.

We can have a more detailed idea of the performance by taking a look to the confusion matrix

In [38]:
# build confusion matrix over validation set
test_truth   = [s for (t,s) in v_validation]
test_predict = [nb_class.classify(t) for (t,s) in v_validation]

print('Confusion Matrix')
print()
print(nltk.ConfusionMatrix( test_truth, test_predict ))

Confusion Matrix

         |   n   p |
         |   e   o |
         |   g   s |
         |   a   i |
         |   t   t |
         |   i   i |
         |   v   v |
         |   e   e |
---------+---------+
negative |<351>149 |
positive | 140<360>|
---------+---------+
(row = reference; col = test)



#### Most Representative Features

The NLTK classifier object allows us to see the most representative features

In [39]:
nb_class.show_most_informative_features(25)

Most Informative Features
              neg_l(sad) = 0.0            negati : positi =     22.3 : 1.0
                has(sad) = 1              negati : positi =     17.6 : 1.0
              neg_r(sad) = 0.0            negati : positi =     14.5 : 1.0
          has(thank,you) = 1              positi : negati =     13.2 : 1.0
             neg_l(hurt) = 0.0            negati : positi =     12.4 : 1.0
             neg_r(hurt) = 0.0            negati : positi =     12.1 : 1.0
          has(dont,have) = 1              negati : positi =     11.5 : 1.0
            neg_r(still) = 0.9            negati : positi =     10.2 : 1.0
                has(die) = 1              negati : positi =     10.2 : 1.0
              neg_r(dad) = 0.0            negati : positi =      9.6 : 1.0
             neg_l(find) = 0.9            negati : positi =      9.6 : 1.0
              neg_l(die) = 0.0            negati : positi =      9.6 : 1.0
               has(wont) = 1              negati : positi =      9.5 : 1.0

### Baseline

We have applied a thorough process to create features for our tweets. However, is it justified? Have we actually created a better representation of our data? To know that, we are going to create a baseline model that uses only the text in the tweets (with no features added).

To that end we define a new extraction function that only extract the terms from the tweets

In [40]:
baseline_train_tweets = [(tweet.split(" "), sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[:training_size]]
baseline_validation_tweets  = [(tweet.split(" "), sentiment) for tweet, sentiment in training_data[['tweet', 'sentiment']].values[training_size:]]

# Wrapper function for the extraction of features
def extract_baseline_features(words):
    
    bag = {}
    words_uni = [ 'has(%s)'% ug for ug in words ]
    
    for f in words_uni:
        bag[f] = 1

    return bag

v_baseline_train = nltk.classify.apply_features(extract_baseline_features, baseline_train_tweets)
v_baseline_validation = nltk.classify.apply_features(extract_baseline_features, baseline_validation_tweets)

We fit a new naive based classifier over this baseline representation and evaluate it

In [41]:
baseline_nb_classifier = nltk.classify.NaiveBayesClassifier
baseline_nb_class = nb_classifier.train(v_baseline_train)

In [42]:
print("Accuracy of the baseline model = ", nltk.classify.accuracy(baseline_nb_class, v_baseline_validation))

Accuracy of the baseline model =  0.693


In [43]:
# build confusion matrix over validation set
test_truth   = [s for (t,s) in v_baseline_validation]
test_predict = [nb_class.classify(t) for (t,s) in v_baseline_validation]

print('Confusion Matrix')
print()
print(nltk.ConfusionMatrix( test_truth, test_predict ))

Confusion Matrix

         |   n   p |
         |   e   o |
         |   g   s |
         |   a   i |
         |   t   t |
         |   i   i |
         |   v   v |
         |   e   e |
---------+---------+
negative |<318>182 |
positive | 194<306>|
---------+---------+
(row = reference; col = test)



As we can see, performance is significantly lower than that of the model using all the features we have created.

In [44]:
# Most Representative Features
baseline_nb_class.show_most_informative_features(25)

Most Informative Features
                has(sad) = 1              negati : positi =     17.2 : 1.0
              has(sucks) = 1              negati : positi =     10.2 : 1.0
            has(awesome) = 1              positi : negati =      9.9 : 1.0
              has(won't) = 1              negati : positi =      9.3 : 1.0
                has(Had) = 1              positi : negati =      8.4 : 1.0
              has(broke) = 1              negati : positi =      8.2 : 1.0
                has(dad) = 1              negati : positi =      8.2 : 1.0
              has(home.) = 1              negati : positi =      8.2 : 1.0
               has(They) = 1              negati : positi =      7.6 : 1.0
              has(bored) = 1              negati : positi =      7.6 : 1.0
               has(poor) = 1              negati : positi =      7.6 : 1.0
             has(wasn't) = 1              negati : positi =      7.6 : 1.0
              has(sorry) = 1              negati : positi =      7.6 : 1.0

## MaxEnt Classifier

Let's try a more sophisticated classifier to see if we can boost the classification performance. In particular we will apply a Maximum Entropy Classifier. 


This classifier works by finding a probability distribution that maximizes the likelihood of testable data. This probability function is parameterized by weight vector. The optimal value of which can be found out using the method of Lagrange multipliers.

In [46]:
max_ent_classifier = nltk.classify.MaxentClassifier
max_ent_class = max_ent_classifier.train(v_train, algorithm='GIS', max_iter=5)

  ==> Training (5 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.497
             2          -0.69303        0.991
             3          -0.69291        0.991
             4          -0.69279        0.991
         Final          -0.69267        0.991


In [47]:
print("Accuracy of the model = ", nltk.classify.accuracy(max_ent_class, v_validation))

Accuracy of the model =  0.722


In [48]:
# build confusion matrix over validation set
test_truth   = [s for (t,s) in v_validation]
test_predict = [max_ent_class.classify(t) for (t,s) in v_validation]

print('Confusion Matrix')
print()
print(nltk.ConfusionMatrix( test_truth, test_predict ))

Confusion Matrix

         |   n   p |
         |   e   o |
         |   g   s |
         |   a   i |
         |   t   t |
         |   i   i |
         |   v   v |
         |   e   e |
---------+---------+
negative |<354>146 |
positive | 132<368>|
---------+---------+
(row = reference; col = test)



The performance is similar than the one of Naive Bayes. If we review the most informative terms, we can see that both algorithms focus on similar features to perform the final classification; hence the similar performance

In [49]:
max_ent_class.show_most_informative_features(25)

  -0.000 neg_l(sad)==0.0 and label is 'positive'
  -0.000 has(sad)==1 and label is 'positive'
  -0.000 neg_r(sad)==0.0 and label is 'positive'
  -0.000 has(dont,have)==1 and label is 'positive'
  -0.000 has(thank,you)==1 and label is 'negative'
  -0.000 neg_l(hurt)==0.0 and label is 'positive'
  -0.000 neg_r(hurt)==0.0 and label is 'positive'
  -0.000 has(die)==1 and label is 'positive'
  -0.000 neg_r(still)==0.9 and label is 'positive'
  -0.000 neg_l(die)==0.0 and label is 'positive'
  -0.000 neg_l(find)==0.9 and label is 'positive'
  -0.000 neg_r(dad)==0.0 and label is 'positive'
  -0.000 has(wasnt)==1 and label is 'positive'
  -0.000 neg_l(wasnt)==0.0 and label is 'positive'
  -0.000 neg_l(work)==0.9 and label is 'positive'
  -0.000 neg_l(boo)==0.0 and label is 'positive'
  -0.000 has(broke)==1 and label is 'positive'
  -0.000 neg_l(wanna)==0.9 and label is 'positive'
  -0.000 neg_r(wasnt)==0.0 and label is 'positive'
  -0.000 has(internet)==1 and label is 'positive'
  -0.000 has(yo

# SentiWordNet

In the theoretical session we presented some sentiment resources that could be used to enrich our dataset with external information.

In particular, SentiWordNet provides a sentiment annotation for the WordNet synsets. We can add this sentiment annotation as new features to our dataset. 

In the following, we define a fuction that based on the words in the tweets and their POS tagging, find the sentiment annotation for the word_POS_TAG in SentiWordNet. We then add these values as new features in our dataset and use them to train a new MaxEnt Classifier

In [50]:
# Download the Wordnet Corpus
nltk.download('wordnet')

# Download the Senti Wordnet Corpus
nltk.download('sentiwordnet')

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn
from nltk import sent_tokenize, word_tokenize, pos_tag
 
lemmatizer = WordNetLemmatizer()
 
def penn_to_wn(tag):
    """
    Convert between the PennTreebank tags to simple Wordnet tags
    """
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    elif tag.startswith('V'):
        return wn.VERB
    return None
 

def swn_polarity(text):
    sentiment = 0.0
    tokens_count = 0
  
    tagged_sentence = pos_tag(word_tokenize(text))
    sentiment = {}
    for word, tag in tagged_sentence:
        
        wn_tag = penn_to_wn(tag)
        if wn_tag not in (wn.NOUN, wn.ADJ, wn.ADV):
            sentiment["sent("+word+")"] = 0.0
            continue
        
        lemma = lemmatizer.lemmatize(word, pos=wn_tag)
        if not lemma:
            sentiment["sent("+word+")"] = 0.0
            continue

        synsets = wn.synsets(lemma, pos=wn_tag)
        if not synsets:
            sentiment["sent("+word+")"] = 0.0
            continue

        # Take the first sense, the most common
        synset = synsets[0]
        swn_synset = swn.senti_synset(synset.name())

        sentiment["sent("+word+")"] = swn_synset.pos_score() - swn_synset.neg_score()
        
    return sentiment

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\madcastea\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package sentiwordnet to
[nltk_data]     C:\Users\madcastea\AppData\Roaming\nltk_data...
[nltk_data]   Package sentiwordnet is already up-to-date!


In [51]:
text = "This is a text with good and very good words and bad and stupid words"
swn_polarity(text)

{'sent(This)': 0.0,
 'sent(is)': 0.0,
 'sent(a)': 0.0,
 'sent(text)': 0.0,
 'sent(with)': 0.0,
 'sent(good)': 0.75,
 'sent(and)': 0.0,
 'sent(very)': 0.0,
 'sent(words)': 0.0,
 'sent(bad)': -0.625,
 'sent(stupid)': -0.75}

This annotation provides a sentiment score (based on the SentiWordNet sentiment score) for each term in the tweets (-1 negative, 1 positive, 0 neutral)

In [52]:
# Wrapper function for the extraction of features + sentiment features
def extract_features_with_sentiment(text):
    features = {}
    
    words = processAll(text)
    
    sentiment_features = swn_polarity(text)
    features.update(sentiment_features)
    
    word_features = get_word_features(words)
    features.update( word_features )

    negation_features = get_negation_features(words)
    features.update( negation_features )
        
    pos_features = get_pos_features(words)
    features.update( pos_features )

    return features

In [53]:
# Apply the data processing and cleaning extraction methodologies
v_train_sentiment = nltk.classify.apply_features(extract_features_with_sentiment,train_tweets)
v_validation_sentiment  = nltk.classify.apply_features(extract_features_with_sentiment,validation_tweets)

In [None]:
# Train a new classfier with the sentiment features
max_ent_classifier = nltk.classify.MaxentClassifier
max_ent_class = max_ent_classifier.train(v_train_sentiment, algorithm='GIS', max_iter=5)

In [None]:
print("Accuracy of the model = ", nltk.classify.accuracy(max_ent_class, v_validation_sentiment))

In [None]:
# build confusion matrix over validation set
test_truth   = [s for (t,s) in v_validation_sentiment]
test_predict = [max_ent_class.classify(t) for (t,s) in v_validation_sentiment]

print('Confusion Matrix')
print()
print(nltk.ConfusionMatrix( test_truth, test_predict ))

As we can see, we have improve the accuracy of our classifier by including the sentiment information of the terms.

If we take a look to the most informative features, we can find some sentiment-related features among them. For instance, `sad` has a negative implication, codified by the feature: `sent(sad)==-0.625`, which is highly informative.

In [None]:
max_ent_class.show_most_informative_features(25)