# Sentiment Analysis - Classifier Tutorial

This repo goes through a tutorial on how to create your own classifier using Twitter data

Tutor followed: http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

In [18]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## Create test data

Create a list of words with positive sentiments

In [19]:
pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]

Create a list of words with negative sentiments

In [20]:
neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative')]

We take both of those lists and create a single list of tuples each containing two elements. First element is an array containing the words and second element is the type of sentiment. We get rid of the words smaller than 2 characters and we use lowercase for everything.

In [21]:
tweets = []
for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3] 
    tweets.append((words_filtered, sentiment))
    
tweets

[(['love', 'this', 'car'], 'positive'),
 (['this', 'view', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive'),
 (['excited', 'about', 'the', 'concert'], 'positive'),
 (['best', 'friend'], 'positive'),
 (['not', 'like', 'this', 'car'], 'negative'),
 (['this', 'view', 'horrible'], 'negative'),
 (['feel', 'tired', 'this', 'morning'], 'negative'),
 (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
 (['enemy'], 'negative')]

In [22]:
# Test tweets
test_tweets = [
    (['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]

## Classifier

The list of word features need to be extracted from the tweets. It is a list with every distinct words ordered by frequency of appearance. We use the following function to get the list plus the two helper functions.

In [23]:
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)    
    word_features = wordlist.keys()
    return word_features

In [24]:
def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words)
    return all_words

In [25]:
word_features = get_word_features(get_words_in_tweets(tweets))

To create a classifier, we need to decide what features are relevant. To do that, we first need a `feature extractor`. The one we are going to use returns a dictionary indicating what words are contained in the input passed. Here, the input is the tweet. We use the word features list defined above along with the input to create the dictionary.

In [26]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [27]:
tweets

[(['love', 'this', 'car'], 'positive'),
 (['this', 'view', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive'),
 (['excited', 'about', 'the', 'concert'], 'positive'),
 (['best', 'friend'], 'positive'),
 (['not', 'like', 'this', 'car'], 'negative'),
 (['this', 'view', 'horrible'], 'negative'),
 (['feel', 'tired', 'this', 'morning'], 'negative'),
 (['not', 'looking', 'forward', 'the', 'concert'], 'negative'),
 (['enemy'], 'negative')]

With our feature extractor, we can apply the features to our classifier using the method apply_features. We pass the feature extractor along with the tweets list defined above.

In [28]:
training_set = nltk.classify.apply_features(extract_features, tweets)

The variable `training_set` contains the labeled feature sets. It is a list of tuples which each tuple containing the feature dictionary and the sentiment string for each tweet. The sentiment string is also called ‘label’.

In [41]:
training_set[6]

({'contains(about)': False,
  'contains(amazing)': False,
  'contains(best)': False,
  'contains(car)': False,
  'contains(concert)': False,
  'contains(enemy)': False,
  'contains(excited)': False,
  'contains(feel)': False,
  'contains(forward)': False,
  'contains(friend)': False,
  'contains(great)': False,
  'contains(horrible)': True,
  'contains(like)': False,
  'contains(looking)': False,
  'contains(love)': False,
  'contains(morning)': False,
  'contains(not)': False,
  'contains(the)': False,
  'contains(this)': True,
  'contains(tired)': False,
  'contains(view)': True},
 'negative')

In [30]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

#### Naive Bayes Algorithm

The Naive Bayes classifier uses the prior probability of each label which is the frequency of each label in the training set, and the contribution from each feature. In our case, the frequency of each label is the same for ‘positive’ and ‘negative’. The word ‘amazing’ appears in 1 of 5 of the positive tweets and none of the negative tweets. This means that the likelihood of the ‘positive’ label will be multiplied by 0.2 when this word is seen as part of the input

In [31]:
#Print the most informative features using this function
classifier.show_most_informative_features(10)

Most Informative Features
           contains(not) = False          positi : negati =      1.6 : 1.0
          contains(love) = False          negati : positi =      1.2 : 1.0
        contains(friend) = False          negati : positi =      1.2 : 1.0
       contains(forward) = False          positi : negati =      1.2 : 1.0
          contains(like) = False          positi : negati =      1.2 : 1.0
       contains(looking) = False          positi : negati =      1.2 : 1.0
         contains(great) = False          negati : positi =      1.2 : 1.0
         contains(about) = False          negati : positi =      1.2 : 1.0
         contains(tired) = False          positi : negati =      1.2 : 1.0
      contains(horrible) = False          positi : negati =      1.2 : 1.0


## Classify

Now that we have our classifier initialized, we can try to classify a tweet and see what the sentiment type output is. Our classifier is able to detect that this tweet has a positive sentiment because of the word ‘friend’ which is associated to the positive tweet ‘He is my best friend’.

In [43]:
tweet = 'Larry is not my friend'
classifier.classify(extract_features(tweet.split()))

'negative'

In [33]:
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(tweet)
print(ss)

{'neg': 0.0, 'neu': 0.484, 'pos': 0.516, 'compound': 0.4939}


In [34]:
# Working within nltk:
#     print extract_features(tweet.split())
# {'contains(not)': False,
#  'contains(view)': False,
#  'contains(best)': False,
#  'contains(excited)': False,
#  'contains(morning)': False,
#  'contains(about)': False,
#  'contains(horrible)': False,
#  'contains(like)': False,
#  'contains(this)': False,
#  'contains(friend)': True,       --This is why positive
#  'contains(concert)': False,
#  'contains(feel)': False,
#  'contains(love)': False,
#  'contains(looking)': False,
#  'contains(tired)': False,
#  'contains(forward)': False,
#  'contains(car)': False,
#  'contains(the)': False,
#  'contains(amazing)': False,
#  'contains(enemy)': False,
#  'contains(great)': False}

`def classify(self, featureset):`

-    Discard any feature names that we've never seen before.
-    Find the log probability of each label, given the features.
-    Then add in the log probability of features given labels.
-    Generate a probability distribution dictionary using the dict logprod
-    Return the sample with the greatest probability from the probability distribution dictionary