## Introduction

This tutorial will introduce you to some basic techniques for performing sentiment analysis. We will concentrate on the tools provided by [Natural Language Toolkit (NLTK)](https://www.nltk.org/), a Python package used for natural language tasks. Sentiment analysis is the process of extracting subjective information about a piece of text, particularly focusing on whether the writer's attitude toward the subject is positive or negative. This topic is useful in a variety of areas. It can allow businesses to understand how consumers are reacting to new initiatives, and adjust their marketing strategy accordingly. Or, it can allow policymakers to gain insight on which groups or demographics to target in their campaigns. These are just a couple examples of the practical applications of sentiment analysis.
 
### Tutorial content

In this tutorial, we will first show how to perform sentiment analysis using VADER, a tool contained in NLTK, with an example of applying this to text from social media. Then, we will introduce ways to train a new classifier with NLTK.

We will cover the following topics in this tutorial:
- [Installing the libraries](#Installing-the-libraries)
- [VADER sentiment analysis tools](#VADER-sentiment-analysis-tools)
- [Example application: sentiment analysis on tweets](#Example-application:-sentiment-analysis-on-tweets)
- [Training a Naive Bayes Classifier with NLTK](#Training-a-Naive-Bayes-Classifier-with-NLTK)
- [Saving and reloading classifiers](#Saving-and-reloading-classifiers)
- [Incorporating classifiers from scikit-learn](#Incorporating-classifiers-from-scikit-learn)
- [Summary and references](#Summary-and-references)

## Installing the libraries

We will use `nltk`, which is already included in Anaconda. There is an extra lexicon and a corpus that need to be downloaded separately:

  ```python
  >>>nltk.download('vader-lexicon')
  >>>nltk.download('movie_reviews')
  ```
 
Another library we will be using is `tweepy`, to collect tweets from Twitter. You can install this using `pip`:

    $ pip install tweepy

## VADER sentiment analysis tools

Now that we've installed the libraries, let's get started with VADER. VADER (Valence Aware Dictionary and sEntiment Reasoner) is a module that contains tools for sentiment analysis, including a sentiment analyzer that is ready to be applied to any string of text.

In [27]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

On my machine, the imports raise a warning that the `twython` library has not been installed. Do not worry about this, since we will not be using any functions that require that library.

We start by creating a `SentimentIntensityAnalyzer`object, which will be used to analyze strings of text.  

In [28]:
analyzer = SentimentIntensityAnalyzer()

`SentimentIntensityAnalyzer` has a lexicon of words that are commonly used in subjective text. Often, these are words that convey feeling or emotion. Each word has been assigned a score representing how positive or negative it is. This lexicon can be accessed with the `make_lex_dict` function.

In [29]:
lexicon_dict = analyzer.make_lex_dict()
print(lexicon_dict['great'], lexicon_dict['okay'], lexicon_dict['horrible'])

3.1 0.9 -2.5


Here, we can see that the word 'great' has a positive score, and 'horrible' has a negative score. This means that they have a positive and negative sentiment, respectively. The word 'okay' also has a positive score, but the value is smaller than that of 'great', showing that 'great' has a stronger positive sentiment than 'okay'.

The key method of `SentimentIntensityAnalyzer` is `polarity_scores`. It is called on a string of text, and returns four float values: 'compound', 'negative', 'neutral', and 'positive'. The 'positive', 'negative', and 'neutral' values give scores of how positive, negative, and neutral the sentiment of the text is, respectively. The analyzer uses the lexicon to take into account the scores of the individual words in the text. The 'compound' value is a sum of these three scores, normalized to be between -1 and 1. Values closer to -1 represent more negative, and values closer to 1 represent more positive, sentiment. A value close to 0 means that the text is neutral, so it does not contain subjective information. In this way, the compound score can identify whether the overall sentiment of a piece of text is positive, negative, or neutral, and how strong that sentiment is.

Let's see some simple examples using `polarity_scores`:

In [30]:
analyzer.polarity_scores('I like chocolate')

{'compound': 0.3612, 'neg': 0.0, 'neu': 0.286, 'pos': 0.714}

As we can see, the call to `polarity_scores` returns a dictionary containing all four values. The 'compound' value for the string "I like chocolate" is 0.3612, which represents a positive sentiment.

In [31]:
analyzer.polarity_scores('I like chocolate!')

{'compound': 0.4199, 'neg': 0.0, 'neu': 0.264, 'pos': 0.736}

VADER also takes into account punctuation marks that may affect the sentiment. When we add an exclamation mark to the same sentence, the intensity of the sentiment goes up. So, since the sentence was already positive, it becomes more positive.

In [32]:
analyzer.polarity_scores('I like chocolate! :D')

{'compound': 0.7798, 'neg': 0.0, 'neu': 0.128, 'pos': 0.872}

An interesting property of VADER is that it is particularly suited for social media use. So, the `SentimentIntensityAnalyzer` knows how to take emoticons into account, as seen here. The intensity of the above sentence with a happy emoticon at the end went up to 0.7798, showing a highly positive sentiment.

## Example application: sentiment analysis on tweets

We will now show an application of analyzing the sentiment of tweets containing a particular word or phrase. The first step is to load `tweepy`, the Python module for retrieving tweets, and set up the authentication. You will also need to register an application on Twitter and get the keys. I have placed my consumer key, consumer secret, access token, and access token secret, all separated by newlines, in a file called "api_keys.txt".

In [33]:
# Load and set up tweepy API
import tweepy
with open('api_keys.txt', 'r') as f:
    creds = f.read().split('\n')
    consumer_key = creds[0]
    consumer_secret = creds[1]
    access_token = creds[2]
    access_token_secret = creds[3]
    
# Set up the authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

`get_tweets_by_query` is our main function to retrieve tweets that contain the given query string. The default maximum number of tweets that will be returned by `tweepy` is 20.

In [34]:
import re

# Use the api object to retrieve recent tweets containing the query string
# Return a list containing the distinct matching tweets
def get_tweets_by_query(query):
    tweets_text = set()
    tweets = api.search(q = query, lang = 'en')
    for tweet in tweets:
        match = re.match(r"RT @\w*: ", tweet.text)
        text = tweet.text
        if (match):
            text = text[match.end():]
        tweets_text.add(text)
    return list(tweets_text)


Now that we have a function to retrieve tweets, we can use VADER to return the fractions of positive, negative, and neutral tweets in the retrieved tweets.

In [35]:
# Use VADER to classify the given text as positive, negative, or neutral 
def analyze_text(text):
    sentiments = analyzer.polarity_scores(text)
    if (sentiments['compound'] > 0):
        return "positive"
    elif (sentiments['compound'] < 0):
        return "negative"
    else:
        return "neutral"

# Returns the percentage of positive, negative, and neutral tweets
def analyze_tweets(tweets):
    (num_positive, num_neutral, num_negative) = (0, 0, 0)
    for tweet in tweets:
        sentiment = analyze_text(tweet)
        if (sentiment == "positive"):
            num_positive += 1
        elif (sentiment == "negative"):
            num_negative += 1
        else:
            num_neutral += 1
    total = len(tweets)
    return ((num_positive/total), (num_negative/total), (num_neutral/total))

We can now use these functions to understand the sentiment towards a particular word or phrase, like "chocolate". We start by retrieving the tweets:

In [36]:
chocolate_tweets = get_tweets_by_query('chocolate')

Then, we use `analyze_tweets` to see the sentiment of these tweets.

In [37]:
analyze_tweets(chocolate_tweets)

(0.5, 0.21428571428571427, 0.2857142857142857)

The above output shows us the fraction of positive, negative, and neutral tweets in the collection `chocolate_tweets`. 

Or, more practically, a company like Pantene might want to get a gauge for the sentiment of people towards the brand.

In [38]:
pantene_tweets = get_tweets_by_query('pantene')
analyze_tweets(pantene_tweets)

(0.6153846153846154, 0.0, 0.38461538461538464)

If a large, random sample of tweets is collected, these fractions of positive vs. negative tweets can help marketers understand what the general sentiment of people on social media is toward a brand.

## Training a Naive Bayes Classifier with NLTK

Though VADER's sentiment intensity analyzer is an easy tool to use without much preparation, it is not always good for texts other than social media. We can also manually train a new classifier using NLTK. Many times, this gives better results, since the training data can be tailored to the task at hand.

In this part of the tutorial, we will introduce the ways new classifiers can be trained and tested.
The steps are:

1. Take a labeled dataset. For sentiment analysis, the labels are usually positive and negative
2. Separate the data into train and test sets
3. Train the classifier on the training data
4. Calculate the accuracy on the test set

NLTK contains many corpora that can be used for classification tasks. For sentiment analysis in particular, one useful corpus is `movie_reviews`. We will use this dataset for our example. 

A classifier included in NLTK that is commonly used for sentiment analysis tasks is `NaiveBayesClassifier`. We first import the corpus and classifier:

In [39]:
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier

NLTK's corpora are a collection of labelled files containing text. The `fileids` method retrieves a list of identifiers for all the files.

In [40]:
file_ids = movie_reviews.fileids()
print(file_ids[0])
print(len(file_ids))

neg/cv000_29416.txt
2000


We see that there are 2,000 files in the movie_reviews corpus. The `categories` method is used to find out what all the possible labels for files in this corpus are.

In [41]:
movie_reviews.categories()

['neg', 'pos']

Since this corpus is suited for sentiment analysis, each file is labeled with either 'neg' or 'pos', which mean negative or positive sentiment. We can get all file ids that match a certain category by passing the category into the `fileids` call. For example, we can access the negative and positive files separately:

In [42]:
neg_files = movie_reviews.fileids('neg')
pos_files = movie_reviews.fileids('pos') 
print(len(neg_files))
print(len(pos_files))

1000
1000


Now we know that there are an equal number of positive and negative files in this corpus. There are also methods to access the actual contents of the corpus. `words` and `sents` retrieve a list of words and of sentences in the corpus. As we can see below, a sentence is represented by a list of words.

In [43]:
print(movie_reviews.words())
print(movie_reviews.sents())

['plot', ':', 'two', 'teen', 'couples', 'go', 'to', ...]
[['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.'], ['they', 'get', 'into', 'an', 'accident', '.'], ...]


Now that we are more familiar with the corpus, we return to our analysis task. We first split the corpus into a train and test set, each containing an equal number of negative and positive files. For this example, we will use 1,400 files for the train set and 600 files for the test set. In a real-world application, these numbers should be tweaked to find the best ratio.

In [44]:
train_neg_files = neg_files[:700]
train_pos_files = pos_files[:700]
test_neg_files = neg_files[700:]
test_pos_files = pos_files[700:]
(train_set, test_set) = ((train_neg_files + train_pos_files), (test_neg_files + test_pos_files))

Now, we want to train a classifier on the train set. One quirk is that `NaiveBayesClassifier` expects features as a tuple of the format (`file_dict`, `category`), where `word_dict` contains all the words in one file, each mapped to True, and `category` is the file label. Currently, `train_set` and `test_set` are just a list of file ids. So, we need to convert both the train and test set into the correct form.

In [45]:
# Convert a dataset into features that can be supplied to NaiveBayesClassifier
def create_features(dataset):
    result = []
    for file_id in dataset:
        file_dict = {}
        for word in movie_reviews.words(fileids = [file_id]):
            file_dict[word] = True
        result.append((file_dict, (movie_reviews.categories(fileids = [file_id])[0])))
    return result

train_features = create_features(train_set)
test_features = create_features(test_set) 

Now, we can simply feed `train_features` to the classifier.

In [46]:
nb_classifier = NaiveBayesClassifier.train(train_features)

`nb_classifier` has now been trained on the train set. The final step is to test it. To do this, we will use the method `nltk.classify.util.accuracy`, which takes in a labelled test set, runs the classifier on the observations, and then computes the accuracy using the labels. 

In [47]:
print(nltk.classify.util.accuracy(nb_classifier, test_features))

0.72


The accuracy on the test set is 72%. We can also find out which words in the corpus were most influential in classifying the files. The function `show_most_informative_features` does this.

In [48]:
nb_classifier.show_most_informative_features()

Most Informative Features
             outstanding = True              pos : neg    =     17.8 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               insulting = True              neg : pos    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.0 : 1.0
               ludicrous = True              neg : pos    =     11.0 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0
                flawless = True              pos : neg    =      9.8 : 1.0
                riveting = True              pos : neg    =      9.7 : 1.0
               affecting = True              pos : neg    =      9.7 : 1.0


As we would expect, the informative words usually carry a strong subjective meaning. For example, "outstanding" is a strong positive word, while "idiotic" is strongly negative. It makes sense that these kinds of words are the most influential.

## Saving and reloading classifiers

Training a classifier can be a long and tedious process. The quality of the training data is very important, so this process is usually heavily focused on. Additionally, the same classifier can often be used for various different programs. In these cases, it would be costly to retrain it over and over.

In Python, there are ways to avoid this hassle and simply save a classifier and reload it later. We will take advantage of the `pickle` module to save our classifier. `pickle` is used to serialize and deserialize objects. Generally, objects are first "pickled", or serialized, into a byte stream, and later "unpickled", or deserialized, back into the original structure. 

As an example, we will pickle our classifier. We do this by specifying a path where it will be saved by the `dump` method.

In [49]:
import pickle
pickle_path = 'classifier.pkl'
classifier_pickle = open(pickle_path, 'wb')
pickle.dump(nb_classifier, classifier_pickle)
classifier_pickle.close()

To unpickle the classifier, we use the `load` method. This takes in the path where the object was saved and converts it back into the original structure.

In [50]:
classifier_unpickle = open(pickle_path, 'rb')
opened_classifier = pickle.load(classifier_unpickle)
opened_classifier.show_most_informative_features()

Most Informative Features
             outstanding = True              pos : neg    =     17.8 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               insulting = True              neg : pos    =     11.7 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.0 : 1.0
               ludicrous = True              neg : pos    =     11.0 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0
                flawless = True              pos : neg    =      9.8 : 1.0
                riveting = True              pos : neg    =      9.7 : 1.0
               affecting = True              pos : neg    =      9.7 : 1.0


As we can see, the classifier retains the same information after saving and reloading. Once reloaded, it can be used on a new classification task as is, or it can be trained further if needed.

## Incorporating classifiers from scikit-learn

Though 72% accuracy is reasonable, it isn't as high as we would like for certain tasks. There is a greater variety of classifiers that can be useful in the `scikit-learn` module. `SklearnClassifier` is an api for these classifiers built into `NLTK`. Some of the common classifiers included in this module are `MultinomialNB`, `BernoulliNB`, `LogisticRegression`, `LinearSVC`, and `SGDClassifier`. Let us load these five classifiers using`SkLearnClassifier`.

In [51]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

multi_nb = SklearnClassifier(MultinomialNB())
bernoulli_nb = SklearnClassifier(BernoulliNB())
logistic_reg = SklearnClassifier(LogisticRegression())
linear_svc = SklearnClassifier(LinearSVC())
sgd_classifier = SklearnClassifier(SGDClassifier())

We can train these classifiers on the same training data as before, and check the the accuracies.

In [52]:
classifiers = {'MultinomialNB' : multi_nb, 'BernoulliNB': bernoulli_nb, 'LogisticRegression' : logistic_reg, 
               'Linear_SVC': linear_svc, 'SGDClassifier': sgd_classifier}
for name, classifier in classifiers.items():
    classifier.train(train_features)
    print("%s accuracy: %f" % (name, nltk.classify.util.accuracy(classifier, test_features)))

MultinomialNB accuracy: 0.821667
BernoulliNB accuracy: 0.796667
LogisticRegression accuracy: 0.883333
Linear_SVC accuracy: 0.866667
SGDClassifier accuracy: 0.836667


For this task, these classifiers all have better accuracy than the `NaiveBayesClassifier`. In this case, the logistic regression classifier has the highest accuracy. In the real world, sentiment analysis is a tough question and classifiers used for this task are never perfect. Often times, one classifier is chosen for a certain task, or a few classifiers can be combined in a voting scheme that takes each into account. 

## Summary and references

This tutorial gave an overview of the sentiment analysis tools provided in the NLTK library. Some useful resources for further exploration are:

1. [General sentiment analysis overview](https://callminer.com/blog/sentiment-analysis-examples-best-practices/)
2. [Sentiment analysis training and testing](https://blog.griddynamics.com/creating-training-and-test-data-sets-and-preparing-the-data-for-twitter-stream-sentiment-analysis-of-social-movie-reviews/)
3. [VADER documentation](http://www.nltk.org/_modules/nltk/sentiment/vader.html)
4. [VADER algorithmic details](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf)
5. [Examples using NLTK's NaiveBayesClassifier](http://www.nltk.org/book/ch06.html)
6. [NLTK SklearnClassifier documentation](http://www.nltk.org/_modules/nltk/classify/scikitlearn.html)
