# Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK)

# Step 1 — Installing NLTK and Downloading the Data
First, install the NLTK package with the pip package manager:

$ pip install nltk==3.3

>> import nltk

Download the sample tweets from the NLTK package:

>> nltk.download('twitter_samples')

In [1]:
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
import re,string
from nltk import FreqDist
import random

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk import classify
from nltk import NaiveBayesClassifier

# Step 2 — Tokenizing the Data
Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements, which are called tokens.

from nltk.corpus import twitter_samples

This will import three datasets from NLTK that contain various tweets to train and test the model:

- negative_tweets.json: 5000 tweets with negative sentiments
- positive_tweets.json: 5000 tweets with positive sentiments
- tweets.20150430-223406.json: 20000 tweets with no sentiments

In [2]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

The strings() method of twitter_samples will print all of the tweets within a dataset as strings. Setting the different tweet collections as a variable will make processing and testing easier.

- to use NLTK’s tokenizers.
>> nltk.download('punkt') 

NLTK provides a default tokenizer for tweets with the .tokenized() method.

In [3]:
tweets_tokens = twitter_samples.tokenized('positive_tweets.json')
# word_tokenized = word_tokenize(tweets_tokens)
print(tweets_tokens[0])
# tweets_tagged = pos_tag_sents(tweets_tokens)

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'being', 'top', 'engaged', 'members', 'in', 'my', 'community', 'this', 'week', ':)']


Here, the .tokenized() method returns special characters such as @ and _. These characters will be removed through regular expressions later 

# Step 3 — Normalizing the Data
Normalization in NLP is the process of converting a word to its canonical form.
Normalization helps group together words with the same meaning but different forms.


» *wordnet* is a lexical database for the English language that helps the script determine the base word. 

» You need the *averaged_perceptron_tagger* resource to determine the context of a word in a sentence.

List of common items and their meaning:

NNP: Noun, proper, singular

NN: Noun, common, singular or mass

IN: Preposition or conjunction, subordinating

VBG: Verb, gerund or present participle

VBN: Verb, past participle

>>nltk.download('wordnet')

>>nltk.download('averaged_perceptron_tagger')

from nltk.tag import pos_tag

In [4]:
print(pos_tag(tweets_tokens[0]))

# Here is the output of the pos_tag function.

[('#FollowFriday', 'JJ'), ('@France_Inte', 'NNP'), ('@PKuchly57', 'NNP'), ('@Milipol_Paris', 'NNP'), ('for', 'IN'), ('being', 'VBG'), ('top', 'JJ'), ('engaged', 'VBN'), ('members', 'NNS'), ('in', 'IN'), ('my', 'PRP$'), ('community', 'NN'), ('this', 'DT'), ('week', 'NN'), (':)', 'NN')]


In [5]:
# this function first gets position tag of each token of tweet 
# Within the if statement, if the tag starts with NN,the token is assigned as a Noun. 
# Similarly, if the tag starts with VB, the token is assigned as a Verb.
def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word,tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else: 
            pos = 'a'
        # WordNetLemmatizer().lemmatize(word,pos) will get the context of the word
        lemmatized_sentence.append(lemmatizer.lemmatize(word,pos))
    return lemmatized_sentence
print(lemmatize_sentence(tweets_tokens[0]))

#  lemmatization does morphological analysis of the words.

['#FollowFriday', '@France_Inte', '@PKuchly57', '@Milipol_Paris', 'for', 'be', 'top', 'engage', 'member', 'in', 'my', 'community', 'this', 'week', ':)']


# Step 4 — Removing Noise from the Data
Noise is any part of the text that does not add meaning or information to data.
For instance, the most common words in a language are called stop words. Some examples of stop words are “is”, “the”, and “a”.
They are generally irrelevant, unless a specific usecase warrants their incursion

*We'll use regular expressions in Python to search for and remove these items*:
- <u>Hyperlinks</u> : To remove hyperlinks, you need to first search for a substring that matches a URL starting with http:// or https://, followed by letters, numbers, or special characters. Once a pattern is matched, the .sub() method replaces it with an empty string.
- <u>Twitter handles in replies</u> :(’@’ to mention or answer another user) We remove @ mentions,followed by numbers/letters/_ , using library *re*
- <u>Punctuation and special characters</u>: finally remove punctuations using the library *string*

in addition stopwords will also get removed using built-in set of *stop words* in NLTK, needs to be downloaded separately


In [6]:
def remove_noise(tweets_tokens, stop_words = ()):
    cleaned_tokens = []
    for token, tag in pos_tag(tweets_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',token)
        token = re.sub('(@[A-Za-z0-9_]+)','',token)
        
        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith("VB"):
            pos = 'v'
        else:
            pos = 'a'
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

In [7]:
# .words() method is used to get list of stop words in english

stop_words = stopwords.words('english')

# print(remove_noise(tweets_tokens[0], stop_words))

’#’ to describe their subjects,
and ’RT’ used to ’Re-Tweet’ (Republish) another tweet.

Notice that the function removes all @ mentions, stop words, and converts the words to lowercase.

Let's use the remove_noise() function to clean the positive and negative tweets

In [8]:
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))
for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

In [9]:
print(positive_tweet_tokens[50])
print(positive_cleaned_tokens_list[50])

['@groovinshawn', 'they', 'are', 'rechargeable', 'and', 'it', 'normally', 'comes', 'with', 'a', 'charger', 'when', 'u', 'buy', 'it', ':)']
['rechargeable', 'normally', 'come', 'charger', 'u', 'buy', ':)']


# Step 5 — Determining Word Density
The most basic form of analysis on textual data is to take out the word frequency.
A single tweet is too small of an entity to find out the distribution of words, hence, the analysis of the frequency of words would be done on all positive tweets.

The following snippet defines a generator function, named *get_all_words*, that takes a list of tweets as an argument to provide a list of words in all of the tweet tokens joined.

In [10]:
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token
            
all_pos_words = get_all_words(positive_cleaned_tokens_list)


In [11]:
freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(20))

# The .most_common() method lists the words which occur most frequently in the data.

[(':)', 3691), (':-)', 701), (':d', 658), ('thanks', 388), ('follow', 357), ('love', 333), ('...', 290), ('good', 283), ('get', 263), ('thank', 253), ('u', 245), ('day', 242), ('like', 229), ('see', 195), ('happy', 192), ("i'm", 183), ('great', 175), ('hi', 173), ('go', 167), ('back', 163)]


From this data, we can see that emoticon entities form some of the most common parts of positive tweets. 

To summarize, we extracted the tweets from nltk, tokenized, normalized, and cleaned up the tweets for using in the model. Finally, we also looked at the frequencies of tokens in the data and checked the frequencies of the top ten tokens.

# Step 6 — Preparing Data for the Model
We will create a training data set to train a model. It is a supervised learning machine learning process, which requires us to associate each dataset with a “sentiment” for training.
Sentiment analysis can be used to categorize text into a variety of sentiments. Here, our model will use the “positive” and “negative” sentiments only(for simplicity).

A *model* is a description of a system using rules and equations. It may be as simple as an equation which predicts the weight of a person, given their height. A sentiment analysis model that you will build would associate tweets with a positive or a negative sentiment. You will need to split your dataset into two parts. The purpose of the first part is to build the model, whereas the next part tests the performance of the model.

# Converting Tokens to a Dictionary
You will use the Naive Bayes classifier in NLTK to perform the modeling exercise.

In [12]:
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)
    
positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

# for value in positive_tokens_for_model:
#     print(value)

# Splitting the Dataset for Training and Testing the Model
Next, we need to prepare the data for training the NaiveBayesClassifier class. Add the following code to the file to prepare the data:

In [13]:
positive_dataset = [(tweet_dict, "Positive")
                    for tweet_dict in positive_tokens_for_model]
negative_dataset = [(tweet_dict, "Negative")
                   for tweet_dict in negative_tokens_for_model]
dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

This code attaches a Positive or Negative label to each tweet. It then creates a dataset by joining the positive and negative tweets.

When training the model, you should provide a sample of your data that does not contain any bias. To avoid bias, you’ve added code to randomly arrange the data using the .shuffle() method of random.

In [14]:
train_data = dataset[:7000]
test_data = dataset[7000:]

print(train_data)



Finally, the code splits the shuffled data into a ratio of 70:30 for training and testing, respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from the shuffled dataset for *training* the model and the final 3000 for *testing* the model.

In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the dataset, and split it into training and testing data.

# Step 7 — Building and Testing the Model

Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train() method to train the model and the .accuracy() method to test the model on the testing data.

>>from nltk import classify

>>from nltk import NaiveBayesClassifier

In [15]:
classifier = NaiveBayesClassifier.train(train_data)
print('Accuracy is:', classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.998
Most Informative Features
                      :( = True           Negati : Positi =   2030.7 : 1.0
                      :) = True           Positi : Negati =   1010.1 : 1.0
                     sad = True           Negati : Positi =     36.8 : 1.0
                follower = True           Positi : Negati =     23.2 : 1.0
                     bam = True           Positi : Negati =     23.0 : 1.0
                  arrive = True           Positi : Negati =     14.8 : 1.0
               community = True           Positi : Negati =     14.7 : 1.0
                     via = True           Positi : Negati =     14.3 : 1.0
                    blog = True           Positi : Negati =     14.1 : 1.0
                followed = True           Negati : Positi =     13.8 : 1.0
None


Accuracy is defined as the percentage of tweets in the testing dataset for which the model was correctly able to predict the sentiment.

In the table that shows the most informative features, every row in the output shows the ratio of occurrence of a token in positive and negative tagged tweets in the training dataset. 

Next, you can check how the model performs on random tweets from Twitter.

In [17]:
custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the App again."
# ...
# custom_tweet = 'Congrats #SportStar on your 7th best goal from last season winning goal of the year :) #Baller #Topbin #oneofmanyworldies'
# ...
# custom_tweet = 'Thank you for sending my baggage to CityX and flying me to CityY at the same time. Brilliant service. #thanksGenericAirline'
custom_tokens = remove_noise(word_tokenize(custom_tweet))
print(classifier.classify(dict([token, True] for token in custom_tokens)))


# This code will allow you to test custom tweets by updating the string associated with the custom_tweet variable.

# Here is the output for the custom text in the example:

# Output:

Negative


The model classified 3rd example as positive. This is because the training data wasn’t comprehensive enough to classify sarcastic tweets as negative.

In case we want our model to predict sarcasm, we would need to provide sufficient amount of training data to train it accordingly.

Our completed code still has artifacts leftover, so the next step will guide us through aligning the code to Python’s best practices.