# Sentimental Analysis using NLTK library of Python

In this tutorial we will explore Python library NLTK and how we can use this library in Sentimental Analysis. 
We will start with the basics of Nltk and after getting some idea about it, we will then move to Sentimental Analysis

## NLTK Library

How to install nltk library in python

For 2.x:
pip install nltk

For mac use sudo pip install nltk

For 3.x:
pip3 install nltk 

For mac use sudo pip3 install nltk  




In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

An interface will be opened
click on all and then click download. It will download all packages. This may take a while... 

## Tokenize sentence
Text can be split into different sentences using nltk method sentence_tokinize()

In [1]:
from nltk.tokenize import sent_tokenize
text = "Hi my name is Uzair. I studied from IBA. And live in Pakistan"
print(sent_tokenize(text))

['Hi my name is Uzair.', 'I studied from IBA.', 'And live in Pakistan']


## Tokenize words
On the other hand we can split text or we can say data can be tokenize into words using nltk method word_tokinize()

In [2]:
from nltk.tokenize import word_tokenize
text = "Hi my name is Uzair. I studied from IBA. And live in Pakistan"
print(word_tokenize(text))

['Hi', 'my', 'name', 'is', 'Uzair', '.', 'I', 'studied', 'from', 'IBA', '.', 'And', 'live', 'in', 'Pakistan']


If you want to store words or sentence you can use arrays for storing

In [3]:
sentences = sent_tokenize(text)
words = word_tokenize(text)
print(sentences)
print(words)

['Hi my name is Uzair.', 'I studied from IBA.', 'And live in Pakistan']
['Hi', 'my', 'name', 'is', 'Uzair', '.', 'I', 'studied', 'from', 'IBA', '.', 'And', 'live', 'in', 'Pakistan']


## Stop Words

A Text may contain  words like ‘am’, ‘who’, ‘where’. We can remove stopwords from the text. There is no universal list of stop words in nlp, however the nltk library provides a list of stop words

In [12]:
from nltk.corpus import stopwords
set(stopwords.words('english'))
# set function is an unordered collection with no duplicate elements 
# For Example: x = [1, 1, 2, 2, 2, 2, 2, 3, 3]
# set(x) # use set
# set([1, 2, 3]) # print

{u'a',
 u'about',
 u'above',
 u'after',
 u'again',
 u'against',
 u'ain',
 u'all',
 u'am',
 u'an',
 u'and',
 u'any',
 u'are',
 u'aren',
 u"aren't",
 u'as',
 u'at',
 u'be',
 u'because',
 u'been',
 u'before',
 u'being',
 u'below',
 u'between',
 u'both',
 u'but',
 u'by',
 u'can',
 u'couldn',
 u"couldn't",
 u'd',
 u'did',
 u'didn',
 u"didn't",
 u'do',
 u'does',
 u'doesn',
 u"doesn't",
 u'doing',
 u'don',
 u"don't",
 u'down',
 u'during',
 u'each',
 u'few',
 u'for',
 u'from',
 u'further',
 u'had',
 u'hadn',
 u"hadn't",
 u'has',
 u'hasn',
 u"hasn't",
 u'have',
 u'haven',
 u"haven't",
 u'having',
 u'he',
 u'her',
 u'here',
 u'hers',
 u'herself',
 u'him',
 u'himself',
 u'his',
 u'how',
 u'i',
 u'if',
 u'in',
 u'into',
 u'is',
 u'isn',
 u"isn't",
 u'it',
 u"it's",
 u'its',
 u'itself',
 u'just',
 u'll',
 u'm',
 u'ma',
 u'me',
 u'mightn',
 u"mightn't",
 u'more',
 u'most',
 u'mustn',
 u"mustn't",
 u'my',
 u'myself',
 u'needn',
 u"needn't",
 u'no',
 u'nor',
 u'not',
 u'now',
 u'o',
 u'of',
 u'off',
 

In below example we will remove stopwords from sample sentence.

In [6]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

sample = "Stopwords code which contain a sample sentence, showing off the stop words filtration."

stop_words_array = set(stopwords.words('english'))

words = word_tokenize(sample)
filtered_sentence =  []
for w in words:
    if w not in stop_words_array:
        filtered_sentence.append(w)

print(words)
print('After removing stopwords')
print(filtered_sentence)

['Stopwords', 'code', 'which', 'contain', 'a', 'sample', 'sentence', ',', 'showing', 'off', 'the', 'stop', 'words', 'filtration', '.']
After removing stopwords
['Stopwords', 'code', 'contain', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


In [7]:
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews
import string



useless_words = stopwords.words('english') + list(string.punctuation)
#all_words = movie_reviews.words()
filtered_words = [word for word in movie_reviews.words() if not word in useless_words]
len(filtered_words)/1e6

0.710579

Lets print out the most common words from filtered words

In [12]:
from collections import Counter
word_counter = Counter(filtered_words)
most_commond_words = word_counter.most_common()[:10]
most_commond_words

[('film', 9517),
 ('one', 5852),
 ('movie', 5771),
 ('like', 3690),
 ('even', 2565),
 ('good', 2411),
 ('time', 2411),
 ('story', 2169),
 ('would', 2109),
 ('much', 2049)]

## Sentiment Analysis Example

We will use Classification approach for classiy out result and this can be done using Traing and Testing.

The training step needs to have data. The classifier will use this training data to make predictions.

We start by defining 2 classes: positive, negative.
We will define the vocabulary of these classes:




In [4]:
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]

In [5]:
def word_features(words):
    return dict([(word, True) for word in words])
 
positive_features = [(word_features(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_features(neg), 'neg') for neg in negative_vocab]



we will add positive_features and negative_features for training set and after that we will clasify using NaiveBayesClassifier



In [6]:
training_set = negative_features + positive_features

classifier = NaiveBayesClassifier.train(training_set)

NameError: name 'NaiveBayesClassifier' is not defined

In [4]:
# Below is the example which classifies sentences according to the training set.

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
 
def word_feats(words):
    return dict([(word, True) for word in words])
 
positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]
 
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
 
training_set = negative_features + positive_features 
 
classifier = NaiveBayesClassifier.train(training_set) 
 
# Predict
neg = 0
pos = 0
sentence = "Awesome movie, but bad music"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify( word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
 
print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

Positive: 0.6
Negative: 0.4


So, through this example you will get some idea of how to use training data and how well we predict about the new data. The better your training data is, the more accurate your result will be. In the example our training data is very small.

## Sentiment Analysis on Movie Corpus

We will use bag-of-words in below example, as the word says bag-of-words is simply a collection of words disregarding order or grammer but all words will be unique.

Classification using machine learning is a technique used for sentiment models. We will build a sentiment classifer using the movie review corpus. 
Classification is a technique which requires labels from data. This is where we will take advantage of bag-of-words and a curated negative and positive reviews we downloaded.
We will implement bag-of-words function to create a positive or negative label for each review bag-of-words.

In [58]:
from nltk.corpus import movie_reviews

def build_bag_of_words_features(words):
    return {
        word:1 for word in words if not word in useless_words}

positive_reviews = movie_reviews.fileids('pos')
negative_reviews = movie_reviews.fileids('neg')

negative_features = [ (build_bag_of_words_features(movie_reviews.words(fileids = [f])), 'neg')
                   for f in negative_reviews]
positive_features = [ (build_bag_of_words_features(movie_reviews.words(fileids = [f])), 'pos')
                   for f in positive_reviews]

print(len(negative_features))
print(len(positive_features))

1000
1000


We will use Naive Bayes Classifier for this task. It is a very simple classifier with a probabilistic approach to classification. What this means is that the relationships between the input features and the class labels is expressed as probabilities. So, given the input features, for example, the probability for each class is estimated. The class with the highest probability then determines the label for the sample.

In [43]:
from nltk.classify import NaiveBayesClassifier

One of the simplest supervised machine learning classifiers is the Naive Bayes Classifier, we will train on 80% of the data what words are generally associated with positive or with negative reviews

Remember, we had 1,000 records in both of positive and negative features. We can use 80% of the data for classification in Naive Bayes. When we provide the first 800 rows in each feature, it's 80%. So we'll store that number, 800, in a variable called split.

In [44]:
split = 800

We are using the
Naive Bayes Classifier to build a classifier
that we will call Sentiment Classifier.

In [45]:
sentiment_classifier = NaiveBayesClassifier.train(positive_features[:split] + negative_features[:split] )

We will classify with the first 800 positive features
and the first 800 negative features.
Remember they had labels pos and neg.

In [46]:
nltk.classify.util.accuracy(sentiment_classifier, positive_features[:split] + negative_features[:split] )

0.980625

We can see that it's about 98% accuracy, so it's good.
But how will the model behave on the 20%

In [47]:

nltk.classify.util.accuracy(sentiment_classifier, positive_features[split:] + negative_features[split:] )

0.7175

And the accuracy of it, if we calculate it is around 71%.
The estimated accuracy for a human is about 80%.
So an accuracy of around 70% is a pretty good accuracy
for such a simple model


Remember, we had a large vocabulary and
the Sentiment Classifier used all the words,
but which of those words gave us this highish accuracy?
The Sentiment Classifier, the one model we built,
has a function, it says, show most informative features.
We can actually run that and see what words
or what features in those reviews were most informative.

In [48]:
sentiment_classifier.show_most_informative_features()

Most Informative Features
             outstanding = 1                 pos : neg    =     13.9 : 1.0
               insulting = 1                 neg : pos    =     13.7 : 1.0
              vulnerable = 1                 pos : neg    =     13.0 : 1.0
               ludicrous = 1                 neg : pos    =     12.6 : 1.0
             uninvolving = 1                 neg : pos    =     12.3 : 1.0
              astounding = 1                 pos : neg    =     11.7 : 1.0
                  avoids = 1                 pos : neg    =     11.7 : 1.0
             fascination = 1                 pos : neg    =     11.0 : 1.0
               affecting = 1                 pos : neg    =     10.3 : 1.0
                  symbol = 1                 pos : neg    =     10.3 : 1.0


So, in this tutorial we start with basics of nltk library and goes to how we can use it in Sentimental Analysis. There are many training datasets available online. Like [University dataset](https://github.com/nltk/nltk/wiki/Sentiment-Analysis).
A good dataset will increase the accuracy of your classifier. More the data better the result will be.

Thanks for reading I hope that you found this tutorial helpful. Feel free to reach me on twitter [@UzairAdamjee](https://twitter.com/UzairAdamjee)
