Daniel Rocha Ruiz, MSc in Data Science and Business Analytics

# Sentiment Analysis in Python

Tutorial extracted from Packt
- https://hub.packtpub.com/how-to-perform-sentiment-analysis-using-python-tutorial/
- The idea of conducting a sentiment analysis sounds very sophisticated, but this tutorial is fairly simple. 'Sentiments' are defined a-priori: you are given a dataset with reviews and their sentiment (positive or negative). Your only task is to create a classifier calibrated, i.e. a function that reads reviews and attributes one of the possible sentiments. The classifier should be calibrated on training data, validated on test data, and used to classify new data. 

In [1]:
# 1) Import Packages

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews

In [2]:
# 2) Extract Features

def extract_features(word_list):
    return dict([(word, True) for word in word_list])

In [4]:
# 3) Load Positive and Negative Reviews

if __name__=='__main__':
    positive_fileids = movie_reviews.fileids('pos')
    negative_fileids = movie_reviews.fileids('neg')
    


In [4]:
# 4) Separate Positive and Negative Reviews

features_positive = [(extract_features(movie_reviews.words(fileids=[f])),'Positive') for f in positive_fileids]
features_negative = [(extract_features(movie_reviews.words(fileids=[f])),'Negative') for f in negative_fileids]

# print sample movie review
print(movie_reviews.raw(fileids=[positive_fileids[42]]))

will hunting ( matt damon ) is a natural genius . 
for a movie character , that's usually a death sentence . 
it's a trait associated with what my brother calls " too good for this world " movies , like phenomenon or powder . 
forgive me for spoiling the ending , but will doesn't die . 
this is no formula movie . 
in fact , it's quite fresh and original . 
it's a character study more than anything , and that's not surprising , considering it was written by two actors : damon and co-star ben affleck . 
will works whatever kind of job he can get . 
first he's a janitor , then he works construction . 
off-screen he speed reads books on any academic subject that interests him . 
on-screen he hangs out with his friends , picking fights in robust , romanticized-hemingway fashion . 
lambeau ( stellan skarsgard from breaking the waves ) , a math professor , learns that the janitor ( will ) is a genius with a special talent for advanced mathematics . 
having confirmed he's not a fluke or a sava

In [5]:
# 5) Determine a Threshold for Train-Test Split

threshold_factor = 0.75
threshold_positive = int(threshold_factor * len(features_positive))
threshold_negative = int(threshold_factor * len(features_negative))

In [6]:
# 6) Split the Training and Test Set

features_train = features_positive[:threshold_positive] + features_negative[:threshold_negative]
features_test = features_positive[threshold_positive:] + features_negative[threshold_negative:]  

print("Number of training datapoints:", len(features_train))
print("Number of test datapoints:", len(features_test))

Number of training datapoints: 1500
Number of test datapoints: 500


In [7]:
# 7) Train a Classifier (here we use a Bayesian Classifier)

classifier = NaiveBayesClassifier.train(features_train)

print("\nAccuracy of the classifier:", nltk.classify.util.accuracy(classifier, features_test))


Accuracy of the classifier: 0.728


In [8]:
# 8) Print the 10 most informative words

print("\nTop 10 most informative words:\n")

classifier.show_most_informative_features(10)


Top 10 most informative words:

Most Informative Features
             magnificent = True           Positi : Negati =     15.0 : 1.0
             outstanding = True           Positi : Negati =     13.6 : 1.0
               insulting = True           Negati : Positi =     13.0 : 1.0
              vulnerable = True           Positi : Negati =     12.3 : 1.0
               ludicrous = True           Negati : Positi =     11.8 : 1.0
                  avoids = True           Positi : Negati =     11.7 : 1.0
             uninvolving = True           Negati : Positi =     11.7 : 1.0
              astounding = True           Positi : Negati =     10.3 : 1.0
             fascination = True           Positi : Negati =     10.3 : 1.0
                 idiotic = True           Negati : Positi =      9.8 : 1.0


In [9]:
# 9) Create New Data

input_reviews = ["It is an amazing movie", 
                 "This is a dull movie. I would never recommend it to anyone.",
                 "The cinematography is pretty great in this movie", 
                 "The direction was terrible and the story was all over the place"]

In [10]:
# 10) Predict on the New Data

print("\nPredictions:")
 
for review in input_reviews:
    print("\nReview:", review)
    probdist = classifier.prob_classify(extract_features(review.split()))
    pred_sentiment = probdist.max()
    
    print("Predicted sentiment:", pred_sentiment)
    print("Probability:", round(probdist.prob(pred_sentiment), 2))


Predictions:

Review: It is an amazing movie
Predicted sentiment: Positive
Probability: 0.61

Review: This is a dull movie. I would never recommend it to anyone.
Predicted sentiment: Negative
Probability: 0.76

Review: The cinematography is pretty great in this movie
Predicted sentiment: Positive
Probability: 0.7

Review: The direction was terrible and the story was all over the place
Predicted sentiment: Negative
Probability: 0.64
