Basically, sentiment analysis is a sentiment classification task that classifies text into 'positive', 'negative' or 'neutral'.

Input: document d; set of classes C={c1,c2,...,cn}

Output: a predicted class c

**why sentiment analysis:**
* helps to predict customer behavior for a particular product
* help to test the adaptability of a product
* automates the task of customer preference reports

**why Naive Bayes:** 
* reduced number of parameters
* linear time complexity as opposed to exponential time complexity

**Case Study: sentiment analysis of movie reviews**

**Agenda**:
* preparation for the data
* define feature extractor
* train Naive Bayes classifier
* disadvantages of Naive Bayes

**Preparation for the data**

data: use an off-line movie review corpus that is covered in NLTKbook

In [1]:
import nltk
from nltk.corpus import movie_reviews
import random

In [4]:
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

**Define feature extractor**

In [7]:
#lowercase all words
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())

#extract 2000 most frequent words in overall corpus
word_features = list(all_words)[:2000]

#check whether word is present in a given doc, save result in a dictionary
def document_features(document):
    document_words = set(document)
    features = {}
    
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

**Train Naive Bayes classifier to predict sentiments**

In [9]:
featuresets = [(document_features(d), c) for (d,c) in documents]

In [11]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

test the classifier

In [12]:
print(nltk.classify.accuracy(classifier, test_set))

0.77


show the most important features as interpreted by Naive Bayes

In [14]:
classifier.show_most_informative_features(10)

Most Informative Features
 contains(unimaginative) = True              neg : pos    =      8.4 : 1.0
        contains(turkey) = True              neg : pos    =      8.2 : 1.0
    contains(schumacher) = True              neg : pos    =      7.0 : 1.0
     contains(atrocious) = True              neg : pos    =      6.6 : 1.0
        contains(suvari) = True              neg : pos    =      6.4 : 1.0
          contains(mena) = True              neg : pos    =      6.4 : 1.0
       contains(singers) = True              pos : neg    =      6.3 : 1.0
        contains(wasted) = True              neg : pos    =      5.8 : 1.0
       contains(bronson) = True              neg : pos    =      5.7 : 1.0
  contains(surveillance) = True              neg : pos    =      5.7 : 1.0


In a document, a review that contains "unimaginative" is almost 8 times more likely to be negative than positive, while a review that mentions "singers" is about 6 times more likely to be positive than negative.

**Disadvantages of Naive Bayes**

The main limitation is the assumption of conditional independent predictors. In real life, it is almost impossible to get a set of predictors that are entirely independent.

reference: https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python