# Sentiment analysis with NLTK Naive Bayes (Bernoulli?)

## Naive Bayes with NLTK

A Naive Bayes classifier determines the probability that an input text belongs to one of a set of classes, eg. predicting if a review is positive or negative.

It is ‘Naive’ because it assumes the words in the text are independent (even though in reality, in natural human language, the order of words convey contextual information).  Despite these assumptions, Naive Bayes has a high degree of accuracy when predicting classes with only a small training set.

- Baines, O., Naive Bayes: Machine Learning and Text Classification Application of Bayes’ Theorem. https://journals.le.ac.uk/ojs1/index.php/lumj/article/download/3484/3110

### imports

In [3]:
import nltk
import pandas as pd
import sys
sys.path.append("..") # Adds higher directory to python modules path.
from NLPmoviereviews.data import load_data_sent
from NLPmoviereviews.utilities import preprocessing

### 1. Load data

In [4]:
# load data
X_train, y_train, X_test, y_test = load_data_sent(percentage_of_sentences=10)

2022-08-09 11:08:37.441736: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-09 11:08:37.486528: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)


### 2. Prepare text

In [5]:
# remove custom stop-words
def rm_custom_stops(sentence):
    '''
    Custom stop word remover
    Parameters:
        sentence (str): a string of words
    Returns:
        list_of_words (list): cleaned sentence as a list of words
    '''
    words = sentence.split()
    stop_words = {'br', 'movie', 'film'}
    
    return [w for w in words if not w in stop_words]

In [6]:
# perform preprocessing (cleaning) & transform to dataframe
def process_df(X, y):
    '''
    Transform texts and labels into dataframe of 
    cleaned texts (as list of words) and human readable target labels
    
    Parameters:
        X (list): list of strings (reviews)
        y (list): list of target labels (0/1)
    Returns:
        df (dataframe): dataframe of processed reviews (as list of words)
                        and corresponding sentiment label (positive/negative)
    '''
    # create dataframe from data
    d = {'text': X, 'sentiment': y}
    df = pd.DataFrame(d)
    
    # make sentiment human-readable
    df['sentiment'] = df.sentiment.map(lambda x: 'positive' if x==1 else 'negative')

    # clean and split text into list of words
    df['text'] = df.text.apply(preprocessing)
    df['text'] = df.text.apply(rm_custom_stops)

    # Generate the feature sets for the movie review documents one by one
    return df

In [7]:
# process data
train_df = process_df(X_train, y_train)
test_df = process_df(X_test, y_test)

In [8]:
# inspect dataframe
train_df.head()

Unnamed: 0,text,sentiment
0,"[absolutely, terrible, dont, lure, christopher...",negative
1,"[know, fall, asleep, usually, due, combination...",negative
2,"[mann, photograph, alberta, rocky, mountain, s...",negative
3,"[kind, snowy, sunday, afternoon, rest, world, ...",positive
4,"[others, mention, woman, go, nude, mostly, abs...",positive


### 3. Create list of most common words

In [9]:
# get frequency distribution of words in corpus & select 2000 most common words
def most_common(df, n=2000):
    '''
    Get n most common words from data frame of text reviews
    
    Parameters:
        df (dataframe): dataframe with column of processed text reviews
        n (int): number of most common words to get
    Returns:
        most_common_words (list): list of n most common words
    '''
    # create list of all words in the train data
    complete_corpus = df.text.sum()
    
    # Construct a frequency dict of all words in the overall corpus 
    all_words = nltk.FreqDist(w.lower() for w in complete_corpus)

    # select the 2,000 most frequent words (incl. frequency)
    most_common_words = all_words.most_common(n)
    
    return [item[0] for item in most_common_words]

In [13]:
# get 2000 most common words
most_common_2000 = most_common(train_df)

# inspect first 10 most common words
most_common_2000[0:10]

['one',
 'make',
 'like',
 'see',
 'get',
 'time',
 'good',
 'go',
 'watch',
 'character']

### 4. Create nltk featuresets from train/test

For the nltk naive bayes classifier, we must tokenize the sentence and figure out which words the sentence shares with all_words/most_common_words. These constitute the sentence's features.

In [14]:
# for a given text, create a featureset (dict of features - {'word': True/False})
def review_features(review, most_common_words):
    '''
    Feature extractor that checks whether each of the most
    common words is present in a given review
    
    Parameters:
        review (list): text reviews as list of words
        most_common_words (list): list of n most common words
    Returns:
        features (dict): dict of most common words & corresponding True/False
    '''
    review_words = set(review)
    features = {}
    for word in most_common_words:
        features['contains(%s)' % word] = (word in review_words)
    return features

In [15]:
# create featureset for each text in a given dataframe
def make_set(df, most_common_words):
    '''
    Generates nltk featuresets for each movie review in dataframe.
    Feature sets are composed of a dict describing whether each of the most 
    common words is present in the text review or not

    Parameters:
        df (dataframe): processed dataframe of text reviews
        most_common_words (list): list of most common words
    Returns:
        feature_set (list): list of dicts of most common words & corresponding True/False
    '''
    return [(review_features(df.text[i], most_common_words), df.sentiment[i]) for i in range(len(df.sentiment))]

In [16]:
# make data into featuresets (for nltk naive bayes classifier)
train_set = make_set(train_df, most_common_2000)
test_set = make_set(test_df, most_common_2000)

In [17]:
# inspect first train featureset
train_set[0]

({'contains(one)': False,
  'contains(make)': True,
  'contains(like)': True,
  'contains(see)': False,
  'contains(get)': False,
  'contains(time)': False,
  'contains(good)': True,
  'contains(go)': False,
  'contains(watch)': False,
  'contains(character)': False,
  'contains(even)': True,
  'contains(think)': False,
  'contains(story)': False,
  'contains(would)': False,
  'contains(really)': False,
  'contains(show)': False,
  'contains(scene)': True,
  'contains(well)': False,
  'contains(look)': False,
  'contains(end)': False,
  'contains(say)': False,
  'contains(much)': False,
  'contains(give)': False,
  'contains(great)': True,
  'contains(first)': False,
  'contains(bad)': False,
  'contains(act)': True,
  'contains(way)': False,
  'contains(know)': False,
  'contains(also)': False,
  'contains(people)': False,
  'contains(dont)': True,
  'contains(take)': False,
  'contains(thing)': False,
  'contains(play)': False,
  'contains(find)': False,
  'contains(love)': True,
  '

### 5. Train & evaluate model (naive bayes classifier)

In [18]:
# Train a naive bayes classifier with train set by nltk
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [19]:
# Get the accuracy of the naive bayes classifier with test set
accuracy = nltk.classify.accuracy(classifier, test_set)
accuracy

0.8332

In [20]:
# Debug info: show top n most informative features
classifier.show_most_informative_features(10)

Most Informative Features
     contains(underrate) = True           positi : negati =     16.7 : 1.0
    contains(ridiculous) = True           negati : positi =     16.4 : 1.0
       contains(unfunny) = True           negati : positi =     13.3 : 1.0
          contains(hong) = True           positi : negati =      9.9 : 1.0
        contains(unfold) = True           positi : negati =      9.6 : 1.0
   contains(wonderfully) = True           positi : negati =      8.6 : 1.0
         contains(appal) = True           negati : positi =      8.4 : 1.0
          contains(lame) = True           negati : positi =      8.0 : 1.0
          contains(dumb) = True           negati : positi =      7.9 : 1.0
         contains(awful) = True           negati : positi =      7.8 : 1.0


### 6. Make prediction

In [21]:
# predict on new review (from mubi.com)
new_review = "Surprisingly effective and moving, The Balcony Movie takes the Front Up\
            concept of talking to strangers, but here attaches it to a fixed perspective \
            in order to create a strong sense of the stream of life passing us by. \
            It's possible to not only witness the subtle changing of seasons\
            but also the gradual opening of trust and confidence in Lozinski's \
            repeating characters. A Pandemic movie, pre-pandemic. 3.5 stars"

In [22]:
# perform preprocessing (cleaning & featureset transformation)
processed_review = rm_custom_stops(preprocessing(new_review))
processed_review = review_features(processed_review, most_common_2000)

In [23]:
# predict label
classifier.classify(processed_review)

'positive'

In [24]:
# to get individual probability for each label and word, taken from:
# https://stackoverflow.com/questions/20773200/python-nltk-naive-bayes-probabilities
for label in classifier.labels():
  print(f'\n\n{label}:')
  for (fname, fval) in classifier.most_informative_features(50):
    print(f"   {fname}({fval}): ", end="")
    print("{0:.2f}%".format(100*classifier._feature_probdist[label, fname].prob(fval)))



negative:
   contains(underrate)(True): 0.12%
   contains(ridiculous)(True): 5.96%
   contains(unfunny)(True): 1.62%
   contains(hong)(True): 0.12%
   contains(unfold)(True): 0.20%
   contains(wonderfully)(True): 0.28%
   contains(appal)(True): 1.70%
   contains(lame)(True): 5.49%
   contains(dumb)(True): 3.52%
   contains(awful)(True): 10.39%
   contains(waste)(True): 12.68%
   contains(laughable)(True): 2.65%
   contains(worst)(True): 14.89%
   contains(rubbish)(True): 2.01%
   contains(ninja)(True): 0.83%
   contains(gon)(True): 1.94%
   contains(pointless)(True): 3.04%
   contains(redeem)(True): 3.04%
   contains(magnificent)(True): 0.28%
   contains(giallo)(True): 0.12%
   contains(asleep)(True): 1.78%
   contains(subtle)(True): 0.43%
   contains(horrible)(True): 6.04%
   contains(kong)(True): 0.20%
   contains(crap)(True): 5.96%
   contains(garbage)(True): 2.57%
   contains(horribly)(True): 2.09%
   contains(wooden)(True): 2.09%
   contains(gritty)(True): 0.28%
   contains(were