## Naive Bayes Movie Sentiment Analysis

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes theorem. It assumes there is
dependency between every pair of features. The model requires to create a test data and train data on which the classifier is run. 

In [1]:
posReview = 'C:/Users/roush/Downloads/Python Sentiment Analysis/rt-polaritydata/rt-polaritydata/rt-polarity.pos'

In [2]:
with open(posReview, 'r') as f:
    posReviews = f.readlines() 

In [3]:
negReview = 'C:/Users/roush/Downloads/Python Sentiment Analysis/rt-polaritydata/rt-polaritydata/rt-polarity.neg'

In [4]:
with open(negReview, 'r') as f:
    negReviews = f.readlines()

Splitting the corpus into training and test data

In [5]:
pos_train_data = posReviews[:2500]
neg_train_data = negReviews[:2500]

In [6]:
pos_test_data = posReviews[2501:]
neg_test_data = negReviews[2501:]

Creating a vocabulary of all the words in the training data

In [7]:
pos_list= []
for line in pos_train_data:
    for word in line.split():
        pos_list.append(word)

In [8]:
neg_list = [word for line in neg_train_data for word in line.split()]

In [9]:
all_words=[]
for sublist in [pos_list, neg_list]:
    for words in sublist:
        all_words.append(words)

Eliminate duplicates

In [10]:
vocabulary = (set(all_words))
len(vocabulary)

14102

Setting up training data. Creating a tuple with Review and their label

In [11]:
tagged_pos_data = []
for review in pos_train_data:
    tagged_pos_data.append({'review':review.split(), 'label':'positive'})

In [12]:
tagged_neg_data = [{'review':review.split(), 'label':'negative'} for review in neg_train_data]

In [13]:
full_tagged = []
for sublist in [tagged_pos_data, tagged_neg_data]:
    for review in sublist:
        full_tagged.append(review)

In [14]:
for review in full_tagged[0:2]:
    print (review)

{'review': ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', "century's", 'new', '"', 'conan', '"', 'and', 'that', "he's", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean-claud', 'van', 'damme', 'or', 'steven', 'segal', '.'], 'label': 'positive'}
{'review': ['the', 'gorgeously', 'elaborate', 'continuation', 'of', '"', 'the', 'lord', 'of', 'the', 'rings', '"', 'trilogy', 'is', 'so', 'huge', 'that', 'a', 'column', 'of', 'words', 'cannot', 'adequately', 'describe', 'co-writer/director', 'peter', "jackson's", 'expanded', 'vision', 'of', 'j', '.', 'r', '.', 'r', '.', "tolkien's", 'middle-earth', '.'], 'label': 'positive'}


In [15]:
training_data =[]
for review in full_tagged:
    training_data.append((review['review'],review['label']))

In [16]:
training_data[0]

(['the',
  'rock',
  'is',
  'destined',
  'to',
  'be',
  'the',
  '21st',
  "century's",
  'new',
  '"',
  'conan',
  '"',
  'and',
  'that',
  "he's",
  'going',
  'to',
  'make',
  'a',
  'splash',
  'even',
  'greater',
  'than',
  'arnold',
  'schwarzenegger',
  ',',
  'jean-claud',
  'van',
  'damme',
  'or',
  'steven',
  'segal',
  '.'],
 'positive')

In [17]:
import nltk

In [18]:
def feature_extraction(review):
    review_words= set(review)
    features = {}
    for word in vocabulary:
        features[word]=(word in review_words)   # word in review_words returns true or false
    return features

Convert the training data into a list of feature vectors.The first element of training_features tuple is the feature vector and the second element is the label

In [19]:
training_features = nltk.classify.apply_features(feature_extraction, training_data)

In [20]:
trainedNBClassifier = nltk.NaiveBayesClassifier.train(training_features)

In [21]:
def naiveBayesSentiment(review):
    word = review.split()
    feature_vector = feature_extraction(word)
    return trainedNBClassifier.classify(feature_vector)

In [22]:
naiveBayesSentiment('You are the Worst')

'negative'

In [23]:
def reviewSentiments(naiveBayesSentiment):
    test_pos = [naiveBayesSentiment(review) for review in pos_test_data]
    test_neg = [naiveBayesSentiment(review) for review in neg_test_data]
    label = {'positive': 1, 'negative':-1}
    numeric_pos = [label[x] for x in test_pos]
    numeric_neg = [label[x] for x in test_neg]
    return {'positive': numeric_pos, 'negative': numeric_neg}

In [24]:
def runDiagnostics(reviewResult):
    posResult = reviewResult['positive']
    negResult = reviewResult['negative']
    truePositive = sum(x>0 for x in posResult)
    trueNegative = sum(x<0 for x in negResult)
    pctTruepos = float(truePositive)/len(posResult)
    pctTrueneg = float(trueNegative)/len(negResult)
    totalTrue = truePositive + trueNegative
    total = len(posResult) + len(negResult)
    print ('Accuracy of Positive reviews=' + '%.2f'% (pctTruepos *100) + '%' )
    print ('Accuracy of Negative reviews=' + '%.2f'% (pctTrueneg *100) + '%' )
    print ('Overall Accuracy=' + '%.2f' % (totalTrue *100/total) + '%' )

In [25]:
runDiagnostics(reviewSentiments(naiveBayesSentiment))

Accuracy of Positive reviews=73.39%
Accuracy of Negative reviews=77.07%
Overall Accuracy=75.23%
