#### For our text classification, we have to find some way to "describe" bits of data, which are labeled as either positive or negative for machine learning training purposes. 

#### These descriptions are called "features" in machine learning. For our project, we're just going to simply classify each word within a positive or negative review as a "feature" of that review. 

#### Then, as we go on, we can train a classifier by showing it all of the features of positive and negative reviews (all the words), and let it try to figure out the more meaningful differences between a positive review and a negative review, by simply looking for common negative review words and common positive review words.

### Naive Bayes Algorithm
#### The algorithm of choice, at least at a basic level, for text analysis is often the Naive Bayes classifier. Part of the reason for this is that text data is almost always massive in size. The Naive Bayes algorithm is so simple that it can be used at scale very easily with minimal process requirements

In [1]:
import nltk
import random
from nltk.corpus import movie_reviews

documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

# first 1000 fileids in corpora are positive sentiment and next 1000 are negative, so random shuffle is used
random.shuffle(documents)    
#print(documents[2])        

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
    
all_words = nltk.FreqDist(all_words)  # converting list of words to frequency distribution dictionary
#print(all_words.most_common(15))
#print(all_words["stupid"])

# taking only top 3000 words as features
word_features = [w[0] for w in all_words.most_common(3000)]
#word_features = list(all_words.keys())[:3000]

# function to find words which are present in a document
def find_features(document):
    words= set(document)   # this is all unique words in the document
    features = {}
    for w in word_features:
        features[w] = (w in words)  # boolean of whether w is present in the document or not
        
    return features 

#print(find_features(movie_reviews.words('neg/cv000_29416.txt')))
# featureset is a list of tuples of category and corresponding words boolean in documents
featureset = [(find_features(rev), category) for (rev, category) in documents]   # rev is review words
#featureset[:2]
print(len(featureset))
training_set = featureset[:1900]
testing_set = featureset[1900:]

classifier = nltk.NaiveBayesClassifier.train(training_set)
print("Naive bayes Accuracy is : ", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

2000
Naive bayes Accuracy is :  81.0
Most Informative Features
             outstanding = True              pos : neg    =     13.5 : 1.0
                   jolie = True              neg : pos    =      8.4 : 1.0
                   mulan = True              pos : neg    =      8.2 : 1.0
                  finest = True              pos : neg    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.5 : 1.0
                  seagal = True              neg : pos    =      7.5 : 1.0
             wonderfully = True              pos : neg    =      7.4 : 1.0
                 idiotic = True              neg : pos    =      7.1 : 1.0
                   damon = True              pos : neg    =      6.0 : 1.0
                   inept = True              neg : pos    =      6.0 : 1.0
                   flynt = True              pos : neg    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.4 : 1.0
                  wasted = True      

### Saving the classifier using pickle and loading it up again to predict on new values

In [2]:
import nltk
import random
from nltk.corpus import movie_reviews
import pickle

documents = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)), category))

# first 1000 fileids in corpora are positive sentiment and next 1000 are negative, so random shuffle is used
random.shuffle(documents)    
#print(documents[2])        

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())
    
all_words = nltk.FreqDist(all_words)  # converting list of words to frequency distribution dictionary
#print(all_words.most_common(15))
#print(all_words["stupid"])

# taking only top 3000 words as features
word_features = [w[0] for w in all_words.most_common(3000)]
#word_features = list(all_words.keys())[:3000]
# function to find words which are present in a document
def find_features(document):
    words= set(document)   # this is all unique words in the document
    features = {}
    for w in word_features:
        features[w] = (w in words)  # boolean of whether w in present in the document
        
    return features 

#print(find_features(movie_reviews.words('neg/cv000_29416.txt')))
featureset = [(find_features(rev), category) for (rev, category) in documents]
#featureset[:2]
print(len(featureset))
training_set = featureset[:1900]
testing_set = featureset[1900:]

#classifier = nltk.NaiveBayesClassifier.train(training_set)

classifier_f= open("naivebayes.pickle", 'rb')
classifier= pickle.load(classifier_f)
classifier_f.close()


print("Naive bayes Accuracy is : ", (nltk.classify.accuracy(classifier, testing_set))*100)
classifier.show_most_informative_features(15)

#uncomment below for saving the classifier
'''
# save_classifier is the file opened in write byte mode and file name is naivebayes.pickle
save_classifier = open("naivebayes.pickle", "wb")
pickle.dump(classifier, save_classifier)   # .dump() will save classifier in opened file
save_classifier.close()
'''

2000
Naive bayes Accuracy is :  87.0
Most Informative Features
             outstanding = True              pos : neg    =     11.2 : 1.0
                   mulan = True              pos : neg    =      8.4 : 1.0
                   damon = True              pos : neg    =      7.9 : 1.0
                  seagal = True              neg : pos    =      7.8 : 1.0
                  finest = True              pos : neg    =      7.6 : 1.0
                 idiotic = True              neg : pos    =      7.4 : 1.0
             wonderfully = True              pos : neg    =      7.3 : 1.0
              schumacher = True              neg : pos    =      7.0 : 1.0
                   inept = True              neg : pos    =      6.1 : 1.0
                   jolie = True              neg : pos    =      5.8 : 1.0
                 flubber = True              neg : pos    =      5.6 : 1.0
                    lame = True              neg : pos    =      5.6 : 1.0
                   waste = True      

'\n# save_classifier is the file opened in write byte mode and file name is naivebayes.pickle\nsave_classifier = open("naivebayes.pickle", "wb")\npickle.dump(classifier, save_classifier)   # .dump() will save classifier in opened file\nsave_classifier.close()\n'