# Conducting Sentiment Analysis:

Input: Text of company's Management Discussion and Analysis (MD&A) section from Form 10-Q and 10-K. 
Output: Sentiment classified as either Positive or Negative along with the Sentiment confidence score in percentage

Code is written in Python and uses following libraries:

* PANDAS
* NLTK
* SKLEARN
* RANDOM
* STATISTICS

Inaddition to above, it uses following files:

    * lemur-stopwords.txt:File containing STOP words
    * clasfuncdef.py:     Contains some of the custom functions and classes
    * Negative terms.csv: Training file of Negative terms denoting Negative Sentiment
    * Positive Terms.csv: Training file of Positive terms denoting Positive Sentiment
    * TestNegative.txt:   Sample 10-K input file denoting Negative Sentiment. 
    * TestPositive.txt:   Sample 10-K input file denoting Positive Sentiment. 
    
This Program shall evaluate Sentiment accuracy using following classifiers and then use the classifier with highest accuracy percentage to evaluate the 10-K input.

    * Naive Bayes Classifier
    * MultinomialNB Classifier
    * BernoulliNB Classifier
    * Logistic Regression Classifier
    * SGD Classifier
    * LinerSVC Classifier


In [83]:
import pandas as pd
import random
import clasfuncdef as cfd
import nltk
from statistics import mode
from nltk.tokenize import word_tokenize
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.naive_bayes import MultinomialNB,BernoulliNB
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC

In [84]:
print("Importing Positive Terms..")
positive_terms = open("Positive Terms.csv","r").read()
print("Import Positive Terms complete !!")

print("")
print("Importing Negative Terms..")
negative_terms = open("Negative Terms.csv","r").read()
print("Import Negative Terms complete !!")
print("")

fname = 'TestPositive.txt'
# fname = 'TestNegative.txt'

# print("Importing Comments whose sentiment shall be analysed..")
# print("")
# print("Enter TestPositive.txt to analyse a Positive sample")
# print("Enter TestNegative.txt to analyse a Negative sample")
# print("")

# fname = input("Enter file name to analyze the sentiment: ")

try:
    src_comments = open(fname,"r").read()
except Exception as e:
    print('Unable to read the file !!')
    exit()

Importing Positive Terms..
Import Positive Terms complete !!

Importing Negative Terms..
Import Negative Terms complete !!



Build Vocabulary and collect sample Positive, Negative terms:

It uses find_features function defined in the file clasfuncdef.py
This function:
* uses Lemmatizer instead of Stemming, since it is assumed that performance is not a concern for this project. 
* Uses the lemur-stopwords.txt file to remove the stop words.
* Builds the featureset to be used in the classifer.

Below is a snapshot of the function. 
def find_features(features_list, term):

    lemmatizer = WordNetLemmatizer()
    str = lemmatizer.lemmatize(term)
    words = word_tokenize(str)
    stop_words = open("lemur-stopwords.txt","r").read()
    filtered_sentence = [w for w in words if not w in stop_words]

    features = {}
    for w in features_list:
        features[w] = (w in filtered_sentence)

    return features

In [85]:
print('Building FeatureSet..')
terms_collection = []
vocabulary = []

for r in positive_terms.split('\n'):
    terms_collection.append( (r, "positive") )
    vocabulary.append(r.lower())

for r in negative_terms.split('\n'):
    terms_collection.append( (r, "negative") )
    vocabulary.append(r.lower())

vocabulary = nltk.FreqDist(vocabulary)
features_list = list(vocabulary.keys())
featuresets = [(cfd.find_features(features_list, trm), category) for (trm, category) in terms_collection]
random.shuffle(featuresets)

print('FeatureSet determination complete !!')

Building FeatureSet..
FeatureSet determination complete !!


Split data into Training and Test. This shall be used in the classifer evalation. Classifers are trained using Train method and then tested for accuracy. 

In [86]:
set_training = featuresets[:50]
set_testing =  featuresets[50:]

Evaluate Naive Bayes Classifier:

In [87]:
NB_classifier = nltk.NaiveBayesClassifier.train(set_training)
print("Naive Bayes classifier accuracy %age:", (nltk.classify.accuracy(NB_classifier, set_testing))*100)
print('')

Naive Bayes classifier accuracy %age: 69.64285714285714



Evaluate MultinomialNB Classifier:

In [88]:
MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(set_training)
print("MultinomialNB classifier accuracy %age:", (nltk.classify.accuracy(MNB_classifier, set_testing))*100)
print('')

MultinomialNB classifier accuracy %age: 69.64285714285714



Evaluate BernoulliNB Classifier:

In [89]:
BernoulliNB_classifier = SklearnClassifier(BernoulliNB())
BernoulliNB_classifier.train(set_training)
print("BernoulliNB classifier accuracy %age:", (nltk.classify.accuracy(BernoulliNB_classifier, set_testing))*100)
print('')

BernoulliNB classifier accuracy %age: 64.28571428571429



Evaluate Logistic Regression Classifier:

In [90]:
LogisticRegression_classifier = SklearnClassifier(LogisticRegression())
LogisticRegression_classifier.train(set_training)
print("LogisticRegression classifier accuracy %age:", (nltk.classify.accuracy(LogisticRegression_classifier, set_testing))*100)
print('')

LogisticRegression classifier accuracy %age: 69.64285714285714



Evaluate SGD Classifier:

In [91]:
SGDClassifier_classifier = SklearnClassifier(SGDClassifier(max_iter=5, tol=None))
SGDClassifier_classifier.train(set_training)
print("SGD classifier accuracy %age:", (nltk.classify.accuracy(SGDClassifier_classifier, set_testing))*100)
print('')

SGD classifier accuracy %age: 69.64285714285714



Evaluate LinearSVC Classifier

In [92]:
LinearSVC_classifier = SklearnClassifier(LinearSVC())
LinearSVC_classifier.train(set_training)
print("LinearSVC classifier accuracy %age:", (nltk.classify.accuracy(LinearSVC_classifier, set_testing))*100)
print('')

LinearSVC classifier accuracy %age: 69.64285714285714



Compare the Classifiers and select the classifier with highest accuracy:
selectclassifier is defined in the file clasfuncdef.py
Below is a snapshot of the code:

Mode functionality returns an error whenever two classifiers return the same accuracy percentage. Inorder to overcome this issue, I have defined the custom find_max_mode  function.

def find_max_mode(list1):
    list_table = statistics._counts(list1)
    len_table = len(list_table)

    if len_table == 1:
        max_mode = statistics.mode(list1)
    else:
        new_list = []
        for i in range(len_table):
            new_list.append(list_table[i][0])
        max_mode = max(new_list)
    return max_mode

class selectClassifier(ClassifierI):
    def __init__(self, *classifiers):
        self._classifiers = classifiers

    def classify(self, features):
        votes = []
        for x in self._classifiers:
            v = x.classify(features)
            votes.append(v)
        return find_max_mode(votes)

    def confidence(self, features):
        votes = []
        for x in self._classifiers:
            v = x.classify(features)
            votes.append(v)

        choice_votes = votes.count(find_max_mode(votes))
        conf = choice_votes / len(votes)
        return conf

In [93]:
selected_classifier = cfd.selectClassifier(
                                  LinearSVC_classifier,
                                  MNB_classifier,
                                  BernoulliNB_classifier,
                                  LogisticRegression_classifier)

print("Selected Classifier accuracy %age:", (nltk.classify.accuracy(selected_classifier, set_testing))*100)
print('')

Selected Classifier accuracy %age: 69.64285714285714



Now the Classifer is ready to evaluate the 10-K file. We will re-use the find features function defined above on the 10-K file to get the featureset.

In [94]:
features_src_comments = cfd.find_features(features_list,src_comments)
print('')




Evaluate the Sentiment (Positive/Negative) and the confidence score (in percentage)

In [95]:
print('Sentiment: ', selected_classifier.classify(features_src_comments).upper())
print('Sentiment Confidence: ', selected_classifier.confidence(features_src_comments)*100, ' %age')

Sentiment:  POSITIVE
Sentiment Confidence:  100.0  %age
