# Text Classification Using Naive Bayes

#### Souvik Roy | 18210083

I am going to show a step by step demostration of classifying text using Naive Bayes.
Naive Bayes is a simple text classification algorithm that uses basic probability laws and works quite well in practice!
Why is text classification required, you may ask?
- To detect fake news, spam emails
- To classify webpages by topic, etc.

The algorithm works on the fundamental rule of Bayes Theorem:
<img src="bt.png" style="width: 400px;">


The above formula states that the probability that our hypothesis is correct given the evidence to support it is equal to the probability of observing that evidence given our hypothesis times the prior probability of the hypothesis divided by the probability of observing that evidence overall.

<img src="nbassume.png" style="width: 400px;">

The reason that this classifier is called "naive" is because of its assumption  it's unreasonable to assume that all features are independent of one another (given the label). In particular, almost all real-world problems contain features with varying degrees of dependence on one another. However, making this assumption simplifies things and that's why it's an effective baseline for supervised learning.

Now,I will build a simple binary classifier and calculate its accuracy on classifying movie reviews as positive or negative based on some features which the it will learn from the training data. We'll be importing the movie_reviews corpus from nltk and use it as our training data. It contains 2000 movie reviews with the labels/categories: positive or negative.

In [47]:
#Importing the essential libraries

import nltk
import random
from nltk.corpus import movie_reviews,stopwords
from nltk.probability import FreqDist
import string
from nltk import NaiveBayesClassifier
from nltk.tokenize import word_tokenize

# Total reviews
print ("\nInformation about the movie_reviews corpus:\nTotal Number of reviews:", len(movie_reviews.fileids()))
 
# Review categories
print ("Labels:",movie_reviews.categories()) 
 
# Total positive reviews
print ("Number of positive reviews:",len(movie_reviews.fileids('pos')))
 
# Total negative reviews
print ("Number of negative reviews:",len(movie_reviews.fileids('neg')))


Information about the movie_reviews corpus:
Total Number of reviews: 2000
Labels: ['neg', 'pos']
Number of positive reviews: 1000
Number of negative reviews: 1000


In [48]:
#This list contains array containing tuples of all movie review words and their respective category:pos or neg
documents = [(list(movie_reviews.words(fileid)), category)
            for category in movie_reviews.categories()
            for fileid in movie_reviews.fileids(category)]
#size of document
print("Size of document:",len(documents))

#printing first tuple
print(documents[1])

#shuffling the documents for no bias
random.shuffle(documents)

Size of document: 2000
(['the', 'happy', 'bastard', "'", 's', 'quick', 'movie', 'review', 'damn', 'that', 'y2k', 'bug', '.', 'it', "'", 's', 'got', 'a', 'head', 'start', 'in', 'this', 'movie', 'starring', 'jamie', 'lee', 'curtis', 'and', 'another', 'baldwin', 'brother', '(', 'william', 'this', 'time', ')', 'in', 'a', 'story', 'regarding', 'a', 'crew', 'of', 'a', 'tugboat', 'that', 'comes', 'across', 'a', 'deserted', 'russian', 'tech', 'ship', 'that', 'has', 'a', 'strangeness', 'to', 'it', 'when', 'they', 'kick', 'the', 'power', 'back', 'on', '.', 'little', 'do', 'they', 'know', 'the', 'power', 'within', '.', '.', '.', 'going', 'for', 'the', 'gore', 'and', 'bringing', 'on', 'a', 'few', 'action', 'sequences', 'here', 'and', 'there', ',', 'virus', 'still', 'feels', 'very', 'empty', ',', 'like', 'a', 'movie', 'going', 'for', 'all', 'flash', 'and', 'no', 'substance', '.', 'we', 'don', "'", 't', 'know', 'why', 'the', 'crew', 'was', 'really', 'out', 'in', 'the', 'middle', 'of', 'nowhere', ','

Now, we will extract features which will help us classify text into the two categories : neg or pos
Before that, we need to remove the stopwords and punctuation. As you can see from the above output, we need to remove words like of,is,the, ', etc

In [49]:
stop = stopwords.words('english')

#taking all words from the corpus and creating a list, removing stopwords and punctuation
all_words = (w.lower() for w in movie_reviews.words() if w.lower() not in stop and w.lower() not in string.punctuation)

#Frequency Distribution will calculate the number of occurence of each word in the entire list of words.
all_words = nltk.FreqDist(all_words)
print(all_words)

#Printing 15 most frequently occuring words
print (all_words.most_common(15))

<FreqDist with 39586 samples and 710578 outcomes>
[('film', 9517), ('one', 5852), ('movie', 5771), ('like', 3690), ('even', 2565), ('good', 2411), ('time', 2411), ('story', 2169), ('would', 2109), ('much', 2049), ('character', 2020), ('also', 1967), ('get', 1949), ('two', 1911), ('well', 1906)]


In [50]:
#Taking the 2000 frequently occuring words
most_common_words = all_words.most_common(2000)

# The most_common_words list's elements are in the form of tuple
# Taking the first element of each tuple
word_features = [item[0] for item in most_common_words]

#Creating feature set to train the classifier
def document_features(document):
    # "set" function will remove repeated/duplicate tokens in the given list
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features
feature_set = [(document_features(doc), category) for (doc, category) in documents]
print (feature_set[0])

({'contains(film)': True, 'contains(one)': True, 'contains(movie)': True, 'contains(like)': True, 'contains(even)': True, 'contains(good)': True, 'contains(time)': True, 'contains(story)': True, 'contains(would)': True, 'contains(much)': True, 'contains(character)': True, 'contains(also)': False, 'contains(get)': False, 'contains(two)': True, 'contains(well)': False, 'contains(characters)': False, 'contains(first)': False, 'contains(--)': False, 'contains(see)': False, 'contains(way)': False, 'contains(make)': False, 'contains(life)': False, 'contains(really)': False, 'contains(films)': True, 'contains(plot)': True, 'contains(little)': False, 'contains(people)': True, 'contains(could)': False, 'contains(scene)': True, 'contains(man)': True, 'contains(bad)': True, 'contains(never)': False, 'contains(best)': False, 'contains(new)': True, 'contains(scenes)': False, 'contains(many)': False, 'contains(director)': True, 'contains(know)': True, 'contains(movies)': False, 'contains(action)': F

From the feature set above, we now create a separate training set and a separate testing/validation set. The training set is used to train the classifier and the testing set is used to test the classifier to check how accurately it classifies the given text.
Since there are 2000 elements in the feature set, We use the first 400 elements of the feature set array as a testing set and the rest of the data as a training set i.e. 80 percent training set and 20 percent testing set.

In [51]:
print (len(feature_set))
test_set = feature_set[:400]
train_set = feature_set[400:]
print (len(train_set))
print (len(test_set))
 
classifier = NaiveBayesClassifier.train(train_set)
 
accuracy = classify.accuracy(classifier, test_set)
print ("Accuracy of the classifer is:",accuracy*100,"%")

2000
1600
400
Accuracy of the classifer is: 79.5 %


<img src="res.png" style="width: 500px;">
We can classify new sentences based on our trained model. argmax means the value that gives me the highest number. So, In this case, we have two classes or values, viz. neg or pos. So, P(V(j)) corresponds to the probability of positive or negative in the corpus. And P(W|V(j)) corresponds to probabilty of each of words that I have in my new sentence given the positive/negative case.

In [52]:
sample_sent = "I hated the poor acting."
tokens = word_tokenize(sample_sent)
sample_sent_set = document_features(tokens)
print ("The sentence '",sample_sent,"' is:",classifier.classify(sample_sent_set))
 
# Probability result
prob_result = classifier.prob_classify(sample_sent_set)
print (prob_result)
print ("Maximum Probability:",prob_result.max()) # Output: neg
print ("Probability of the sentence being Negative:",prob_result.prob("neg"))
print ("Probability of the sentence being Positive:",prob_result.prob("pos"))

The sentence ' I hated the poor acting. ' is: neg
<ProbDist with 2 samples>
Maximum Probability: neg
Probability of the sentence being Negative: 0.9999969987008751
Probability of the sentence being Positive: 3.001299111534239e-06


In [53]:
from IPython.core.display import display, HTML

print("\033[1m"+"References:")
display(HTML("""<a href="https://www.nltk.org/book/ch06.html">NLTK: Chapter 6. Learning to Classify Text</a>"""))
display(HTML("""<a href="https://www.youtube.com/watch?v=EGKeC2S44Rs&t=126s">YouTube video by Francisco Iacobelli</a>"""))
display(HTML("""<a href="https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/">Text Classification Tutorial</a>"""))

[1mReferences:
