# Naive Bayes in Text Classification

### Introduction

In the assignment 3, we have made a knowledge of text classification in analyzing the tweets of different representatives from different parties. But in building the model, the assignment just used the SVM from the famous library sklearn, which is very popular and widely used. To help further understanding, this tutorial will teach you how to build your own model, which is naive bayes in this tutorial, to make text classification. We are going to go through the whole process from process the data, extract features, build models and test our model using some test documents. And the last part, which is the most interesting place, we are going to search a speech draft from online and see if our model could predict it correctly. Ok, let's go.

In [1]:
import re
import math

### Prepossessing

In this tutorial, we are not going to use any external library. The libraries above are all what we need. It would be very easy to make use of some external libraries to do some data science job. But only if we realize them without any external help, can we have a much deeper understand of the model and tools.

In [2]:
def get_list_articles_from_file(file_path):
    f = open(file_path)
    content = f.readlines()
    f.close()
    list_articles = []
    for line in content:
        line = line.replace('\n', '')
        list_articles.append(line)
    return list_articles

list_train_articles = get_list_articles_from_file("train.txt")
print(list_train_articles[0])

RED	THANK YOU . THIS IS QUITE AN INSTITUTION . IT'S GOOD TO BE BACK HERE WITH YOU . IT'S GOOD TO BE BACK IN MICHIGAN . YOU KNOW , SOMEHOW EVERYTHING JUST SEEMS RIGHT HERE . IN THE WINTER , OF COURSE , THE SKIES ARE CLOUDY ALL DAY . MOST OF THE CARS YOU SEE ON THE ROADS ARE MADE HERE IN THE GOOD OLD U-S-OF A . PEOPLE KNOW THAT POP IS NOT A RELATIVE , IT'S A SOFT DRINK , AND THEY KNOW THAT UNKNOWNWORD IS THE BEST GINGER UNKNOWNWORD IN THE WORLD . AND OF COURSE , FOR ME , I HAVE A LOT OF MEMORIES HERE . THIS IS WHERE BOTH ANN AND I WERE BORN . IT'S WHERE I MET HER . WE WERE IN OUR SENIOR YEAR WHEN WE WENT TO A PARTY TOGETHER . I WAS IN SENIOR YEAR , SHE WAS A SOPHOMORE . SHE CAME WITH SOMEONE ELSE . I NOTICED HER AT AGE 16 . SHE WAS VERY INTERESTING . I WENT TO THE GUY WHO BROUGHT HER THERE AND SAID , 'LOOK , I LIVE CLOSER TO ANN THAN YOU DO , CAN I GIVE HER A RIDE HOME ?' WE'VE BEEN GOING STEADY EVER SINCE . SO WE KNOW EACH OTHER REAL WELL . I SAID TO HER AFTER WE MADE THE DECISION TO GE

Here is our train data set. Let's take a look at the first data. It starts with 'RED' or 'BLUE', which is the label of the party, followed by '\t' and the whole speech text. In language technology, there are some important notes to consider. The first is that we are going to remove punctuations because they are often some noisy features which will influence the performance of our model. The others will be stop words and tokenize, but we will not deal with them in this tutorial. So we are going to write a funtion to remove functuations.

In [3]:
def remove_punctuation_from_word_list(list_words):
    # Argument: List, the list of words in one article
    # Return: List, the list of words without punctuation in one article
    pattern = re.compile(r'^[A-Z]')
    res = []
    for word in list_words:
        if pattern.match(word):
            if word != 'UNKNOWNWORD':
                res.append(word)
    return res

test_input = "THIS IS A PUNCTUATION REMOVE ! TEST ."
print(remove_punctuation_from_word_list(test_input.split()))

['THIS', 'IS', 'A', 'PUNCTUATION', 'REMOVE', 'TEST']


You are going to see the result like below:

['THIS', 'IS', 'A', 'PUNCTUATION', 'REMOVE', 'TEST']

As we can see from output of the first data of the training data, it first comes with 'RED' or 'BLUE', which is the label of the party, followed by '\t' and the whole speech text. We are going to generate the label and the whole word list with punctuations removed for each data. So we are going to write a function to take a data string as input and return a tuple with the label and the list of all words with punctuations removed.

In [4]:
def get_word_list_and_party_from_article(article):
    # Argument: String, the content of one train article
    # Return: Tuple, first: the list of all words from the article, second: the true party of the article
    party = article.split('\t')[0]
    list_words = article.split('\t')[1].split(' ')
    list_words = remove_punctuation_from_word_list(list_words)
    return list_words, party

test_input = list_train_articles[0]
print(get_word_list_and_party_from_article(test_input))

(['THANK', 'YOU', 'THIS', 'IS', 'QUITE', 'AN', 'INSTITUTION', "IT'S", 'GOOD', 'TO', 'BE', 'BACK', 'HERE', 'WITH', 'YOU', "IT'S", 'GOOD', 'TO', 'BE', 'BACK', 'IN', 'MICHIGAN', 'YOU', 'KNOW', 'SOMEHOW', 'EVERYTHING', 'JUST', 'SEEMS', 'RIGHT', 'HERE', 'IN', 'THE', 'WINTER', 'OF', 'COURSE', 'THE', 'SKIES', 'ARE', 'CLOUDY', 'ALL', 'DAY', 'MOST', 'OF', 'THE', 'CARS', 'YOU', 'SEE', 'ON', 'THE', 'ROADS', 'ARE', 'MADE', 'HERE', 'IN', 'THE', 'GOOD', 'OLD', 'U-S-OF', 'A', 'PEOPLE', 'KNOW', 'THAT', 'POP', 'IS', 'NOT', 'A', 'RELATIVE', "IT'S", 'A', 'SOFT', 'DRINK', 'AND', 'THEY', 'KNOW', 'THAT', 'IS', 'THE', 'BEST', 'GINGER', 'IN', 'THE', 'WORLD', 'AND', 'OF', 'COURSE', 'FOR', 'ME', 'I', 'HAVE', 'A', 'LOT', 'OF', 'MEMORIES', 'HERE', 'THIS', 'IS', 'WHERE', 'BOTH', 'ANN', 'AND', 'I', 'WERE', 'BORN', "IT'S", 'WHERE', 'I', 'MET', 'HER', 'WE', 'WERE', 'IN', 'OUR', 'SENIOR', 'YEAR', 'WHEN', 'WE', 'WENT', 'TO', 'A', 'PARTY', 'TOGETHER', 'I', 'WAS', 'IN', 'SENIOR', 'YEAR', 'SHE', 'WAS', 'A', 'SOPHOMORE', '

Your are going to see a tuple. The first element is a list containing a lot of words, like 'THANK', 'YOU' and so on. The second element is a string, which is 'RED' meaning the Republican Party.

### Train Model

In this tutorial, we are going to realize a naive bayes model. Naive Bayes classifier is a simple probabilistic classifer in machine learning, which is based on applying the famous Bayes' Theorem with a naive independence assumption between all features.

Bayes' Theorem is
$$P(A\mid B) = \frac{P(B\mid A)P(A)}{P(B)}$$
In Naive Bayes model, we have an assumption that all features are conditionally independent given the label. In this case, we will have the formula below:
$$P(X_1, X_2, \cdots, X_n|Y)=P(X_1|Y)P(X_2|Y)\cdots P(X_n|Y)$$
And in our text classification case, the label Y is the party and the features are the number of appearance of each word. So if we want to predict the party based on the speech text, we will have such formula:
$$P(party|text)=\frac{P(text|party)P(party)}{P(text)}$$

$$P(text)=P(text|party=RED)P(party=RED)+P(text|party=BLUE)P(party=BLUE)$$

$$P(text|party)=\prod_{word \in text}P(word|party)$$

After all the previous preparation, we are going to build our naive bayes model. We need to record the number of articles of RED and BLUE, and the counter of words of RED and BLUE, also a vocabulary dictionary to record all the words appeared to do smooth.

In [5]:
dict_red = {}
dict_blue = {}
red = 0
blue = 0
vocabulary = set()

for article in list_train_articles:
    words_list, party = get_word_list_and_party_from_article(article)
    if (party == 'RED'):
        red += 1
        for word in words_list:
            if word in dict_red:
                dict_red[word] += 1
            else:
                dict_red[word] = 1
            vocabulary.add(word)
    else:
        blue += 1
        for word in words_list:
            if word in dict_blue:
                dict_blue[word] += 1
            else:
                dict_blue[word] = 1
            vocabulary.add(word)

prob_blue = blue * 1.0 / (blue + red)
prob_red = red * 1.0 / (blue + red)
vocabularity_blue = sum(dict_blue.values())
vocabularity_red = sum(dict_red.values())
print('vocabulary size: ', len(vocabulary))
print('probability of blue: ', prob_blue)
print('probability of red: ', prob_red)
print('the sum of the number of appearance of all words in BLUE: ', vocabularity_blue)
print('the sum of the number of appearance of all words in RED: ', vocabularity_red)
print('the number of appearance of SHOULD in BLUE: ', dict_blue['SHOULD'])
print('the number of appearance of THANK in RED: ', dict_red['THANK'])

vocabulary size:  6818
probability of blue:  0.5
probability of red:  0.5
the sum of the number of appearance of all words in BLUE:  59582
the sum of the number of appearance of all words in RED:  50621
the number of appearance of SHOULD in BLUE:  77
the number of appearance of THANK in RED:  49


We have counted the number of appearance of each word in the Naive Bayes Model. And if everything is correct, you will see the result below:

vocabulary size:  6819

probability of blue:  0.5

probability of red:  0.5

the sum of the number of appearance of all words in BLUE:  59953

the sum of the number of appearance of all words in RED:  51083

the number of appearance of THANK in RED:  49

the number of appearance of SHOULD in BLUE:  77

Now, since we have already got all our needed parameters for Naive Bayes Model, we are going to write the predict function to finish our Naive Bayes Model. As we can see from the formula above, we are going to multiply the probability of P(word|party), as the probabilty will be very small, some may be e-6 level, and multiply them together will be much small, which will not be maintained by computer precision. So we have to make a transformation to take the log of each probability and add them together, which will be available and more accurate. (In the test data set, they will be word "UNKNOWNWORD", we can just skip it in prediction.)

In [6]:
def naive_bayes_predict(test_words_list, prob_blue, prob_red, dict_blue, dict_red, vocabulary, smooth):
    new_vocabulary = vocabulary.copy()
    vocabularity_blue = sum(dict_blue.values())
    vocabularity_red = sum(dict_red.values())
    indicator_blue = math.log(prob_blue)
    indicator_red = math.log(prob_red)
    for word in test_words_list:
        if word != 'UNKNOWNWORD':
            new_vocabulary.add(word)
    for word in test_words_list:
        if word == 'UNKNOWNWORD':
            continue
        if word in dict_blue:
            indicator_blue += math.log((dict_blue[word] + smooth)/(len(new_vocabulary) * smooth + vocabularity_blue))
        else:
            indicator_blue += math.log(smooth / (len(new_vocabulary) * smooth + vocabularity_blue))
        if word in dict_red:
            indicator_red += math.log((dict_red[word] + smooth)/(len(new_vocabulary) * smooth + vocabularity_red))
        else:
            indicator_red += math.log(smooth / (len(new_vocabulary) * smooth + vocabularity_red))
    predict_prob_blue = indicator_blue / (indicator_blue + indicator_red)
    predict_prob_red = indicator_red / (indicator_blue + indicator_red)
    if indicator_blue > indicator_red:
        return 'BLUE', predict_prob_blue, predict_prob_red
    else:
        return 'RED', predict_prob_blue, predict_prob_red
    

### Test Model

In [7]:
list_test_articles = get_list_articles_from_file("test.txt")
test_red_predict_red = 0
test_blue_predict_blue = 0
test_red_predict_blue = 0
test_blue_predict_red = 0

# index = 1
for article in list_test_articles:
    words_list, party = get_word_list_and_party_from_article(article)
    prediction, predict_blue, predict_red = naive_bayes_predict(words_list, prob_blue, prob_red, dict_blue, dict_red, vocabulary, 0.1)
    if party == 'BLUE' and prediction == 'BLUE':
        test_blue_predict_blue += 1
    elif party == 'BLUE' and prediction == 'RED':
        test_blue_predict_red += 1
    elif party == 'RED' and prediction == 'BLUE':
        test_red_predict_blue += 1
    else:
        test_red_predict_red += 1
    # print('Test Case ', index, ': the probability of blue: ', predict_blue, ' the probability of red: ', predict_red)
    
print('Confusion Matrix:')
print('Actual\Predict     BLUE      RED')
print('BLUE               ', test_blue_predict_blue, '      ', test_blue_predict_red)
print('RED                 ', test_red_predict_blue, '      ', test_red_predict_red)


Confusion Matrix:
Actual\Predict     BLUE      RED
BLUE                12        0
RED                  0        6


If everything is correct, we will see the result of confusion matrix and we will see that the model will predict the test set all correctly, and the result shoud be:
![Image of Confusion](https://i.imgur.com/hJXrFAm.jpg)

### Extension 

It seems that our model performs very well. But we are not only satisfied with this. And you can search for some more test data to see if our model could classify it correctly. And I am choosing a speech article of Trump on the Internet and let it be my test data. And see if our model could correctly classify it into RED. The article is chosen from http://www.cnn.com/2017/02/28/politics/donald-trump-speech-transcript-full-text/index.html

In [8]:
trump_article = get_list_articles_from_file("trump.txt")[0]
trump_article = trump_article.upper()
words_list, party = get_word_list_and_party_from_article(trump_article)
prediction, predict_blue, predict_red = naive_bayes_predict(words_list, prob_blue, prob_red, dict_blue, dict_red, vocabulary, 0.1)
print('The prediction of this article is: ', prediction)

The prediction of this article is:  RED


Oh yeah, we will see from the result that our model successfully classified it into RED. This is so interesting. And if you are interested about this, you could search for more party representatives' speech texts and send it to our model and see if it predicts correctly.

### Conclusion

The Data Science is very interesting and beautiful. It is like a very extensive sea and what we learned is just a like a small island. Our Naive Bayes model is very simple and there is still a lot of things to do research. In Natural Language Processing, removing stop words, building n-gram models and tokenize of words (depends->depend) are some famous notes which could improve the performance of the model, but we do not realize in this tutorial. If you are interesting about this, you could just add these important notes by yourself and see if it works. Also, there are a lot of interesting courses in Language Technology Institute are waiting for you to register.

Also, you could realize the model easily from some external libraries, but I hope in the process of learning data science, it is very important to realize it without any external help. This would be very helpful to your knowledge and growth to an excellent data scientist.