# Text classification with Naive Bayes

---
**Author**: Marko Bajec

**Last update**: 11.2.2019

**Description**: in this example we show how to use Naive Bayes algorithm to classify a document into one of the predefined categories.  

**Required libraries** (use pip3):
* <code>nltk</code> 
* <code>re</code>
* <code>texttable</code>
* <code>counter</code>

---
## Naive Bayes
**Naive Bayes** is a family of probabilistic algorithms. It is based on **Bayes' Theorem**, which describes the probability of a feature, based on prior knowledge of conditions that might be related to that feature. 

Let's say we have a *taxonomy* with the following two *categories*:
* c<sub>1</sub>: <code>sport</code>, 
* c<sub>2</sub>: <code>politics</code>.

In the *training dataset* we have five tagged documents:
1. *A great game.* :<code>sports</code>,
2. *The election was over.* : <code>politics</code>,
3. *Very clean match.* : <code>sports</code>,
4. *A clean but forgettable game.* : <code>sports</code>, and
5. *It was a close election.* : <code>politics</code>.

**Question**: Is the following text about <code>sports</code> or <code>politics</code>?
<span style="color:blue">*A very close game.*</span>

## Python implementation
Let's first get some **statistics** based on the training dataset. We will need to know:
* what is the portion of documents in the training dataset that belong to each category,
* how many distinct words we have in the dataset
* what is the frequency of each word in documents of each category
* what is the probability that a word $w$ appears in a document $d$ of a category $c$

We will also need to employ **Laplace Smoothing** to toggle zero values.


In [None]:
import nltk
import re
from collections import Counter
from texttable import Texttable

# categories
c1 = "sports"
c2 = "politics"

# training documents
training_dataset = [["A great game.", c1], ["The election was over.", c2], ["Very clean match.", c1], 
                    ["A clean but forgettable game.", c1], ["It was a close election.", c2]]

# representation of each category in the training dataset
Pc1 = 3/5
Pc2 = 2/5

# document to be classified
sentance = "A very close game"

# class to keep:
# - distinct words (word),
# - their frequencies for each category (fc1, fc2) 
# - their probability of occurence in each category (prwordc1, prwordc2)
# - their probabilities using Laplace smoothing (lapc1, lapc2)
class kword:
    def __init__(self, word, fc1, fc2, prwordc1, prwordc2, lapc1, lapc2):
        self.word = word
        self.fc1 = fc1
        self.fc2 = fc2
        self.prwordc1 = prwordc1
        self.prwordc2 = prwordc2
        self.lapc1 = lapc1
        self.lapc2 = lapc2

# tokenize - separatelly for category c1 and c2
text, text_c1, text_c2 = [], [], []        
for i in range(0, len(training_dataset)):
    text.extend(nltk.word_tokenize(training_dataset[i][0]))
    if training_dataset[i][1] == c1:
        text_c1.extend(nltk.word_tokenize(training_dataset[i][0]))
    else:
        text_c2.extend(nltk.word_tokenize(training_dataset[i][0]))

# change words to lowercase
text = [x.lower() for x in text]
text_c1 = [x.lower() for x in text_c1]
text_c2 = [x.lower() for x in text_c2]

# remove punctation
nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
text_filtered = [w for w in text if nonPunct.match(w)]
text_c1_filtered = [w for w in text_c1 if nonPunct.match(w)]
text_c2_filtered = [w for w in text_c2 if nonPunct.match(w)]

text_counts = Counter(text_filtered)
text_c1_counts = Counter(text_c1_filtered)
text_c2_counts = Counter(text_c2_filtered)

# get words counts
dist_words = set(text_counts)
num_dist_words = len(dist_words)
num_words_c1 = len(text_c1_filtered)
num_words_c2 = len(text_c2_filtered)

# create result list
# for each word, check how many times it appears in c1 and c2, calculate probability of a word in a category 
# and smooth the results using Laplace 
results = []
for word in dist_words:
    fc1 = text_c1_counts[word]
    fc2 = text_c2_counts[word]
    results.append(kword(word, fc1, fc2, fc1/num_words_c1, fc2/num_words_c2, 
                         (fc1+1)/(num_words_c1+num_dist_words), (fc2+1)/(num_words_c2+num_dist_words)))
    
# print results

print("Number of documents in the training dataset: %s" % len(training_dataset))
print("Distribution of docs per category: c1 = %s, c2 = %s" % (Pc1, Pc2))
print("Number of distinct words: %s" % num_dist_words)

table = Texttable(0) #0 means the table width is unlimitted
table.set_cols_align(["l", "r", "r", "r", "r", "r", "r"])
table.set_cols_valign(["m", "m", "m", "m", "m", "m", "m"])
#table.add_rows([["Word", "fc1", "fc2", "Pr(word|c1)", "Pr(word|c2)", "Lap(Pr(word|c1))", "Lap(Pr(word|c2))"],
#               ["Test", 32, 2.019, 0, 0, 0, 0]])
table.header(["Word", "fc1", "fc2", "Pr(word|c1)", "Pr(word|c2)", "Lap(Pr(word|c1))", "Lap(Pr(word|c2))"])
for row in results:
    table.add_row([row.word, row.fc1, row.fc2, row.prwordc1, row.prwordc2, row.lapc1, row.lapc2])
    
print(table.draw() + "\n")
   

### Which category is more likely the observed document belongs to?  
Let's now calculate the probability that the document "*A very close geme*" belongs to either of the categories, <code>sport</code> or <code>politics</code>. To do this we use the following formula:

<span style="color:darkred">
$$\LARGE 𝑐_𝑗 = \underset{c_k \in C}{\operatorname{argmax}}Pr⁡(𝑐_𝑘)\prod_{w=1}^{|V|} Pr(w_i|c_k)^{f_{w_{i}}}$$
</span>

Let's say our document $D$ that we would like to classify, consists of words $w_1, w_2,..., w_n \in D$. The probability that $D$ belongs to the category $c$ is calculated as the probability of category $c$, i.e. $Pr(c)$ multiplied with the product of probabilities of occurencies of words $w_i$ in the observed document $D$, i.e. $Pr(w_i|c)$. 

In [None]:
# create word list for the observed document
target_doc = nltk.word_tokenize(sentance)
target_doc = [x.lower() for x in target_doc]

# calculate category probability
Pr_wC1 = 1
Pr_wC2 = 1

for i in results:
  if i.word in target_doc:
        Pr_wC1 = Pr_wC1 * i.lapc1
        Pr_wC2 = Pr_wC2 * i.lapc2

print("Probability of category c1 is %.8f" % round((Pc1 * Pr_wC1), 10))
print("Probability of category c2 is %.8f" % round((Pc2 * Pr_wC2), 10))
