Bayes Classification <br/>
$$P(Y | x) = \frac{P(x|Y)P(Y)}{P(x)}$$
Naive Bayes Assumption uses feature independence given the label <br/>
$$P(Y | x) = \frac{\prod_i^d P(x_i | Y)P(Y)}{P(X)}$$

# Naive Bayes Classification for Text Classification

Give a text $S$ of the form $S = w_1, w_2, w_3, ...... w_d $. Where $w_i$'s are the words. We want find a label for the text. Which is done by using naive bayes classification.
$$P(Y | S) = \frac{\prod_i^d P(w_i | Y)P(Y)}{P(S)}$$
For this we need estimate $P(Y)$ and $P(w_i| Y)$ from the training data.

# Parameter Estimation 

$$ P(Y= y_k) = \frac{count(Y= y_k)}{count( Y= ANY)} $$
<br/>
$$ P( X=x_j | Y=y_k) = \frac{Count(X= x_j\ \&\ Y = y_k)}{  Count(X= ANY\ \&\ Y = y_k) } $$

So we need to estimate the following counts from the training data <br/>
$count(Y= y_k)$ <br/> $count( Y= ANY)$ <br/> $Count(X= x_j\ \&\ Y = y_k)$ <br/> 
$Count(X= ANY\ \&\ Y = y_k)  $ 

For text for d words <br/>

$$ log( P(Y = y_k | x) )  = log \big( P(Y=y_k) \big) + \sum_{i=1}^d log \big(P(X = x_j | Y= y_k ) \big)$$
<br/>
$$log( P(Y = y_k | x) )  = log \big( \frac{count(Y= y_k)}{count( Y= ANY)}   \big)  +  \sum_{i=1}^d log \big( \frac{Count(X= x_j\ \&\ Y = y_k)}{  Count(X= ANY\ \&\ Y = y_k) } \big) $$

In [1]:
import re
from collections import  defaultdict



# Dataset

In [3]:
f =  open('DBPedia.verysmall/verysmall_train.txt', 'r') 
i = 0 
for article in f:
    try :
        labels, line = article.split('\t')
    except :
            continue
    print(labels, line)
    i = i + 1
    if i > 10 :
        break
    
f.close()

American_film_directors  <http://dbpedia.org/resource/Allan_Dwan> <http://dbpedia.org/ontology/abstract> "Allan Dwan (3 April 1885 \u2013 28 December 1981) was a pioneering Canadian-born American motion picture director, producer and screenwriter."@en .

Articles_containing_video_clips  <http://dbpedia.org/resource/Animation> <http://dbpedia.org/ontology/abstract> "Animation is the process of creating motion and shape change illusion by means of the rapid display of a sequence of static images that minimally differ from each other. The illusion\u2014as in motion pictures in general\u2014is thought to rely on the phi phenomenon. Animators are artists who specialize in the creation of animation.Animations can be recorded on either analogue media, such as a flip book, motion picture film, video tape, or on digital media, including formats such as animated GIF, Flash animation or digital video. To display animation, a digital camera, computer, or projector are used along with new technolog

In [7]:
cls  = set()
vocab= set()

C = defaultdict(lambda: 0)
f =  open('DBPedia.verysmall/verysmall_train.txt', 'r') 

In [8]:



for article in f:
    try :        
            labels, line = article.split('\t')
    except :
            continue
    C['Y=ANY'] = C['Y=ANY'] + 1
    labels = labels.split(',')
    for label in labels:
        label = label.replace(" ","")
        C['Y='+label] = C['Y='+label] + 1
        cls.add(label)
    line  = re.sub(r'[^a-zA-Z]', " ",re.sub(r'<.*>',"", line) ).split()[:-1]
        
    for word in line :
        word  = word.lower()
        word  = word.replace(" ","")
        vocab.add(word)
        for label in labels:
            label = label.replace(" ","")
            C['Y='+label + '^X=' +word] =  C['Y='+label + '^X=' +word] + 1
            C['Y='+label + '^X=ANY'] =  C['Y='+label + '^X=ANY'] + 1
        
f.close()


C['Nsize'] = len(vocab)
C['Nclass'] = len(cls)

In [9]:
import math

q_x =   1./C['Nsize']
q_y =   1./C['Nclass']

In [11]:
C

defaultdict(<function __main__.<lambda>>,
            {'Y=ANY': 10497,
             'Y=American_film_directors': 311,
             'Y=American_film_directors^X=allan': 4,
             'Y=American_film_directors^X=ANY': 44499,
             'Y=American_film_directors^X=dwan': 1,
             'Y=American_film_directors^X=april': 37,
             'Y=American_film_directors^X=u': 494,
             'Y=American_film_directors^X=december': 37,
             'Y=American_film_directors^X=was': 485,
             'Y=American_film_directors^X=a': 802,
             'Y=American_film_directors^X=pioneering': 5,
             'Y=American_film_directors^X=canadian': 8,
             'Y=American_film_directors^X=born': 238,
             'Y=American_film_directors^X=american': 366,
             'Y=American_film_directors^X=motion': 29,
             'Y=American_film_directors^X=picture': 51,
             'Y=American_film_directors^X=director': 411,
             'Y=American_film_directors^X=producer': 199,
   

Posterior
$$log( P(Y = y_k | x) )  = log \big( \frac{count(Y= y_k) + m q_y}{count( Y= ANY)+ m}   \big)  +  \sum_{i=1}^d \frac{Count(X= x_j\ \&\ Y = y_k)+ m q_x}{  Count(X= ANY\ \&\ Y = y_k) + m } $$
<br/>
$$ y_{pred} = argmax_{y_i }\ \ log( P(Y = y_i | x) )  $$
$ q_y = \frac{1}{No\ of\ Class}$ <br/>
$q_x = \frac{1}{No\ of\ words}$ <br/>

Metric Evaluation 
$$ Accuracy = \sum_i^{No\ of\ test}  1_{ y_{pred}\ in\ Labels_i }$$

In [16]:


f =  open('DBPedia.verysmall/verysmall_test.txt', 'r')
test_size = 0
acc = 0
for article in f:
    
    try :        
            labels, line = article.split('\t')
    except:
        continue
    test_size = test_size + 1
            
    labels = labels.split(',')
    labels = [label.replace(" ","") for label in labels]
            
    line  = re.sub(r'[^a-zA-Z]', " ",re.sub(r'<.*>',"", line) ).split()[:-1]
    likli = []
    for label in cls :
        prob = 0
        for word in line :
            word  = word.lower()
            word  = word.replace(" ","")
                
            prob = prob + math.log( ( C.get('Y='+label + '^X=' +word, 0) + q_x )* 1./  ( C.get('Y='+label + '^X=ANY', 0) + 1 ) ) 
        prob = prob + math.log( ( C.get('Y='+label, 0) + q_y ) * 1./ (C['Y=ANY'] + 1) )
        likli.append((label,prob))
    predict = sorted(likli, key = lambda x: -x[1])[0][0]
    acc = acc + int (predict in labels)
        
        
        
f.close()

print("Accuracy = ",acc *1./test_size)


Accuracy =  0.9919839679358717
