<a href="https://colab.research.google.com/github/Collinsngenokip/basic-ml-course/blob/Week5/05_Naive_Bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this assignment, we will implement Bernoulli Naive Bayes and Multinomial Naive Bayes, and apply them for text classification.  We will experiment on the 20 newsgroups text dataset. It comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation).

First, we load the dataset from sklearn.

In [53]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')

You can access to text by data property. For labels, their name and corresponding numeric values are stored in target_names and target. \ Let's take a look at our data

In [54]:
len(newsgroups_train.data)

11314

In [55]:
newsgroups_train.target, newsgroups_train.target_names

(array([7, 4, 4, ..., 3, 1, 8]),
 ['alt.atheism',
  'comp.graphics',
  'comp.os.ms-windows.misc',
  'comp.sys.ibm.pc.hardware',
  'comp.sys.mac.hardware',
  'comp.windows.x',
  'misc.forsale',
  'rec.autos',
  'rec.motorcycles',
  'rec.sport.baseball',
  'rec.sport.hockey',
  'sci.crypt',
  'sci.electronics',
  'sci.med',
  'sci.space',
  'soc.religion.christian',
  'talk.politics.guns',
  'talk.politics.mideast',
  'talk.politics.misc',
  'talk.religion.misc'])

In [56]:
newsgroups_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

When applying machine learning to solve problems, designing algorithm is not the only way to optimize. We can also intervene on data, i.e. data preprocessing, feature selection, etc. In some case, this approach is even better than model optimizing. For this dataset, you can notice that the text have lots of redundant information, for example punctuation, title, etc. We can remove those from our data to get better performance. Here I define a function to remove all punctuation from text.

In [57]:
def remove_tokens(token_list, text):
    for token in token_list:
        text = text.replace(token, '')
    return text

In [58]:
from string import punctuation
preprocessed_text = [remove_tokens(punctuation, text) for text in newsgroups_train.data]

In [59]:
newsgroups_train.data[0]

"From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

**Assignment 1** : First we have to transform text into numeric feature. You have to build a matrix that counts word occurences in each documents (0.5pt). For fast computing, we only select 30000 words with highest frequency. Hint You should use sklearn.feature_extraction.text.CountVectorizer and max_features argument.

In [60]:
from sklearn.feature_extraction.text import CountVectorizer
num_word = 30000
vectorizer = CountVectorizer(max_features=num_word)
vectorizer.fit(preprocessed_text)

train_data = vectorizer.transform(preprocessed_text).toarray()
print(train_data.shape)

(11314, 30000)


Recall that for Naive Bayes, we find label that satisfy 


**Assigment 2** : We will derive prior probabilities  from data by computing frequency of class. You have to compute the number of documents in each class in class_freq variable, and divide to the total number of documents to get prior probability in prior_prob variable (1pt)

In [61]:
import numpy as np
classes,class_freq = np.unique(newsgroups_train.target, return_counts=True)
prior_prob = class_freq/np.sum(class_freq)
np.sum(prior_prob)
print(class_freq)

[480 584 591 590 578 593 585 594 598 597 600 595 591 594 593 599 546 564
 465 377]


**Assigment 3** : In this step, we will implement Bernoulli Naive Bayes. Therefore, the conditional probability is probability of that a document with label  has the word . \ To do that, we need the number of documents which has word  and label  for every pair . Your task is computing these values and store them in word_label_freq variable. It should be a numpy array for fast computing in the next step. (0.5.pt) \ Hint: Our 'train_data features are the number of occurences of words in documents. You can convert them to binary feature that whether a word appears in a document.

In [62]:
word_label_frequency = np.zeros((len(newsgroups_train.target_names),train_data.shape[1]))
count = 0
for i,b in enumerate(train_data[0]):
  if b:
    count += 1
#print(count)  
for i in range(len(newsgroups_train.target)): 
  word_label_frequency[newsgroups_train.target[i]] += train_data[i].clip(max=1)
print(word_label_frequency) 

[[ 0.  0.  0. ...  0.  0.  0.]
 [ 5.  0.  0. ... 10.  2.  1.]
 [ 2.  0.  1. ...  0.  0.  0.]
 ...
 [ 0.  4.  1. ...  0.  0.  0.]
 [ 1.  0.  0. ...  0.  0.  0.]
 [ 0.  0.  0. ...  0.  0.  0.]]


**Assigment 4 **: The conditional probability is computed by dividing the number of documents which has word  and label  to the number of documents with label . However, if there is no document which has word  and label  in training data, the probability will be zero, which is undesirable. \ To handle this problem, we can apply Laplace smoothing, then conditional probability will be computed as following

Your task here is implementing this formula with default alpha=0.1 and then fill in all the probability values in cond_prob variable. It should be a numpy array for fast computing in the next step. (1pt)

In [63]:
alpha = 0.01

In [64]:
cond_prob = np.array([(word_label_frequency[i] + alpha)/(class_freq[i] + num_word * alpha) for i in range(len(class_freq))])
print(cond_prob.shape)

(20, 30000)


**Assigment 5** : For test data, the conditional probabily follows Bernoulli distribution and is computed by

Then we multiply with prior probability and select the class with highest value as predicted label. For numerical stability, you should use log probability to compute. (2pt

*Hint* Remember to convert test data feature to binary feature as training data.

In [65]:
def find_label(data):
  result_class = 0
  for i,class_val in enumerate(cond_prob):
    prod = np.log(prior_prob[i]) + np.sum(np.log(class_val)) + np.sum(np.log(1 - class_val))
  return result_class

Now we can obtain labels and accuracy score of model on test data

In [66]:
preprocessed_test_text = [remove_tokens(punctuation, text) for text in newsgroups_test.data]
test_data = vectorizer.transform(preprocessed_test_text)

In [67]:
pred = []
from tqdm import tqdm
for text in tqdm(test_data):
    pred.append(find_label(text))

1676it [00:41, 39.98it/s]


KeyboardInterrupt: ignored

In [None]:
from sklearn import metrics
metrics.accuracy_score(pred, newsgroups_test.target)

**Assigment 6** : Next, we will implement Multinomial Naive Bayes. For this model, the conditional probability follows multinomial distribution. We have to estimate conditional probabilities from data as the frequency of words in all documents with a given class. \ First, you have to count the number of occurences of a word  in all document with class  (0.5pt).

In [None]:
word_label_freq = ...

**Assigment 7** : Your task here is to count the total number of words in all documents of class  to compute probability in next step. (0.5)pt \ Hint You can sum the number of occurences of all words in class  that we obtained in last step.

In [None]:
num_word_in_classes = ...

Assigment 8 : Now we can compute conditional probability for every pair . Similar to Bernoulli Naive Bayes, we also add Laplace smoothing to avoid zero probability

where  is the number of occurences of word  in all documents with class ,  is the total number of words in all documents of class . \ For numerical stability, you should compute log of probablity (1pt)



In [None]:
log_cond_prob = ...

Assignment 9 : After getting all necessary probabilities, we can find label of new test data. For this task, you have to implement find_label function that compute product of prior and conditional probablities, and select the label with highest value. Finally, you can get prediction for all test data and get accuracy score (3pt).

In [None]:
def find_label(data):
    """
      Your code here
    """

In [None]:
pred = []
for text in tqdm(test_data):
    pred.append(find_label(text))
metrics.accuracy_score(pred, newsgroups_test.target)

(Optional) Try to improve performance of Naive Bayes model in this dataset. You can try everything to do this, i.e. change hyperparameters, preprocess data, ...