# Naive Bayes - 20news
In this exercise we use the naive Bayes method for text classification. 
In the [accompanying notebook](./HW1_2IIG0_24_25.ipynb), the `20newsgroups` dataset is loaded for four classes of newsgroups: `rec.autos`, `rec.motorcycles`, `rec.sport.baseball`, `rec.sport.hockey`. 
The text documents are transformed to a bag of words representation, given by a data matrix $D∈{0,1}^{n×d}$ where each row represents a document and every column a word. $D$ is an indicator matrix of the words that occur in each document.

In [1]:
import numpy as np

In [2]:
from sklearn.datasets import fetch_20newsgroups
categories = ['rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey']
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

In [3]:
y_train = train.target
y_train

array([0, 3, 0, ..., 3, 1, 2], dtype=int64)

In [4]:
train.target_names

['rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey']

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words="english", min_df=5,token_pattern="[^\\W\\d_]+", binary=True)
D = vectorizer.fit_transform(train.data)
D_test = vectorizer.transform(test.data)

In [6]:
vectorizer.get_feature_names_out()

array(['aa', 'aaa', 'aamir', ..., 'zubov', 'zx', 'zz'], dtype=object)

In [7]:
np.where(vectorizer.get_feature_names_out() == 'naive')[0]

array([4299], dtype=int64)

## Exercise 5a
Compute the class prior probabilities $p(y)$:
* $p(y = 0)$
* $p(y = 1)$
* $p(y = 2)$
* $p(y = 3)$

In [8]:
# Compute the class prior probabilities
n = y_train.shape[0]
p_y = np.zeros(4)
for i in range(4):
    p_y[i] = np.sum(y_train == i) / n
p_y

array([0.2486396 , 0.25031394, 0.24989535, 0.25115111])

## Exercise 5b
What are the log-probabilities of the word 'naive' given each class? Use Laplace smoothing with $\alpha=1e−5$. Note that the log is in ML as a default the natural logarithm to the base of $e$.
Assuming that $x_{naive}$ denotes the random variable for the feature-word 'naive', compute the following probabilities:
* $\log p(x_{naive}=1∣y=0)$
* $\log p(x_{naive}=1∣y=1)$
* $\log p(x_{naive}=1∣y=2)$

In [9]:
# Compute the log-probabilities of the word 'naive' given each class
alpha = 1e-5
D = vectorizer.fit_transform(train.data)
D = D.toarray()
n, d = D.shape
log_p_x_y = np.zeros((4, d))
for i in range(4):
    log_p_x_y[i] = np.log((np.sum(D[y_train == i, :], axis=0) + alpha) / (np.sum(D[y_train == i, :]) + alpha * d))
# print column 4299 (corresponding to the index of word naive)
log_p_x_y[:, 4299]

array([-10.78571652, -10.07003614, -10.7793644 ,  -9.82897749])

## Exercise 5c
What is the posterior probability that a document belongs to the classes `rec.autos`, `rec.motorcycles`, `rec.sport.baseball`, or `rec.sport.hockey`, given that the words 'autos', 'motorcycles', 'baseball', or 'hockey' respectively appear in the document? 
Use Bayes' theorem to compute the posterior probability for each of the following:
* $p(y=0∣x_{auto}=1)$
* $p(y=1∣x_{motorcycles}=1)$
* $p(y=2∣x_{baseball}=1)$
* $p(y=3∣x_{hockey}=1)$

In [10]:
# Find the indices of the words 'autos', 'motorcycles', 'baseball', and 'hockey'
words = ['autos', 'motorcycles', 'baseball', 'hockey']
indices = np.zeros(4, dtype=int)
for i in range(4):
    indices[i] = np.where(vectorizer.get_feature_names_out() == words[i])[0][0]
indices

array([ 402, 4219,  484, 2956])

In [11]:
# Compute the posterior probabilities 
# p(y=0|x_auto=1) = p(x_auto=1|y=0) * p(y=0) / p(x_auto=1)
# p(x_auto=1) = sum(p(x_auto=1|y) * p(y))
p_x_auto_1 = np.sum(np.exp(log_p_x_y[:, 402]) * p_y)
p_y_0_x_auto_1 = np.exp(log_p_x_y[0, 402]) * p_y[0] / p_x_auto_1
p_y_0_x_auto_1

0.9999993053247772

In [12]:
# p(y=1|x_motorcycles=1) = p(x_motorcycles=1|y=1) * p(y=1) / p(x_motorcycles=1)
p_x_motorcycles_1 = np.sum(np.exp(log_p_x_y[:, 4219]) * p_y)
p_y_1_x_motorcycles_1 = np.exp(log_p_x_y[1, 4219]) * p_y[1] / p_x_motorcycles_1
p_y_1_x_motorcycles_1

0.9026014911751483

In [13]:
# p(y=2|x_baseball=1) = p(x_baseball=1|y=2) * p(y=2) / p(x_baseball=1)
p_x_baseball_1 = np.sum(np.exp(log_p_x_y[:, 484]) * p_y)
p_y_2_x_baseball_1 = np.exp(log_p_x_y[2, 484]) * p_y[2] / p_x_baseball_1
p_y_2_x_baseball_1

0.8596891709633955

In [14]:
# p(y=3|x_hockey=1) = p(x_hockey=1|y=3) * p(y=3) / p(x_hockey=1)
p_x_hockey_1 = np.sum(np.exp(log_p_x_y[:, 2956]) * p_y)
p_y_3_x_hockey_1 = np.exp(log_p_x_y[3, 2956]) * p_y[3] / p_x_hockey_1
p_y_3_x_hockey_1

0.9811299872500109