<a href="https://colab.research.google.com/github/yexf308/MAT592/blob/main/Module1/NaiveBayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%pylab inline 
import numpy.linalg as LA

$\def\m#1{\mathbf{#1}}$
$\def\mm#1{\boldsymbol{#1}}$
$\def\mb#1{\mathbb{#1}}$
$\def\c#1{\mathcal{#1}}$

Refer to PML and [StatQuest](https://statquest.gumroad.com/l/wvtmc). 

# Naive Bayes Classifier (NBC)

Naive Bayes assumption: features are **conditionally independent** given the class label. i.e.,
the probability can be expressed by a class conditional density, 
$$ p(\m{x}|y=c,\mm{\theta}) = \Pi_{d=1}^D p(x_d|y=c, \theta_{dc})
$$
where $\theta_{dc}$ are the parameters for the class conditional density for class $c$ and feature $d$. 

### Features: 
- "naive": this assumption is very naive! We don't expect **features** to be conditional independent on the **class label**. It is not true in reality, but it still performs well. 

- "simple": this model is simple. Note $y\in \{1,\dots, C\}$.  It only has $O(CD)$ parameters for $C$ classes and $D$ features, compared with $O(CD^2)$ in QDA. It may not likely overfit!

- Connection to QDA: later! Naiver Bayes is much more broader than QDA since conditional density is not necessarily the Gaussian distribution. 

The posterior distribution over the class labels is 
$$
p(y=c|\m{x}, \theta) = \frac{p(y=c|\mm{\pi}) p(\m{x}|y=c,\mm{\theta}) }{\sum_{c'}p(y=c'|\mm{\pi}) p(\m{x}|y=c',\mm{\theta})} = \frac{p(y=c|\mm\pi)  \Pi_{d=1}^D p(x_d|y=c, \theta_{dc})}{\sum_{c'}p(y=c'|\mm\pi) \Pi_{d=1}^D p(x_d|y=c', \theta_{dc'})}
$$
where $\mm\pi$ is the prior probability of class $c$ and $\theta=\{\mm\pi, {\theta_{dc}}\}$ are the parameters. 


## Examples

- $x_d\in\{0,1\}$: Binary feature. 
   - Model: **multivariate Bernoulli naive Bayes**.
   - Bernoulli distribution: $ p(\m{x}|y=c,\mm{\theta}) = \Pi_{d=1}^D \text{Ber}(x_d|\theta_{dc})$, where $\theta_{dc} = p(x_d = 1 |y=c)$. Note $1-\theta_{dc} = p(x_d = 0 |y=c)$.

- $x_d\in\{1, \dots, K\}$: Categorical feature.
   - Model:  **multivariate Categorical naive Bayes**.
   - Categorical distribution: $ p(\m{x}|y=c,\mm{\theta}) = \Pi_{d=1}^D \text{Cat}(x_d|\mm{\theta}_{dc})$ where $\theta_{dck}=p(x_d=k | y=c)$. Note $\sum_{k=1}^K \theta_{dck}=1$. 


- $x_d\in \mb{R}$: Real-valued feature.
   - Model: **GDA with diagonal variance**.
   - Univariate Gaussian distribution: $ p(\m{x}|y=c,\mm{\theta})=\Pi_{d=1}^D \c{N}(x_d| \mu_{dc}, \sigma_{dc}^2)$, where $\mu_{dc}$ is the mean of feature $d$ when the class label is $c$ and $\sigma_{dc}^2$ is its variance. 




## Model fitting 
Fit a naive Bayes classifier using MLE. Note $\c{D}=\{\m{x}^{(i)}, y^{(i)}\}_{i=1}^N$. 
\begin{align}
p(\c{D}|\mm{\theta}) &= \Pi_{i=1}^N \text{Cat}(y^{(i)}|\mm{\pi}) \Pi_{d=1}^D p(x_d^{(i)}|y^{(i)},\mm{\theta}_d) \\
&= \Pi_{i=1}^N \text{Cat}(y^{(i)}|\mm{\pi}) \Pi_{d=1}^D \Pi_{c=1}^C p(x_d^{(i)}|\mm\theta_{dc})^{\mb{1}_{(y^{(i)}=c)}}  
\end{align}
Then the log-likelihood is given 
\begin{align}
\log p(\c{D}|\mm{\theta}) = \left[\sum_{i=1}^N \sum_{c=1}^C \mb{1}_{(y^{(i)}=c)}\log \pi_c \right] +\sum_{c=1}^C \sum_{d=1}^D \left[\sum_{i: y^{(i)}=c}\log p(x_d^{(i)}|\mm{\theta}_{dc})\right]
\end{align}

Decomposes into a term for $\mm{\pi}$, $CD$ terms for each $\mm\theta_{dc}$:
\begin{align}
\log p(\c{D}|\mm{\theta}) = \log p(\c{D}_y|\mm{\pi}) + \sum_c\sum_d \log p(\c{D}_{dc}|\mm{\theta}_{dc})
\end{align}

where $\c{D}_y = \{y^{(i)}: i =1:N\}$ are all the labels and $\c{D}_{dc} = \{x_d^{(i)}: y^{(i)}=c\}$ are all the values of feature $d$ for examples from class $c$. 



- The MLE for $\mm{\pi}$ is the vector of emirical counts $\hat{\pi}_c = \frac{N_c}{N}$. 

- The MLE for $\mm{\theta}_{dc}$ depend on the class conditional density for features $p(\m{x}|y=c, \mm{\theta})$.

  - In Categorial feature, the MLE is given as 
    \begin{align}
    \hat{\theta}_{dck} = \frac{N_{dck}}{\sum_{k'=1}^K N_{dck'}}= \frac{N_{dck}}{N_c}
    \end{align}
where $N_{dck} = \sum_{i=1}^N \mb{1}_{(x^{(i)}_d=k, y^{(i)}=c)}$ is the number of times that feature $d$ had value $k$ in examples of class $c$. 

  - In Binary feature, the categorical distribution becomes the Bernoulli, the MLE is given as 
  \begin{align}
   \hat{\theta}_{dc} = \frac{N_{dc}}{N_c}
  \end{align}
  which is the empirical fraction of times that feature $d$ on in examples of class $c$. 

  - In real-valued feature, use Gaussian distribution. Similar as QDA, 
  \begin{align}
  &\hat\mu_{dc} = \frac{1}{N_c} \sum_{i: y^{(i)}=c} x_d^{(i)} \\
  &\hat\sigma^2_{dc} = \frac{1}{N_c} \sum_{i: y^{(i)}=c}(x^{(i)}_d - \hat\mu_{dc})^2
  \end{align}

So NBC is very efficient and simple. 





# Example 1: Spam email 

## Connection to logistic regression
Assime all features are discrete and have $K$ states, i.e., $x_d=\{1,2,\dots, K\}$. 

Define $x_{dk}=\mb{1}_{(x_d=k)}$, then $\m{x}_d$ is the one-hot vector of feature $d$. Then the class conditional density can be written as follows: 
\begin{align}
p(\m{x}|y=c, \mm{\theta})=\Pi_{d=1}^D \text{Cat}(x_d |y=c, \mm{\theta}) = \Pi_{d=1}^D\Pi_{k=1}^K \theta_{dck}^{x_{dk}}
\end{align}
Hence the posterior over classes is given by
\begin{align}
p(y=c | \m{x}, \mm \theta) &= \frac{\pi_c\Pi_d\Pi_k  \theta_{dck}^{x_{dk}} }{\sum_{c'} \pi_{c'}\Pi_d\Pi_k  \theta_{dc'k}^{x_{dk}}} \\
& = \frac{\exp(\log \pi_c +\sum_d \sum_k x_{dk}\log \theta_{dck})}{\sum_{c'}\exp(\log \pi_{c'} +\sum_d \sum_k x_{dk}\log \theta_{dc'k})} \\
&=\frac{\exp(\beta_c^\top \m{x}+\gamma_c)}{\sum_{c'}\exp(\beta_{c'}^\top \m{x}+\gamma_{c'})}
\end{align}
with the suitably $\beta_c$ and $\gamma_c$. This has exactly the same form as multinomial logistic regression with softmax output. 

**Note:** the result holds for arbitrary feature distributions in the exponential family.

**Difference:**
- In Naive Bayes, we maximize the joint likelihood $p(\c{D}|\mm\theta)= \Pi_{i=1}^N p(\m{x}^{(i)}, y^{(i)}|\mm\theta)$. 

- In logistic regression, we maximize the conditional likelihood $\Pi_{i=1}^N p(y^{(i)}|\m{x}^{(i)}, \mm\theta)$.

So both methods will give different results. 



# Generative classifier vs Discriminative classifier

<img src="https://github.com/yexf308/MAT592/blob/main/image/GvsD.png?raw=true" width="400" />


### Generative classifier
Def: A model of the form $p(\m{x}, y) = p(y)p(\m{x}|y)$. It can use each class $y$ to generate examples $\m{x}$. 

Examples: 
 - LDA and QDA.
 - Naive Bayes.
 - Gaussian mixture and other mixture models
 - Variational autoencoder
 - Generative adversarial network

Advantages:

- **Easy to fit.** From LDA and Naive Bayes, we fit classifers by counting and averaging. We don't need solve a convex or nonconvex optimization problems numerically in the discriminative classifier which will be time consuming. 

- **Can handle missing input features**: will discuss it in detail. In a discriminative classifier, it assumes the feature $\m{x}$ is always available to be conditioned on. 

- **Can handle unlabeled training data**: For semi-supervised learning, in which we combine labeled data $\{\m{x}^{(i)}, y^{(i)}\}$ and unlabeled data $\{\m{x}^{(i)}\}$. It is harder and not natural way to process in discriminative classifier. 

- **Can fit classes separately:** We estimate the parameters of each class conditional density independently, so we do not have to retrain the model when we add more classes. In a discriminative classifier, the whole model has to be retrained. 



### Discriminative classifier
Def: A model of the form $p(y|\m{x}) $. It can only be used to discriminate between different classes.

Examples:

- KNN
- Logistic regression
- SVM
- Decision tree and random forest

Advantages: 

- **Better predictive accuracy:** Discriminative classifiers are often much more accurate than generative classifiers. The reason is that the conditional distribution $p(y|\m{x})$ is often much simpler (and therefore easier to learn) than the joint distribution $p(y, \m{x})$. In particular, discriminative models do not need to “waste effort” modeling the distribution of the input features.

- **Can handle feature preprocessing:** A big advantage of discriminative methods is that they allow us to preprocess the input in arbitrary ways. For example, we can perform a polynomial expansion of the input features. It is often hard to define a generative model on such pre-processed data, since the new features can be correlated in complex ways which are hard to model.

- **Well-calibrated probabilities:**  Some generative classifiers, like NBC, make strong independence assumptions which are often not valid. Discriminative models, such as logistic regression, are often better calibrated in terms of their probability estimates. 



## Handling missing features
In a generative classifier, we can handle this situation by marginalizing out the missing values.

Suppose we are missing the value of $x_1$, which is the first feature of $\m{x}$, we can compute 
\begin{align}
p(y=c|x_{2:D}, \mm{\theta}) &\propto  p(y=c|\mm\pi) p(x_{2:D}|y=c, \mm\theta) \\
&=  p(y=c|\mm\pi) \sum_{x_1}p(x_1,x_{2:D}|y=c, \mm\theta) 
\end{align}

In Gaussian discriminant analysis, we can marginalize out $x_1$ by conditional expectation. More in applied statistics. 

In NBC, we can ignore the
likelihood term for $x_1$,
\begin{align}
 \sum_{x_1}p(x_1,x_{2:D}|y=c, \mm\theta)  = \left[\sum_{x_1} p(x_1| y=c, \mm\theta_{1c})\right] \Pi_{d=2}^Dp(x_d|y=c, \mm\theta_{dc})  =\Pi_{d=2}^Dp(x_d|y=c, \mm\theta_{dc}) 
\end{align}



