## Lighthouse Labs
### W05D04 Naive Bayes
Instructor: Socorro Dominguez  
February 04, 2020

In [1]:
# And import the libraries
import IPython
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from IPython.display import HTML

%pylab inline
# pip install git+git://github.com/mgelbart/plot-classifier.git
from plot_classifier import plot_classifier
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import (
    CountVectorizer,
    TfidfTransformer,
    TfidfVectorizer,
)

# train test split and cross validation
from sklearn.model_selection import (
    GridSearchCV,
    cross_val_score,
    cross_validate,
    train_test_split,
)

from sklearn.pipeline import Pipeline, make_pipeline

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB

pd.set_option("display.max_colwidth", 200)

Populating the interactive namespace from numpy and matplotlib


**Second Half Agenda:**
* Conditional Probability Review
    * Bayes theorem
    
    
* Naive Bayes
    * Multionomial
    * Gaussian

## Review: Conditional probabilities

- Conditional probabilities are a way of using information we have about random variables.
- For example, for a fair 6-sided dice, what if we already know the roll is odd, because someone told us. What are the 6 _conditional_ probabilities of each outcome?
  - They are: ...
- We write conditional probabilities with a vertical bar, `|`. The information we're conditioning on goes after the bar.
  - E.g., $P(X=i \mid \text{X is odd})$ is a **conditional probability**
  - The set of these values form the **conditional distribution**
    - Conditional distributions are still probability distributions - they must sum up to 1.

- So, what's the pattern here?
  - The conditioning _eliminates some possible outcomes_.
  - For the remaining outcomes, we _renormalized_ the distribution.
  - That is, we took the proportion of the allowed outcomes that satisfy the event description.

### Conditional probabilities - formalizing things

The key equation with conditional probabilities is

$$P(A\mid B)=\frac{P(A \cap B)}{P(B)}$$

The "renormalizing" trick is a consequence of this. 

Consider, what's the probability of rolling a 6 given that the roll is not 1?

- Let $A$ be the roll is a 6
- Let $B$ be the roll is a not a 1

$$P(A\mid B)=\frac{P(A \cap B)}{P(B)}=\frac{P(A)}{P(B)}=\frac{1/6}{5/6}=\frac{1}{5}$$

In this case, we had the simplification that $P(A\cap B)=P(A)$. This is often not the case.

### Bayes' Theorem

**Brain teaser:** A heritable disease occurs randomly in 10% of the population. If someone has the disease, it is passed on to their children with probability 50%. A mother has 1 healthy child. Given this, what's the conditional probability that the mother has the disease? 

- Is the answer 10%? Less? More? How do we quantify it?
  - Let $M$ be the event that the mother has the disease.
  - Let $C$ be the event that the child has the disease.
  - We want $P(M\mid \textrm{not } C)$. We have $P(M)=0.1$ and $P(\textrm{not }C\mid M)=0.5$.

Solution:

$$P(M \mid \textrm{not } C) = \frac{P(\textrm{not } C \mid M)P(M)}{P(\textrm{not } C)}$$

So we still need $P(\textrm{not } C)$. This could happen in 2 ways ("law of total probability")

$$P(\textrm{not } C)=P(\textrm{not } C \mid M)P(M) + P(\textrm{not } C \mid \textrm{not } M)P(\textrm{not } M)$$

We know $P(\textrm{not } M)=1-P(M)=0.9$.   
We assume $P( C \mid \textrm{not } M)=0.1$ because the child can randomly get the disease like anyone else,   
so then $P(\textrm{not } C \mid \textrm{not } M)=1-P( C \mid \textrm{not } M)=0.9$. 

Finally, then, we're left with:

$$P(M \mid \textrm{not } C) = \frac{0.5 \times 0.1}{0.5\times 0.1 + 0.9 \times 0.9} = 0.058$$


- We can get what we need using **Bayes' Theorem**.
- We've seen above that, for events $A$ and $B$, $P(A,B)=P(A\mid B)P(B)$. 
- We can also write this as $P(A,B)=P(B\mid A)P(A)$. 
- Since these are equal, we get the famous Bayes' theorem:
​
$$P(A\mid B)=\frac{P(B\mid A)P(A)}{P(B)}$$

If curious, you should also review:
- Law of Total Probability  $P(X=x)=\sum_y P(X=x\mid Y=y)P(Y=y)$
- Conditional Expectations

## Naive Bayes

- For years, best spam filtering methods used naive Bayes.
- Our first probabilistic classifier where we think of learning as a problem of statistical inference.

- Classification technique based on Bayes’ Theorem **with an assumption of independence among predictors** - hence the Naive. 
    - The presence of a particular feature in a class is unrelated to the presence of any other feature.

E.g. You receive a spam mail that contains the words "Money", "URGENT!", "Prize!". Even if these features depend on each other or others, all of these properties independently contribute to the probability that this email is SPAM.

- Naive Bayes is easy to build and useful for very large data sets. 

- Naive Bayes outperforms even highly sophisticated classification methods and works well with text data.

## Naive Bayes Classifier


Before understanding the theory, let's try `scikit-learn`'s implementation of Naive Bayes on Kaggle's [SMS Spam Collection Dataset](https://www.kaggle.com/uciml/sms-spam-collection-dataset).

#### We will use `CountVectorizer` to get bag-of-words (BOW) representation

- So we used `CountVectorizer` to convert text data into feature vectors where
    - each feature is a unique word in the text  
    - each feature value represents the frequency or presence/absence of the word in the given message         
    
<img src='./images/bag-of-words.png' width="800">

[Source](https://web.stanford.edu/~jurafsky/slp3/4.pdf)       

In [2]:
sms_df = pd.read_csv("data/spam.csv", encoding="latin-1")
sms_df = sms_df.drop(["Unnamed: 2", "Unnamed: 3", "Unnamed: 4"], axis=1)
sms_df = sms_df.rename(columns={"v1": "target", "v2": "sms"})

In [3]:
train_df, test_df = train_test_split(sms_df, test_size=0.2, random_state=123)
X_train, y_train = train_df["sms"], train_df["target"]
X_test, y_test = test_df["sms"], test_df["target"]

In [4]:
train_df.head()

Unnamed: 0,target,sms
385,ham,It took Mr owl 3 licks
4003,ham,Well there's a pattern emerging of my friends telling me to drive up and come smoke with them and then telling me that I'm a weed fiend/make them smoke too much/impede their doing other things so ...
1283,ham,Yes i thought so. Thanks.
2327,spam,"URGENT! Your mobile number *************** WON a å£2000 Bonus Caller prize on 10/06/03! This is the 2nd attempt to reach you! Call 09066368753 ASAP! Box 97N7QP, 150ppm"
1103,ham,Aiyah sorry lor... I watch tv watch until i forgot 2 check my phone.


In [5]:
from sklearn.naive_bayes import MultinomialNB

pipe_nb = make_pipeline(CountVectorizer(), MultinomialNB())
pipe_nb.fit(X_train, y_train)
print("Training Acc.: ", round(pipe_nb.score(X_train,y_train),4))
print("Valid Acc.: ", round(pipe_nb.score(X_test,y_test), 4))

Training Acc.:  0.9933
Valid Acc.:  0.9865


### Naive Bayes `predict`

- Given a new message, we want to predict whether it's spam or non spam (ham).
- Example: Predict whether the following message is spam or non spam (ham). 
> "URGENT! Free!!"

In [7]:
deploy_test = ["URGENT! Free!!", "I like Socorro's classes!"]
pipe_nb.predict(deploy_test)

array(['spam', 'ham'], dtype='<U4')

### Probabilistic classifiers: `predict` by hand 

- What's it's doing under the hood? 
- Let's look at an example with a toy dataset. 

In [8]:
X = [
    "URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!",
    "Lol you are always so convincing.",
    "Block 2 has interesting courses.",
    "URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!",
    "Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!",
    "Block 2 has been interesting so far.",
]
y = ["spam", "non spam", "non spam", "spam", "spam", "non spam"]

In [9]:
pipe_nb_toy = make_pipeline(CountVectorizer(max_features = 4, stop_words='english'), MultinomialNB())
pipe_nb_toy.fit(X, y);

In [10]:
data = pipe_nb_toy['countvectorizer'].transform(X)
train_bow_df = pd.DataFrame(data.toarray(), columns=pipe_nb_toy['countvectorizer'].get_feature_names(), index=X)
train_bow_df['target'] = y

In [11]:
train_bow_df

Unnamed: 0,block,free,prize,urgent,target
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,1,1,spam
Lol you are always so convincing.,0,0,0,0,non spam
Block 2 has interesting courses.,1,0,0,0,non spam
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,1,1,spam
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!,0,1,0,0,spam
Block 2 has been interesting so far.,1,0,0,0,non spam


Suppose we are given text messages in `deploy_test` and we want to find the targets for these examples, how do we do it using naive Bayes?

First, let's get numeric representation of our text messages. 

In [12]:
deploy_test = ["URGENT! Free!!", "I like Week 5 block better."]
data = pipe_nb_toy['countvectorizer'].transform(deploy_test).toarray()
bow_df = pd.DataFrame(data, columns=pipe_nb_toy['countvectorizer'].get_feature_names(), index=deploy_test)


Unnamed: 0,block,free,prize,urgent
URGENT! Free!!,0,1,0,1
I like Week 5 block better.,1,0,0,0


In [13]:
bow_df

Unnamed: 0,block,free,prize,urgent
URGENT! Free!!,0,1,0,1
I like Week 5 block better.,1,0,0,0


### Naive Bayes prediction idea

Suppose we want to predict whether the following message is "spam" or "non spam".
> "URGENT! Free!!"

Representation of the message: `[0, 1, 0, 1]`

To predict the correct class, naive Bayes calculates the following probability scores. 

- $P(\text{spam} \mid \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})$ 
- $P(\text{non spam} \mid  \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})$
- Picks the label with higher probability scores. 

### Applying Bayes' theorem 

Uses Bayes' theorem to calculate probabilities:

$$P(A \mid B) = \frac{P(B \mid A) \times P(A)}{P(B)}$$

$$P(\text{spam} \mid \text{message})= \frac{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{spam}) \times P(\text{spam})}{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})}$$

$$P(\text{non spam} \mid \text{message}) = \frac{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{non spam}) \times P( \text{non spam})}{P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent=1})}$$

- $P(\text{message})$: marginal probability that a message has the given set of words 
    - Hard to calculate but can be ignored in our scenario as it occurs in the denominator for both $P(\text{spam} \mid \text{message})$ and $P(\text{non spam} \mid \text{message})$.
    - So we ignore the denominator in both cases. 


### Let's focus on $P(\text{spam} \mid \text{message})$

- After ignoring the denominator: 
$$P(\text{spam} \mid \text{message}) \propto P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{spam}) \times P(\text{spam})$$

- To calculate $P(\text{spam} \mid \text{message})$, we need:  
    - $P(\text{spam})$: marginal probability that a message is spam
    - $P(\text{message}\mid\text{spam})$: conditional probability that message has words $w_1, w_2, \dots, w_d$, given that it is spam.
        - Hard to calculate because it would require huge numbers of parameters and impossibly large training sets. But we need it. 
        - with $d$ binary features, how many possible "text messages" are there?
        - we cannot possibly have access to all the data

### Naive Bayes' approximation to calculate $P(\text{message}|\text{spam})$

- A common assmption is **naive Bayes** assumption, which states that **features are independent, conditioned on the target**. 
    - Example: In our spam classification example, **once you know that a message is spam**, the probability that the word "urgent" appears is independent of whether "free" also appeared. 
    
- We can write this mathematically as 

$$\begin{equation}
\begin{split}
& P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{spam}) \\
&\approx P(\text{block} = 0 \mid \text{spam}) \times P(\text{free} = 1 \mid \text{spam}) \times P(\text{prize} = 0 \mid \text{spam}) \times P(\text{urgent} = 1 \mid \text{spam})
\end{split}
\end{equation}$$


### Naive Bayes' approximation

- In general, 
$$P(\text{message} \mid \text{spam}) = P(w_1, w_2, . . . , w_d \mid \text{spam}) \approx \prod_{i=1}^{d}P(w_i \mid \text{spam})$$

$$P(\text{message} \mid \text{non spam}) = P(w_1, w_2, . . . , w_d \mid \text{non spam}) \approx \prod_{i=1}^{d}P(w_i \mid \text{non spam})$$


### Going back to estimating $P(\text{spam} \mid \text{message})$

With naive Bayes' assumption, to calculate $P(\text{spam} \mid \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1)$, we need the following:  
1. Prior probability: $P(\text{spam})$ 
2. Conditional probabilities: 
    1. $P(\text{block} = 0 \mid \text{spam})$
    2. $P(\text{free} = 1 \mid \text{spam})$
    3. $P(\text{prize} = 0 \mid \text{spam})$
    4. $P(\text{urgent} = 1 \mid \text{spam})$

We use our training data to calculate these probabilities. 

In [14]:
train_bow_df

Unnamed: 0,block,free,prize,urgent,target
URGENT!! As a valued network customer you have been selected to receive a £900 prize reward!,0,0,1,1,spam
Lol you are always so convincing.,0,0,0,0,non spam
Block 2 has interesting courses.,1,0,0,0,non spam
URGENT! You have won a 1 week FREE membership in our £100000 prize Jackpot!,0,1,1,1,spam
Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free!,0,1,0,0,spam
Block 2 has been interesting so far.,1,0,0,0,non spam


- Prior probability
    - $P(\text{spam}) = 3/6$
    
- Conditional probabilities
    - What is $P(\text{block} = 0 \mid \text{spam})$? 
        - Given target is spam, how often "block" = 0? $3/3$
    - $P(\text{free} = 1 \mid \text{spam}) = 2/3$ 
    - $P(\text{prize} = 0 \mid \text{spam}) = 1/3$
    - $P(\text{urgent} = 1 \mid \text{spam}) = 2/3$

### Estimating $P(\text{spam} \mid \text{message})$

$$\begin{equation}
\begin{split}
P(\text{spam} \mid \text{message}) &\propto P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{spam}) \times P(\text{spam})\\
&\propto P(\text{block} = 0 \mid \text{spam}) \times P(\text{free} = 1 \mid \text{spam}) \\
& \times P(\text{prize} = 0 \mid \text{spam}) \times P(\text{urgent} = 1 \mid \text{spam}) \times P(\text{spam})\\
&\propto 3/3 \times 2/3 \times 1/3 \times 2/3 \times 3/6\\
\end{split}
\end{equation}$$


In [None]:
spam_prior = 3/6
block0_spam = 3/3
free1_spam = 2/3
prize0_spam = 1/3
urgent1_spam = 2/3
spam_prior * block0_spam * free1_spam * prize0_spam * urgent1_spam

### Let's estimate $P(\text{non spam} \mid \text{message})$

With naive Bayes' assumption, to calculate $P(\text{non spam} \mid \text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1)$, we need the following:  
1. Prior probability: $P(\text{non spam})$ 
2. Conditional probabilities: 
    1. $P(\text{block} = 0 \mid \text{non spam})$
    2. $P(\text{free} = 1 \mid \text{non spam})$
    3. $P(\text{prize} = 0 \mid \text{non spam})$
    4. $P(\text{urgent} = 1 \mid \text{non spam})$

Again we use the data to calculate these probabilities. 

In [None]:
train_bow_df

- Prior probability 
    - $P(\text{non spam}) = 3/6$

- Conditional probabilities 
    - What is $P(\text{block} = 0 \mid \text{non spam})$? 
        - Given target is non spam, how often "block" = 0? $1/3$
    - $P(\text{free} = 1 \mid \text{non spam}) = 0/3$ 
    - $P(\text{prize} = 0 \mid \text{non spam}) = 3/3$
    - $P(\text{urgent} = 1 \mid \text{non spam}) = 0/3$

### Estimating $P(\text{non spam} \mid \text{message})$

$$\begin{equation}
\begin{split}
P(\text{non spam} \mid \text{message}) &\propto P(\text{block} = 0, \text{free} = 1, \text{prize} = 0, \text{urgent} = 1 \mid \text{non spam}) \times P(\text{non spam})\\
&\propto P(\text{block} = 0 \mid \text{non spam}) \times P(\text{free} = 1 \mid \text{non spam}) \\
& \times P(\text{prize} = 0 \mid \text{non spam}) \times P(\text{urgent} = 1 \mid \text{non spam}) \times P(\text{non spam})\\
&\propto 1/3 \times 0 \times 3/3 \times 0 \times 1/3\\
\end{split}
\end{equation}$$


In [None]:
non_spam_prior = 3/6
block0_non_spam = 0/3
free1_non_spam = 1/3
prize0_non_spam = 1/3
urgent1_non_spam = 2/3
non_spam_prior * block0_non_spam * free1_non_spam * prize0_non_spam * urgent1_non_spam

### Naive Bayes prediction

Since $(\text{spam} \mid \text{message})$ (0.074) is proportional to a larger number compared to $(\text{non spam} \mid \text{message})$ (0), we predict $spam$! 

## 2. `predict_proba`

### What is our toy pipeline's prediction? 

In [None]:
deploy_test = ["URGENT! Free!!"]
pipe_nb_toy.predict(deploy_test)

### Naive Bayes classifier `predict_proba`
- So far we have been looking into binary predictions but often a more granular information is useful. 
- Naive Bayes classifier gives you probability estimates for each class and we can get this information using `predict_proba` method of the classifier.  

In [None]:
pipe_nb_toy.predict_proba(deploy_test)

In [None]:
pipe_nb_toy.classes_

Above: The classifier is "76% confident" that the class is spam! 

### Predicting probabilities

- We have a new and useful method, `predict_proba`.
- `predict` returns the class with the highest probability.
- `predict_proba` gives us the actual probability scores. 
- Looking at the probabilities can help us understand the model.
- We can find the spam messages where our classifier is most confident and least confident. 

### Naive Bayes classifier `fit`

- Calculate prior probabilities and conditional probabilities for each feature given each class. 

Note that when we estimated probabilities in our toy example (e.g., $P(\text{word} \mid spam)$), we happened to have each feature value as either 0 or 1, i.e., just the existence of a word in the document's bag of words. We computed $P(\text{word} \mid spam)$ as a fraction of times the word appears among all words in all messages of the spam class. If we want to work with frequencies instead of existence, we first concatenate all documents with that class (e.g., spam class) into one big "class c" text. Then we use the frequency of the word (e.g., _urgent_ below) in this concatenated document to give a (maximum likelihood) estimate of the probability:

$$P(\text{urgent} \mid \text{spam}) = \frac{Count(\text{urgent}, \text{spam})}{\sum_{w \in vocabulary} Count(w, \text{spam})}$$ 

$$P(\text{urgent} \mid \text{spam}) = \frac{\text{how often _urgent_ occurs with spam}}{\text{total number of tokens (all occurrences of all words) in spam}}$$


- Recall that when we worked through a toy example by hand, we estimated
    - $P(\text{non spam} \mid \text{message}) \propto 0$
    - $P(\text{spam} \mid \text{message}) \propto 0.074$
- Why don't `predict_proba` scores match with the probability scores we calculated before? 
- The scores we computed are not normalized. Remember that we ignored the denominator.
- These ones are normalized so that they sum to 1.
- The model is using something called "smoothing" to avoid the problem of zero probabilities. 

In [None]:
pipe_nb_toy.predict_proba(deploy_test)

## 3. Laplace smoothing



In [None]:
train_bow_df

- Remember when we calculated $P(\text{non spam} \mid \text{message})$, some of our conditional probabilities were zero. 
    - $P(\text{free} = 1 \mid \text{non spam}) = 0/3$ 
    - $P(\text{urgent} = 1 \mid \text{non spam}) = 0/3$

- Naive Bayes naively multiplies all the feature likelihoods together, and if any of the terms is zero, it's going to void all other evidence and the probability of the class is going to be zero. 
- Sounds worrisome! 
- We have limited data and if we do not see a feature occurring with a class, it doesn't mean it would never occur with that class. 

### A simplest solution: Laplace smoothing

- The simplest way to avoid zero probabilities is to add one to all the counts.
- All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. 
- In `scikit-learn` we control it using hyperparameter `alpha` (by default `alpha=1.0`). 


So our previous bag of words representation becomes like this: 

In [None]:
data = pipe_nb_toy['countvectorizer'].transform(X)
train_bow_df = pd.DataFrame(data.toarray() + 1, columns=pipe_nb_toy['countvectorizer'].get_feature_names(), index=X)
train_bow_df['target'] = y
train_bow_df

### Adjusting the counts 

Note that the following calculations would change now with updated counts now: 

$$P(\text{word} \mid \text{spam}) = \frac{Count(\text{word}, \text{spam}) + 1}{\sum_{w \in vocabulary} Count(w, \text{spam}) + |vocabulary|}$$

### `alpha` hyperparameter and the fundamental tradeoff 

- High alpha $\rightarrow$ underfitting
    - means we are adding large counts to everything and so we are diluting the data
- Low alpha $\rightarrow$ overfitting

###  Gaussian Naive Bayes

- Other datasets has continuous-valued features.
- But so far, we've only seen how to use Naive Bayes for discrete features.
- We can either discretize our continuous features into discrete bins (with counts), or...
- Use _Gaussian_ naive Bayes (read more [here](https://machinelearningmastery.com/naive-bayes-for-machine-learning/) and [here](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naive_Bayes))
- Now:
    - Assume each feature is normally distributed 
    - Calculate the mean ($\mu_k$) and standard deviation ($\sigma_k$) for each feature for each class
    - Use the following equation to calculate the conditional probability of observing feature value $v$ in class $C_k$

<img src='./images/gaus_nb.png' width="400">


Source: [Wikipedia](https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Gaussian_naive_Bayes)

- Gaussian naive Bayes assumes normality
    - Are our features normal?
    - Not really but in practice we transform our data to try and make it more normal
    - Scikit-learn provides the `PowerTransformer()` for this process
    - From the [docs](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html#sklearn.preprocessing.PowerTransformer): "*...Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like.*"    

### General comments on naive Bayes

- Surprising accuracy 
- A fast and robust way to learn the corresponding parameters
- Scales great; learning a naive Bayes classifier is just a matter of counting how many times each attribute co-occurs with each class
- Can be easily used for multi-class classification. 
- It's closely related to linear classifiers we'll see in the next lecture. 
    - When we take the logarithms, the products turn into summations. 
- Can provides a informative set of features from which to predict the class (next class)

### General comments on naive Bayes

- Assumes that spammers generate e-mails by picking words at random. It means that sentences have no syntax and content. Is that a fair assumption? 
    - oversimplification 
    - sometimes the best theories are the most oversimplified, provided their predictions are accurate, because they explain the most with the least. 

- Although naive Bayes is known as a decent classifier, it is known to be a **bad estimator**, so the probability outputs from `predict_proba` are not to be taken too seriously.
