# Summary

## Pros

- Can be used for multi-class problems
- Handles both discrete and continuous data
- Model size is constant in the size of the data
- Online learning is possible

## Cons

- Bad estimator of class probabilities
- Naïve assumption rarely holds
- Can be simplistic
- Can be unstable with small amounts of data

## Can any distribution be used?

In theory yes. Any distribution can be plugged into the Naïve Bayes algorithm. On top of that, each feature could, in theory, follow a different custom distribution. However, it is quite rarely used in practice as Naïve Bayes is often used as a baseline, not a fine-tuned model.

## Why is multinomial different than Bernoulli with binary features?

A common misconception is that the Bernoulli distribution and the Multinomial distribution behave similarly if they face binary features. This is not true! The main difference lies in the fact that the Bernoulli distribution models the absence of features, whereas the Multinomial distribution is only affected by features that are present.

> The answer to [this post](https://datascience.stackexchange.com/questions/27624/difference-between-bernoulli-and-multinomial-naive-bayes) explains this phenonemon quite nicely.

## How to deal with large amounts of data?

Because most probability distributions can be updated online, Naïve Bayes classifiers are well-suited for large amounts of data. This is also why they are often used as baselines for large problems. Indeed, they are able to provide a quick estimate of how “difficult” a problem is, which can lead to different design choices later on.

## What happens when the Naïve assumption is violated?

Violating the Naïve assumption occurs very often. However, it doesn’t always mean that it will have a negative impact. Indeed, the assumption becomes violated when features are dependent on each other. But, if the features that are dependent on each other happen to be good predictors, it might even help improve the classification. Intuitively, if the Naïve assumption is respected, all features are assumed to bring an equal amount of information to the model, however, when features are dependent on each other, this is not true anymore. This is like if, during a vote, certain people got to vote more than other people. If they voted for the good option, this will be fine, but if they voted for something incorrect, this will have a negative impact.

## Tutorials

- [Machine Learning Mastery](https://machinelearningmastery.com/naive-bayes-tutorial-for-machine-learning/)
- [DatumBox](http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/)
- [Analytics Vidhya](https://www.analyticsvidhya.com/blog/2017/09/naive-bayes-explained/)
- [Python Machine Learning](https://pythonmachinelearning.pro/text-classification-tutorial-with-naive-bayes/)

## Implementations

- [sklearn](http://scikit-learn.org/stable/modules/naive_bayes.html)
- [ML From scratch](https://github.com/eriklindernoren/ML-From-Scratch/blob/master/mlfromscratch/supervised_learning/naive_bayes.py)

## Videos

- [Victor Lavrenko](https://www.youtube.com/watch?v=os-NaA0ldGs&list=PLBv09BD7ez_6CxkuiFTbL3jsn2Qd1IU7B)
- [Andrew Ng](https://www.youtube.com/watch?v=z5UQyCESW64)
- [Udacity](https://www.youtube.com/watch?v=M59h7CFUwPU&t=80s)
- [Luis Serrano](https://www.youtube.com/watch?v=kqSzLo9fenk)
- [Edureka](https://www.youtube.com/watch?v=vz_xuxYS2PM)

## Other

- [Naive Bayes Classifier (wiki)](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
- [Bayes' Theorem (wiki)](https://en.wikipedia.org/wiki/Bayes%27_theorem)