# Naive Bayes

Two artists: 

* A - Abba
* B - Beatles

We give the model a song and ask it to predict which artist wrote the song. What are we looking for in terms of probabilities?

$P(A | lyrics) = \frac{P(lyrics | A) * P(A)}{P(lyrics)}$

where l is the song lyrics.

## 1. Prior

**What if we know nothing about the song?**

$p(A)$ and $P(B)$ based on class frequency from the training data

## 2. With information we get a conditional probability

we want to know the **probability that a song with the word love is from Abba**

$p(A|love)$

this is different from **the probability that an Abba song contains the word love**:

$p(love|A)$

In [4]:
abba_songs = 100
abba_songs_with_love = 50
beatles_songs = 100
beatles_songs_with_love = 100

p_love_A = 50 / 100
p_A_love = 50 / 150
p_A = 100 / (100+100)

In [2]:
p_love_A

0.5

In [3]:
p_A_love

0.3333333333333333

## 3. Bayes Theorem

a statistical tool for converting conditional probabilities

$p(A|love) = \frac{p(love|A) \cdot p(A)}{p(love)}$

$p(B|love) = \frac{p(love|B) \cdot p(B)}{p(love)}$

we assume that $p(love|A)$ and $p(A)$ is known.

the *marginal probability* $p(love)$ is usually ignored.

For $p(love|A)$ we can use the TF-IDF instead of the count.

## 4. What if we have multiple words?

$p(A|lyrics) = \frac{p(lyrics|A) \cdot p(A)}{p(lyrics)}$

What is $p(lyrics|A)$ ?

In [5]:
words = ['love', 'you', 'yeah']
data = [[0.5, 0.25, 0],
        [1.0, 0.33, 0.5]]
import pandas as pd
feature_matrix = pd.DataFrame(data, columns=words, index=['abba', 'beatles'])
feature_matrix

Unnamed: 0,love,you,yeah
abba,0.5,0.25,0.0
beatles,1.0,0.33,0.5


Naive Bayes assumption:

$p(lyrics|A) = p(w1|A) \cdot p(w2|A) \cdot p(w3|A)$

we assume that songs are written by putting individual words randomly.

-> we assume that all words are independent events.

In [6]:
song = "love you"

p_song_abba = 0.5 * 0.25
p_song_abba

0.125

In [7]:
song = "love yeah"

p_song_beatles = 1 * 0.5
p_song_beatles

0.5

In [8]:
p_song_abba = 0.5 * 0
p_song_abba

0.0

Now we have the conditional probability of the lyrics given the artists. Let's multiply that by the probability of the artists.

In [11]:
p_song_abba * p_A

0.0

In [12]:
p_song_beatles * (1-p_A)

0.25

The model predicts that the song is from the beatles!

## What to do about the zero probs?

We use a **smoothing term**: 

* we assume that every word occurs k times at least
* so that probability is always > 0
* we assume that the artist attached a copy of each word in the dictionary to the song

If we increase the smoothing term, we *dilute* the information from the song.

**the smoothing term is a regularization hyperparameter!**

## import from sklearn

In [13]:
from sklearn.naive_bayes import MultinomialNB

In [14]:
nb = MultinomialNB(alpha=1)