# Naive bayes by hand

In [1]:
import numpy as np
import pandas as pd

Imagine you have 4 apples with these attributes

In [2]:
apples_docs = [
    "red round",
    "red round",
    "green sour round",
    "green round",
]

and 3 bananas with these attributes:

In [3]:
bananas_docs = [
    "yellow skinny",
    "yellow skinny",
    "green skinny"
]

Split into list of lists:

In [4]:
apples = [a.split() for a in apples_docs]
bananas = [b.split() for b in bananas_docs]

In [5]:
apples, bananas

([['red', 'round'],
  ['red', 'round'],
  ['green', 'sour', 'round'],
  ['green', 'round']],
 [['yellow', 'skinny'], ['yellow', 'skinny'], ['green', 'skinny']])

**Q.** What is the sorted set of all attributes (assign to vocabulary variable $V$)?

(Let's ignore the unknown word issue in our vectors and in our computations.)

In [16]:
flat_list = [item for sublist in apples+bananas for item in sublist]
V = sorted(list(set(flat_list)))
V

['green', 'red', 'round', 'skinny', 'sour', 'yellow']

<details>
<summary>Solution</summary>
['green', 'red', 'round', 'skinny', 'sour', 'yellow']
    
You can compute like this:
    
```
Va = set(np.concatenate(apples))
Vb = set(np.concatenate(bananas))
V = sorted(Va.union(Vb))
```
</details>

**Q**. What is the word vector for the "red round" apple?

The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order.

In [17]:
[1 if i== 'red' or i=='round' else 0 for i in V ]

[0, 1, 1, 0, 0, 0]

<details>
<summary>Solution</summary>
    The row vector is <tt>[0, 1, 1, 0, 0, 0]</tt> for "red round"
</details>

**Q**. What is the word vector for the "green sour round" apple?

The column values are 1 if the word is mentioned otherwise 0. Assume the sorted column order.

In [8]:
[1 if i in ['green', 'sour', 'round'] else 0 for i in V]

[1, 0, 1, 0, 1, 0]

<details>
<summary>Solution</summary>
    The row vector is <tt>[1, 0, 1, 0, 1, 0]</tt> for "green sour round"
</details>

Let's look at all fruit vectors now and fruit target column

In [9]:
data = np.zeros((7,len(V)))
for i,row in enumerate(apples+bananas):
    for w in row:
        data[i,V.index(w)] = 1
df = pd.DataFrame(data,columns=V,dtype=int)
df['fruit'] = [0,0,0,0,1,1,1]
df

Unnamed: 0,green,red,round,skinny,sour,yellow,fruit
0,0,1,1,0,0,0,0
1,0,1,1,0,0,0,0
2,1,0,1,0,1,0,0
3,1,0,1,0,0,0,0
4,0,0,0,1,0,1,1
5,0,0,0,1,0,1,1
6,1,0,0,1,0,0,1


**Q.** What is a good estimate of $P(apple)$=`P_apple` and $P(banana)$=`P_banana`?

(Define those variables)

In [10]:
P_apple, P_banana = len(apples)/df.shape[0], len(bananas)/df.shape[0]
P_apple, P_banana

(0.5714285714285714, 0.42857142857142855)

<details>
<summary>Solution</summary>
<pre>
P_apple = 4/7
P_banana = 3/7
</pre>
</details>

**Q.** What are good estimates of $P(red|apple)$ and $P(red|banana)$?

Probably best to take ratio of number of apples that are red to the number of apples.  When vector values are binary it feels wrong to do as we did for doc classification.  (In that case, we'd count how many times, say, "red" appears in apple rows and divide by total number of words in apple descriptions. Hmm..I guess same thing as attributes aren't repeated.)

In [22]:
p_redapple, p_redbanan = df.query("red == 1 and fruit == 0").shape[0]/(df['fruit']==0).sum(), df.query("red == 1 and fruit == 1").shape[0]/(df['fruit']==1).sum()
p_redapple, p_redbanan

(0.5, 0.0)

<details>
<summary>Solution</summary>
P(red|apple) = 2/4 apples are red and 0/3 bananas are red.
</details>

**Q.** What are good estimates of $P(green|apple)$ and $P(green|banana)$?

In [23]:
g_redapple, g_redbanan = df.query("green == 1 and fruit == 0").shape[0]/(df['fruit']==0).sum(), df.query("green == 1 and fruit == 1").shape[0]/(df['fruit']==1).sum()
P_grea,P_greb

(0.5, 0.3333333333333333)

<details>
<summary>Solution</summary>
2/4 apples are green and 1/3 bananas are green.
</details>

## Laplace smoothing of $P(w|c)$

**Q.** If $P(skinny|apple)=0$, what is our smoothed estimate?

In [26]:
(df.query("skinny == 1 and fruit == 0").shape[0]+1)/((df.fruit== 0).sum()+len(V))

0.1

<details>
<summary>Solution</summary>
$P(skinny|apple) = (count(skinny,apple)+1)/(count(apple)+|V|) = (0+1)/(4+6) = .1$
</details>

**Now, do that using vector operations to get smoothed `P_w_apple` from the apple records**

Recall that `df[df.fruit==0]` gets you just the apple records.  Your $P(skinny|apple)$ resuls should be:

```
green     0.3
red       0.3
round     0.5
skinny    0.1
sour      0.2
yellow    0.1
fruit     0.1
```

In [45]:
P_w_apple = (df[df.fruit==0].sum()+1)/((df.fruit==0).sum()+len(V))
P_w_apple

green     0.3
red       0.3
round     0.5
skinny    0.1
sour      0.2
yellow    0.1
fruit     0.1
dtype: float64

<details>
<summary>Solution</summary>
    <pre>
w_counts_apple = df[df.fruit==0].sum(axis=0)     
P_w_apple = (w_counts_apple+1) / (len(apples)+len(V))
P_w_apple
</pre>
</details>

**Do that same thing to `P_w_banana` from the banana records**

You should get:

```
green     0.222222
red       0.111111
round     0.111111
skinny    0.444444
sour      0.111111
yellow    0.333333
fruit     0.444444
```

In [46]:
P_w_banana= (df[df.fruit==1].sum()+1)/(len(V)+(df.fruit==1).sum())
P_w_banana

green     0.222222
red       0.111111
round     0.111111
skinny    0.444444
sour      0.111111
yellow    0.333333
fruit     0.444444
dtype: float64

<details>
<summary>Solution</summary>
<pre>
w_counts_banana = df[df.fruit==1].sum(axis=0)
P_w_banana = (w_counts_banana+1) / (len(bananas)+len(V))
P_w_banana
</pre>
</details>

**Q.** Given `P_w_apple`, what is `P_apple_redround`, the "probability" that "red round" is an apple?

(We haven't normalized the scores (per our friend Bayes) so they aren't technically probabilities.)  Just compute the score we'd use for classification per the lecture.  Hint: `P_w_apple['skinny']` gives the estimate of $P(skinny|apple)$.

In [53]:
P_w_apple.red*P_w_apple['round']*P_apple

0.0857142857142857

<details>
<summary>Solution</summary>
    The answer is 0.0857142857142857 via:<br>
    <tt>P_apple_redround = P_apple * P_w_apple['red']*P_w_apple['round']</tt>
</details>

**Q.** Given `P_w_banana`, what is `P_banana_redround`, the "probability" that "red round" is a banana?

In [54]:
P_banana*P_w_banana['red']*P_w_banana['round']

0.005291005291005291

<details>
<summary>Solution</summary>
    The answer is 0.005291005291005291 via:<br>
    <tt>P_banana_redround = P_banana * P_w_banana['red']*P_w_banana['round']</tt>
</details>

Here's how to easily compute the probability of each word in V given class apple and class banana:

In [55]:
[P_w_apple[w] for w in V]

[0.3, 0.3, 0.5, 0.1, 0.2, 0.1]

In [56]:
[P_w_banana[w] for w in V]

[0.2222222222222222,
 0.1111111111111111,
 0.1111111111111111,
 0.4444444444444444,
 0.1111111111111111,
 0.3333333333333333]

**Now, define a function for computing likelihood of a document index $d \in [0,5]$**

$$
c^*= \underset{c}{argmax} ~ P(c) \prod_{w \in V} P(w | c)^{n_w(d)}
$$

You have these pieces: `P_apple`, `P_w_apple`, and $n_w(d)$ is just the value in `df[w][d]`.

In [63]:
df

Unnamed: 0,green,red,round,skinny,sour,yellow,fruit
0,0,1,1,0,0,0,0
1,0,1,1,0,0,0,0
2,1,0,1,0,1,0,0
3,1,0,1,0,0,0,0
4,0,0,0,1,0,1,1
5,0,0,0,1,0,1,1
6,1,0,0,1,0,0,1


In [78]:
np.power([P_w_apple[w] for w in V],df.iloc[:,:-1])

Unnamed: 0,green,red,round,skinny,sour,yellow
0,1.0,0.3,0.5,1.0,1.0,1.0
1,1.0,0.3,0.5,1.0,1.0,1.0
2,0.3,1.0,0.5,1.0,0.2,1.0
3,0.3,1.0,0.5,1.0,1.0,1.0
4,1.0,1.0,1.0,0.1,1.0,0.1
5,1.0,1.0,1.0,0.1,1.0,0.1
6,0.3,1.0,1.0,0.1,1.0,1.0


In [66]:
np.prod(np.power([P_w_apple[w] for w in V],df.iloc[:,:-1]),axis= 1)

0    0.15
1    0.15
2    0.03
3    0.15
4    0.01
5    0.01
6    0.03
dtype: float64

In [69]:
def likelihood_apple(d:int):
    return P_apple*np.prod(np.power([P_w_apple[w] for w in V],df.iloc[:,:-1]),axis= 1)[d]
def likelihood_banana(d:int):
    return P_banana*np.prod(np.power([P_w_banana[w] for w in V],df.iloc[:,:-1]),axis=1)[d]

In [74]:
likelihood_apple(0)

0.0857142857142857

<details>
<summary>Solution</summary>
<pre>
def likelihood_apple(d:int):
    return P_apple * np.product([P_w_apple[w]**df[w][d] for w in V])
def likelihood_banana(d:int):
    return P_banana * np.product([P_w_banana[w]**df[w][d] for w in V])
</pre>
</details>


**Run the following loop to make predictions for each document**

Output should be:

```
red round        : 0.085714, 0.005291 => apple
red round        : 0.085714, 0.005291 => apple
green sour round : 0.017143, 0.001176 => apple
green round      : 0.085714, 0.010582 => apple
yellow skinny    : 0.005714, 0.063492 => banana
yellow skinny    : 0.005714, 0.063492 => banana
green skinny     : 0.017143, 0.042328 => banana
```

In [72]:
docs = apples_docs+bananas_docs
for d in range(len(df)):
    a = likelihood_apple(d)
    b = likelihood_banana(d)
    print(f"{docs[d]:17s}: {a:4f}, {b:4f} => {'apple' if a>b else 'banana'}")

red round        : 0.085714, 0.005291 => apple
red round        : 0.085714, 0.005291 => apple
green sour round : 0.017143, 0.001176 => apple
green round      : 0.085714, 0.010582 => apple
yellow skinny    : 0.005714, 0.063492 => banana
yellow skinny    : 0.005714, 0.063492 => banana
green skinny     : 0.017143, 0.042328 => banana
