# Naive Bayes
## Baye's Theorem & its Building Block

In this module, you will learn about another type of supervised classification, i.e. **Naive Bayes**. Naive Bayes is a **probabilistic classifier** which **returns the probability of a test point belonging to a class** rather than the label of the test point.

### Conditional Probability

Before you get into Bayes’ theorem, let's understand conditional probability and try to understand its intuition.

 ![91.png](attachment:3cce11fb-e35d-451a-93f1-539c2fc30a77.png)

 ![92.png](attachment:8b165111-3b18-42a9-9d1c-5ec0f8abcebe.png)

 

## Bayes' Theorem

In this section, you will understand **Bayes’ Theorem** for **calculating** the **conditional probability**. The example we will use now is something almost all of us can relate to which is cricket.



While watching cricket matches on TV, you may have seen statistics similar to this: “India wins 70% matches when Tendulkar scores a century.” Sounds like conditional probability? 

This is a classic example of how conditional probability can be used to estimate the chances of an event taking place, given certain other events that have happened.



Suppose that India plays 100 matches, out of which it wins 60 and loses 40. Also, Sachin Tendulkar plays these 100 matches, scores a century in 12 of them, and doesn't score a century in the rest 88.



To make things interesting, you also have this additional information: out of the 60 games that India wins, Sachin scores a century in 10, and out of the 40 games that India loses, Sachin scores a century only in two.



Let us look at how the two-way contingency matrix will look like for the above case :

![93.png](attachment:51bedac4-5407-4654-96e7-6c0f6c5751a6.png)

Now, can you answer this question: **what is the probability that India wins, given that Sachin has scored a century?**

![94.png](attachment:468aa987-656d-4e33-a4e8-48223325dd33.png)


**Bayes' Theorem is a simple way to update what you know about something after getting new information.**

**Here’s how to think about it:**

- Imagine you have an initial belief about the chance of something happening (for example, a patient being sick).
- You get some new evidence (like a test result).
- Bayes’ Theorem helps you combine your initial belief and the new evidence to get an updated probability (how likely it is the patient is sick after seeing the test result)

**In formula terms:**
$$
P(A|B) = \frac{P(B|A) \times P(A)}{P(B)}
$$
- P(A) : your initial guess (prior probability).
- P(B|A) : how likely the new information is, assuming your guess was correct.
- P(B) : overall likelihood of the new information.
- P(A|B) : the updated probability after seeing the new info (posterior probability).

In simple words: **It’s a way to update your beliefs based on new evidence**.


### Naive Bayes for Categorical Data

You will understand Naive Bayes classifier through an example of mushrooms where the aim is to classify a new mushroom into edible or poisonous class. You will also learn that this algorithm uses probability to do such classification. You will follow the steps given below through the course of this session:

- Bayes Theorem 
- Naïve Bayes on Categorical data

### Naive Bayes - With One Feature

**Naïve Bayes** is a **probabilistic classifier** that returns the probability of a test point belonging to a class, **using Bayes’ theorem**. 

Please find the Mushroom Dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Naive+Bayes/Naive+Bayes+For+Categorical+Data/Mushroom+Subset.xlsx) 

![95.png](attachment:a80b7502-ab3d-40ad-bab7-4492cc7be3ad.png)

You will now implement Naïve Bayes on this mushroom dataset and try to classify a new test point into either of the two classes – edible or poisonous. 

As the name suggests, the Naive Bayes algorithm uses Bayes’ theorem to classify new test points. 

But, how exactly does it do this? Let’s find out.

![96.png](attachment:ba6b8dfb-59be-420d-b307-a089dc6a6766.png)

- The **effect of the denominator P(x) is not incorporated** while calculating probabilities as it is the same for both the classes and hence, can be ignored without affecting the final outcome.

- The class assigned to the new test point is the class for which  Equation is greater.

![97.png](attachment:f88f1bac-da16-4c24-bd91-1920e9744eaf.png)


## MCQ 

<div class="text_component"><p><strong><span style="font-size: 14px;">Comprehension - Part 1</span></strong><span style="font-size: 
      14px;"><br><br><br></span></p><p><span style="font-size: 
      14px;"><strong>Comprehension: Naive Bayes With One Feature</strong><br><br></span></p><div align="center"><table border="1" cellpadding="0" cellspacing="1"><tbody><tr><td colspan="3"><p><span style="font-size: 
      14px;">Table 1: Mushroom Dataset with One Feature</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;"><strong>S.No</strong></span></p></td><td><p><span style="font-size: 
      14px;"><strong>Type of mushroom</strong></span></p></td><td><p><span style="font-size: 
      14px;"><strong>Cap shape</strong></span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">1.</span></p></td><td><p><span style="font-size: 
      14px;">Poisonous</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">2.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">3.</span></p></td><td><p><span style="font-size: 
      14px;">Poisonous</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">4.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">5.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">6.</span></p></td><td><p><span style="font-size: 
      14px;">Poisonous</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">7.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Bell</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">8.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Bell</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">9.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">10.</span></p></td><td><p><span style="font-size: 
      14px;">Poisonous</span></p></td><td><p><span style="font-size: 
      14px;">Convex</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">11.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Flat</span></p></td></tr><tr><td><p><span style="font-size: 
      14px;">12.</span></p></td><td><p><span style="font-size: 
      14px;">Edible</span></p></td><td><p><span style="font-size: 
      14px;">Bell</span></p></td></tr></tbody></table></div><p><span style="font-size: 
      14px;">&nbsp;</span></p><p><span style="font-size: 
      14px;">Consider the table shown above. &nbsp;There are two types of mushrooms, edible and poisonous, which is the target (dependent) variable. &nbsp;They have various kinds of cap-shapes. Out of the total 12 mushrooms, eight are edible and four poisonous.</span></p><p><span style="font-size: 
      14px;">&nbsp;</span></p><p><span style="font-size: 
      14px;">You want to train Naive Bayes using this data so that it can predict whether a given (new) mushroom is edible or poisonous. The task is to classify a mushroom as edible/poisonous.</span></p></div>


#### Q1. What is the feature in this task?

- [ ] Type of Mushroom
- [ ] Cap-Shape

** Comprehension - Part 2**

Say you represent the two class labels as ![](![image.png](attachment:15bfe81c-f99c-4e64-9fab-19a72c5981fe.png)![image.png](attachment:3d4d25ba-9794-465d-b4d4-b271b5b3aa5a.png)), where k = 1 represents edible, and k = 2 represents poisonous. The task is to predict the probability of a mushroom belonging to C1 or C2 using the feature ‘cap-type’. You can represent the class either as C1, C2 or C = edible/poisonous.

The feature ‘cap-shape’ is represented by X, i.e. X can take the values CONVEX, FLAT, BELL, etc. In the following questions, you will break down each term of the Bayes Theorem and understand them individually.


#### Q2. The probability of a CONVEX mushroom being edible, P(C = edible | X = CONVEX) is given by:

- [x] P( X = CONVEX | C = edible) . P(C = edible) / P(X = CONVEX)
- [ ] P( X = CONVEX | C = edible) . P(X = CONVEX) / P(C = edible)
- [ ] P(C = edible | X = CONVEX ) . P(X = CONVEX) / P(C = edible)
- [ ] None of the above

#### Q3. The value of P(C = edible) is simply the number of edible mushrooms in the dataset divided by the total observations. What is the value of P( C = edible)?

- [x] 8/12
- [ ] 7/14
- [ ] 8/14
- [ ] 7/12

**Comprehension - Part 3**

So you noticed that P(C = edible) is 8/12 = 66.66%. This means that approx. 66.66% of all mushrooms are edible. Note that P(C = edible) appears in the numerator of the Bayes expression and this value is directly proportional to the chances of a mushroom being edible. Let’s understand the other two terms in the Bayes expression. 

#### Q4. Now let’s say you picked a new mushroom whose cap-shape is CONVEX. What are the chances of this happening, i.e. what is the value of P(X = CONVEX)?

- [ ] 2/12
- [x] 8/12
- [ ] 4/12
- [ ] Can not be calculated


#### Q5. What is the probability of the mushroom being CONVEX given it is edible, i.e. P(X = CONVEX | C = edible)? This is the fraction of CONVEX mushrooms out of all the edible ones.

- [ ] 8/12
- [ ] 3/8
- [ ] 4/12
- [x] 4/8



#### Q6. In the previous questions, you have calculated that P(C = edible) is 8/12, P(X = CONVEX) is 8/12 and  P(X = CONVEX | C = edible) is 4/8. What is the probability that the CONVEX mushroom is edible, P(C = edible | X = CONVEX)?

- [x] 4/8
- [ ] 8/12
- [ ] 4/12
- [ ] 8/24


#### Q7. In the previous question, you found the probability of the CONVEX mushroom being edible. What is the probability of the CONVEX mushroom being poisonous, P(C = poisonous | X = CONVEX)?

- [ ] 8/12
- [ ] 4/12
- [x] 4/8
- [ ] 5/8

#### Q8. What are the chances of a random mushroom being poisonous, i.e. P(C = poisonous)?

- [ ] 8/12
- [ ] 4/8
- [x] 4/12
- [ ] 3/4

#### Q9. What are the chances of a mushroom being CONVEX given it is poisonous, i.e. P(X = CONVEX | C = poisonous)?

- [ ] 4/12
- [ ] 8/12
- [x] 1
- [ ] 6/12

**Comprehension - Part 4**


Let’s analyse the results of this problem:

The probabilities of a CONVEX mushroom being edible and poisonous are both 50%. The probability of a mushroom being edible, `P(C = edible | X = CONVEX)` is : 

    P( X = CONVEX | C = edible) . P(C = edible) / P(X = CONVEX)
    
    = (4/8).(8/12) / (8/12)
    
    = 50%


Similarly, the probability of the mushroom being poisonous, `P(C = poisonous| X = CONVEX)` is
 
    = P( X = CONVEX | C = poisonous) . P(C = poisonous) / P(X = CONVEX)
    
    = (4/4).(4/12) / (8/12)
    
    = 50%


Note that the denominator is common in both calculations, i.e. P(X = CONVEX) = 8/12, and thus you do not need to calculate it. You can simply compare the numerators and conclude the classes based on that:

- **Edible:** P( X = CONVEX | C = edible) . P(C = edible) =  (4/8).(8/12) = 4/12 = 33.33%

- **Poisonous:** P( X = CONVEX | C = poisonous) . P(C = poisonous) =  (4/4).(4/12) = 4/12 = 33.33%

Since both numerators are 4/12, you cannot classify the CONVEX mushroom as edible or poisonous (if you consider 50% as the threshold probability for classification). The fundamental concept is that you only need to compare the numerators for the two classes and assign the class based on that.

Let’s now break down the Bayes theorem. The 50% probability that the CONVEX mushroom is edible (or poisonous) is a result of three probabilities. P(edible | CONVEX) is:


- Proportional to P(edible), which tells us how abundant edible mushrooms are; if P(edible) is high, then P(edible | CONVEX) will be high simply because edible mushrooms are abundant!
     - P(edible) is 66.66% and P(poisonous) is 33.33 %
     - This pushes the favour towards edible since they are in abundance
- Proportional to P(CONVEX | edible), which explains how likely you are to find a CONVEX mushroom if you separately consider all the edible ones;
    - P(CONVEX | edible) is 50% and P(CONVEX | poisonous) is 100%
    - This pushes the favour towards poisonous since all poisonous mushrooms are CONVEX
- Inversely proportional to P(CONVEX); this term cancels out while comparing the two classes
  
Thus, the numerators are equal because of the product of two probabilities balances each other out.

**P(edible)** = 66.66% Equation 50% = 33.33%

**P(poisonous)** = 33.33%  100% = 33.33%

## Multivarite
### Conditional Independence in Naive Bayes

In the previous segment, you understood the basic idea behind the working of Naive Bayes and how it is implemented on categorical data consisting of one feature and one target variable. In this case, the calculations for solving the classification problem are very simple as the probabilities can simply be calculated by counting. In this segment, you will **understand how Naive Bayes would work if there are more than one feature in the data set**.

![98.png](attachment:5c25178f-0394-4f8b-bf35-d21c6ea757c2.png)

![99.png](attachment:4ebea705-fa36-4622-9bb6-dc5dac25bc64.png)


**Naïve Bayes** follows an assumption that the **variables are conditionally independent** given the class i.e.  `P(X = convex,smooth | C= edible)` can be written as `P(X=smooth | C=edible) EquationP(X=convex | C=edible)`. The terms `P(X=smooth | C=edible)` and `P(X=convex | C=edible)` is simply calculated by counting the data points. 

Hence, the name **“Naïve”** because in most real-world situations the variables are not conditionally independent given the class label but most of the times the algorithm works nonetheless.

Let us say you are trying to compute `P(A and B | C)`. If `P(A | C)` is the same for all values of `B` and `P(B | C)` is the same for all values of `A`, then there is conditional independence between A and B, given C. This is when `P(A and B | C) = P(A | C) x P(B | C)`, implying that A is not conditioned on B or vice versa.



Despite this assumption, Naive Bayes has proven to work very well in some cases, such as text classification. You'll study an example of classifying emails into spam/ham in the next session.

## MCQ

**Comprehension - Naive Bayes with Multiple Features**

Table 2: Mushroom Dataset
<table align="center" border="1" cellpadding="1" cellspacing="1"><tbody><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>S.No</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Type of Mushroom</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Cap.shape</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Cap.surface</strong></span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">1.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Poisonous</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">2.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">3.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Poisonous</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Smooth</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">4.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Smooth</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">5.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Fibrous</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">6.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Poisonous</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">7.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Bell</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">8.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Bell</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">9.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">10.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Poisonous</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Convex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">11.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Flat&nbsp;</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Scaly</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">12.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Edible</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Bell</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Smooth</span></div></td></tr></tbody></table>

      
Refer to the table above for the questions that follow. The first two columns are same as before. The third column is cap.surface - it is the second, newly added feature. The task is to **predict the Type.of.mushroom given its two features**.



In the multivariate case, the feature X is written as X = (cap.shape, cap.surface). Let us say if you take a mushroom having cap.shape = CONVEX and cap.surface =  SCALY, the probability of it being edible is expressed as:


    
    P(C = edible | X = CONVEX, SCALY)
    
    = P(X = CONVEX, SCALY | C = edible) Equation P(edible) / P(X = CONVEX, SCALY)



You can similarly write the expression for `P(C = poisonous | X = CONVEX, SCALY)` and compare that with `P(C = edible | X = CONVEX, SCALY)` and conclude the result. Recall that you do not need to calculate the denominator because it is same for both the edible and the poisonous class.


**Useful numbers:**

Number of edible mushrooms = 8

Number of poisonous mushrooms = 4

#### Q1. Say you take a new mushroom which is (CONVEX, SMOOTH). What is the numerator of P(C = edible | X = CONVEX, SMOOTH)?

- [X] P(edible) x P(CONVEX | edible) x P(SMOOTH| edible)
- [ ] P(CONVEX) x P(CONVEX | edible) x P(SMOOTH| edible)
- [ ] P(SMOOTH) x P(CONVEX | edible) x P(SMOOTH| edible)
- [ ] None of these

#### Q2. What is P(CONVEX | edible)?

- [ ] 8/12
- [ ] 4/12
- [x] 4/8
- [ ] 3/12
#### Q3. What is P(SMOOTH | edible)?

- [ ] 2/8
- [ ] 2/12
- [ ] 8/12
- [ ] 4/8

#### Q4.  What is P(CONVEX | poisonous)?

- [ ] 8/12
- [ ] 4/8
- [x] 1
- [ ] 4/12

Ans: Out of 4 poisonous mushrooms, all 4 are CONVEX.

#### Q5.  What is P(SMOOTH| poisonous)?

- [ ] 1
- [x] 1/4
- [ ] 4/8
- [ ] 1/12
      
#### Q6. In the previous questions, you have calculated that:

P(CONVEX | edible) = 4/8

P(SMOOTH| edible) = 2/8

P(CONVEX | poisonous) = 1 and

P(SMOOTH| poisonous) = 1/4


If all mushrooms above 50% probability of being edible are classified as edible, is the CONVEX, SMOOTH mushroom edible?

- [ ] Yes

- [ ] No

- [X] Cannot be decided, it is a tie

Answer: 

P(edible | CONVEX, SMOOTH) = P(edible).P(CONVEX | edible).P(SMOOTH| edible)/denominator = (8/12)(4/8)(2/8)/d = 1/12d

P(poisonous | CONVEX, SMOOTH) = P(poisonous).P(CONVEX | poisonous). P(SMOOTH| poisonous)/denominator = (4/12)(1)(1/4)/d = 1/12d.

Since both numerators are equal to 1/12d, this mushroom cannot be classified with a 50% threshold. Although if you would take a higher threshold, like 60% (which is reasonable since you don't want to take responsibility of people eating poisonous mushrooms), then it will be classified as poisonous. Why? Because, when you set the threshold as 60%, you want the probability of edible|CONVEX,SMOOTH to atleast 60%.

## Understanding Naive Bayes

You saw how conditional independence lets you calculate the class probability in cases where you have more than one feature. Now, in this segment, you will deal with the original five variable problem, where the new test point has the following features:  

- Cap Shape = Convex
- Cap Surface = Smooth
- Cap Colour = White
- Bruises = Yes
- Odour = None

Here the objective is to classify it into edible or poisonous class. Let's see how that can be done

![100.png](attachment:3f4d9c24-e9ce-49c2-9ef4-184b1e957e6e.png)

### Terms in the Bayes theorem 

Again, Bayes theorem is defined as: 

![](https://d35ev2v1xsdze0.cloudfront.net/b28ba1da-383c-4807-ac0b-f561d48d1742-1742810311399.png)


- **P(![](https://latex.upgrad.com/render?formula=C_%7Bi%7D)/X)** is called the **posterior probability**, which is finally compared for the classes, and the test point is assigned the class whose Posterior probability is greater.

- **P(![](https://latex.upgrad.com/render?formula=C_%7Bi%7D))** is known as the prior probability. It is the **probability of an event occurring before the collection of new data**.  Prior plays an important role while classifying, when using Naïve Bayes, as it highly influences the class of the new test point.

- **P(X/![](https://latex.upgrad.com/render?formula=C_%7Bi%7D))** represents the **likelihood function**. It **tells the likelihood of a data point occurring in a category**. The conditional independence assumption is leveraged while computing the likelihood probability.

- **P(x)** - The effect of the denominator P(x) is not incorporated while calculating probabilities as it is the same for both the classes and hence, can be ignored without affecting the final outcome.

### Prior, Posterior and Likelihood

Let’s understand the terminology of Bayes theorem.

You have been using 3 terms: 
- P(Class = edible / poisonous)
- P(X | Class) 
- P(Class | X) 

**Bayesian classification** is based on the principle that ‘you **combine your prior knowledge or beliefs about a population** with the **case specific information** to get the actual (posterior) probability’.

-  **P(Class = edible)** or **P(Class = poisonous)** is called the **prior probability** : This incorporates our ‘prior beliefs’ before you collect specific information. If 90% of mushrooms are edible, then the prior probability is 0.90. Prior gets multiplied with the likelihood to give the posterior. In many cases, the prior has a tremendous effect on the classification. **If the prior is neutral (50% are edible), then the likelihood may largely decide the outcome**.
- P(X|Class) is the **likelihood** :  After agreeing upon the prior, you collect new, case-specific data (like plucking mushrooms randomly from a farm and observing the cap colours). **Likelihood updates our prior beliefs with the new information**. If you find a CONVEX mushroom, then you’d want to know how likely you were to find a convex one if you had only plucked edible mushrooms.
  If  P(CONVEX| edible) is high, say 80%, implying that there was an 80% chance of getting a convex mushroom if you only took from edible mushrooms, this will reflect in increased chances of the mushroom being edible.

    If the likelihood is neutral (e.g. 50%), then the prior probability may largely decide the outcome. If the prior is way too powerful, then likelihood often barely affects the result.
- P(Class = edible | X) is the **posterior probability** : It is the outcome which **combines prior beliefs and case-specific information**. It is a balanced outcome of the prior and the likelihood.
  If Zimbabwe takes 3 Australian wickets in the first over in a world cup, would you predict Australia to lose? Probably not, because the prior odds are way too strong in favour of Australia. They’ve never lost to Zimbabwe in a world cup! The likelihood, though it may be high, gets balanced by the prior odds (Australia’s prior odds may even be 99%!) to give you the correct posterior.


## MCQ

**Comprehension - Naive Bayes with Multiple Features:**

Please use the table data to answer the questions below:

Table 3: Mushroom Dataset
<table align="center" border="1" cellpadding="1" cellspacing="1"><tbody><tr><td><span style="font-size: 
      14px;">S.No</span></td><td><span style="font-size: 
      14px;"><strong>Type of Mushroom</strong></span></td><td><span style="font-size: 
      14px;"><strong>Cap shape</strong></span></td><td><span style="font-size: 
      14px;"><strong>Cap surface</strong></span></td></tr><tr><td><span style="font-size: 
      14px;">1.</span></td><td><span style="font-size: 
      14px;"><strong>Poisonous</strong></span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">2.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">3.</span></td><td><span style="font-size: 
      14px;"><strong>Poisonous</strong></span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Smooth</span></td></tr><tr><td><span style="font-size: 
      14px;">4.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Smooth</span></td></tr><tr><td><span style="font-size: 
      14px;">5.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Fibrous</span></td></tr><tr><td><span style="font-size: 
      14px;">6.</span></td><td><span style="font-size: 
      14px;"><strong>Poisonous</strong></span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">7.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Bell</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">8.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Bell</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">9.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">10.</span></td><td><span style="font-size: 
      14px;"><strong>Poisonous</strong></span></td><td><span style="font-size: 
      14px;">Convex</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">11.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Flat&nbsp;</span></td><td><span style="font-size: 
      14px;">Scaly</span></td></tr><tr><td><span style="font-size: 
      14px;">12.</span></td><td><span style="font-size: 
      14px;">Edible</span></td><td><span style="font-size: 
      14px;">Bell</span></td><td><span style="font-size: 
      14px;">Smooth</span></td></tr></tbody></table>

#### Q1.In the table above, the prior probability is higher for a mushroom being:

- [ ] Edible
- [ ] Poisonous

#### Q2. Say you consider a (CONVEX, SCALY) mushroom. The likelihood is higher for it being:

- [ ] Edible
- [ ] Poisonous

#### Q3. The values of P(X|Class). P(Class) where X = (CONVEX, SCALY) for both classes (edible and poisonous) are respectively:

- [ ] Edible = 20.8 %; Poisonous = 25.0 %
- [ ] Edible = 20.8 %; Poisonous = 20.0 %
- [ ] Edible = 20 %; Poisonous = 25.0 %
- [ ] Edible = 28.5 %; Poisonous = 25.8 %

#### Q4. For the (CONVEX, SCALY) mushroom:

- [ ] The prior and posterior both are in favor of edible class
- [ ] The prior and posterior both are in favor of poisonous class
- [ ] The prior is in favor of edible; posterior in favor of poisonous
- [ ] The prior is in favor of poisonous; posterior in favor of edible

Ans: Prior is 8/12 and 4/12 for edible and poisonous respectively; posterior is 20.8% and 25%.


## Graded Question I:

<div class="MuiBox-root css-qgb4hw"><div class="MuiBox-root css-1xzog2f" id="switch-player-content"></div><div class="MuiBox-root css-j7qwjs"><div class="MuiBox-root css-lrle2m-container"><div class="text_component"><p style="text-align: justify;"><strong><span style="font-size: 14px;">Graded Questions - Part 1</span></strong><span style="font-size: 
      14px;"><br><br></span></p><p style="text-align: justify;"><span style="font-size: 
      14px;">In this segment, you will solve questions on 'Naive Bayes' and <strong>will be graded on your answers.</strong></span></p><p style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Comprehension - Spam and Ham E-Mails</strong></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">Bayesian classifiers are often used for document classification. The words in the documents are used as features for classification. For example, if you want to classify emails as spam or ham (genuine mail), you can use the ‘frequency of words in the text of an email’ as the features. The grammar is disregarded, which means that <em>unimportant&nbsp;</em>words like it, there, the, and etc. &nbsp;are ignored.</span></p><p style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">For example, if the main text of an email is:</span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">“Best offers on weight loss fitness bands! <strong>Buy</strong> this weekend to get a <strong>free</strong> protein supplement too!!<strong>&nbsp;Limited</strong> stock, <strong>buy</strong> now and get <strong>free</strong> stuff! Hurry up! For more <strong>free</strong> offers, subscribe on the link below.” &nbsp;</span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">Then the <strong>frequently&nbsp;</strong> occurring words i.e. the<strong>&nbsp;most important keywords (features)</strong> can be counted and stored in a table as shown below. The email above (obviously spam) is shown in the first row of the table. <strong>Freq 1</strong> is the <strong>most frequent word; Freq 2&nbsp;</strong>is the <strong>second most frequent word etc</strong>. Also, note that the <strong>order of features is important.</strong> If the features are (free, report, buy, click), in that order, then ‘free’ is ‘Freq 1’, ‘report’ is Freq 2 and so on. Which means that &nbsp;(report, free, buy, click) is a different observation from (free, report, buy, click).</span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">The data set with features and class labels is shown below:</span></p><span style="font-size: 
      14px;">Table 5: Most Occurring Words</span><table align="center" border="1" cellpadding="1" cellspacing="1"><tbody><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>S.No</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Class</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Freq 1</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Freq 2</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Freq 3</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Freq 4</strong></span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">1.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">free</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">buy</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">limited</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">hurry</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">2.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">reply</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">data</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">report</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">presentation</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">3.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">report</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">presentation</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">file</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">end of day</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">4.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">limited</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">file</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">buy</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">click</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">5.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">meeting</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">timelines</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">limited</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">documents</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">6.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">hurry</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">data</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">buy</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">stock</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">7.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">limited</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">sex</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">click</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">viagra</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">8.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">presentation</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">end of day</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">data</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">report</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">9.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">reply</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">data</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">presentation</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">click</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">10.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">free</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">reply</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">weekend</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">click</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">11.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">limited</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">click</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">free</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">hurry</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">12.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">meeting</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">end of day</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">weekend</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">data</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">13.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam</strong></span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">hurry</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">weekend</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">stock</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">offer</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">14.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">report&nbsp;</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">presentation</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">file</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">end of day</span></div></td></tr><tr><td><div style="text-align: justify;"><span style="font-size: 
      14px;">15.</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">Ham</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">free</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">timelines</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">reply</span></div></td><td><div style="text-align: justify;"><span style="font-size: 
      14px;">offer</span></div></td></tr></tbody></table><p style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">Let’s assume a simplified scenario where spammers use only the following important words in their emails:</span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Spam Keywords:</strong> buy, free, hurry, weekend, stock, offer, viagra, sex, limited, click<br><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;">Also, assume that you are building a model for an organisation where the only important words in genuine (ham) emails are as follows:</span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Ham Keywords:</strong> reply, data, report, presentation, file, end of day, meeting, timelines, delay, documents</span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Note:&nbsp;</strong>Wherever you come across the word independent/independence in this module, conditional independence is implied as discussed in the previous segment "<em>Conditional Independence in Naive Bayes".</em></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><br></span></p><p dir="ltr" style="text-align: justify;"><span style="font-size: 
      14px;"><strong>NOTE:&nbsp;</strong></span></p><ol dir="ltr"><li style="text-align: justify;"><span style="font-size: 
      14px;"><strong>Use the above-given table to answer the following questions.</strong></span></li><li style="text-align: justify;"><strong><span style="font-size: 14px;">To Solve the question, you need to very careful with the order of the features, For example: If my given feature list is(free, data, weekend, click) then free is freq1, data is freq2 and so on. Hence the probability of P(free | spam) will be 2/7 and P(click | spam) will be 2/7. </span></strong></li></ol></div></div><div class="MuiBox-root css-0"></div></div></div>

#### Q1. What is the prior probability of a mail being spam, P(class = spam)?

- [x] 7/15
- [ ] 8/15
- [ ] 7/8
- [ ] None of these
#### Q2. What does Naive Bayes assume while classifying spam or ham mails?

- [ ] That frequency of keywords like hurry, free, offer etc. are dependent on each other
- [ ] That frequency of keywords like hurry, free, offer etc. are conditionally independent of each other
- [ ] That P(spam) and P(ham) are independent of each other
- [ ] That P(spam) and P(X | spam) are independent of each other
#### Q3. Consider an email with the vector of features X = (free, data, weekend, click). What is the likelihood, P(X | spam)?
- [ ] 2/ 50625
- [ ] 2/2401
- [ ] 4/ 50625
- [ ] 4/ 2401
Ans:

Number of spam emails
- Spam emails are rows 1, 4, 6, 7, 10, 11, 13 → **7 spam emails**

For each keyword, how many spam emails contain it?

- **free**: in spam rows 1, 10, 11 = **3**
- **data**: in spam row 6 = **1**
- **weekend**: spam rows 10, 13 = **2**
- **click**: spam rows 4, 7, 10, 11 = **4**

$$
P(X|\text{spam}) = P(\text{free}|\text{spam}) \times P(\text{data}|\text{spam}) \times P(\text{weekend}|\text{spam}) \times P(\text{click}|\text{spam})
$$
$$
P(\text{free}|\text{spam}) = \frac{3}{7}
$$
$$
P(\text{data}|\text{spam}) = \frac{1}{7}
$$
$$
P(\text{weekend}|\text{spam}) = \frac{2}{7}
$$
$$
P(\text{click}|\text{spam}) = \frac{4}{7}
$$
$$
P(X | \text{spam}) = \frac{3}{7} \times \frac{1}{7} \times \frac{2}{7} \times \frac{4}{7}
= \frac{3 \times 1 \times 2 \times 4}{7^4}
= \frac{24}{2401}
$$

Since 4/2401 is the only numerator found in our calculation so its the correct option

#### Q4. Consider an email with the vector of features X = (free, data, weekend, click). What is the likelihood, P(X | ham)?

- [ ] 4/ 4096
- [ ] 2/ 4096
- [ ] 2/ 50625
- [ ] 4/50625

Ans:
Okay, let’s carefully compute $P(X \mid \text{Ham})$ step by step using **Naive Bayes with position-specific features** (same approach as the earlier spam calculation).

---

**Dataset recap**

We have **15 emails** total:

* 7 are **Spam**
* 8 are **Ham**

Each email has 4 features: **Freq1, Freq2, Freq3, Freq4**.

We’re asked about:

$$
X = (\text{free}, \text{data}, \text{weekend}, \text{click})
$$

---

**Step 1:** Count occurrences in Ham (8 rows)

Let’s check each feature **by position**.

1. **P(Freq1 = free | Ham)**
   Ham rows: {2,3,5,8,9,12,14,15}

* Row 15 → Freq1 = free
  So count = 1 out of 8.
  $\;\;= 1/8$

2. **P(Freq2 = data | Ham)**
   Ham rows again.

* Row 2 → Freq2 = data
* Row 8 → Freq2 = end of day (not data)
* Row 9 → Freq2 = data
* Row 12 → Freq2 = end of day
  So total = 2 out of 8.
  $\;\;= 2/8 = 1/4$

3. **P(Freq3 = weekend | Ham)**
   Check Ham rows:

* Row 12 → Freq3 = weekend
  So total = 1 out of 8.
  $\;\;= 1/8$

4. **P(Freq4 = click | Ham)**
   Check Ham rows:

* Row 9 → Freq4 = click
  So total = 1 out of 8.
  $\;\;= 1/8$

---

**Step 2:** Multiply probabilities

$$
P(X \mid Ham) = (1/8) \times (1/4) \times (1/8) \times (1/8)
$$

$$
= \frac{1}{2048}
$$

---

**Step 3:** Match to given options

$$
\frac{1}{2048} = \frac{2}{4096}
$$


---
#### Q5. The value of P(X|Class). P(Class) for class = spam for X = (free, data, weekend, click)?

- [ ] (4/50625)(8/15)
- [ ] (4/2401)(7/15)
- [ ] (2/50625)(8/15)
- [ ] (4/50625)(7/15)

Ans:


We want:

$$
P(X \mid \text{Spam}) \cdot P(\text{Spam})
$$

where $X = (\text{free}, \text{data}, \text{weekend}, \text{click})$.

---

**Step 1:** Prior

$$
P(\text{Spam}) = \frac{\#Spam}{\#Total} = \frac{7}{15}
$$

---

**Step 2:** Likelihood $P(X \mid \text{Spam})$

We already computed this earlier (using position-specific Naive Bayes, 7 Spam emails):

* $P(\text{Freq1}=\text{free} \mid \text{Spam}) = 2/7$
* $P(\text{Freq2}=\text{data} \mid \text{Spam}) = 1/7$
* $P(\text{Freq3}=\text{weekend} \mid \text{Spam}) = 1/7$
* $P(\text{Freq4}=\text{click} \mid \text{Spam}) = 2/7$

Multiply:

$$
P(X \mid \text{Spam}) = \frac{2}{7} \cdot \frac{1}{7} \cdot \frac{1}{7} \cdot \frac{2}{7} = \frac{4}{2401}
$$

---

**Step 3:** Multiply with prior

$$
P(X \mid \text{Spam}) \cdot P(\text{Spam}) = \frac{4}{2401} \cdot \frac{7}{15}
$$

#### Q6.  What is the posterior for class = ham (i.e. without division by denominator) for the feature vector  X = (free, data, weekend, click)?

- [ ] (4/50625)(8/15)
- [ ] (4/4096)(7/15)
- [ ] (2/4096)(8/15)
- [ ] (2/50625)(8/15)

Ans:
Got it 👍 — we need the **unnormalized posterior**

$$
P(X \mid Ham) \cdot P(Ham)
$$

---

**Step 1:** Prior

$$
P(Ham) = \frac{\#Ham}{\#Total} = \frac{8}{15}
$$

---

**Step 2:** Likelihood $P(X \mid Ham)$ (position-specific, 8 Ham emails)

* $P(\text{Freq1} = free \mid Ham) = 1/8$ (row 15)
* $P(\text{Freq2} = data \mid Ham) = 2/8 = 1/4$ (rows 2, 9)
* $P(\text{Freq3} = weekend \mid Ham) = 1/8$ (row 12)
* $P(\text{Freq4} = click \mid Ham) = 1/8$ (row 9)

$$
P(X \mid Ham) = \frac{1}{8} \cdot \frac{1}{4} \cdot \frac{1}{8} \cdot \frac{1}{8} 
= \frac{1}{2048} 
= \frac{2}{4096}
$$

---

**Step 3:** Multiply with prior

$$
P(X \mid Ham) \cdot P(Ham) = \frac{2}{4096} \cdot \frac{8}{15}
$$

---

✅ Correct answer = **(2/4096)(8/15)**

---

#### Q7. Which class should be point X = (free, data, weekend, click) be classified into?
- [ ] Spam
- [ ] Ham

## Graded Question II:

You have worked on a small data set in the previous questions. Let’s say that you run your predictions on a larger data set with 1000 test points and get the following table of predictions (confusion matrix).



![](https://d35ev2v1xsdze0.cloudfront.net/49beb7ee-fd28-4efc-a940-6008fdc37d50-m1pgxixd.png)


The positive class is spam. To evaluate the model, you will need to check the accuracy, sensitivity (rate of true positives) and specificity (rate of true negatives).

#### Q1. What is the accuracy of the model?

- [ ] 940/1000
- [ ] 440/480
- [ ] 500/520
- [ ] 60/1000

Ans: 


From the table:

* **True Positives (TP)** = 440 (spam correctly predicted as spam)
* **True Negatives (TN)** = 500 (ham correctly predicted as ham)
* **False Positives (FP)** = 20 (ham predicted as spam)
* **False Negatives (FN)** = 40 (spam predicted as ham)

---

**Formula for Accuracy**

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
$$

---

**Substitution**

$$
\text{Accuracy} = \frac{440 + 500}{440 + 500 + 20 + 40}
$$

$$
= \frac{940}{1000} = 0.94
$$

---

✅ **Accuracy = 94%**



#### Q2. What is the sensitivity of the model?

- [ ] 440 / 480
- [ ] 440 / 460
- [ ] 500 / 520
- [ ] 500 / 540

Ans: 

Good question 👍

**Sensitivity** = **Recall** = **True Positive Rate (TPR)**

$$
\text{Sensitivity} = \frac{TP}{TP + FN}
$$

---

From your confusion matrix:

* $TP = 440$ (spam predicted as spam)
* $FN = 40$ (spam predicted as ham)

$$
\text{Sensitivity} = \frac{440}{440 + 40} = \frac{440}{480} = 0.9167
$$

---

✅ **Sensitivity = 91.67%**



#### Q3. What is the specificity of the model ?
- [ ] 440 / 480
- [ ] 440 / 460
- [ ] 500 / 520
- [ ] 500 / 540

Ans:

Perfect 👍

**Specificity** = **True Negative Rate (TNR)**

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$

---

From the confusion matrix:

* $TN = 500$ (ham predicted as ham)
* $FP = 20$ (ham predicted as spam)

$$
\text{Specificity} = \frac{500}{500 + 20} = \frac{500}{520} = 0.9615
$$

---

✅ **Specificity = 96.15%**


#### Q4.Given that you do not want to misclassify any genuine emails, which metric should be as high as possible?

- [ ] Accuracy
- [ ] Sensitivity
- [ ] Specificity

Great question 👍

If the goal is **not to misclassify genuine emails (ham)**, that means:

* We want to **correctly identify ham when it is actually ham**.
* In the confusion matrix, that corresponds to **True Negatives (TN)**.
* The metric that measures this is **Specificity (True Negative Rate)**:

$$
\text{Specificity} = \frac{TN}{TN + FP}
$$

So the metric that should be as high as possible is:

✅ **Specificity**

# Naive Bayes for Text Classification

You will learn about the Naive Bayes for text classification and understand how the classifier works in the background. Also, you will learn the python implementation of Naive Bayes for text classification problem.

**Naive Bayes** is commonly **used for text classification** in applications such as **predicting spam emails**, classifying text (e.g. news) into categories such as politics, sports, lifestyle etc. In general, Naive Bayes has proven to perform well in text classification applications.

## Document Classifier Pre Processing Steps

### Part 1
Build a Naive Bayes document classifier, i.e. a model for classifying text into categories. Let’s first understand the problem statement and some necessary data preprocessing steps before building the actual classifier.

You understood the data you are going to be working with and to convert the documents given into a dictionary/vocabulary by removing ‘stop words’ which are not helpful for helping in document classification.

![101.png](attachment:a85fd17f-6f40-4122-a811-c810e665e2c9.png)

### Part 2

You will now look  at how to **use the vocabulary/dictionary** to **represent** the given **test documents** by counting the occurences of various words. This is the way we represent the documents in Multinomial Naive Bayes. 

![102.png](attachment:0bb390cb-9ef1-4f35-b055-a5e76d8b7d4d.png)

### Part 2

**Notations** and **document representation** in an array format.

![103.png](attachment:97d9bb3d-8138-4cd7-9285-c53e66268c6b.png)

### Worked out Example - Part 1

You will now learn how to classify new documents into classes ‘cinema’  and ‘education’. Using a worked out example, you will understand how the Naive Bayes classifier assigns class labels to documents.

![104.png](attachment:1e2443c6-9010-4cdb-aaf2-222a50099a0d.png)

### Worked out Example - Part 2

![105.png](attachment:eabfd86d-0fc2-49ea-ae19-031ff7cbe81e.png)

### Worked out Example - Part 3

Now, let's understand the actual process of classification using Multinomial Naive Bayes Classifier.

![106.png](attachment:a18209a8-6f06-4274-8f22-235410d8447c.png)


## MCQ
#### Q1. Suppose you have the following dictionary based on some training document narrating stories about love or action.

<table border="1"><colgroup><col><col><col><col><col><col><col><col></colgroup><tbody><tr><td><p dir="ltr">W1</p></td><td><p dir="ltr">W2</p></td><td><p dir="ltr">W3</p></td><td><p dir="ltr">W4</p></td><td><p dir="ltr">W5</p></td><td><p dir="ltr">W6</p></td><td><p dir="ltr">W7</p></td><td><p dir="ltr">W8</p></td></tr><tr><td><p dir="ltr">bike</p></td><td><p dir="ltr">couple</p></td><td><p dir="ltr">fast</p></td><td><p dir="ltr">furious</p></td><td><p dir="ltr">tears</p></td><td><p dir="ltr">love</p></td><td><p dir="ltr">shoot</p></td><td><p dir="ltr">songs</p></td></tr></tbody></table>

What will be feature vector of the document “A fast moving bike entered into the complex and shoot the couple.“

- [ ] 1,1,1,0,0,0,1,0
- [ ] 0,1,0,1,0,0,0,0,0,0,1,0,1
- [ ] 1,1,1,1,0,1,0,1
- [ ] Document contains word which are not present in the given dictionary hence we can't make a feature vector for this document.
      
#### Q2. Assume the following likelihoods i.e P(word|class) for each word being part of a positive or negative review of a hotel. 
<table border="1" cellspacing="1" cellpadding="1"><thead><tr><th scope="col">&nbsp;</th><th scope="col">Pos</th><th scope="col">Neg</th></tr></thead><tbody><tr><td>i</td><td>0.09</td><td>0.16</td></tr><tr><td>loved</td><td>0.30</td><td>0.06</td></tr><tr><td>the</td><td>0.06</td><td>0.05</td></tr><tr><td>food</td><td>0.04</td><td>0.35</td></tr><tr><td>and</td><td>0.08</td><td>0.07</td></tr><tr><td>cleanliness</td><td>0.40</td><td>0.03</td></tr></tbody></table>

What class will Naive Bayes assign to the sentence “I loved the food and cleanliness.” if the priors of the classes are considered equal ( it is equivalent to not considering the prior )

- [ ] Pos
- [ ] Neg
- [ ] Can't be determined

Ans: 
Alright, let’s go step by step carefully. We are applying the **Naive Bayes classifier** to decide whether the review sentence

> *“I loved the food and cleanliness.”*

is **Positive** (Pos) or **Negative** (Neg).

---

Step 1: Understand Naive Bayes

Naive Bayes says:

$$
P(\text{Class} \mid \text{Words}) \propto P(\text{Words} \mid \text{Class}) \cdot P(\text{Class})
$$

* If priors $P(\text{Pos})$ and $P(\text{Neg})$ are **equal**, we just compare:

  $$
  P(\text{Words} \mid \text{Pos}) \quad vs \quad P(\text{Words} \mid \text{Neg})
  $$
* “Naive” means we assume each word is **independent given the class**.
  So:

$$
P(\text{Words} \mid \text{Class}) = \prod_{\text{word} \in \text{sentence}} P(\text{word} \mid \text{Class})
$$

---

Step 2: Extract the probabilities from the table

We are given the conditional probabilities for each word:

\| Word         | P(word|Pos) | P(word|Neg) |
\|--------------|-------------|-------------|
\| i            | 0.09        | 0.16        |
\| loved        | 0.30        | 0.06        |
\| the          | 0.06        | 0.05        |
\| food         | 0.04        | 0.35        |
\| and          | 0.08        | 0.07        |
\| cleanliness  | 0.40        | 0.03        |

Sentence: **i loved the food and cleanliness**
So we multiply the probabilities of these words under each class.

---

Step 3: Compute likelihood for Pos

$$
P(\text{sentence} \mid Pos) = 0.09 \times 0.30 \times 0.06 \times 0.04 \times 0.08 \times 0.40
$$

Step by step:

* $0.09 \cdot 0.30 = 0.027$
* $0.027 \cdot 0.06 = 0.00162$
* $0.00162 \cdot 0.04 = 0.0000648$
* $0.0000648 \cdot 0.08 = 0.000005184$
* $0.000005184 \cdot 0.40 = 0.0000020736$

So:

$$
P(\text{sentence} \mid Pos) \approx 2.07 \times 10^{-6}
$$

---

Step 4: Compute likelihood for Neg

$$
P(\text{sentence} \mid Neg) = 0.16 \times 0.06 \times 0.05 \times 0.35 \times 0.07 \times 0.03
$$

Step by step:

* $0.16 \cdot 0.06 = 0.0096$
* $0.0096 \cdot 0.05 = 0.00048$
* $0.00048 \cdot 0.35 = 0.000168$
* $0.000168 \cdot 0.07 = 0.00001176$
* $0.00001176 \cdot 0.03 = 0.0000003528$

So:

$$
P(\text{sentence} \mid Neg) \approx 3.53 \times 10^{-7}
$$

---

Step 5: Compare

* Positive likelihood: $2.07 \times 10^{-6}$
* Negative likelihood: $3.53 \times 10^{-7}$

Clearly:

$$
P(\text{sentence} \mid Pos) > P(\text{sentence} \mid Neg)
$$

So Naive Bayes classifies the review as:

✅ **Positive (Pos)**

---

👉 Notice how:

* Words like *“loved”* (0.30 vs 0.06) and *“cleanliness”* (0.40 vs 0.03) **strongly boost the Pos class**.
* Even though *“food”* favors Neg (0.35 vs 0.04), the effect of *“loved”* and *“cleanliness”* dominates, making the final decision Positive.

---

#### Q3. Assume the following likelihoods i.e P(word|class) for each word being part of a positive or negative review of a hotel. 
<table border="1" cellspacing="1" cellpadding="1"><thead><tr><th scope="col">&nbsp;</th><th scope="col">Pos</th><th scope="col">Neg</th></tr></thead><tbody><tr><td>i</td><td>0.09</td><td>0.16</td></tr><tr><td>loved</td><td>0.30</td><td>0.06</td></tr><tr><td>the</td><td>0.06</td><td>0.05</td></tr><tr><td>food</td><td>0.04</td><td>0.35</td></tr><tr><td>and</td><td>0.08</td><td>0.07</td></tr><tr><td>cleanliness</td><td>0.40</td><td>0.03</td></tr></tbody></table>

What class will Naive Bayes assign to the sentence “I loved the food and cleanliness.” if the prior probabilities for positive and negative  classes are considered 0.1 and 0.9 respectively

- [ ] Pos
- [ ] Neg
- [ ] Can't be determined

Ans:

We compare the unnormalized posteriors
$\;P(\text{words}\mid\text{Class})\cdot P(\text{Class})$.

From earlier:

* $P(\text{words}\mid Pos)=2.0736\times10^{-6}$
* $P(\text{words}\mid Neg)=3.528\times10^{-7}$

Multiply by priors $P(Pos)=0.1,\;P(Neg)=0.9$:

* Unnormalized Pos = $2.0736\times10^{-6}\times0.1 = 2.0736\times10^{-7}$
* Unnormalized Neg = $3.528\times10^{-7}\times0.9 = 3.1752\times10^{-7}$

Since $3.1752\times10^{-7} > 2.0736\times10^{-7}$, Naive Bayes assigns the sentence to:

✅ **Neg**

(For completeness, normalized posteriors:
$P(Pos\mid\text{words}) \approx 2.0736/5.2488 \approx 0.395$ and
$P(Neg\mid\text{words}) \approx 0.605$.)



## Laplace Smoothing - 
### Part 1

![107.png](attachment:70f2a45b-3e44-45cc-808d-caf09c59d95d.png)

### Part 2

![108.png](attachment:8c0654b4-a839-4553-86c1-b2fa49e6929e.png)

You saw how Laplace smoothing helped solve the zero probability problem.



Please note that - If there are **words** occurring in a test sentence which are **not a part of the dictionary**, then they will **not be considered** as part of the feature vector since it only considers the words that are part of the dictionary. These new words will be completely ignored.

# Bernoulli Naive Bayes

In Bernoulli Naive Bayes Classifier is the way we build the bag of **words representation**, which in this case is **just 0 or 1**. Simply put, Bernoulli Naive Bayes is **concerned only** with whether the **word is present or not in a document**, whereas Multinomial Naive Bayes counts the no. of occurrences of the words as well.

![109.png](attachment:3dd26a2b-7f59-4e44-ab81-15ddf3393a26.png)



# Multinomial vs Bernouli


Perfect! Here’s a clear **tabular comparison** of **Multinomial vs. Bernoulli models**:  

| Aspect                     | **Multinomial Naive Bayes**                                | **Bernoulli Naive Bayes**                              |
|-----------------------------|-----------------------------------------------------------|--------------------------------------------------------|
| **Feature Representation** | Uses **word frequencies** (counts of how many times a feature/word appears) | Uses **binary indicators** (1 = word present, 0 = absent) |
| **When Frequency Matters** | Yes, higher counts influence prediction                    | No, only presence/absence matters                     |
| **Best Suited For**        | Longer documents where word **counts matter** (news, articles, product reviews, topic modeling) | Short text, keyword-driven data where presence of certain signals matters (spam detection, sentiment where key words are strong indicators) |
| **Example Use Case**       | - Classifying a news article into *sports, politics, tech* based on word usage frequencies | - Classifying an email as *spam/not spam* based on whether trigger words like “free” or “lottery” appear |
| **Advantages**             | Captures richer information (frequency of words)           | Simpler and better for very short texts or sparse data |
| **Disadvantages**          | May overweight very frequent but generic words unless normalized | Ignores frequency, so loses information in longer texts |
| **Input Required**         | Count vector (or TF-IDF works well)                       | Binary occurrence vector                              |

🔑 **Rule of thumb**:  
- Use **Multinomial NB** when document length > a few words and frequencies matter (topic classification, reviews).  
- Use **Bernoulli NB** if text is short or binary signals are enough (tweet classification, spam filters).  

***

👉 Do you want me to also include a **mathematical difference (formula view)** column in the table for clarity, or keep it practical only?


# Python Lab - Education Or Cinema ?

In this segment, you will learn how to implement both Multinomial and Bernoulli Naive Bayes classifiers in python.



Please find the dataset of test data [here](https://ml-course2-upgrad.s3.amazonaws.com/Naive+Bayes/Naive+Bayes+for+Text+Classification/example_test.csv), train data [here](https://ml-course2-upgrad.s3.amazonaws.com/Naive+Bayes/Naive+Bayes+for+Text+Classification/example_train.csv) and the code file [here](https://github.com/ContentUpgrad/Naive-Bayes/tree/main/Naive%20Bayes%20for%20text%20classification).

## Part 1

[notebooks_Naive_Bayes_Multinomial_Bernoulli_+Demo (1).ipynb](http://localhost:8888/lab/tree/MLC76/lxp%20content/3.Machine%20Learning%20I/12.Naive-Bayes-main/Naive%20Bayes%20for%20text%20classification/notebooks_Naive_Bayes_Multinomial_Bernoulli_%2BDemo%20(1).ipynb)


# Python Lab - SMS Spam Ham Classifier : Bernoulli


[notebooks_SMS+Classifier+_+Bernoulli+NB (1).ipynb](http://localhost:8888/lab/tree/MLC76/lxp%20content/3.Machine%20Learning%20I/12.Naive-Bayes-main/Naive%20Bayes%20for%20text%20classification/notebooks_SMS%2BClassifier%2B_%2BBernoulli%2BNB%20(1).ipynb)

# MCQ

**Comprehension - Naive Bayes for Text Classification**

Consider the set of documents tabulated below. There are 5 training documents. Answer following questions based on this table. Documents have been indexed starting from 0 to match the indexing style of python.

<table border="1"><colgroup><col><col><col></colgroup><tbody><tr><td><p dir="ltr" style="text-align: justify;">Doc.No.</p></td><td><p dir="ltr" style="text-align: justify;">Document</p></td><td><p dir="ltr" style="text-align: justify;">Class</p></td></tr><tr><td><p dir="ltr" style="text-align: justify;">0</p></td><td><p dir="ltr" style="text-align: justify;">Coffee Tea &nbsp;Soup Coffee Coffee</p></td><td><p dir="ltr" style="text-align: justify;">Hot</p></td></tr><tr><td><p dir="ltr" style="text-align: justify;">1</p></td><td><p dir="ltr" style="text-align: justify;">Coffee is hot and so is Soup &nbsp;and Tea</p></td><td><p dir="ltr" style="text-align: justify;">Hot</p></td></tr><tr><td><p dir="ltr" style="text-align: justify;">2</p></td><td><p dir="ltr" style="text-align: justify;">Espresso is a hot Coffee &nbsp;and not a Tea</p></td><td><p dir="ltr" style="text-align: justify;">Hot</p></td></tr><tr><td><p dir="ltr" style="text-align: justify;">3</p></td><td><p dir="ltr" style="text-align: justify;">Coffee is neither Tea nor Soup</p></td><td><p dir="ltr" style="text-align: justify;">Hot</p></td></tr><tr><td><p dir="ltr" style="text-align: justify;">4</p></td><td><p dir="ltr" style="text-align: justify;">Sprite Pepsi &nbsp;Cold Coffee and cold Tea</p></td><td><p dir="ltr" style="text-align: justify;">Cold</p></td></tr></tbody></table>

In [1]:
import pandas as pd
import sklearn

data = {
    "Doc.No.": [0, 1, 2, 3, 4],
    "Document": [
        "Coffee Tea  Soup Coffee Coffee",
        "Coffee is hot and so is Soup  and Tea",
        "Espresso is a hot Coffee  and not a Tea",
        "Coffee is neither Tea nor Soup",
        "Sprite Pepsi  Cold Coffee and cold Tea"
    ],
    "Class": [
        "Hot",
        "Hot",
        "Hot",
        "Hot",
        "Cold"
    ]
}

df = pd.DataFrame(data)

#### Q1. How many words will be there in the dictionary vector without stop words?

- [ ] 8
- [ ] 7
- [ ] 6
- [ ] 9

In [2]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer( stop_words='english')
vec.fit(df['Document'])
vocab_list = vec.vocabulary_
print("Vocabulary size:", len(vocab_list))

print("Vocabulary list:", vocab_list)

Vocabulary size: 8
Vocabulary list: {'coffee': 0, 'tea': 7, 'soup': 5, 'hot': 3, 'espresso': 2, 'sprite': 6, 'pepsi': 4, 'cold': 1}


#### Q2. What will be the feature vector after transforming the document:

“Coffee is neither Tea nor Soup”  look like?

The words in the dictionary are ordered in the way shown below :

![image.png](attachment:191930f4-0962-419e-8aaf-26ca229ca2c8.png)

In [3]:
# Transform the specific document
doc = ["Coffee is neither Tea nor Soup"]
feature_vector = vec.transform(doc)

# Convert the feature vector to an array and print it
print("Feature vector for document:", feature_vector.toarray())

Feature vector for document: [[1 0 0 0 0 1 0 1]]


#### Q3.What will be the feature vector after transforming the document:

“I hate cold Coffee but love Tea and hot Coffee”  look like?

The words in the dictionary are ordered in the way shown below :

![image.png](attachment:d2e01164-8c73-493a-8426-d2c45ea2d100.png)
 

In [4]:
doc = ['I hate cold Coffee but love Tea and hot Coffee']
feature_vector = vec.transform(doc)

# Convert the feature vector to an array and print it
print("Feature vector for document:", feature_vector.toarray())

Feature vector for document: [[2 1 0 1 0 0 0 1]]


#### Q4. We have been asked to classify a new document whose content is not yet disclosed. What most likely will its class be?

- [ ] Hot
- [ ] Cold
- [ ] Can't be predicted.


In [5]:
# Count documents per class
class_counts = df['Class'].value_counts()
total_docs = len(df)

# Calculate prior probabilities
priors = class_counts / total_docs
print("Priors:\n", priors)

Priors:
 Class
Hot     0.8
Cold    0.2
Name: count, dtype: float64


#### Q5. What is the probability of word “Coffee” appearing in a document which has been classified as "Hot" if we are planning to do a Multinomial Naive Bayes Classification?

<table border="1"><tbody><tr><td><p dir="ltr">Doc.No.</p></td><td><p dir="ltr">Document</p></td><td><p dir="ltr">Class</p></td></tr><tr><td><p dir="ltr">0</p></td><td><p dir="ltr">Coffee Tea &nbsp;Soup Coffee Coffee</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">1</p></td><td><p dir="ltr">Coffee is hot and so is Soup &nbsp;and Tea</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">2</p></td><td><p dir="ltr">Espresso is a Hot Coffee &nbsp;and not a Tea</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">3</p></td><td><p dir="ltr">Coffee is neither Tea nor Soup</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">4</p></td><td><p dir="ltr">Sprite Pepsi &nbsp;Cold Coffee and cold Tea</p></td><td><p dir="ltr">Cold</p></td></tr></tbody></table>


- [ ] 7/22
- [ ] 6/22
- [ ] 7/16
- [ ] 6/16

Ans: 

Word Coffee appears 6 times in all documents of class hot ( d0 : 3 , d1:1, d2:1 and d3:1 ) and there are 16 words altogether in the hot class of documents( d0 : 5 , d1: 4 , d2: 4 and d3: 3). Hence the probability of word Coffee in class hot is 6/16.

#### Q6. What is Binarization of a feature vector?

- [ ] Converting each entry of word count in a feature vector to binary number
- [x] Converting all non-zero word count of a feature vector to 1 and leaving zero counts as it is
- [ ] Making a feature vector of True and False depending upon the presence and absence of any word in the document under consideration.\
- [ ] None of these


#### Q7. What is Binarized feature vector for the document  “ I hate cold Coffee but love Tea and hot Coffee”?

![image.png](attachment:a7755f0f-0465-4a3f-b0d3-57c6759bc81b.png)


Ans: 
Multinomial feature vector for the given document against our original dictionary is

<table border="1"><colgroup><col><col><col><col><col><col><col><col></colgroup><tbody><tr><td><p dir="ltr">coffee</p></td><td><p dir="ltr">cold</p></td><td><p dir="ltr">espresso</p></td><td><p dir="ltr">hot</p></td><td><p dir="ltr">pepsi</p></td><td><p dir="ltr">soup</p></td><td><p dir="ltr">sprite</p></td><td><p dir="ltr">tea</p></td></tr><tr><td><p dir="ltr">2</p></td><td><p dir="ltr">1</p></td><td><p dir="ltr">0</p></td><td><p dir="ltr">1</p></td><td><p dir="ltr">0</p></td><td><p dir="ltr">0</p></td><td><p dir="ltr">0</p></td><td><p dir="ltr">1</p></td></tr></tbody></table>
We convert all non-zero entries to 1 to make it a binarized feature vector in case of Bernoulli Naive Bayes. Binarized feature vector only represents the presence or absence of a word in the document. Third  option is correct


#### Q8. What is the correct expression for the likelihood of document  “Coffee and Tea” for the “Hot” class if we are planning to do a Multinomial Naive Bayes Classification?

- [ ] P(Coffee | Hot) * P(and | Hot ) * P(Tea| Hot)
- [x] P(Coffee | Hot) *  P(Tea| Hot)
- [ ] P(Hot | Coffee) * P(Hot | and) * P(Hot| Tea)
- [ ] None of the above

#### Q9. What is the value of Likelihood of document  “Coffee and Tea” for the “Cold” class if we are planning to do a Multinomial Naive Bayes Classification?

- [ ] 1/22 * 1/22
- [ ] 1/6  * 0 * 1/6
- [x] 1/6  * 1/6
- [ ] None of the above

ans 

Word Coffee appears 1 time in all documents of class cold ( d4 : 1  ) and there are 6 words altogether in cold class of documents( d4 : 6). Hence probability of word Coffee in class clod is 1/6.Similarly P(Tea|Cold) = 1/6 . Therefore the likelihood of the document is 1/6 * 1/6

# MCQ

**Naive Bayes for Text Classification - Part 2**

Consider a naive Bayes model based on the above training documents with the classes Hot and Cold and the following conditional probability table of the words given a class:
<table border="1"><colgroup><col><col><col></colgroup><thead><tr><th scope="row"><p dir="ltr" style="text-align: justify;">Word</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">P(word | Hot)</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">P(word | Cold)</p></th></tr></thead><tbody><tr><th scope="row"><p dir="ltr" style="text-align: justify;">Coffee</p></th><td><p dir="ltr" style="text-align: justify;">6/16</p></td><td><p dir="ltr" style="text-align: justify;">1/6</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">cold</p></th><td><p dir="ltr" style="text-align: justify;">0</p></td><td><p dir="ltr" style="text-align: justify;">2/6</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">espresso</p></th><td><p dir="ltr" style="text-align: justify;">1/16</p></td><td><p dir="ltr" style="text-align: justify;">0</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">hot</p></th><td><p dir="ltr" style="text-align: justify;">P</p></td><td><p dir="ltr" style="text-align: justify;">0</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">pepsi</p></th><td><p dir="ltr" style="text-align: justify;">0</p></td><td><p dir="ltr" style="text-align: justify;">Q</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">soup</p></th><td><p dir="ltr" style="text-align: justify;">3/16</p></td><td><p dir="ltr" style="text-align: justify;">0</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">sprite</p></th><td><p dir="ltr" style="text-align: justify;">R</p></td><td><p dir="ltr" style="text-align: justify;">1/6</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">tea</p></th><td><p dir="ltr" style="text-align: justify;">4/16</p></td><td><p dir="ltr" style="text-align: justify;">1/6</p></td></tr></tbody></table>

Based on the above table answer the following :


#### Q1. Few conditional probabilities  have been left blank and marked as P , Q and R. What is a possible combination of P, Q and R?

Apart from the above table also keep in mind the original set of training document table :

<table border="1"><tbody><tr><td><p dir="ltr">Doc.No.</p></td><td><p dir="ltr">Document</p></td><td><p dir="ltr">Class</p></td></tr><tr><td><p dir="ltr">0</p></td><td><p dir="ltr">Coffee Tea &nbsp;Soup Coffee Coffee</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">1</p></td><td><p dir="ltr">Coffee is hot and so is Soup &nbsp;and Tea</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">2</p></td><td><p dir="ltr">Espresso is a Hot Coffee &nbsp;and not a Tea</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">3</p></td><td><p dir="ltr">Coffee is neither Tea nor Soup</p></td><td><p dir="ltr">Hot</p></td></tr><tr><td><p dir="ltr">4</p></td><td><p dir="ltr">Sprite Pepsi &nbsp;Cold Coffee and cold Tea</p></td><td><p dir="ltr">Cold</p></td></tr></tbody></table>

- [ ] P= 1/16 , Q = 2/6 and R = 1/16

- [ ] P= 2/16 , Q= 2/6 and R =0

- [x] P= 2/16 , Q= 1/6  and R= 0

- [ ] It needs more information to calculate these values

#### Q2. What is the value of P(“ I love tea and coffee”|Hot)? Use the table given above to calculate.

- [x] 4/16 * 6/16

- [ ] 6/22 * 4/22

- [ ] 6/16* 0 * 4/16

- [ ] Can not be calculated

# MCQ

**Naive Bayes for Text Classification - Part 3**


While trying to calculate the likelihood of a test document for a given class, it is possible that there exist certain words which although are a part of the dictionary but don't appear in the training documents of that class like the word pepsi does not appear in documents of hot class. Then, the probability of that word for that class becomes zero ( P(pepsi|hot) =0 )  and it makes the complete likelihood term zero. This is called the zero-probability problem.


To counter this problem, a ‘1’ is added to the total of every word count of all the words of the dictionary for that class. This increases the total word count for that class by the length of the dictionary. This technique is called Laplace Smoothing. After applying Laplace Smoothing, the updated table shown before will look like as follows -

<table border="1"><colgroup><col><col><col><col><col></colgroup><thead><tr><th scope="row"><p dir="ltr">Word</p></th><th scope="col"><p dir="ltr">Word Count|Hot</p><p dir="ltr">Actual +1</p></th><th scope="col"><p dir="ltr">P(word | Hot)</p></th><th scope="col"><p dir="ltr">Word Count|Cold</p><p dir="ltr">Actual +1</p></th><th scope="col"><p dir="ltr">P(word|Cold)</p></th></tr></thead><tbody><tr><th scope="row"><p dir="ltr">Coffee</p></th><td><p dir="ltr">6 +1</p></td><td><p dir="ltr">(6+1)/(16+8)</p></td><td><p dir="ltr">1+1</p></td><td><p dir="ltr">(1+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">cold</p></th><td><p dir="ltr">0+1</p></td><td><p dir="ltr">(0+1)/(16+8)</p></td><td><p dir="ltr">2+1</p></td><td><p dir="ltr">(2+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">espresso</p></th><td><p dir="ltr">1+1</p></td><td><p dir="ltr">(1+1)/(16+8)</p></td><td><p dir="ltr">0+1</p></td><td><p dir="ltr">(0+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">hot</p></th><td><p dir="ltr">2+1</p></td><td><p dir="ltr">(2+1)/(16+8)</p></td><td><p dir="ltr">0+1</p></td><td><p dir="ltr">(0+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">pepsi</p></th><td><p dir="ltr">0+1</p></td><td><p dir="ltr">(0+1)/(16+8)</p></td><td><p dir="ltr">1+1</p></td><td><p dir="ltr">(1+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">soup</p></th><td><p dir="ltr">3+1</p></td><td><p dir="ltr">(3+1)/(16+8)</p></td><td><p dir="ltr">0+1</p></td><td><p dir="ltr">(0+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">sprite</p></th><td><p dir="ltr">0+1</p></td><td><p dir="ltr">(0+1)/(16+8)</p></td><td><p dir="ltr">1+1</p></td><td><p dir="ltr">(1+1)/(6+8)</p></td></tr><tr><th scope="row"><p dir="ltr">tea</p></th><td><p dir="ltr">4+1</p></td><td><p dir="ltr">(4+1)/(16+8)</p></td><td><p dir="ltr">1+1</p></td><td><p dir="ltr">(1+1)/(6+8)</p></td></tr></tbody></table>

#### Q1. What is the value of P(“ I love cold coffee”|Hot)?
- [x] 1/24 * 7/24
- [ ] 1/16 * 6/16
- [ ] 0* 1/24 * 7/24

#### Q2. What is the most likely class for the document “cold tea”  based on the likelihood terms only (i.e. assume equal priors for both the classes)
- [ ] Hot
- [x] Cold
- [ ] Can't be determined
#### Q3. Compute the most likely class for the document “cold tea”  based on likelihood and prior of the classes. Assume a naive Bayes classifier and use Laplace smoothing for the likelihoods. Its class should be
- [x] Hot
- [ ] Cold
- [ ] Can't be determined
ans:
P(cold tea | cold)*P(cold) > P(cold tea | Hot)*P(Hot)

#### Q4. A bag A contains 3 Red and 4 Green balls and another bag B contains 4 Red and 6 Green balls. One bag is selected at random and a ball is drawn from it. If the ball drawn is found Green , find the probability that the bag chosen was A.

- [ ] 4/7

- [x] 20/41

- [ ] 10/17

- [ ] 4/17

---
Ans: 

Using Bayes:

* $P(A)=P(B)=\tfrac12$
* $P(G|A)=\tfrac{4}{7}$, $P(G|B)=\tfrac{6}{10}=\tfrac{3}{5}$
* $P(G)=\tfrac12\cdot\tfrac{4}{7}+\tfrac12\cdot\tfrac{3}{5}=\tfrac{41}{70}$

So,

$$
P(A\mid G)=\frac{P(G\mid A)P(A)}{P(G)}
=\frac{\tfrac{4}{7}\cdot \tfrac12}{\tfrac{41}{70}}
=\frac{20}{41}.
$$


#### Q5. The bag A  contain 6 Green, 4 Blue ; B contains 4 Green, 6 Blue and C contains 5 Green, 5 Blue balls respectively. A bag is randomly selected  and a ball is drawn from it. If the ball drawn is Green, find the probability that it is drawn from bag A.

- [ ] 6/10

- [ ] 6/30

- [x] 2/5

- [ ] 15/30
---
Ans:



Bayes’ Theorem with equal prior for each bag:

* $P(A)=P(B)=P(C)=\tfrac13$
* $P(G|A)=\tfrac{6}{10}=\tfrac35$, $P(G|B)=\tfrac{4}{10}=\tfrac25$, $P(G|C)=\tfrac{5}{10}=\tfrac12$

Total probability of drawing Green:

$$
P(G)=\tfrac13\!\left(\tfrac35+\tfrac25+\tfrac12\right)=\tfrac13\!\left(1+\tfrac12\right)=\tfrac12.
$$

Thus,

$$
P(A|G)=\frac{P(G|A)P(A)}{P(G)}
=\frac{\left(\tfrac35\right)\left(\tfrac13\right)}{\tfrac12}
=\frac{1/5}{1/2}=\frac{2}{5}.
$$


# MCQ

**Naive Bayes Practice Questions - 1**

<table border="1"><colgroup><col><col><col><col><col><col><col></colgroup><thead><tr><th scope="row"><p dir="ltr" style="text-align: justify;">Courses</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">Data Science</p><p dir="ltr" style="text-align: justify;">(DS)</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">Machine Learning</p><p dir="ltr" style="text-align: justify;">(ML)</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">Deep</p><p dir="ltr" style="text-align: justify;">Learning</p><p dir="ltr" style="text-align: justify;">(DL)</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">Big</p><p dir="ltr" style="text-align: justify;">Data</p><p dir="ltr" style="text-align: justify;">(BD)</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">Artificial</p><p dir="ltr" style="text-align: justify;">Intelligence</p><p dir="ltr" style="text-align: justify;">(AI)</p></th><th scope="col"><p dir="ltr" style="text-align: justify;">Total</p></th></tr></thead><tbody><tr><th scope="row"><p dir="ltr" style="text-align: justify;">Male</p></th><td><p dir="ltr" style="text-align: justify;">80</p></td><td><p dir="ltr" style="text-align: justify;">60</p></td><td><p dir="ltr" style="text-align: justify;">40</p></td><td><p dir="ltr" style="text-align: justify;">50</p></td><td><p dir="ltr" style="text-align: justify;">30</p></td><td><p dir="ltr" style="text-align: justify;">260</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">Female</p></th><td><p dir="ltr" style="text-align: justify;">70</p></td><td><p dir="ltr" style="text-align: justify;">40</p></td><td><p dir="ltr" style="text-align: justify;">50</p></td><td><p dir="ltr" style="text-align: justify;">70</p></td><td><p dir="ltr" style="text-align: justify;">10</p></td><td><p dir="ltr" style="text-align: justify;">240</p></td></tr><tr><th scope="row"><p dir="ltr" style="text-align: justify;">Total</p></th><td><p dir="ltr" style="text-align: justify;">150</p></td><td><p dir="ltr" style="text-align: justify;">100</p></td><td><p dir="ltr" style="text-align: justify;">90</p><p style="text-align: justify;"><br></p></td><td><p dir="ltr" style="text-align: justify;">120</p></td><td><p dir="ltr" style="text-align: justify;">40</p></td><td><p dir="ltr" style="text-align: justify;">500</p></td></tr></tbody></table>


The contingency table above shows the number of females and males who joined specific programs. For example, it shows that 70 females joined the Data Science program and that 40 males joined the Deep Learning program. Also, notice that the table shows the totals in the last row and column, respectively.



You have a total of 240 females and 260 males in the dataset. This is a total of 500. Finally, the table also shows the totals for specific programs. For example, 150 people joined the Data Science program, 100 joined Machine Learning, and so on. Note that each student is enrolled in just one course.

#### Q1. Given this contingency table, Determine if being Male and having joined Big Data course are INDEPENDENT?

- [ ] Yes, they are independent
- [x] No, they are not
- [ ] May or May Not be
- [ ] Answering this will need more information.
---
Ans: 
To prove that two variables (say A and B) are independent, we must show that

P( A AND B) = P(A | B) * P(B) = P(A) * P(B)

First, from the Contingency Table:

P(A AND B) is P( Male AND BD Student) = 50/500

 

P(A) = P(Male) = 260/500

P(B) = P(BD Student) = 120/500

P(A | B) = P(Male GIVEN BD Student) = 50/120



OK- now we have everything we need to check for independence:

P( A AND B) = P(A | B) * P(B) = P(A) * P(B)

P(A | B) * P(B)= P(Male | BD Student) * P(BD Student)

                     = 50/120 * 120/500 = 50/500

P(A) * P(B) = P(Male) * P(BD Student) = 260/500 * 120/500

As we can see P(Male | BD Student) * P(BD Student) not equal to P(Male) * P(BD Student)

So NO these are NOT independent.

#### Q2. Given this contingency table,answer the following.

The probability of a student being a Female student and a DL Student is greater than the probability of a Female student being a DL student.

- [ ] True

- [ ] False

- [ ] They have equal probability

- [ ] Can't  be answered

---
Ans: 

Let us first calculate the probability of a student being a Female student and a DL Student. We essentially want to calculate P(F and DL) . From the table we can see that there are 50 students who are both a Female and a DL students out of **total 500** students hence P(F and DL) = 50/500=5/50.

Now the second probability is “probability of a Female student being a DL student” or P(DL |F). We are interested in the probability of a DL student given that the student is a Female. There are **240 Females** out of which DL students are 50. Therefore P(DL|F) = 50/240=5/24

The statement is false.

#### Q3. Suppose A and B are two independent events. Given that P(not A) = 0.2 and P(B)=0.3. Consider following unfilled contingency table and answer the questions

<table border="1" cellspacing="1" cellpadding="1" style="width: 500px;"><tbody><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>A</strong></td><td style="text-align: center;"><strong>Not A</strong></td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>Not B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;">Total</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>50</strong></td></tr></tbody></table>

- [ ] 12/50

- [ ] 3/50

- [ ] 7/50

- [ ] Can't be determined
---
Ans

We’re given:

* $P(\text{not A})=0.2 \implies P(A)=0.8$
* $P(B)=0.3$
* $A$ and $B$ are **independent events**

So,

$$
P(A \cap B) = P(A) \cdot P(B) = 0.8 \times 0.3 = 0.24
$$

Since the total is **50**,

$$
0.24 \times 50 = 12
$$

Thus,

$$
P(A \cap B) = \frac{12}{50}
$$

✅ Correct Answer: **12/50**


#### Q4, Suppose A and B are two independent events. Given that P(not A) = 0.2 and P(B)=0.3. Consider following unfilled contingency table and answer the questions

<table border="1" cellspacing="1" cellpadding="1" style="width: 500px;"><tbody><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>A</strong></td><td style="text-align: center;"><strong>Not A</strong></td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>Not B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;">Total</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>50</strong></td></tr></tbody></table>

- [ ] 12/50

- [ ] 15/50

- [x] 28/50

- [ ] Insufficient Data

#### Q4. Suppose A and B are two independent events. Given that P(not A) = 0.2 and P(B)=0.3. Consider following unfilled contingency table and answer the questions

<table border="1" cellspacing="1" cellpadding="1" style="width: 500px;"><tbody><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>A</strong></td><td style="text-align: center;"><strong>Not A</strong></td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>Not B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;">Total</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>50</strong></td></tr></tbody></table>

What is the probability of B happening given that A has not happened? In other words, what is the value of P(B|not A)?

- [ ] 12/50

- [ ] 3/10

- [ ] 10/15

- [ ] Insufficient Data


---

Ans

Given that P(not A) = 0.2 . Suppose total no. of events of “not A” is x then x/50 is 0.2. Therefore n(not A) = x= 10.

 


n(A)=50−10=40

Since P(B)=0.3 we have n(B)=0.3×50=15 and n(not B)=50−15=35. From this we can partially complete the table:

<table border="1" cellspacing="1" cellpadding="1" style="width: 500px;"><tbody><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>A</strong></td><td style="text-align: center;"><strong>Not A</strong></td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">15</td></tr><tr><td style="text-align: center;"><strong>Not B</strong></td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">35</td></tr><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">40</td><td style="text-align: center;">10</td><td style="text-align: center;"><strong>50</strong></td></tr></tbody></table>

Next, we use the fact that A and B are independent. From the definition of independence

P((not A) and B)=P(not A)×P(B)=0.2×0.3=0.06 

Therefore n((not A) and B)=0.06×50= 3

We find the rest of the values in the table by making sure that each row and column sums to its total.

<table border="1" cellspacing="1" cellpadding="1" style="width: 500px;"><tbody><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;"><strong>A</strong></td><td style="text-align: center;"><strong>Not A</strong></td><td style="text-align: center;">&nbsp;</td></tr><tr><td style="text-align: center;"><strong>B</strong></td><td style="text-align: center;">12</td><td style="text-align: center;">3</td><td style="text-align: center;">15</td></tr><tr><td style="text-align: center;"><strong>Not B</strong></td><td style="text-align: center;">28</td><td style="text-align: center;">7</td><td style="text-align: center;">35</td></tr><tr><td style="text-align: center;">&nbsp;</td><td style="text-align: center;">40</td><td style="text-align: center;">10</td><td style="text-align: center;"><strong>50</strong></td></tr></tbody></table>


#### Q5. Consider the following equation in a Naive Bayes classification problem.

P (x|c) = P (Equation|c) . P (Equation|c) .  ......  . P (Equation|c) = Equation


Here X is a feature vector where x1, x2 …. are attributes of that feature vector. C is a specific class. Which of the following is/are true w.r.t  the above information


1. Above equation is only true if x1, x2...xd are conditionally independent
2. P(x∣c) simply means: “How likely is it to observe this particular pattern x given that it belongs to class c
3. In the context of a classification problem P(x|c) is also termed as the likelihood
4. P(x|c) is also termed as the posterior probability

- [ ] Option 1 and 4
- [ ] Option 2 and 3
- [ ] Option 1 and 2
- [x] Option 1,2 and 3

#### Q6. Bayes theorem is defined as  

![](https://latex.codecogs.com/gif.latex?%5Cdpi%7B150%7D%20P%28C_%7Bi%7D/X%29%20%3D%20%5Cfrac%7BP%28X/C_%7Bi%7D%29%20P%28C_%7Bi%7D%29%7D%7BP%28X%29%7D)

The likelihood , in the context of a classification problem, can be interpreted as

- [ ] What is the probability that a particular object belongs to class C given its observed feature values

- [ ] What is the probability of a class C in the sample being considered

- [x] What is the probability of observing a given feature vector knowing that it belongs to a class C

- [ ] None of the above

#### Q6. Bayes theorem is defined as  

![](https://latex.codecogs.com/gif.latex?%5Cdpi%7B150%7D%20P%28C_%7Bi%7D/X%29%20%3D%20%5Cfrac%7BP%28X/C_%7Bi%7D%29%20P%28C_%7Bi%7D%29%7D%7BP%28X%29%7D)

The prior probability, in the context of a classification problem, can be interpreted as

- [ ] What is the probability that a particular object belongs to class C given its observed feature values

- [x] What is the probability of a class C in the sample being considered

- [ ] What is the probability of observing a given feature vector knowing that it belongs to a class C

- [ ] None of the above

#### Q7. Bayes theorem is defined as  ![](https://latex.codecogs.com/gif.latex?%5Cdpi%7B150%7D%20P%28C_%7Bi%7D/X%29%20%3D%20%5Cfrac%7BP%28X/C_%7Bi%7D%29%20P%28C_%7Bi%7D%29%7D%7BP%28X%29%7D)

The posterior probability, in the context of a classification problem, can be interpreted as

- [x] What is the probability that a particular object belongs to class C given its observed feature values

- [ ] What is the probability of a class C in the sample being considered

- [ ] What is the probability of observing a given feature vector knowing that it belongs to a class C

- [ ] None of the above

#### Q8.Which of the following is/are true w.r.t  the assumptions made in Naive Bayes Classification?

1. All the rows of  a collection of data are i.i.d, (independent and identically distributed)i.e., all data points  are independent of each other and are drawn from the similar distribution.

2. All features are conditionally independent.

3. Even if samples are not a sequence of independent, identically distributed (IID) random distribution , they can be classified using Naive Bayes Classification

4. Conditional independence of the features of a feature vector is not required

- [ ] Option 1 and 4

- [ ] Option 2 and 3

- [x] Option 1 and 2

- [ ] Option 1,3 and 4


# Graded Questions

In this segment, you will use the [IMDB](http://www.imdb.com/) movie reviews dataset to classify reviews as 'positive' or 'negative'. We have divided the data into training and test sets. The training set contains 800 positive and 800 negative movie reviews whereas the test set contains 200 positive and 200  negative movie reviews.



This was one of the first widely-available sentiment analysis datasets compiled by Pang and Lee's. The data was first collected in 2002, however, the text is similar to movies reviews you find on IMDB today. The dataset is in a CSV format. It has two categories: Pos (reviews that express a positive or favourable sentiment) and Neg (reviews that express a negative or unfavourable sentiment). For this exercise, we will assume that all reviews are either positive or negative; there are no neutral reviews.





You will need to build a Multinomial Naive Bayes classification model in Python for solving the questions.

Please find the imdb_train dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Naive+Bayes/Naive+Bayes+for+Text+Classification/movie_review_train.csv) and imdb_test dataset [here](https://ml-course2-upgrad.s3.amazonaws.com/Naive+Bayes/Naive+Bayes+for+Text+Classification/movie_review_test.csv).

Note: 

Tag negative(Neg) as 0 and positive(Pos) as 1.
Please answer the following questions based on the model you make using above datasets:

#### Q1. What is the size of vocabulary after removing the stop words? Note that the vocabulary size depends only on the training set.

- [ ] 39467

- [ ] 35858

- [ ] 53732

- [ ] 21739

In [22]:
import pandas as pd
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

train_data = pd.read_csv('13. Graded Question/movie_review_train.csv')
test_data = pd.read_csv('13. Graded Question/movie_review_test.csv')
vect = CountVectorizer(stop_words='english')
vocab_list = vect.fit(train_data['text'])
print(" size of vocabulary",len(vocab_list.vocabulary_))


 size of vocabulary 35858


#### Q2.Suppose we don't want to consider those (rare) words which have appeared only in 3% of the documents, or say those (extremely common ones) which have appeared in 80% of the documents.

Use CountVectorizer(stop_words='english', min_df=.03, max_df=.8) to create a new vocabulary from the training set. What is the size of the new vocabulary?

- [ ] 1666

- [ ] 1563

- [ ] 1643

1356 

In [15]:
vect = CountVectorizer(stop_words='english', min_df = 0.03, max_df = 0.8 )
vocab_list = vect.fit(train_data['text'])
print(" size of vocabulary",len(vocab_list.vocabulary_))

 size of vocabulary 1643


#### Q3. Suppose we build the vocabulary from the training data using CountVectorizer(stop_words='english', min_df=.03, max_df=.8) and then transform the test data using CountVectorizer(). How many nonzero entries are there in the sparse matrix (corresponding to the test data)? 

Note: Test data is provided in a separate CSV file.

- [ ] 51663

- [ ] 21786

- [ ] 56983

In [18]:
vocab_list_train = vect.fit_transform(train_data['text'])
vocab_list_test = vect.transform(test_data['text'])

# Step 3: Count nonzero entries
nonzeros = vocab_list_test.nnz  # number of nonzero entries in sparse matrix
print("Number of nonzero entries:", nonzeros)
print("Feature names:", vect.get_feature_names_out())
print("Test vector:", vocab_list_test.toarray())

Number of nonzero entries: 51663
Feature names: ['000' '10' '100' ... 'york' 'young' 'younger']
Test vector: [[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 1 0 ... 0 0 0]
 [0 1 0 ... 0 1 0]
 [0 0 0 ... 0 2 0]]


#### Q4. Train a Bernoulli Naive Bayes model on the training set and predict the classes of the test set. Each movie review in the test set has been labelled as 'Pos' or 'Neg'. What is the accuracy of the model?

Note - Dictionary should be prepared using CountVectorizer(stop_words='english', min_df=.03, max_df=.8)

- [ ] 0.65

- [ ] 0.79

- [ ] 0.77

- [ ] 0.67

In [27]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
# instantiating bernoulli NB class
bnb=BernoulliNB()

bnb.fit(vocab_list_train, train_data['class'])

y_pred = bnb.predict(vocab_list_test)
accuracy = accuracy_score(test_data['class'], y_pred)

print("Accuracy:", accuracy)

Accuracy: 0.79


#### Q5. The confusion matrix is a matrix which tabulates

True Negative(TN) , False Positive (FP) , False Negative (FN) and True Positive (TP) as follows:

 	
<table border="1"><colgroup><col><col><col></colgroup><thead><tr><th scope="row">&nbsp;</th><th scope="col"><p dir="ltr">Predicted Negative</p><p dir="ltr">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;⇓</p></th><th scope="col"><p dir="ltr">Predicted Positive</p><p dir="ltr">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;⇓</p></th></tr></thead><tbody><tr><th scope="row"><p dir="ltr">Actual Negative ⇒</p></th><td><p dir="ltr">TN</p></td><td><p dir="ltr">FP</p></td></tr><tr><th scope="row"><p dir="ltr">Actual Positive &nbsp;&nbsp;⇒</p></th><td><p dir="ltr">FN</p></td><td><p dir="ltr">TP</p></td></tr></tbody></table>

Run metrics.confusion_matrix(actual class of test data, predicted class of test data). How many reviews are actually negative but have been classified as positive by the model?

**Note :**

1. Dictionary should be  prepared using CountVectorizer(stop_words='english',min_df=.03,max_df=.8)

2. Remember that we have tagged negative as 0 and positive as 1 and If needed, look up the documentation of confusion_matrix to understand which values in the cells correspond to positives/negatives.

3. The CF docs mention that C{i, j} is the number which is known to be in class i but are predicted in class j. In this case, {0,1} is thus actually 0 (negative) and predicted 1 (positive).

- [ ] 177
- [ ] 61
- [ ] 139
- [ ] 23

In [30]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(test_data['class'], y_pred)

# output
# array([[177,  23],
#        [ 61, 139]])

# cm[0,0] = TN (actually 0, predicted 0)

# cm[0,1] = FP (actually 0, predicted 1)

# cm[1,0] = FN (actually 1, predicted 0)

# cm[1,1] = TP (actually 1, predicted 1)

# "How many reviews are actually negative but have been classified as positive by the model?"

# That is exactly False Positives (FP).

# cm[0,1] = FP (actually 0, predicted 1)
print("Confusion Matrix:\n", cm)
false_positives = cm[0,1]
print("Actually Negative but predicted Positive (FP):", false_positives)

Confusion Matrix:
 [[177  23]
 [ 61 139]]
Actually Negative but predicted Positive (FP): 23


# Python Coding Questions


Instructions

You are provided with a list of tuples having three elements, name, age and salary. You are required to arrange the tuples in order of any of the three elements. The variable (name/age/salary) has to be decided by the second input, which is an integer (1 : name, 2 : Age, 3 : Salary). Make sure that the ordering should be ascending.



Example:

   Input 1 : [('Ram', 23 , 3000) , ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000)]

             2

   Output 1:[ ( 'Suresh' , 19 , 8000) , ('Mohan' , 22 , 4000 ) ,('Ram', 23 , 3000)]





   Input 2 : [('Ram', 23 , 3000) , ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000), ('Sita' , 28,2500)]

             3

   Output 2:[ ('Sita' , 28,2500), ('Ram', 23 , 3000), ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000)]


#### Q1. You are provided with a list of tuples having three elements, name, age and salary. You are required to arrange the tuples in order of any of the three elements. The variable (name/age/salary) has to be decided by the second input, which is an integer (1 : name, 2 : Age, 3 : Salary). Make sure that the ordering should be ascending.



Example:

   Input 1 : [('Ram', 23 , 3000) , ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000)]

             2

   Output 1:[ ( 'Suresh' , 19 , 8000) , ('Mohan' , 22 , 4000 ) ,('Ram', 23 , 3000)]





   Input 2 : [('Ram', 23 , 3000) , ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000), ('Sita' , 28,2500)]

             3

   Output 2:[ ('Sita' , 28,2500), ('Ram', 23 , 3000), ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000)]

            

In [76]:
import ast,sys

input_str = "[('Ram', 23 , 3000) , ('Mohan' , 22 , 4000 ) , ( 'Suresh' , 19 , 8000)]"
input_list = ast.literal_eval(input_str)

sorting_choice = 2

print(sorted(input_list, key= lambda x : x[sorting_choice-1], reverse=False))



[('Suresh', 19, 8000), ('Mohan', 22, 4000), ('Ram', 23, 3000)]


# 