## Naive Bayes Classifier
#### (For binary or multi classification)

##### NBC is based on Naive Bayes Theorem

$$ \large P(A|B) = \frac{P(A|B)\times P(A)}{P(B)} $$

where **P(A|B)** is Probability of A given B

#### 1. What are the basic assumptions?
- Features are independent


#### 2. What are the advantages when working with Naive Bayes Classifier?
- Works very well with many numbers of features (For example Natural Language Processing, after text pre-processing we end up with thousands of vectors based on number of words in a dictionary).
- Works well with large datasets, it will perform fast with respect to time, because Naive Bayes classifier completely works on probability.
- It converges faster when training a model, the convergence happens very quickly.
- It also performs good with categorical features, as well as with sparce matrices when there are a lot of zeros and ones, for example after Bag Of Words text pre-processing.

#### 3. Disadvantages
- Correlated features affect performance
---

Suppose you there are 5000 features after some NLP pre-processing technics and there are many features within the 5000 which are highly correlated, it will affect performance due to probabilities in Naive Bayes. Probability is a main key concept inside it. Performance is in impact because when calculating the probability of one feature with respect to another feature, and if they are highly correlated they are selected twice in the model, overinflating their importance.

#### 4. Whether feature scaling is required?
- No. Naive Bayes works based on probability, no feature scaling is required.

#### 5. Impact of Missing Values
- Naive Bayes can handle missing values
---

Attributes are handled separately by the algorithm at both model construction time and prediction time. If a data instance has a missing value for an attribute, it will be ignored while preparing the model and ignored when probability is calculated for an class value.

#### 6. Impact of Outliers
- It is usually robust to outliers.

#### Different Problem Statements can be solved using Naive Bayes
- Sentiment Analysis
- Spam classification
- twitter sentiment analysis
- document categorization

#### Naive Bayes Formula use in Naive Bayes Classifier

$$ \large P(A|B) = \frac{P(A|B)\times P(A)}{P(B)} $$

where **P(A|B)** Probability of A given B, when A is already given

Suppose we have dataset:
independent features:
x = {$ x_{1}, x_{2}, x_{3} ... x_{n}$}
and dependent feature:
y = {y}

we can convert formula like this:

$$\large P(y|x_{1}, x_{2}, x_{3}...x_{n}) = \frac{P(x_{1}|y) \times P(x_{2}|y) ... P(x_{n}|y)\times P(y)}{P(x_{1})\times P(x_{2})\times P(x_{3}) ... P(x_{n})}$$

where we can compute probability of y when x is given

we can change the formula into shorter version, take P(y) and move it to the left side:

$$\large\frac{P(y)\times \pi_{i=1}^n \times P(x_{i}|y)}{P(x_{1})\times P(x_{2})\times P(x_{3}) ... P(x_{n})}$$

The denominator will be the same for every record, we can consider it as a constant

$$\large P(x_{1})\times P(x_{2})\times P(x_{3}) ... P(x_{n})$$

The nominator will be directly proportional to this:

$$\large P(y|x_{1}, x_{2}, x_{3}...x_{n})$$    
$$\large \alpha$$    
$$\large P(y)\times \pi_{i=1}^n \times P(x_{i}|y)$$

Since it is directly proportional, in order to find particular value x, we need to take argmax of our computation:

$\large y = argmax(P(y)\times \pi_{i=1}^n \times P(x_{i}|y))$

### Example problem statement

#### Outlook
|        | YES  | NO   |P(yes)| P(no)|                     
|--------|------|------|------|------|
|sunny   |  2   |  3   | 2/9  | 3/5  |
|overcast|  4   |  0   | 4/9  | 0/5  |
|rainy   |  3   |  2   | 3/9  | 2/5  |
|total   |  9   |  5   | 100% | 100% |


#### Temperature
|        | YES  | NO   |P(yes)| P(no)|
|--------|------|------|------|------|
|hot     |  2   |  2   | 2/9  | 2/5  |
|mild    |  4   |  2   | 4/9  | 2/5  |
|cool    |  3   |  1   | 3/9  | 1/5  |
|total   |  9   |  5   | 100% | 100% |

#### Play 
|        |      |P(yes) & P(no)|
|--------|------|--------------|
|yes     |  9   |     9/14     | 
|no      |  5   |     5/14     |
|total   |  14  |     100%     |


By these three tables we can calculate the probability whether person will play tennis or not

Suppose we have scenario: today(sunny, hot), where sunny is $x_{1}$ feature and hot is $x_{2}$ feature

let's determine whether person will play tennis or not, we can apply formula of naive bayes:

What is the probability of YES given today:

$\large P(yes|today) = \frac{P(sunny|yes)\times P(hot|yes) \times P(yes)}{P(today)}$

since P(today) is always the same for all records, we can skip it and we get:

$2/9 \times 2/9 \times 9/14 = 0.031$

What is the probability of NO given today:

$\large P(no|today) = \frac{P(sunny|no)\times P(hot|no) \times P(no)}{P(today)} = 3/5 \times 2/5 \times 5/14 = 0.08571$


How we determine whether the output is yes or no?

we have to take the values 0.031 and 0.08571, and normalize it to 1

$\large P(yes) = \frac{0.031}{0.031+0.08571} \approx 0.27$

$\large P(no) = 1 - 0.27 = 0.73$

#### 0.73 is greater than 0.27, so in this scenario today(sunny, hot) the output will be NO, person will not go to play tennis