##### Bayes’s Theorem

According to the Wikipedia, In probability theory and statistics,** Bayes’s theorem** (alternatively *Bayes’s law* or *Bayes’s rule*) describes the probability of an event, based on prior knowledge of conditions that might be related to the event.
Mathematically, it can be written as:

![formula.jpeg](attachment:formula.jpeg)

Where A and B are events and P(B)≠0
* P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.
* P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.
* P(A) and P(B) are the probabilities of observing A and B respectively; they are known as the marginal probability.


Let’s understand it with the help of an example:

**The problem statement:**

There are two machines which manufacture bulbs. Machine 1 produces 30 bulbs per hour and machine 2 produce 20 bulbs per hour. Out of all bulbs produced, 1 % turn out to be defective. Out of all the defective bulbs, the share of each machine is 50%.  What is the probability that a bulb produced by machine 2 is defective?

We can write the information given above in mathematical terms as:

The probability that a bulb was made by Machine 1, P(M1)=30/50=0.6

The probability that a bulb was made by Machine 2, P(M2)=20/50=0.4

The probability that a bulb is defective, P(Defective)=1%=0.01

The probability that a defective bulb came out of Machine 1, P(M1 | Defective)=50%=0.5

The probability that a defective bulb came out of Machine 2, P(M2 | Defective)=50%=0.5

Now, we need to calculate the probability of a bulb produced by machine 2 is defective i.e.,
P(Defective | M2).
Using the Bayes Theorem above, it can be written as:

$P(Defective | M2)=\frac { P(M2 | Defective) * P(Defective)} { P(M2)}$

Substituting the values, we get:$P(Defective | M2)=\frac {0.5*0.01}{0.4}= 0.0125$

Task for you is to calculate the probability that a bulb produced by machine 1 is defective.

You are planning a picnic today, but the morning is cloudy

Oh no! 50% of all rainy days start off cloudy!
But cloudy mornings are common (about 40% of days start cloudy)
And this is usually a dry month (only 3 of 30 days tend to be rainy, or 10%)
What is the chance of rain during the day?

We will use Rain to mean rain during the day, and Cloud to mean cloudy morning.

The chance of Rain given Cloud is written P(Rain|Cloud)

So let's put that in the formula:

**P(Rain|Cloud) =  [P(Rain) P(Cloud|Rain)]  $/$  P(Cloud)**                
                      
 

- P(Rain) is Probability of Rain = 10%
- P(Cloud|Rain) is Probability of Cloud, given that Rain happens = 50%
- P(Cloud) is Probability of Cloud = 40%

P(Rain|Cloud) =  (0.1 x 0.5) $/$ 0.4   = .125

Or a 12.5% chance of rain. Not too bad, let's have a picnic!

We’ll extend this same understanding to understand the Naïve Baye’s Algorithm.


<img src=1.PNG width=600>


<img src=2.PNG width=600>


<img src=3.PNG width=600>


<img src=4.PNG width=600>


<img src=5.PNG width=600>

In [1]:
## importing the library
import pandas as pd

In [2]:
## reading the dataset
data=pd.read_csv('spam.csv',encoding='latin')

In [3]:
data

Unnamed: 0,type,email
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
## Creating X and y
X=data.email
y=data.type

In [5]:
X

0       Go until jurong point, crazy.. Available only ...
1                           Ok lar... Joking wif u oni...
2       Free entry in 2 a wkly comp to win FA Cup fina...
3       U dun say so early hor... U c already then say...
4       Nah I don't think he goes to usf, he lives aro...
                              ...                        
5567    This is the 2nd time we have tried 2 contact u...
5568                Will Ì_ b going to esplanade fr home?
5569    Pity, * was in mood for that. So...any other s...
5570    The guy did some bitching but I acted like i'd...
5571                           Rofl. Its true to its name
Name: email, Length: 5572, dtype: object

In [4]:
## text preprocessing and feature vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
tf=TfidfVectorizer() ## object creation
X=tf.fit_transform(X) ## fitting and transforming the data


In [5]:
X.shape

(5572, 8672)

In [6]:
## getting the features name
tf.get_feature_names()[2000:2010]

['chez',
 'chg',
 'chgs',
 'chic',
 'chick',
 'chicken',
 'chickened',
 'chief',
 'chik',
 'chikku']

In [9]:
## number of features created
len(tf.get_feature_names())

8672

In [10]:
X

<5572x8672 sparse matrix of type '<class 'numpy.float64'>'
	with 73916 stored elements in Compressed Sparse Row format>

In [11]:
## getting the feature vectores
X=X.toarray()

In [13]:
X[2000:2010]

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [12]:
## Creating training and testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=6)

In [13]:
## Model creation
from sklearn.naive_bayes import BernoulliNB
nb=BernoulliNB(alpha=0.01) ## model object creation
nb.fit(X_train,y_train) ## fitting the model
y_hat=nb.predict(X_test) ## getting the prediction

In [14]:
## model evalution
from sklearn.metrics import classification_report,confusion_matrix

In [15]:
print(classification_report(y_test,y_hat))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1192
        spam       0.98      0.94      0.96       201

    accuracy                           0.99      1393
   macro avg       0.98      0.97      0.98      1393
weighted avg       0.99      0.99      0.99      1393



In [16]:
## confusion matrix
pd.crosstab(y_test,y_hat)

col_0,ham,spam
type,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,1188,4
spam,12,189


In [17]:
message1=pd.Series(['This is to inform you that you applicationn has been rejected'])

In [18]:
message1

0    This is to inform you that you applicationn ha...
dtype: object

In [19]:
## converting the text data to text vector
message1_transform=tf.transform(message1)

In [20]:
message1_transform

<1x8672 sparse matrix of type '<class 'numpy.float64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [21]:
tf.get_feature_names()

['00',
 '000',
 '000pes',
 '008704050406',
 '0089',
 '0121',
 '01223585236',
 '01223585334',
 '0125698789',
 '02',
 '0207',
 '02072069400',
 '02073162414',
 '02085076972',
 '021',
 '03',
 '04',
 '0430',
 '05',
 '050703',
 '0578',
 '06',
 '07',
 '07008009200',
 '07046744435',
 '07090201529',
 '07090298926',
 '07099833605',
 '07123456789',
 '0721072',
 '07732584351',
 '07734396839',
 '07742676969',
 '07753741225',
 '0776xxxxxxx',
 '07781482378',
 '07786200117',
 '077xxx',
 '078',
 '07801543489',
 '07808',
 '07808247860',
 '07808726822',
 '07815296484',
 '07821230901',
 '078498',
 '07880867867',
 '0789xxxxxxx',
 '07946746291',
 '0796xxxxxx',
 '07973788240',
 '07xxxxxxxxx',
 '08',
 '0800',
 '08000407165',
 '08000776320',
 '08000839402',
 '08000930705',
 '08000938767',
 '08001950382',
 '08002888812',
 '08002986030',
 '08002986906',
 '08002988890',
 '08006344447',
 '0808',
 '08081263000',
 '08081560665',
 '0825',
 '083',
 '0844',
 '08448350055',
 '08448714184',
 '0845',
 '08450542832',
 '084

In [22]:
## predicting from model
nb.predict(message1_transform)

array(['ham'], dtype='<U4')

In [23]:
message2=pd.Series(['free lucky draw jackpot'])


In [24]:
message2_transform=tf.transform(message2)

In [25]:
nb.predict(message2_transform)

array(['spam'], dtype='<U4')

In [None]:
tf= no of words repeated in senetence/total no of words in the sentence

In [None]:
idf=log(number of senetnece/no of senetences containing the wordsß)