# Bayes Theorem

 
Bayes’ Theorem is a simple mathematical formula used for calculating conditional probabilities.

Conditional probability is a measure of the probability of an event occurring given that another event has (by assumption, presumption, assertion, or evidence) occurred.

The formula is: —

![formula](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-1.jpg)


Formula tells us: how often A happens given that B happens, written P(A|B) also called posterior probability, When we know: how often B happens given that A happens, written P(B|A) and how likely A is on its own, written P(A) and how likely B is on its own, written P(B).

 


# Example


![table](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-2.png)

The posterior probability P(y|X) can be calculated by first, creating a Frequency Table for each attribute against the target. Then, molding the frequency tables to Likelihood Tables and finally, use the Naïve Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of the prediction. Below are the Frequency and likelihood tables for all three predictors.

![tables](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-8.png)

Frequency and Likelihood tables of ‘Type’
![TypeTable](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-9.png)

Frequency and Likelihood tables of ‘Origin’


![typetable](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-10.png)

![predictors](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-11.png)

As per the equations discussed above, we can calculate the posterior probability P(Yes | X) as :

![](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-12.png)

and, P( No | X ):

![](https://www.kdnuggets.com/wp-content/uploads/bayes-nagesh-13.png)

Since 0.144 > 0.048, Which means given the features RED SUV and Domestic, our example gets classified as ’NO’ the car is not stolen.

In [23]:
import pandas as pd

import plotly.express as pe

from statistics import mean

#model

from sklearn.naive_bayes import BernoulliNB

from sklearn.model_selection import StratifiedKFold

from hyperopt import hp,tpe,fmin,Trials,STATUS_OK,space_eval
from hyperopt.early_stop import no_progress_loss

#metrics
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

In [24]:
path = r"C:\Users\harsh\Desktop\NPCI-Python-ML\datasets\Loan_Status_Classification.csv"

df = pd.read_csv(path)

df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,0,1,1,0,0,6608,0,137,180,1,1,1
1,0,1,2,0,0,4226,1040,110,360,1,1,1
2,1,1,0,1,0,3167,2283,154,360,1,2,1
3,0,0,0,1,1,6950,0,175,180,1,2,1
4,0,1,0,1,0,3993,3274,207,360,1,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...
293,1,0,0,1,0,3846,0,111,360,1,2,1
294,0,0,0,1,0,2435,0,75,360,1,1,0
295,0,0,2,1,0,4923,0,166,360,0,2,1
296,0,1,3,0,0,2071,754,94,480,1,2,1


### step 2: Data exploration

In [25]:
df.shape

(298, 12)

In [26]:
df.index

RangeIndex(start=0, stop=298, step=1)

In [27]:
df.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 298 entries, 0 to 297
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   Gender             298 non-null    int64
 1   Married            298 non-null    int64
 2   Dependents         298 non-null    int64
 3   Education          298 non-null    int64
 4   Self_Employed      298 non-null    int64
 5   ApplicantIncome    298 non-null    int64
 6   CoapplicantIncome  298 non-null    int64
 7   LoanAmount         298 non-null    int64
 8   Loan_Amount_Term   298 non-null    int64
 9   Credit_History     298 non-null    int64
 10  Property_Area      298 non-null    int64
 11  Loan_Status        298 non-null    int64
dtypes: int64(12)
memory usage: 28.1 KB


In [29]:
df.isna().sum()

Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [30]:
df.nunique()

Gender                 2
Married                2
Dependents             4
Education              2
Self_Employed          2
ApplicantIncome      257
CoapplicantIncome    150
LoanAmount           145
Loan_Amount_Term       9
Credit_History         2
Property_Area          3
Loan_Status            2
dtype: int64

In [31]:
df["Loan_Status"].value_counts()

1    150
0    148
Name: Loan_Status, dtype: int64

In [32]:
df.columns = [   col.replace("_", "") for col in df.columns    ]
print(df.columns)

mapping = {"CoapplicantIncome" : "CoApplicantIncome"}


df.rename(  columns = mapping, inplace=True        )

df.columns

Index(['Gender', 'Married', 'Dependents', 'Education', 'SelfEmployed',
       'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'LoanAmountTerm',
       'CreditHistory', 'PropertyArea', 'LoanStatus'],
      dtype='object')


Index(['Gender', 'Married', 'Dependents', 'Education', 'SelfEmployed',
       'ApplicantIncome', 'CoApplicantIncome', 'LoanAmount', 'LoanAmountTerm',
       'CreditHistory', 'PropertyArea', 'LoanStatus'],
      dtype='object')

### separate out categorical & real features



In [33]:
real_value_features = ["ApplicantIncome", "CoApplicantIncome", "LoanAmount"]
categorical_features = [col for col in df.columns if col not in real_value_features  if col != "LoanStatus"]
non_binary_categorical_features = [col for col in categorical_features if df[col].nunique()>2]




In [35]:
df["Dependents"] = df["Dependents"].apply(lambda x : int(x > 2))

df["PropertyArea"] = df["PropertyArea"].apply(lambda x : int(x > 1) )

df["LoanAmountTerm"] = df["LoanAmountTerm"].apply(lambda x : int(x > 180))


df

Unnamed: 0,Gender,Married,Dependents,Education,SelfEmployed,ApplicantIncome,CoApplicantIncome,LoanAmount,LoanAmountTerm,CreditHistory,PropertyArea,LoanStatus
0,0,1,0,0,0,6608,0,137,0,1,0,1
1,0,1,0,0,0,4226,1040,110,1,1,0,1
2,1,1,0,1,0,3167,2283,154,1,1,1,1
3,0,0,0,1,1,6950,0,175,0,1,1,1
4,0,1,0,1,0,3993,3274,207,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...
293,1,0,0,1,0,3846,0,111,1,1,1,1
294,0,0,0,1,0,2435,0,75,1,1,0,0
295,0,0,0,1,0,4923,0,166,1,0,1,1
296,0,1,1,0,0,2071,754,94,1,1,1,1


In [36]:

### Cut ApplicantIncome into 2 categories such that values below or equal to 50% go in one category
### and values above 50% go in the second category
df["ApplicantIncome"]=pd.qcut(df["ApplicantIncome"], q=[0,0.5,1.0], labels=[0, 1]  )


df["CoApplicantIncome"]=pd.qcut(df["CoApplicantIncome"], q=[0,0.5,1.0], labels=[0, 1]  )
df["LoanAmount"]=pd.qcut(df["LoanAmount"], q=[0,0.5,1.0], labels=[0, 1]  )


In [37]:
df.nunique()

Gender               2
Married              2
Dependents           2
Education            2
SelfEmployed         2
ApplicantIncome      2
CoApplicantIncome    2
LoanAmount           2
LoanAmountTerm       2
CreditHistory        2
PropertyArea         2
LoanStatus           2
dtype: int64

#### step 3: feature & target selection

In [38]:
features = real_value_features + categorical_features
target = "LoanStatus"

In [45]:
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
lst_accu_stratified = []

"""
StratifiedKFold takes features & target values. It returns 2 index values for training and testing for each fold.
Using indices, data points for training and testing can be obtained
"""


for train_index, test_index in skf.split(df[features], df[target]):
    #locate all training & testing features by using train_index (positions in table). iloc function takes indices 
    x_train_fold, x_test_fold = df[features].iloc[train_index,: ], df[features].iloc[test_index,: ]
   
    #locate all training & testing labels by using test_index (positions in table). df[target] returns a 1-D array which can be indexed directly 
    y_train_fold, y_test_fold = df[target][train_index], df[target][test_index]
   
    #make and train the model
    model = BernoulliNB()
    model.fit(x_train_fold, y_train_fold)
   
    #append score in a list
    lst_accu_stratified.append(model.score(x_test_fold, y_test_fold))

#take average of all folds as your actual accuracy factor
print(mean(lst_accu_stratified))

0.6810344827586207
