Conditional Probability

P(queen/diamond) = 1/13

P(A/B) = Probability of event A knowing that event B has already occured.

Thomas Bayes

P(A/B) = [P(B/A)*P(A)] / P(B)

P(queen/diamond) = [(P(diamond/queen) * P(queen)] / P(diamond)
                 = (1/4 *1/13) / (1/4)
                 = 1/13

Naive Bayes Classifier Algorithm

Why Naive? We make a naive assumption that features such as male, class, age, cabin, fare etc are independent of each other. Even though this may not be
true in reality.

Naive Bayes is used in spam emails detection, handwritting recognition, weather prediction, facial recognition, and news articles categorization

In [6]:
import pandas as pd
df = pd.read_csv('../Files/Naive_Bayes/titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
df.drop(['PassengerId','Name','SibSp','Parch','Ticket','Cabin','Embarked'],axis='columns',inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare
0,0,3,male,22.0,7.25
1,1,1,female,38.0,71.2833
2,1,3,female,26.0,7.925
3,1,1,female,35.0,53.1
4,0,3,male,35.0,8.05


In [8]:
target = df.Survived
inputs = df.drop('Survived',axis='columns')

In [9]:
# Using dummies method as Sex is a categorical column
dummies = pd.get_dummies(inputs.Sex).astype(int)
dummies.head()

Unnamed: 0,female,male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [10]:
inputs = pd.concat([inputs,dummies],axis='columns').drop('Sex',axis='columns')
inputs.head()

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1


In [11]:
# now we check for any null values
inputs.columns[inputs.isna().any()]

Index(['Age'], dtype='object')

In [13]:
inputs.Age[:10]

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [17]:
inputs.Age = inputs.Age.fillna(inputs.Age.mean())
inputs.head(6)

Unnamed: 0,Pclass,Age,Fare,female,male
0,3,22.0,7.25,0,1
1,1,38.0,71.2833,1,0
2,3,26.0,7.925,1,0
3,1,35.0,53.1,1,0
4,3,35.0,8.05,0,1
5,3,29.699118,8.4583,0,1


In [18]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(inputs,target,test_size=0.2)

In [19]:
print(len(X_train),len(X_test),len(inputs))

712 179 891


In [20]:
# we are going to use Gaussian Naive base distribution model aka bell curve, used when distribution is normal or Gaussian
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()

In [21]:
model.fit(X_train,y_train)

In [22]:
model.score(X_test,y_test)

0.7932960893854749

In [23]:
X_test[:10]

Unnamed: 0,Pclass,Age,Fare,female,male
503,3,37.0,9.5875,1,0
449,1,52.0,30.5,0,1
525,3,40.5,7.75,0,1
694,1,60.0,26.55,0,1
446,2,13.0,19.5,1,0
487,1,58.0,29.7,0,1
152,3,55.5,8.05,0,1
617,3,26.0,16.1,1,0
748,1,19.0,53.1,0,1
354,3,29.699118,7.225,0,1


In [24]:
y_test[:10]

503    0
449    1
525    0
694    0
446    1
487    0
152    0
617    0
748    0
354    0
Name: Survived, dtype: int64

In [25]:
model.predict(X_test[:10])

array([1, 0, 0, 0, 1, 0, 0, 1, 0, 0], dtype=int64)

In [27]:
model.predict_proba(X_test[:10])

array([[0.05386788, 0.94613212],
       [0.89121738, 0.10878262],
       [0.99028614, 0.00971386],
       [0.86074948, 0.13925052],
       [0.01737489, 0.98262511],
       [0.86789557, 0.13210443],
       [0.98573102, 0.01426898],
       [0.05267593, 0.94732407],
       [0.86121931, 0.13878069],
       [0.99062798, 0.00937202]])