# Assignment 9: Bayesian Analysis

### Conditional Probability 

Q.1 Explain in few terms what is Naive Bayes. What is it considered Naive?

In [None]:
# Naive bayes is a classification algorithm based on the Baye's theorem, which has conditional probability built in and prior beliefs.
# It is called naive because it assumes that all the predictors are independent of one another.


In [24]:
import pandas as pd

data = pd.read_csv('shingles.csv')
print(data.info())
data.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1841 entries, 0 to 1840
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Rash                     1841 non-null   object
 1   SwollenLymphNode         1841 non-null   object
 2   Chills                   1841 non-null   object
 3   PolymeraseChainReaction  1841 non-null   object
 4   VZVAntibodyTest          1841 non-null   object
 5   Blisters                 1841 non-null   object
dtypes: object(6)
memory usage: 86.4+ KB
None


Unnamed: 0,Rash,SwollenLymphNode,Chills,PolymeraseChainReaction,VZVAntibodyTest,Blisters
0,no,no,no,no,pos,no
1,yes,no,no,no,neg,no
2,no,no,no,no,neg,no
3,no,no,no,no,neg,no
4,no,no,no,no,neg,no


Q.2. Does this data contain any missing values?

In [25]:
data.isna().sum()
print("No missing values")

No missing values


Q.3. Split the data into 70/30 train test

In [28]:
from sklearn.model_selection import train_test_split

# Converting text values to numeric for models
data.replace("yes", 1, inplace=True)
data.replace("no", 0, inplace=True)
data.replace("pos", 1, inplace=True)
data.replace("po", 1, inplace=True) # found this value and am interpretting it to be a typo meaning 'pos'
data.replace("neg", 0, inplace=True)

y = data["Rash"]
X = data.drop("Rash", axis=1, inplace=False)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Q.4. Train a Gaussian Naive Bayes model, a Multinomial Naive Bayes and a Bernoulli Naive Bayes on the dataset to predict Rash. Compute the accuracy for each. Explain your results. 

In [156]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

BernNB = BernoulliNB()
BernNB.fit(X_train, y_train)
y_pred = BernNB.predict(X_test)
print("Bernoulli score:", accuracy_score(y_test, y_pred))

GausNB = GaussianNB()
GausNB.fit(X_train, y_train)
y_pred = GausNB.predict(X_test)
print("Gaussian score:", accuracy_score(y_test, y_pred))

MultiNB = MultinomialNB()
MultiNB.fit(X_train, y_train)
y_pred = MultiNB.predict(X_test)
print("Multinomial score:", accuracy_score(y_test, y_pred))

# Explanation: The accuracy scores I've found for these 3 models is quite low at around 50-53%. I think further hyper-parameter (alpha)
# tweaking needs to be performed to get these scores up.


Bernoulli score: 0.5135623869801085
Gaussian score: 0.5117540687160941
Multinomial score: 0.5334538878842676


Q.5. Utilizing Pipeline and GridSearchCV, use 5 different alpha values to train a Bernoulli Naive Bayes and Multinomial Naive Bayes on the dataset. Print out the accuracy for each, and explain your results.

In [154]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
import numpy as np

# Normally would've had a preprocessing component, perhaps with LabelBinarizer, but it was already performed above.
pipeline_multi = Pipeline(steps=[
    ('MultiNB', MultinomialNB())
])

pipeline_bern = Pipeline(steps=[
    ('BernNB', BernoulliNB()),
])

# Alphas to try out
alphas = np.array([0.01, 0.1, 1, 10, 100])

# Bernoulli grid search best params
bern_grid = GridSearchCV(estimator=pipeline_bern, param_grid=dict(BernNB__alpha=alphas), cv=10)
bern_grid.fit(X_train, y_train)
y_pred_bern = bern_grid.predict(X_test)
print("Bernoulli best alpha:", bern_grid.best_params_)
print("Bernoulli training score:", bern_grid.best_score_)
print("Bernoulli accuracy score for best model against Test Data:", accuracy_score(y_test, y_pred_bern))
print("\n")

# Multinomial grid search best params
multi_grid = GridSearchCV(estimator=pipeline_multi, param_grid=dict(MultiNB__alpha=alphas), cv=10)
multi_grid.fit(X_train, y_train)
y_pred_multi = multi_grid.predict(X_test)
print("Multinomial best alpha:", multi_grid.best_params_)
print("Multinomial training score:", multi_grid.best_score_)
print("Multinomial accuracy score with best model against Test Data:", accuracy_score(y_test, y_pred_multi))

# It seems that tuning the alpha parameter allowed us to achieve a slightly higher accuracy score for both models.
# From 51.3% to 53.7% for Bernoulli using an alpha of 100
# From 53.3% to 53.5% for Multinomial using an alpha of 100
# Prior to the hyper parameter tuning, Multinomial was the best model, but with modifying the alpha, Bernoulli is now the best model.
# I increased the CV to 10 from 5 after experimenting with various amounts of folds and this seemed to be better than the default.

Bernoulli best alpha: {'BernNB__alpha': 100.0}
Bernoulli training score: 0.5776465600775194
Bernoulli accuracy score for best model against Test Data: 0.5370705244122965


Multinomial best alpha: {'MultiNB__alpha': 100.0}
Multinomial training score: 0.5853682170542636
Multinomial accuracy score with best model against Test Data: 0.5352622061482821


## Inference in Bayesian networks

Q.6. Create a new text cell in your Notebook: Complete a 50-100 word summary 
    (or short description of your thinking in applying this week's learning to the solution) 
     of your experience in this assignment. Include:
                                                                      
What was your incoming experience with this model, if any?
what steps you took, what obstacles you encountered.
how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?)
This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.


In [2]:
# Enter summary here
# No incoming experience. I'm not entirely sure why we had to use pipelines or what the expectation was in how to use it.
# All of the guides online showed some preprocessing, but since I already did that above, it would've been redundant (I
# mentioned this in the comments and how I would've approached it otherwise). An interesting note while I was playing
# with this data was that the GridSearchCV gives a best score based on the training data and not accuracy scores against the
# test data. I nearly presented the 'best_score_' as my accuracy, until I realized this. It makes sense why this happens
# but the parameter name was misleading. The documentation says, "Mean cross-validated score of the best_estimator".
# I liked experimenting with various parameters to discover how the data could yield different results-- which is very
# real world in my opinion. I think steps that were missing from this assignment was further clarification on what
# we needed to do with the 'Pipeline' class.