# Module 5 - Bayesian analysis - Exercises

## Analysis of a Heart Disease Dataset

In this exercise, your task is to understand factors that differentiate the presence of a heart disease from absence of heart disease.

The goal is for you to find any any other trends in heart data to predict certain cardiovascular events or find any clear indications of heart health, using both a Naive Bayes classifer and a Bayesian network




In [2]:
# Numerical Data Manipulation libraries
import pandas as pd
import numpy as np
import statistics as stat

# Figure Plotting libraries
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import seaborn as sns
sns.set()

# Naive Bayes libraries
import sklearn
from sklearn.naive_bayes import BernoulliNB      # Naive Bayes Classifier based on a Bernoulli Distribution
from sklearn.naive_bayes import GaussianNB       # Naive Bayes Classifier based on a Gaussian Distribution
from sklearn.naive_bayes import MultinomialNB    # Naive Bayes Classifier based on a Multinomial Distribution

# Machine Learning libraries
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# Text Analysis libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

# Bayesian Networks libraries
import pyAgrum as bn_graphs
import pyAgrum as gum
import pyAgrum.lib.notebook as gnb
from pyAgrum.lib.bn2roc import showROC


In [16]:
# some useful auxiliary functions

# this is a very helpful function to discretise continous data
# Bayesian networks only support discrete values, they are not able to deal with continuous variables
# given a dataset that has continous variables, we need to discretise them
def discretize_dataframe( df, num_bins, class_var ):
    r=np.array(range(num_bins+1))/(1.0*num_bins)
  
    # quantiles are building using pandas.qcut
    # The "class" column is just copied.
    l=[]
    
    for col in df.columns.values:
        
        if col!=class_var:
            l.append( pd.DataFrame( pd.qcut( df[col],r, duplicates='drop',precision=2),columns=[col]))
        else:
            l.append( pd.DataFrame( np.round(df[col],2),columns=[col]))
    
    treated=pd.concat(l, join='outer', axis=1)
    return treated


### The dataset

Feature information 

- age
- sex
- chest pain type (4 values)
- resting blood pressure
- serum cholestoral in mg/dl
- fasting blood sugar > 120 mg/dl
- resting electrocardiographic results (values 0,1,2)
- maximum heart rate achieved
- exercise induced angina
- oldpeak = ST depression induced by exercise relative to rest
- the slope of the peak exercise ST segment
- number of major vessels (0-3) colored by flourosopy
- thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
- target: 0 = no heart disease; 1 = heart disease

In [12]:
# path to dataset
path = "data/heart.csv"

# variable we want to predict
class_var = "target"

# load dataset
data = pd.read_csv(path)
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [None]:
# how many features and how many datapoints does this dataset have?
num_rows = 
num_columns = 
print("There are a total of %d data records with %d features\n" %(num_rows,num_columns))

# what are the features in this dataset?
feature_names = 
print(feature_names)

# what are the different heart diseases that we have in this dataset?
labels = 
print(labels)

In [5]:
# check the distribution of the different heart diseases in the dataset
data.groupby("target").count()

Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,138,138,138,138,138,138,138,138,138,138,138,138,138
1,165,165,165,165,165,165,165,165,165,165,165,165,165


### Predicting heart disease using Naive Bayes

In this exercise, you are expected to apply the Naive Bayes classifier to predict if a person has a heart disease or not


In [None]:
# STEP 1: create the training set and the test set
X_train, X_test, y_train, y_test = 

# STEP 2: specify the learning algorithm
model = 

# STEP 3: fit the training data to model


# STEP 4: make predictions on test set
y_prediction = 

# STEP 5: Measure the accuracy of the model
accuracy = 

print( 'The overall accuracy of the model is %.2f%%' %(accuracy))

### Analyse the data using a Bayesian network

In [None]:
# STEP 1: separate the dataset into conitnuous variables and discrete variables

continuous_var_df = 

discrete_var_df = 

# STEP 2: discretise the continuous variable data using 4 bins
discretized_df = 

# STEP 3: join the discrete_var_df with the discretized_df into a single dataframe
processed_df = 

# STEP 4: Learn a Bayesian network from the processed_df


# STEP 5: Analyse the BN by observing different variables and checking their impact in the target variable



