#### Classification | MVP

# Predicting Heart Disease<a id='top'></a> 


## **Analysis Goal**  
[Research question](#1)

## **Process**

Classification metric – 
AUC = determining the 'most at risk' (say top 100) by ordering by liklihood 
F1/recall = providing a concrete label (either at risk or not at risk) 


[Dataset](#2)

## **Preliminary Visualization**
[Visualization](#3)

## **Preliminary Conclusions**
[Conclusion](#4)


In [None]:
import pandas as pd
import numpy as np
# import imblearn.over_sampling
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


from sklearn.compose import make_column_transformer
from sklearn.ensemble import AdaBoostRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression ,LogisticRegression
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC ,SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
# from xgboost import XGBClassifier

# import plotly.express as px
# import plotly.graph_objects as go
# from plotly.subplots import make_subplots

## 1. Research Question<a id='1'></a> 

* **RQ:** Could a model predict the probability of a patient having heart disease based on the risk factors in electronic health records?
* **Data source:** [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)
* **Error metric:** Recall


[back to top](#top)

## 2. Dataset: [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)<a id='2'></a>  


In [None]:
df = pd.read_csv('heart_2020_cleaned.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# find nulls
df.isnull().sum()

In [None]:
# summary statistics on numeric columns
df.describe()

### Rename columns

In [None]:
# rename column to not refer to people by their disease
df.rename(columns = {'Diabetic':'Diabetes'}, inplace = True)

# rename column for readability
df.rename(columns = {'AlcoholDrinking':'Alcohol', 
                     'Smoking': 'Tobacco',
                     'AgeCategory':'Age', 
                     'PhysicalHealth': 'Health_Physical', 
                     'MentalHealth':'Health_Mental',
                     'GenHealth': 'Health_General',
                     'DiffWalking': 'Walking',
                     'SleepTime': 'Sleep',
                     'PhysicalActivity':'Activity',
                     'KidneyDisease': 'Kidney',
                     'SkinCancer': 'Skin'}, inplace = True)


In [None]:
# list unique values by column to see what needs to be coded with numbers/dummy variables

for col in df:
    print(col, df[col].unique())

### 1 | Map values

In [None]:
df_num = df.copy()

In [None]:
# code Y/N to 1/0
#     HeartDisease
#     Tobacco
#     Alcohol 
#     Stroke
#     Walking 
#     Diabetes -> adjust in next cell for 'borderline diabetes' 'Yes (during pregnancy)'
#     Activity
#     Asthma 
#     Kidney 
#     Skin

df_num = df_num.replace({'Yes': 1, 'No': 0}) 

In [None]:
# code categories to nums Diabetes, Sex, Race (alpha), Health_General (poor 1, excellent 5)

df_num = df_num.replace({'Yes (during pregnancy)': 2,           #Diabetes
                 'No, borderline diabetes': 3,  
                 'Female': 1,                                   #Sex 
                 'Male': 2,                             
                 'American Indian/Alaskan Native': 1,           #Race  
                 'Asian':2,                     
                 'Black':3,                      
                 'Hispanic':4,                   
                 'Other': 5,                     
                 'White': 6,                     
                 'Poor': 1,                                     #Health_General
                 'Fair': 2,                     
                 'Good': 3,                     
                 'Very good': 4,                
                 'Excellent': 5})               


In [None]:
# code Age to lowest age of category

df_num = df_num.replace({'18-24':18,
             '25-29':25, 
             '30-34':30, 
             '35-39':35, 
             '40-44':40, 
             '45-49':45, 
             '50-54':50,
             '55-59':55,
             '60-64':60,
             '65-69':65,
             '70-74':70,
             '75-79':75,    
             '80 or older':80})

In [None]:
# list unique values by column to verify cols coded with numbers

for col in df_num :
    print(col, df_num[col].unique())

In [None]:
df_num.head(10)

### 2 | Dummy variables

In [None]:
df_dmy = df.copy()

In [None]:
# dummy variables for non-numerical columns (commented out)
df_dmy = pd.get_dummies(data=df_dmy, 
                        columns=['Sex',                         #Demographics
                                 'Age',
                                 'Race',           
                                 'Activity',                    #Health behaviors
#                                  'Sleep', 
                                 'Alcohol', 
                                 'Tobacco',
#                                  'Health_Physical',           #Health
#                                  'Health_Mental', 
                                 'Health_General', 
                                 'Walking',
#                                  'BMI',
                                 'Asthma',                      #Chronic disease                     
                                 'Diabetes',
                                 'Kidney',
                                 'Skin'
                                 'Stroke'],
                        drop_first=True)


In [None]:
df_dmy.head(10)  

### 3 | X, y sets for mapped `y_num` `X_num` & dummy `y_dmy` `X_dmy`

In [None]:
# separate target from select features using mapped variables

y_num = df_num['HeartDisease'] 

X_num = df_num.loc[:, ['Sex',            #Demographics
               'Age',
               'Race',           
               'Activity',               #Health behaviors
               'Sleep', 
               'Alcohol', 
               'Tobacco',
               'Health_Physical',        #Health
               'Health_Mental', 
               'Health_General', 
               'Walking',
               'BMI',
               'Asthma',                 #Chronic disease
               'Diabetes',
               'Kidney',
               'Skin'
               'Stroke']]


# separate target from select features using dummy variables
y_dmy = df_dmy['HeartDisease']

X_dmy = df_dmy.loc[:, ['Sex',            #Demographics
               'Age',
               'Race',           
               'Activity',               #Health behaviors
               'Sleep', 
               'Alcohol', 
               'Tobacco',
               'Health_Physical',        #Health
               'Health_Mental', 
               'Health_General', 
               'Walking',
               'BMI',
               'Asthma',                 #Chronic disease
               'Diabetes',
               'Kidney',
               'Skin'
               'Stroke']]


In [None]:
# split test data set using mapped variables
X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(X_num, 
                                                    y_num, 
                                                    test_size=0.2, 
                                                    random_state=42)

# split test data set
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_dmy, 
                                                    y_dmy, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [None]:
# baseline rate of target using the mean of training data

print('Baseline probability of heart disease (num):', (round(np.mean(y_train_), 4)*100),'%')
print('Baseline probability of heart disease (dmy):', (round(np.mean(y_train_n), 4)*100),'%')


In [None]:
# sns.pairplot(pd.concat([X_train, y_train], axis=1), hue='HeartDisease');


In [None]:
# picke df, df_num, df_dmy

heart_disease_df = df 
heart_disease_df.to_pickle('heart_disease_df.pkl')

heart_disease_df_num = df_num
heart_disease_df_num.to_pickle('heart_disease_df_num.pkl')

heart_disease_df_dmy = df_dmy
heart_disease_df_dmy.to_pickle('heart_disease_df_dmy.pkl')


In [None]:
# csv df, df_num, df_dmy

heart_disease_df.to_csv(r'/Users/sandraparedes/Documents/GitHub/metis_dsml/heart_disease_df.csv', index=False)
heart_disease_df_num.to_csv(r'/Users/sandraparedes/Documents/GitHub/metis_dsml/heart_disease_df_num.csv', index=False)
heart_disease_df_dmy.to_csv(r'/Users/sandraparedes/Documents/GitHub/metis_dsml/heart_disease_df_dmy.csv', index=False)



https://github.com/laramillernm/Metis-Classification-Project/blob/main/TelcoChurnFinal.ipynb

In [None]:
# model with all features

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred_lr = logreg.predict(X_test)
y_prob_pred_test = logreg.predict_proba(X_test)

print(f1_score(y_test, y_pred_lr, average="macro"))


# classification report 

classify_logreg = classification_report(y_test, y_pred_lr)
print(classify_logreg)


In [None]:
# scale X_train and X_test

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# fit decision tree to X_train, y_train

classifier = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)


In [None]:
# predict on X_test

y_pred_dt = classifier.predict(X_test)
print(f1_score(y_test, y_pred_dt, average="macro"))


# classification report 

classify_dt = classification_report(y_test, y_pred_dt)
print(classify_dt)


https://github.com/hyewonjng/Metis-Vaccination/blob/main/codes/2_classification_models.ipynb

In [None]:
# split X and y twice for Xy_train, Xy_test, Xy_validate sets

y = df.series
X = df.drop(labels = ['column_name', 'column_name'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = .2, random_state = 42, stratify= y)

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, 
                                                            test_size = .25, random_state = 42)


In [None]:
# BernoulliNB() 

# scale X_train 
std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)

# fit and score naive bayes Bernoulli model on X_train_scaled, y_train
nb = BernoulliNB()
nb.fit(X_train_scaled, y_train)
nb.score(X_train_scaled, y_train)

# validate naive bayes Bernoulli model
std_scale = StandardScaler()
X_validate_scaled = std_scale.fit_transform(X_validate)

# fit and score naive bayes Bernoulli model on X_validate_scaled, y_validate
nb = BernoulliNB()
nb.fit(X_validate_scaled, y_validate)
nb.score(X_validate_scaled, y_validate)


In [None]:
# BernoulliNB() 
# predict on X_validate_scaled and score y_validate, y_predict with all metrics

y_predict = nb.predict(X_validate_scaled) 

print("Accuracy:",metrics.accuracy_score(y_validate, y_predict))
print("Precision:",metrics.precision_score(y_validate, y_predict))
print("Recall:",metrics.recall_score(y_validate, y_predict))
print("F1:",metrics.f1_score(y_validate, y_predict))

In [None]:
#LogisticRegression()
# scale X_train

std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)

logit = LogisticRegression(C=1000) # high C removes regularization
logit.fit(X_train_scaled, y_train)

y_predict = logit.predict(X_train_scaled) 
logit.score(X_train_scaled, y_train)


In [None]:
#LogisticRegression()
# scale X_val 

std_scale = StandardScaler()
X_val_scaled = std_scale.fit_transform(X_val)

logit = LogisticRegression(C=1000) # high C removes regularization
logit.fit(X_val_scaled, y_val)
logit.score(X_val_scaled, y_val)


In [None]:
#LogisticRegression()
# predict on X_validate_scaled and score y_validate, y_predict with all metrics

y_pred = logit.predict(X_validate_scaled) 

print("Accuracy:",metrics.accuracy_score(y_validate, y_predict))
print("Precision:",metrics.precision_score(y_validate, y_predict))
print("Recall:",metrics.recall_score(y_validate, y_predict))
print("f1:",metrics.f1_score(y_validate, y_predict))


In [None]:
fpr, tpr, thresholds = roc_curve(y_val, logit.predict_proba(X_val_scaled)[:,1])

plt.plot(fpr, tpr,lw=2)
plt.plot([0,1],[0,1],c='violet',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])


plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve');
print("ROC AUC score = ", roc_auc_score(y_val, logit.predict_proba(X_val_scaled)[:,1]))


In [None]:
# Class imbalance

# setup for the ratio argument of RandomOverSampler initialization SMOTE
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
ratio = {1 : n_pos * 3, 0 : n_neg} 

smote = imblearn.over_sampling.SMOTE(sampling_strategy = ratio, random_state = 42)

X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

nb_smote = BernoulliNB() 
nb_smote.fit(X_train_smote, y_train_smote)

print('Logistic Regression on SMOTE Train Data; Test Recall: %.3f, Test AUC: %.3f' % \
      (recall_score(y_validate, nb_smote.predict(X_validate_scaled)), 
       roc_auc_score(y_validate, nb_smote.predict_proba(X_validate_scaled)[:,1])))


In [None]:
# Feature importance

importance = logit.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [None]:
def make_confusion_matrix(model, threshold=0.5):
    
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    
    y_predict = (model.predict_proba(X_test_scaled)[:, 1] >= threshold)
    fraud_confusion = confusion_matrix(y_test, y_predict)
    plt.figure(dpi=80)
    sns.heatmap(fraud_confusion, cmap=plt.cm.BuGn, annot=True, square=True, fmt='d',
           xticklabels=['non-vaccinated', 'vaccinated'],
           yticklabels=['non-vaccinated', 'vaccinated']);
    plt.xlabel('prediction')
    plt.ylabel('actual')

make_confusion_matrix(rf) #rf = random forest model


In [None]:
y_pred = rf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("f1:",metrics.f1_score(y_test, y_pred))


https://github.com/emichaelbernardo/titanic/blob/main/Classification.ipynb

In [None]:
# Look at survival rate by Sex, Age and Pclass

age = pd.cut(df_passengers['Age'], [0, 12, 17, 64, 80])
df_passengers.pivot_table('Survived', ['Sex', age], 'Pclass')

In [None]:
# Look at survival rate by Sex, Age and Embarked

df_passengers.pivot_table('Survived', ['Sex', age], 'Embarked')

In [None]:
# visualize data

cols = ['AgeGroup', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

n_rows = 2
n_cols = 3

# The subplot grid and the figure size of each graph
# This returns a Figure (fig) and an Axes Object (axs)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(n_cols*3.2,n_rows*3.2))

for r in range(0,n_rows):
    for c in range(0,n_cols):  
        
        i = r*n_cols+ c # index to go through the number of columns       
        ax = axs[r][c]  # Show where to position each subplot
        sns.countplot(df_passengers[cols[i]], hue=df_passengers["Survived"], ax=ax)
        ax.set_title(f'Survival by {cols[i]}' )
        ax.legend(title="Survived", loc='upper right') 
        
plt.tight_layout()  

In [None]:
# Plot the survival rate of each class
sns.barplot(x='Pclass', y='Survived', data=df_passengers)

In [None]:
#Plot the survival rate of each Sex
sns.barplot(x='Sex', y='Survived', data=df_passengers)

In [None]:
# Look at suvival probablity by AgeGroup and Sex
sns.barplot(x = 'AgeGroup', y ='Survived', hue='Sex', data = df_passengers)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by AgeGroup')

In [None]:
# Look at suvival probablity by AgeGroup and Embarked
sns.barplot(x = 'Embarked', y ='Survived', hue='Sex', data = df_passengers)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by Embarked')

In [None]:
# View distribution of passengers
sns.factorplot(y = 'Age', x = 'Sex', hue = 'Pclass', kind = 'box', data = df_passengers).set(title='Distribution by Age, Sex and Pclass')
sns.factorplot(y = 'Age', x = 'Parch', hue='Sex', kind = 'box', data = df_passengers).set(title='Distribution by Age and Parch')
sns.factorplot(y = 'Age', x = 'SibSp', kind = 'box', data = df_passengers).set(title='Distribution by Age and SibSp')
sns.factorplot(y = 'Age', x = 'Embarked', kind = 'box', data = df_passengers).set(title='Distribution by Age and Embarked')


In [None]:
# functions to score models

def accuracy(actuals, preds):
    return np.mean(actuals == preds)

def precision(actuals, preds):
    tp = np.sum((actuals == 1) & (preds == 1))
    fp = np.sum((actuals == 0) & (preds == 1))
    return tp / (tp + fp)

def recall(actuals, preds):
    tp = np.sum((actuals == 1) & (preds == 1))
    fn = np.sum((actuals == 1) & (preds == 0))
    return tp / (tp + fn)

def F1(actuals, preds):
    p, r = precision(actuals, preds), recall(actuals, preds)
    return 2*p*r / (p + r)


[back to top](#top)

## 3. Preliminary Visualization<a id='3'></a> 


[back to top](#top)

## 4. Preliminary Conclusions<a id='4'></a> 


[back to top](#top)