#### Classification | MVP

# Predicting Heart Disease<a id='top'></a> 


## **Analysis Goal**  
[Research question](#1)

## **Process**

Classification metric – 
AUC = determining the 'most at risk' (say top 100) by ordering by liklihood 
F1/recall = providing a concrete label (either at risk or not at risk) 


[Dataset](#2)

## **Preliminary Visualization**
[Visualization](#3)

## **Preliminary Conclusions**
[Conclusion](#4)


In [1]:
import pandas as pd
import numpy as np
# import imblearn.over_sampling
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


from sklearn.compose import make_column_transformer
from sklearn.ensemble import AdaBoostRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression ,LogisticRegression
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC ,SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle
# from xgboost import XGBClassifier

# import plotly.express as px
# import plotly.graph_objects as go
# from plotly.subplots import make_subplots

## 1. Research Question<a id='1'></a> 

* **RQ:** Could a model predict the probability of a patient having heart disease based on the risk factors in electronic health records?
* **Data source:** [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)
* **Error metric:** Recall


[back to top](#top)

## 2. Dataset: [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)<a id='2'></a>  


In [2]:
df = pd.read_csv('heart_2020_cleaned.csv')

In [None]:
df.head()

In [None]:
df.info()

In [14]:
# find nulls
df.isnull().sum()

HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHealth        0
DiffWalking         0
Sex                 0
AgeCategory         0
Race                0
Diabetes            0
PhysicalActivity    0
GenHealth           0
SleepTime           0
Asthma              0
KidneyDisease       0
SkinCancer          0
dtype: int64

In [15]:
# summary statistics on numeric columns
df.describe()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetes,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
count,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0,319795.0
mean,0.085595,28.325399,0.412477,0.068097,0.03774,3.37171,3.898366,0.13887,1.475273,52.440945,5.396742,0.207205,0.775362,3.595028,7.097075,0.134061,0.036833,0.093244
std,0.279766,6.3561,0.492281,0.251912,0.190567,7.95085,7.955235,0.345812,0.499389,18.069747,1.212208,0.554528,0.417344,1.042918,1.436007,0.340718,0.188352,0.290775
min,0.0,12.02,0.0,0.0,0.0,0.0,0.0,0.0,1.0,18.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0
25%,0.0,24.03,0.0,0.0,0.0,0.0,0.0,0.0,1.0,40.0,6.0,0.0,1.0,3.0,6.0,0.0,0.0,0.0
50%,0.0,27.34,0.0,0.0,0.0,0.0,0.0,0.0,1.0,55.0,6.0,0.0,1.0,4.0,7.0,0.0,0.0,0.0
75%,0.0,31.42,1.0,0.0,0.0,2.0,3.0,0.0,2.0,65.0,6.0,0.0,1.0,4.0,8.0,0.0,0.0,0.0
max,1.0,94.85,1.0,1.0,1.0,30.0,30.0,1.0,2.0,80.0,6.0,3.0,1.0,5.0,24.0,1.0,1.0,1.0


In [3]:
# rename column to not refer to people by their disease
df.rename(columns = {'Diabetic':'Diabetes'}, inplace = True)


In [13]:
# list unique values by column to see what needs to be coded with numbers

for col in df:
    print(col, df[col].unique())

In [5]:
# code Y/N to 1/0
#     HeartDisease
#     AlcoholDrinking 
#     PhysicalHealth 
#     DiffWalking 
#     Diabetes -> adjust in next cell for 'borderline diabetes' 'Yes (during pregnancy)'
#     PhysicalActivity
#     Asthma 
#     KidneyDisease 
#     SkinCancer

df = df.replace({'Yes': 1, 'No': 0}) 

In [10]:
# code categories to nums Diabetes, Sex, Race (alpha), GenHealth (poor 1, excellent 5)

df = df.replace({'Yes (during pregnancy)': 2,           #Diabetes
                 'No, borderline diabetes': 3,  
                 'Female': 1,                           #Sex 
                 'Male': 2,                             
                 'American Indian/Alaskan Native': 1,  #Race  
                 'Asian':2,                     
                 'Black':3,                      
                 'Hispanic':4,                   
                 'Other': 5,                     
                 'White': 6,                     
                 'Poor': 1,                             #GenHealth
                 'Fair': 2,                     
                 'Good': 3,                     
                 'Very good': 4,                
                 'Excellent': 5})               


In [6]:
# code AgeCategory to lowest age of category

df = df.replace({'18-24':18,
             '25-29':25, 
             '30-34':30, 
             '35-39':35, 
             '40-44':40, 
             '45-49':45, 
             '50-54':50,
             '55-59':55,
             '60-64':60,
             '65-69':65,
             '70-74':70,
             '75-79':75,    
             '80 or older':80})

In [11]:
# list unique values by column to see what needs to be coded with numbers

for col in df:
    print(col, df[col].unique())

HeartDisease [0 1]
BMI [16.6  20.34 26.58 ... 62.42 51.46 46.56]
Smoking [1 0]
AlcoholDrinking [0 1]
Stroke [0 1]
PhysicalHealth [ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 18.  8. 25.
 16. 29. 27. 17. 24. 12. 23. 26. 22. 19.  9. 13. 11.]
MentalHealth [30.  0.  2.  5. 15.  8.  4.  3. 10. 14. 20.  1.  7. 24.  9. 28. 16. 12.
  6. 25. 17. 18. 21. 29. 22. 13. 23. 27. 26. 11. 19.]
DiffWalking [0 1]
Sex [1 2]
AgeCategory [55 80 65 75 40 70 60 50 45 18 35 30 25]
Race [6 3 2 1 5 4]
Diabetes [1 0 3 2]
PhysicalActivity [1 0]
GenHealth [4 2 3 1 5]
SleepTime [ 5.  7.  8.  6. 12.  4.  9. 10. 15.  3.  2.  1. 16. 18. 14. 20. 11. 13.
 17. 24. 19. 21. 22. 23.]
Asthma [1 0]
KidneyDisease [0 1]
SkinCancer [1 0]


There is a linear mapping for age, income, etc. 

Compare performance of the linear mapped approach with dummy variable approach. 

The logistic_exercise notebook may help with understanding the relationship between the linear nature of features and their predictive power



In [16]:
# separate target from select features
y = df['HeartDisease']

X = df.loc[:, ['BMI', 'Smoking', 'AlcoholDrinking', 'Stroke',
       'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory',
       'Race', 'Diabetes', 'PhysicalActivity', 'GenHealth', 'SleepTime',
       'Asthma', 'KidneyDisease', 'SkinCancer']]


In [17]:
y.info()

<class 'pandas.core.series.Series'>
RangeIndex: 319795 entries, 0 to 319794
Series name: HeartDisease
Non-Null Count   Dtype
--------------   -----
319795 non-null  int64
dtypes: int64(1)
memory usage: 2.4 MB


In [18]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 17 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   BMI               319795 non-null  float64
 1   Smoking           319795 non-null  int64  
 2   AlcoholDrinking   319795 non-null  int64  
 3   Stroke            319795 non-null  int64  
 4   PhysicalHealth    319795 non-null  float64
 5   MentalHealth      319795 non-null  float64
 6   DiffWalking       319795 non-null  int64  
 7   Sex               319795 non-null  int64  
 8   AgeCategory       319795 non-null  int64  
 9   Race              319795 non-null  int64  
 10  Diabetes          319795 non-null  int64  
 11  PhysicalActivity  319795 non-null  int64  
 12  GenHealth         319795 non-null  int64  
 13  SleepTime         319795 non-null  float64
 14  Asthma            319795 non-null  int64  
 15  KidneyDisease     319795 non-null  int64  
 16  SkinCancer        31

In [19]:
# split test data set
X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    y, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [None]:
sns.pairplot(pd.concat([X_train, y_train], axis=1), hue='HeartDisease');


In [None]:
# picke df 
clean_df = df 
clean_df.to_pickle('clean_df.pkl')


In [None]:
# csv df
clean_df.to_csv(r'/Users/sandraparedes/Dropbox/Mac/Downloads/clean_df.csv', index=False)


https://github.com/laramillernm/Metis-Classification-Project/blob/main/TelcoChurnFinal.ipynb

In [None]:
# change "total charges to numeric" 
df['Total_Charges'] = pd.to_numeric(df['Total_Charges'], errors='coerce')

# change zip code to string
df['Zip_Code'] = df['Zip_Code'].astype('str')


In [None]:
# create dummy variables
base = pd.get_dummies(data=dftelco_base, columns=['Gender', 'Senior_Citizen', 'Partner', 'Dependents', 'Phone_Service', 'Multiple_Lines', 'Internet_Service',
       'Online_Security', 'Online_Backup', 'Device_Protection', 'Tech_Support',
       'Streaming_TV', 'Streaming_Movies', 'Contract', 'Paperless_Billing',
       'Payment_Method'], drop_first=True)



In [None]:
# base model with all features

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred_lr = logreg.predict(X_test)
y_prob_pred_test = logreg.predict_proba(X_test)

print(f1_score(y_test, y_pred_lr, average="macro"))

In [None]:
classify_logreg = classification_report(y_test, y_pred_lr)
print(classify_logreg)


In [None]:
# feature scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# fit decision tree classification to the training set
classifier = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)


In [None]:
# predicting test set
y_pred_dt = classifier.predict(X_test)
print(f1_score(y_test, y_pred_dt, average="macro"))

classify_dt = classification_report(y_test, y_pred_dt)
print(classify_dt)

https://github.com/hyewonjng/Metis-Vaccination/blob/main/codes/2_classification_models.ipynb

In [None]:
# split X and y twice for train, test, validate sets
y = df_2.h1n1_vaccine
X = df_2.drop(labels = ['h1n1_vaccine','respondent_id','seasonal_vaccine'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42, stratify= y)
X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, test_size = .25, random_state = 42)

In [None]:
# scale data train data
std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)

# fit and score naive bayes Bernoulli model on train scaled set
nb = BernoulliNB()
nb.fit(X_train_scaled, y_train)
nb.score(X_train_scaled, y_train)

# validate naive bayes Bernoulli model
std_scale = StandardScaler()
X_validate_scaled = std_scale.fit_transform(X_val)

# fit and score naive bayes Bernoulli model on validate scaled set
nb = BernoulliNB()
nb.fit(X_validate_scaled, y_validate)
nb.score(X_validate_scaled, y_validate)

In [None]:
# predict on validate scaled set and score with all metrics

y_predict = nb.predict(X_validate_scaled) 
print("Accuracy:",metrics.accuracy_score(y_validate, y_predict))
print("Precision:",metrics.precision_score(y_validate, y_predict))
print("Recall:",metrics.recall_score(y_validate, y_predict))
print("F1:",metrics.f1_score(y_validate, y_predict))

In [None]:
# Treating ordinal variables to make sure that they are encoded in the correct orders.

education_lvl = [['< 12 Years','12 Years','College Graduate','Some College']]
age_lvl = [['18 - 34 Years','35 - 44 Years','45 - 54 Years','55 - 64 Years','65+ Years']]
income_lvl = [['Below Poverty','<= $75,000, Above Poverty','> $75,000']]

transformer = make_column_transformer(
    (OrdinalEncoder(categories=education_lvl), ['education']),
    (OrdinalEncoder(categories=age_lvl), ['age_group']),
    (OrdinalEncoder(categories=income_lvl), ['income_poverty'])
)

transformer.fit_transform(df_2)

In [None]:
# creating dummy variables for categorical features
df_2 = pd.get_dummies(df_2, columns =['race','sex','income_poverty',
                                    'marital_status','rent_or_own','employment_status','census_msa','hhs_geo_region',
                                     'employment_industry','employment_occupation'], drop_first = True)

In [None]:
# creating target and feature variables
y = df_2.h1n1_vaccine
X = df_2.drop(labels = ['h1n1_vaccine','respondent_id','seasonal_vaccine'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42, stratify = y)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = .25, random_state = 42)

In [None]:
std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)

logit = LogisticRegression(C=1000) # setting C very high essentially removes regularization
logit.fit(X_train_scaled, y_train)

y_predict = logit.predict(X_train_scaled) 
logit.score(X_train_scaled, y_train)

In [None]:
std_scale = StandardScaler()
X_val_scaled = std_scale.fit_transform(X_val)

logit = LogisticRegression(C=1000) # setting C very high essentially removes regularization
logit.fit(X_val_scaled, y_val)
logit.score(X_val_scaled, y_val)

In [None]:
y_pred = logit.predict(X_val_scaled) 

print("Accuracy:",metrics.accuracy_score(y_val, y_pred))
print("Precision:",metrics.precision_score(y_val, y_pred))
print("Recall:",metrics.recall_score(y_val, y_pred))
print("f1:",metrics.f1_score(y_val, y_pred))


In [None]:
fpr, tpr, thresholds = roc_curve(y_val, logit.predict_proba(X_val_scaled)[:,1])

plt.plot(fpr, tpr,lw=2)
plt.plot([0,1],[0,1],c='violet',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])


plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve');
print("ROC AUC score = ", roc_auc_score(y_val, logit.predict_proba(X_val_scaled)[:,1]))

In [None]:
# Treating Class Imbalance

# setup for the ratio argument of RandomOverSampler initialization SMOTE
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
ratio = {1 : n_pos * 3, 0 : n_neg} 

smote = imblearn.over_sampling.SMOTE(sampling_strategy = ratio, random_state = 42)

X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

nb_smote = BernoulliNB() 
nb_smote.fit(X_train_smote, y_train_smote)

print('Logistic Regression on SMOTE Train Data; Test Recall: %.3f, Test AUC: %.3f' % \
      (recall_score(y_validate, nb_smote.predict(X_validate_scaled)), 
       roc_auc_score(y_validate, nb_smote.predict_proba(X_validate_scaled)[:,1])))



In [None]:
# Feature importance

importance = logit.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
	print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [None]:
def make_confusion_matrix(model, threshold=0.5):
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    y_predict = (model.predict_proba(X_test_scaled)[:, 1] >= threshold)
    fraud_confusion = confusion_matrix(y_test, y_predict)
    plt.figure(dpi=80)
    sns.heatmap(fraud_confusion, cmap=plt.cm.BuGn, annot=True, square=True, fmt='d',
           xticklabels=['non-vaccinated', 'vaccinated'],
           yticklabels=['non-vaccinated', 'vaccinated']);
    plt.xlabel('prediction')
    plt.ylabel('actual')

make_confusion_matrix(rf) #rf = random forest model


In [None]:
y_pred=rf.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("f1:",metrics.f1_score(y_test, y_pred))


https://github.com/emichaelbernardo/titanic/blob/main/Classification.ipynb

In [None]:
#simplify Age to 4 groups: Child (0-12), Teen (13-18), Adult (19-65), Senior (66+)

def encodeAge(age):
    conditions = [age < 13,
                  age < 18,
                  age < 65,
                  age < 100 ]

    values = ['Child','Teen','Adult','Senior']
    return np.select(conditions, values, default='Adult') 

df_passengers['AgeGroup'] = df_passengers.Age.apply(encodeAge)

In [None]:
# Look at survival rate by Sex, Age and Pclass

age = pd.cut(df_passengers['Age'], [0, 12, 17, 64, 80])
df_passengers.pivot_table('Survived', ['Sex', age], 'Pclass')

In [None]:
# Look at survival rate by Sex, Age and Embarked

df_passengers.pivot_table('Survived', ['Sex', age], 'Embarked')

In [None]:
# Start Visualizing the Data

cols = ['AgeGroup', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

n_rows = 2
n_cols = 3

# The subplot grid and the figure size of each graph
# This returns a Figure (fig) and an Axes Object (axs)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(n_cols*3.2,n_rows*3.2))

for r in range(0,n_rows):
    for c in range(0,n_cols):  
        
        i = r*n_cols+ c # index to go through the number of columns       
        ax = axs[r][c]  # Show where to position each subplot
        sns.countplot(df_passengers[cols[i]], hue=df_passengers["Survived"], ax=ax)
        ax.set_title(f'Survival by {cols[i]}' )
        ax.legend(title="Survived", loc='upper right') 
        
plt.tight_layout()  

In [None]:
# Plot the survival rate of each class.
sns.barplot(x='Pclass', y='Survived', data=df_passengers)

In [None]:
#Plot the survival rate of each Sex.
sns.barplot(x='Sex', y='Survived', data=df_passengers)

In [None]:
# Look at suvival probablity by AgeGroup and Sex
sns.barplot(x = 'AgeGroup', y ='Survived', hue='Sex', data = df_passengers)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by AgeGroup')

In [None]:
# Look at suvival probablity by AgeGroup and Embarked
sns.barplot(x = 'Embarked', y ='Survived', hue='Sex', data = df_passengers)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by Embarked')

In [None]:
# View distribution of Passengers
sns.factorplot(y = 'Age', x = 'Sex', hue = 'Pclass', kind = 'box', data = df_passengers).set(title='Distribution by Age, Sex and Pclass')
sns.factorplot(y = 'Age', x = 'Parch', hue='Sex', kind = 'box', data = df_passengers).set(title='Distribution by Age and Parch')
sns.factorplot(y = 'Age', x = 'SibSp', kind = 'box', data = df_passengers).set(title='Distribution by Age and SibSp')
sns.factorplot(y = 'Age', x = 'Embarked', kind = 'box', data = df_passengers).set(title='Distribution by Age and Embarked')


In [None]:
## Numerically encode categorical features

def encode_categorical_features(df):
    labelencoder = LabelEncoder()
    
    # Numerically encode Sex
    df.Sex= labelencoder.fit_transform(df.Sex.values)
    
    # Numerically encode Embarked
    df.Embarked= labelencoder.fit_transform(df.Embarked.values)
    
    # Numerically encode AgeGroup
    df.AgeGroup= labelencoder.fit_transform(df.AgeGroup.values)

# Encode catergorical features
encode_categorical_features(df_modeling)
df_modeling



In [None]:
# Create functions to score models

def accuracy(actuals, preds):
    return np.mean(actuals == preds)

def precision(actuals, preds):
    tp = np.sum((actuals == 1) & (preds == 1))
    fp = np.sum((actuals == 0) & (preds == 1))
    return tp / (tp + fp)

def recall(actuals, preds):
    tp = np.sum((actuals == 1) & (preds == 1))
    fn = np.sum((actuals == 1) & (preds == 0))
    return tp / (tp + fn)

def F1(actuals, preds):
    p, r = precision(actuals, preds), recall(actuals, preds)
    return 2*p*r / (p + r)


[back to top](#top)

## 3. Preliminary Visualization<a id='3'></a> 


[back to top](#top)

## 4. Preliminary Conclusions<a id='4'></a> 


[back to top](#top)