#### Classification | MVP

# Predicting Heart Disease<a id='top'></a> 


## **Analysis Goal**  
[Research question](#1)

## **Process**
Data source – (n=   , features = 17), quant qual, 

Classification metric – recall = providing a concrete label (either at risk or not at risk) 


[Dataset](#2)

## **Preliminary Visualization**
[Visualization](#3)

## **Preliminary Conclusions**
[Conclusion](#4)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


from sklearn.compose import make_column_transformer
from sklearn.ensemble import AdaBoostRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression ,LogisticRegression
from sklearn.metrics import *
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import LabelEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC ,SVR
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import shuffle

from xgboost import XGBClassifier
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

## 1. Research Question<a id='1'></a> 

* **RQ:** Could a model predict the probability of a patient having heart disease based on the risk factors in electronic health records?
* **Data source:** [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)
* **Error metric:** Recall


[back to top](#top)

## 2. Dataset: [Personal Key Indicators of Heart Disease](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease)<a id='2'></a>  


In [2]:
df = pd.read_csv('heart_2020_cleaned.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.columns

In [None]:
# find nulls
df.isnull().sum()

In [None]:
# summary statistics on numeric columns
df.describe()

### Rename columns

In [4]:
# rename column for readability
df.rename(columns = {'HeartDisease': 'y_heart_disease',
                     'Smoking': 'behavior_tobacco',
                     'AlcoholDrinking':'behavior_alcohol', 
                     'SleepTime': 'behavior_sleep',
                     'PhysicalActivity':'behavior_activity',
                     'AgeCategory':'demo_age', 
                     'Sex': 'demo_gender',
                     'Race': 'demo_race',
                     'Stroke': 'disease_stroke',
                     'Diabetic':'disease_diabetes',
                     'KidneyDisease': 'disease_kidney',
                     'Asthma': 'disease_asthma',
                     'SkinCancer': 'disease_skin',                     
                     'GenHealth': 'health_general',
                     'BMI': 'health_bmi',
                     'MentalHealth':'health_mental',
                     'PhysicalHealth': 'health_physical', 
                     'DiffWalking': 'health_mobility'}, inplace = True)

df = df.sort_index(axis=1)


In [5]:
# list unique values by column to see what needs to be coded with numbers/dummy variables

for col in df:
    print(col, df[col].unique())

behavior_activity ['Yes' 'No']
behavior_alcohol ['No' 'Yes']
behavior_sleep [ 5.  7.  8.  6. 12.  4.  9. 10. 15.  3.  2.  1. 16. 18. 14. 20. 11. 13.
 17. 24. 19. 21. 22. 23.]
behavior_tobacco ['Yes' 'No']
demo_age ['55-59' '80 or older' '65-69' '75-79' '40-44' '70-74' '60-64' '50-54'
 '45-49' '18-24' '35-39' '30-34' '25-29']
demo_gender ['Female' 'Male']
demo_race ['White' 'Black' 'Asian' 'American Indian/Alaskan Native' 'Other'
 'Hispanic']
disease_asthma ['Yes' 'No']
disease_diabetes ['Yes' 'No' 'No, borderline diabetes' 'Yes (during pregnancy)']
disease_kidney ['No' 'Yes']
disease_skin ['Yes' 'No']
disease_stroke ['No' 'Yes']
health_bmi [16.6  20.34 26.58 ... 62.42 51.46 46.56]
health_general ['Very good' 'Fair' 'Good' 'Poor' 'Excellent']
health_mental [30.  0.  2.  5. 15.  8.  4.  3. 10. 14. 20.  1.  7. 24.  9. 28. 16. 12.
  6. 25. 17. 18. 21. 29. 22. 13. 23. 27. 26. 11. 19.]
health_mobility ['No' 'Yes']
health_physical [ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 1

In [24]:
#save cleaned df 

heart_disease_df = df 
heart_disease_df.to_pickle('heart_disease_df.pkl')
heart_disease_df.to_csv(r'/Users/sandraparedes/Documents/GitHub/metis_dsml/04_classification/classification_project/heart_disease_df.csv', index=False)



### 1 | Map values

In [25]:
df_map = heart_disease_df.copy()

In [26]:
# map Y/N to 1/0:
    # behavior_activity ['Yes' 'No']
    # behavior_alcohol ['No' 'Yes']
    # behavior_tobacco ['Yes' 'No']
    # disease_asthma ['Yes' 'No']
    # disease_diabetes ['Yes' 'No' 'No, borderline diabetes' 'Yes (during pregnancy)'] #adjust in next cell
    # disease_kidney ['No' 'Yes']
    # disease_skin ['Yes' 'No']
    # disease_stroke ['No' 'Yes']
    # health_mobility ['No' 'Yes']
    # y_heart_disease ['No' 'Yes']


df_map = df_map.replace({'Yes': 1, 'No': 0}) 

In [27]:
# map: 
    # disease_diabetes ['Yes' 'No' 'No, borderline diabetes' 'Yes (during pregnancy)'] #adjust in next cell
    # demo_gender ['Female' 'Male']
    # demo_race ['White' 'Black' 'Asian' 'American Indian/Alaskan Native' 'Other' 'Hispanic'] (alpha order)
    # health_general ['Very good' 'Fair' 'Good' 'Poor'=1 'Excellent'=5]

df_map = df_map.replace({'Yes (during pregnancy)': 2,           #Diabetes
                 'No, borderline diabetes': 3,  
                 'Female': 1,                                   #Sex 
                 'Male': 2,                             
                 'American Indian/Alaskan Native': 1,           #Race  
                 'Asian':2,                     
                 'Black':3,                      
                 'Hispanic':4,                   
                 'Other': 5,                     
                 'White': 6,                     
                 'Poor': 1,                                     #Health_General
                 'Fair': 2,                     
                 'Good': 3,                     
                 'Very good': 4,                
                 'Excellent': 5})               


In [28]:
# map age to lowest in bin: 
    # demo_age ['55-59' '80 or older' '65-69' '75-79' '40-44' '70-74' '60-64' '50-54' '45-49' '18-24' '35-39' '30-34' '25-29']

df_map = df_map.replace({'18-24':18,
             '25-29':25, 
             '30-34':30, 
             '35-39':35, 
             '40-44':40, 
             '45-49':45, 
             '50-54':50,
             '55-59':55,
             '60-64':60,
             '65-69':65,
             '70-74':70,
             '75-79':75,    
             '80 or older':80})


In [29]:
# list unique values by column to verify mapped properly

for col in df_map :
    print(col, df_map[col].unique())

behavior_activity [1 0]
behavior_alcohol [0 1]
behavior_sleep [ 5.  7.  8.  6. 12.  4.  9. 10. 15.  3.  2.  1. 16. 18. 14. 20. 11. 13.
 17. 24. 19. 21. 22. 23.]
behavior_tobacco [1 0]
demo_age [55 80 65 75 40 70 60 50 45 18 35 30 25]
demo_gender [1 2]
demo_race [6 3 2 1 5 4]
disease_asthma [1 0]
disease_diabetes [1 0 3 2]
disease_kidney [0 1]
disease_skin [1 0]
disease_stroke [0 1]
health_bmi [16.6  20.34 26.58 ... 62.42 51.46 46.56]
health_general [4 2 3 1 5]
health_mental [30.  0.  2.  5. 15.  8.  4.  3. 10. 14. 20.  1.  7. 24.  9. 28. 16. 12.
  6. 25. 17. 18. 21. 29. 22. 13. 23. 27. 26. 11. 19.]
health_mobility [0 1]
health_physical [ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 18.  8. 25.
 16. 29. 27. 17. 24. 12. 23. 26. 22. 19.  9. 13. 11.]
y_heart_disease [0 1]


In [30]:
df_map.head(10)

Unnamed: 0,behavior_activity,behavior_alcohol,behavior_sleep,behavior_tobacco,demo_age,demo_gender,demo_race,disease_asthma,disease_diabetes,disease_kidney,disease_skin,disease_stroke,health_bmi,health_general,health_mental,health_mobility,health_physical,y_heart_disease
0,1,0,5.0,1,55,1,6,1,1,0,1,0,16.6,4,30.0,0,3.0,0
1,1,0,7.0,0,80,1,6,0,0,0,0,1,20.34,4,0.0,0,0.0,0
2,1,0,8.0,1,65,2,6,1,1,0,0,0,26.58,2,30.0,0,20.0,0
3,0,0,6.0,0,75,1,6,0,0,0,1,0,24.21,3,0.0,0,0.0,0
4,1,0,8.0,0,40,1,6,0,0,0,0,0,23.71,4,0.0,1,28.0,0
5,0,0,12.0,1,75,1,3,0,0,0,0,0,28.87,2,0.0,1,6.0,1
6,1,0,4.0,0,70,1,6,1,0,0,1,0,21.63,2,0.0,0,15.0,0
7,0,0,9.0,1,80,1,6,1,1,0,0,0,31.64,3,0.0,1,5.0,0
8,0,0,5.0,0,80,1,6,0,3,1,0,0,26.45,2,0.0,0,0.0,0
9,1,0,10.0,0,65,2,6,0,0,0,0,0,40.69,3,0.0,1,0.0,0


In [31]:
# save and pickle df_map

heart_disease_df_map = df_map
heart_disease_df_map.to_pickle('heart_disease_df_map.pkl')
heart_disease_df_map.to_csv(r'/Users/sandraparedes/Documents/GitHub/metis_dsml/04_classification/classification_project/heart_disease_df_map.csv', index=False)


### 2 | Dummy variables

In [32]:
df_dmy = heart_disease_df.copy()

In [33]:
# dummy variables for non-numerical columns (commented out)
df_dmy = pd.get_dummies(data=df_dmy, 
                        columns=['behavior_activity', 
                                 'behavior_alcohol', 
#                                  'behavior_sleep',
                                 'behavior_tobacco', 
                                 'demo_age', 
                                 'demo_gender', 
                                 'demo_race',
                                 'disease_asthma', 
                                 'disease_diabetes', 
                                 'disease_kidney', 
                                 'disease_skin',
                                 'disease_stroke', 
#                                  'health_bmi', 
                                 'health_general', 
#                                  'health_mental',
                                 'health_mobility', 
#                                  'health_physical', 
                                 'y_heart_disease'],
                        drop_first=True)

# numerical features for reference: 
    # behavior_sleep [ 5.  7.  8.  6. 12.  4.  9. 10. 15.  3.  2.  1. 16. 18. 14. 20. 11. 13.
    # health_bmi [16.6  20.34 26.58 ... 62.42 51.46 46.56]
    # health_mental [30.  0.  2.  5. 15.  8.  4.  3. 10. 14. 20.  1.  7. 24.  9. 28. 16. 12. 6. 25. 17. 18. 21. 29. 22. 13. 23. 27. 26. 11. 19.]
    # health_physical [ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 18.  8. 25. 16. 29. 27. 17. 24. 12. 23. 26. 22. 19.  9. 13. 11.] 


In [34]:
df_dmy.head(10)  

Unnamed: 0,behavior_sleep,health_bmi,health_mental,health_physical,behavior_activity_Yes,behavior_alcohol_Yes,behavior_tobacco_Yes,demo_age_25-29,demo_age_30-34,demo_age_35-39,...,disease_diabetes_Yes (during pregnancy),disease_kidney_Yes,disease_skin_Yes,disease_stroke_Yes,health_general_Fair,health_general_Good,health_general_Poor,health_general_Very good,health_mobility_Yes,y_heart_disease_Yes
0,5.0,16.6,30.0,3.0,1,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
1,7.0,20.34,0.0,0.0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0
2,8.0,26.58,30.0,20.0,1,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,6.0,24.21,0.0,0.0,0,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
4,8.0,23.71,0.0,28.0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
5,12.0,28.87,0.0,6.0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,1,1
6,4.0,21.63,0.0,15.0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
7,9.0,31.64,0.0,5.0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
8,5.0,26.45,0.0,0.0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
9,10.0,40.69,0.0,0.0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,0


In [35]:
df_dmy.rename(columns = {'disease_diabetes_Yes (during pregnancy)': 'disease_diabetes_Yes_pregnancy',
                         'disease_diabetes_No, borderline diabetes': 'disease_diabetes_No_borderline'}, 
              inplace = True)



In [36]:
df_dmy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 38 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   behavior_sleep                  319795 non-null  float64
 1   health_bmi                      319795 non-null  float64
 2   health_mental                   319795 non-null  float64
 3   health_physical                 319795 non-null  float64
 4   behavior_activity_Yes           319795 non-null  uint8  
 5   behavior_alcohol_Yes            319795 non-null  uint8  
 6   behavior_tobacco_Yes            319795 non-null  uint8  
 7   demo_age_25-29                  319795 non-null  uint8  
 8   demo_age_30-34                  319795 non-null  uint8  
 9   demo_age_35-39                  319795 non-null  uint8  
 10  demo_age_40-44                  319795 non-null  uint8  
 11  demo_age_45-49                  319795 non-null  uint8  
 12  demo_age_50-54  

In [40]:
df_dmy.columns

Index(['behavior_sleep', 'health_bmi', 'health_mental', 'health_physical',
       'behavior_activity_Yes', 'behavior_alcohol_Yes', 'behavior_tobacco_Yes',
       'demo_age_25-29', 'demo_age_30-34', 'demo_age_35-39', 'demo_age_40-44',
       'demo_age_45-49', 'demo_age_50-54', 'demo_age_55-59', 'demo_age_60-64',
       'demo_age_65-69', 'demo_age_70-74', 'demo_age_75-79',
       'demo_age_80 or older', 'demo_gender_Male', 'demo_race_Asian',
       'demo_race_Black', 'demo_race_Hispanic', 'demo_race_Other',
       'demo_race_White', 'disease_asthma_Yes',
       'disease_diabetes_No_borderline', 'disease_diabetes_Yes',
       'disease_diabetes_Yes_pregnancy', 'disease_kidney_Yes',
       'disease_skin_Yes', 'disease_stroke_Yes', 'health_general_Fair',
       'health_general_Good', 'health_general_Poor',
       'health_general_Very good', 'health_mobility_Yes',
       'y_heart_disease_Yes'],
      dtype='object')

In [37]:
# save and pickle df_dmy

heart_disease_df_dmy = df_dmy
heart_disease_df_dmy.to_pickle('heart_disease_df_dmy.pkl')
heart_disease_df_dmy.to_csv(r'/Users/sandraparedes/Documents/GitHub/metis_dsml/04_classification/classification_project/heart_disease_df_dmy.csv', index=False)


### 3 | X, y sets for mapped `y_num` `X_num` & dummy `y_dmy` `X_dmy`

In [42]:
# separate target from select features using mapped variables

y_map = df_map['y_heart_disease'] 

X_map = df_map.loc[:, ['behavior_activity', 
                       'behavior_alcohol', 
                       'behavior_sleep',
                       'behavior_tobacco', 
                       'demo_age', 
                       'demo_gender', 
                       'demo_race',
                       'disease_asthma', 
                       'disease_diabetes', 
                       'disease_kidney', 
                       'disease_skin', 
                       'disease_stroke', 
                       'health_bmi', 
                       'health_general', 
                       'health_mental',
                       'health_mobility', 
                       'health_physical']]


# separate target from select features using dummy variables

y_dmy = df_dmy['y_heart_disease_Yes']

X_dmy = df_dmy.loc[:, ['behavior_sleep', 
                       'health_bmi', 
                       'health_mental', 
                       'health_physical',
                       'behavior_activity_Yes', 
                       'behavior_alcohol_Yes', 
                       'behavior_tobacco_Yes',
                       'demo_age_25-29', 
                       'demo_age_30-34', 
                       'demo_age_35-39', 
                       'demo_age_40-44',
                       'demo_age_45-49', 
                       'demo_age_50-54', 
                       'demo_age_55-59', 
                       'demo_age_60-64',
                       'demo_age_65-69', 
                       'demo_age_70-74', 
                       'demo_age_75-79',
                       'demo_age_80 or older', 
                       'demo_gender_Male', 
                       'demo_race_Asian',
                       'demo_race_Black', 
                       'demo_race_Hispanic', 
                       'demo_race_Other',
                       'demo_race_White', 
                       'disease_asthma_Yes',
                       'disease_diabetes_No_borderline', 
                       'disease_diabetes_Yes',
                       'disease_diabetes_Yes_pregnancy', 
                       'disease_kidney_Yes',
                       'disease_skin_Yes', 
                       'disease_stroke_Yes', 
                       'health_general_Fair',
                       'health_general_Good', 
                       'health_general_Poor',
                       'health_general_Very good', 
                       'health_mobility_Yes']]


In [43]:
# split test data set using num df
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(X_map, 
                                                    y_map, 
                                                    test_size=0.2, 
                                                    random_state=42)

# split test data set using dummy df
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(X_dmy, 
                                                    y_dmy, 
                                                    test_size=0.2, 
                                                    random_state=42)

In [50]:
# baseline rate of target using the mean of training data

print('Baseline probability of heart disease on y_train_map:', (round(np.mean(y_train_m), 4)*100),'%')
print('Baseline probability of heart disease on y_train_dummy:', (round(np.mean(y_train_d), 4)*100),'%')


Baseline probability of heart disease on y_train_map: 8.51 %
Baseline probability of heart disease on y_train_dummy: 8.51 %


In [49]:
# plot features of num df


In [None]:
# plot features of dummy df


## Baseline

In [56]:
# logistic regression – scaled map train

scaler_map = StandardScaler()
X_train_sc_m = scaler_map.fit_transform(X_train_m)
X_test_sc_m = scaler_map.transform(X_test_m)

logreg_map = LogisticRegression()
logreg_map.fit(X_train_sc_m, y_train_m)

print('Logistic Regression (map train):', round(f1_score(logreg_map.predict(X_test_sc_m), y_test_m), 4))

Logistic Regression (map train): 0.1664


In [58]:
# logistic regression – scaled dummy train

scaler_dmy = StandardScaler()
X_train_sc_d = scaler_dmy.fit_transform(X_train_d)
X_test_sc_d = scaler_dmy.transform(X_test_d)

logreg_dmy = LogisticRegression()
logreg_dmy.fit(X_train_sc_d, y_train_d)

print('Logistic Regression (dummy train):', round(f1_score(logreg_dmy.predict(X_test_sc_d), y_test_d), 4))

Logistic Regression (dummy train): 0.169


https://github.com/laramillernm/Metis-Classification-Project/blob/main/TelcoChurnFinal.ipynb

In [None]:
# model with all features

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred_lr = logreg.predict(X_test)
y_prob_pred_test = logreg.predict_proba(X_test)

print(f1_score(y_test, y_pred_lr, average="macro"))


# classification report 

classify_logreg = classification_report(y_test, y_pred_lr)
print(classify_logreg)


In [None]:
# scale X_train and X_test

sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# fit decision tree to X_train, y_train

classifier = DecisionTreeClassifier(criterion = 'gini', random_state = 0)
classifier.fit(X_train, y_train)


In [None]:
# predict on X_test

y_pred_dt = classifier.predict(X_test)
print(f1_score(y_test, y_pred_dt, average="macro"))


# classification report 

classify_dt = classification_report(y_test, y_pred_dt)
print(classify_dt)


https://github.com/hyewonjng/Metis-Vaccination/blob/main/codes/2_classification_models.ipynb

In [None]:
# split X and y twice for Xy_train, Xy_test, Xy_validate sets

y = df.series
X = df.drop(labels = ['column_name', 'column_name'], axis = 1)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = .2, random_state = 42, stratify= y)

X_train, X_validate, y_train, y_validate = train_test_split(X_train, y_train, 
                                                            test_size = .25, random_state = 42)


In [None]:
# BernoulliNB() 

# scale X_train 
std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)

# fit and score naive bayes Bernoulli model on X_train_scaled, y_train
nb = BernoulliNB()
nb.fit(X_train_scaled, y_train)
nb.score(X_train_scaled, y_train)

# validate naive bayes Bernoulli model
std_scale = StandardScaler()
X_validate_scaled = std_scale.fit_transform(X_validate)

# fit and score naive bayes Bernoulli model on X_validate_scaled, y_validate
nb = BernoulliNB()
nb.fit(X_validate_scaled, y_validate)
nb.score(X_validate_scaled, y_validate)


In [None]:
# BernoulliNB() 
# predict on X_validate_scaled and score y_validate, y_predict with all metrics

y_predict = nb.predict(X_validate_scaled) 

print("Accuracy:",metrics.accuracy_score(y_validate, y_predict))
print("Precision:",metrics.precision_score(y_validate, y_predict))
print("Recall:",metrics.recall_score(y_validate, y_predict))
print("F1:",metrics.f1_score(y_validate, y_predict))

In [None]:
#LogisticRegression()
# scale X_train

std_scale = StandardScaler()
X_train_scaled = std_scale.fit_transform(X_train)

logit = LogisticRegression(C=1000) # high C removes regularization
logit.fit(X_train_scaled, y_train)

y_predict = logit.predict(X_train_scaled) 
logit.score(X_train_scaled, y_train)


In [None]:
#LogisticRegression()
# scale X_val 

std_scale = StandardScaler()
X_val_scaled = std_scale.fit_transform(X_val)

logit = LogisticRegression(C=1000) # high C removes regularization
logit.fit(X_val_scaled, y_val)
logit.score(X_val_scaled, y_val)


In [None]:
#LogisticRegression()
# predict on X_validate_scaled and score y_validate, y_predict with all metrics

y_pred = logit.predict(X_validate_scaled) 

print("Accuracy:",metrics.accuracy_score(y_validate, y_predict))
print("Precision:",metrics.precision_score(y_validate, y_predict))
print("Recall:",metrics.recall_score(y_validate, y_predict))
print("f1:",metrics.f1_score(y_validate, y_predict))


In [None]:
fpr, tpr, thresholds = roc_curve(y_val, logit.predict_proba(X_val_scaled)[:,1])

plt.plot(fpr, tpr,lw=2)
plt.plot([0,1],[0,1],c='violet',ls='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])


plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve');
print("ROC AUC score = ", roc_auc_score(y_val, logit.predict_proba(X_val_scaled)[:,1]))


In [None]:
# Class imbalance

# setup for the ratio argument of RandomOverSampler initialization SMOTE
n_pos = np.sum(y_train == 1)
n_neg = np.sum(y_train == 0)
ratio = {1 : n_pos * 3, 0 : n_neg} 

smote = imblearn.over_sampling.SMOTE(sampling_strategy = ratio, random_state = 42)

X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)

nb_smote = BernoulliNB() 
nb_smote.fit(X_train_smote, y_train_smote)

print('Logistic Regression on SMOTE Train Data; Test Recall: %.3f, Test AUC: %.3f' % \
      (recall_score(y_validate, nb_smote.predict(X_validate_scaled)), 
       roc_auc_score(y_validate, nb_smote.predict_proba(X_validate_scaled)[:,1])))


In [None]:
# Feature importance

importance = logit.coef_[0]

# summarize feature importance
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))
    
# plot feature importance
plt.bar([x for x in range(len(importance))], importance)
plt.show()

In [None]:
def make_confusion_matrix(model, threshold=0.5):
    
    # Predict class 1 if probability of being in class 1 is greater than threshold
    # (model.predict(X_test) does this automatically with a threshold of 0.5)
    
    y_predict = (model.predict_proba(X_test_scaled)[:, 1] >= threshold)
    fraud_confusion = confusion_matrix(y_test, y_predict)
    plt.figure(dpi=80)
    sns.heatmap(fraud_confusion, cmap=plt.cm.BuGn, annot=True, square=True, fmt='d',
           xticklabels=['non-vaccinated', 'vaccinated'],
           yticklabels=['non-vaccinated', 'vaccinated']);
    plt.xlabel('prediction')
    plt.ylabel('actual')

make_confusion_matrix(rf) #rf = random forest model


In [None]:
y_pred = rf.predict(X_test)

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred))
print("f1:",metrics.f1_score(y_test, y_pred))


https://github.com/emichaelbernardo/titanic/blob/main/Classification.ipynb

In [None]:
# Look at survival rate by Sex, Age and Pclass

age = pd.cut(df_passengers['Age'], [0, 12, 17, 64, 80])
df_passengers.pivot_table('Survived', ['Sex', age], 'Pclass')

In [None]:
# Look at survival rate by Sex, Age and Embarked

df_passengers.pivot_table('Survived', ['Sex', age], 'Embarked')

In [None]:
# visualize data

cols = ['AgeGroup', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

n_rows = 2
n_cols = 3

# The subplot grid and the figure size of each graph
# This returns a Figure (fig) and an Axes Object (axs)
fig, axs = plt.subplots(n_rows, n_cols, figsize=(n_cols*3.2,n_rows*3.2))

for r in range(0,n_rows):
    for c in range(0,n_cols):  
        
        i = r*n_cols+ c # index to go through the number of columns       
        ax = axs[r][c]  # Show where to position each subplot
        sns.countplot(df_passengers[cols[i]], hue=df_passengers["Survived"], ax=ax)
        ax.set_title(f'Survival by {cols[i]}' )
        ax.legend(title="Survived", loc='upper right') 
        
plt.tight_layout()  

In [None]:
# Plot the survival rate of each class
sns.barplot(x='Pclass', y='Survived', data=df_passengers)

In [None]:
#Plot the survival rate of each Sex
sns.barplot(x='Sex', y='Survived', data=df_passengers)

In [None]:
# Look at suvival probablity by AgeGroup and Sex
sns.barplot(x = 'AgeGroup', y ='Survived', hue='Sex', data = df_passengers)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by AgeGroup')

In [None]:
# Look at suvival probablity by AgeGroup and Embarked
sns.barplot(x = 'Embarked', y ='Survived', hue='Sex', data = df_passengers)
plt.ylabel('Survival Probability')
plt.title('Survival Probability by Embarked')

In [None]:
# View distribution of passengers
sns.factorplot(y = 'Age', x = 'Sex', hue = 'Pclass', kind = 'box', data = df_passengers).set(title='Distribution by Age, Sex and Pclass')
sns.factorplot(y = 'Age', x = 'Parch', hue='Sex', kind = 'box', data = df_passengers).set(title='Distribution by Age and Parch')
sns.factorplot(y = 'Age', x = 'SibSp', kind = 'box', data = df_passengers).set(title='Distribution by Age and SibSp')
sns.factorplot(y = 'Age', x = 'Embarked', kind = 'box', data = df_passengers).set(title='Distribution by Age and Embarked')


In [None]:
# functions to score models

def accuracy(actuals, preds):
    return np.mean(actuals == preds)

def precision(actuals, preds):
    tp = np.sum((actuals == 1) & (preds == 1))
    fp = np.sum((actuals == 0) & (preds == 1))
    return tp / (tp + fp)

def recall(actuals, preds):
    tp = np.sum((actuals == 1) & (preds == 1))
    fn = np.sum((actuals == 1) & (preds == 0))
    return tp / (tp + fn)

def F1(actuals, preds):
    p, r = precision(actuals, preds), recall(actuals, preds)
    return 2*p*r / (p + r)


[back to top](#top)

## 3. Preliminary Visualization<a id='3'></a> 


[back to top](#top)

## 4. Preliminary Conclusions<a id='4'></a> 


[back to top](#top)