---

# Explore classification methods on stroke dataset

In this notebook we explore and compare different classification methods on "healthcare-dataset-stroke-data.csv" dataset. Our methods include:

1. Random Forest
2. Decision Tree
3. KNN
4. Logistic Regression

In [12]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pylab as pl
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import jaccard_score
from sklearn.metrics import classification_report, confusion_matrix
import itertools
from sklearn.metrics import log_loss
from sklearn.cluster import KMeans

df = pd.read_csv('/Users/celiajiang/Desktop/rice/INDE577/healthcare-dataset-stroke-data.csv')
#process data
df['gender'] = df['gender'].map({'Male': 1, 'Female': 0,'Other':2})
df['ever_married'] = df['ever_married'].map({'Yes': 1, 'No': 0})
df['work_type'] = df['work_type'].map({'Private': 0, 'Self-employed': 1,'children':2,'Govt_job':3,'Never_worked':4})
df['Residence_type'] = df['Residence_type'].map({'Urban': 1, 'Rural': 0})
df['smoking_status'] = df['smoking_status'].map({'never smoked': 0, 'Unknown': 1,'formerly smoked':2,'smokes':3})

df_num = df[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'avg_glucose_level', 'bmi', 'stroke']].dropna()
X, y = df_num[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'smoking_status', 'avg_glucose_level', 'bmi']], df_num['stroke']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Data Processing
In the data processing part, we have transformed description of "gender", "ever_married", "work_type", "Residence_type", "smoking_status" into categorical variables.

### Random Forest
In this part, we explored random forest classification method by applying all variables, and obtained the classification accuracy: 94.603%

In [24]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# predict with the trained model
y_pred = rf.predict(X_test)
print(f'{accuracy_score(y_test, y_pred) = } \n')

accuracy_score(y_test, y_pred) = 0.9460285132382892 



### Decision Tree
In this part, we explored decision tree classification method by setting max_depth as 4, and obtained the classification accuracy: 94.501%

In [25]:
strokeTree = DecisionTreeClassifier(criterion="entropy", max_depth = 4)
strokeTree.fit(X_train,y_train)
trainTree = strokeTree.predict(X_train)
predTree = strokeTree.predict(X_test)

Decision_Tree_Train=metrics.accuracy_score(y_train, trainTree)
Decision_Tree_Test=metrics.accuracy_score(y_test, predTree)
print("DecisionTrees's Accuracy: ", metrics.accuracy_score(y_test, predTree))

DecisionTrees's Accuracy:  0.945010183299389


### KNN
In this part, we explored KNN classification method by setting k as 2 and 4, obtaining the classification accuracy: 94.297% and 94.094% respectively.

In [26]:
k = 2
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neigh.predict(X_test)
KNN_Train=metrics.accuracy_score(y_train, neigh.predict(X_train))
KNN_Test=metrics.accuracy_score(y_test, yhat)
print("KNN Accuracy: ",KNN_Test )

KNN Accuracy:  0.9429735234215886


In [10]:
k = 4
#Train Model and Predict  
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
yhat = neigh.predict(X_test)
KNN_Train=metrics.accuracy_score(y_train, neigh.predict(X_train))
KNN_Test=metrics.accuracy_score(y_test, yhat)
print("KNN Accuracy: ",KNN_Test )

KNN Accuracy:  0.9409368635437881


### Logistic Regression
In this part, we explored logistic regression classification method, obtaining the classification accuracy: 94.501%.

In [18]:
LR = LogisticRegression(C=0.01, solver='liblinear').fit(X_train,y_train)
ytrain = LR.predict(X_train)
yhat = LR.predict(X_test)
ytrain_prob=LR.predict_proba(X_train)
yhat_prob = LR.predict_proba(X_test)
Logistic_Regression_Train=jaccard_score(y_train, ytrain,pos_label=0)
Logistic_Regression_Train
Logistic_Regression_Test=jaccard_score(y_test, yhat,pos_label=0)
print("Logistic Regression Accuracy: ", Logistic_Regression_Test)

Logistic Regression Accuracy:  0.945010183299389


In [28]:
from sklearn.ensemble import VotingClassifier
final_model = VotingClassifier(estimators=[('lr', LR), ('dt',strokeTree),('knn',neigh),('rf', rf)], voting='hard')
final_model.fit(X_train, y_train)
pred_final = final_model.predict(X_test)
metrics.accuracy_score(y_test, pred_final)

0.9460285132382892