# Stroke Prediction 2.1: Model and evaulation

## Read in data

In [3]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.set_option('display.max_columns', None)# display all the columns

In [4]:
train_data = pd.read_csv("clean_train.csv")
test_data = pd.read_csv("clean_test.csv")

In [5]:
train_data.head()

Unnamed: 0,age,hypertension,heart_disease,ever_married,avg_glucose_level,bmi,smoking_status,Female,Male,Other,Govt_job,Never_worked,Private,Self-employed,children,stroke
0,3.0,0.0,0.0,0.0,95.12,18.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,58.0,1.0,0.0,1.0,87.96,39.2,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,8.0,0.0,0.0,0.0,110.89,17.6,0.326663,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,70.0,0.0,0.0,1.0,69.04,35.9,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,14.0,0.0,0.0,0.0,161.28,19.1,0.681701,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [10]:
X = train_data.iloc[:, :-1].values
y = train_data.iloc[:, -1].values

In [11]:
# Scale the data 
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

## Ensemable learning

Ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone

### Common ensemble techniques

Stacking<br><br>
Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm is trained to make a final prediction using all the predictions of the other algorithms as additional inputs.<br><br>
Blending<br><br>
Blending follows the same approach as stacking but uses only a holdout (validation) set from the train set to make predictions. In other words, unlike stacking, the predictions are made on the holdout set only. The holdout set and the predictions are used to build a model which is run on the test set.<br><br>
Bagging<br><br>
The idea behind bagging is combining the results of multiple models (for instance, all decision trees) to get a generalized result.<br><br>
Boosting<br><br>
Boosting involves incrementally building an ensemble by training each new model instance to emphasize the training instances that previous models mis-classified. In some cases, boosting has been shown to yield better accuracy than bagging, but it also tends to be more likely to over-fit the training data.

### Algorithms based on Bagging and Boosting

Bagging algorithms:<br><br>
Bagging meta-estimator: Bagging meta-estimator is an ensembling algorithm that can be used for both classification (BaggingClassifier) and regression (BaggingRegressor) problems. It follows the typical bagging technique to make predictions. The subset of the dataset includes all features.<br><br>
Random forest:It is an extension of the bagging estimator algorithm. The base estimators in random forest are decision trees. Unlike bagging meta estimator, random forest randomly selects a set of features which are used to decide the best split at each node of the decision tree.<br><br>
Boosting algorithms:<br><br>
AdaBoost：Adaptive boosting or AdaBoost is one of the simplest boosting algorithms. Usually, decision trees are used for modelling. Multiple sequential models are created, each correcting the errors from the last model. AdaBoost assigns weights to the observations which are incorrectly predicted and the subsequent model works to predict these values correctly.<br><br>
GBM：Gradient Boosting or GBM is another ensemble machine learning algorithm that works for both regression and classification problems. GBM uses the boosting technique, combining a number of weak learners to form a strong learner. Regression trees used as a base learner, each subsequent tree in series is built on the errors calculated by the previous tree.<br><br>
XGBM：XGBoost (extreme Gradient Boosting) is an advanced implementation of the gradient boosting algorithm. XGBoost has proved to be a highly effective ML algorithm, extensively used in machine learning competitions and hackathons. XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique.<br><br>
Light GBM：Light GBM beats all the other algorithms when the dataset is extremely large. Compared to the other algorithms, Light GBM takes lesser time to run on a huge dataset.LightGBM is a gradient boosting framework that uses tree-based algorithms and follows leaf-wise approach while other algorithms work in a level-wise approach pattern.<br><br>
CatBoost：CatBoost can automatically deal with categorical variables and does not require extensive data preprocessing like other machine learning algorithms.

## Parameter sweeping and model evaluation

Bagging meta-estimator

In [12]:
from sklearn.ensemble import BaggingClassifier
from sklearn import tree
from sklearn.model_selection import cross_val_score

print ("Bagging meta-estimator accuracy: ", np.mean(cross_val_score(BaggingClassifier(), X, y,cv=10)))

Bagging meta-estimator accuracy:  0.9813594411046613


In [13]:
n_esti = [5, 10, 15, 20, 25, 30]
max_samp = [1, 2, 3, 4, 5]
max_feat = [1, 2, 3, 4, 5]

In [14]:
bag_res = []
for ne in n_esti:
    for ms in max_samp:
        for mf in max_feat:
            score = np.mean(cross_val_score(BaggingClassifier(n_estimators = ne, max_samples = ms, max_features = mf), X, y,cv=10))
            bag_res.append([score, ne, ms, mf])

In [15]:
# Find result with highest accuracy
def find_best(lst):
    return max(lst, key=lambda x: x[0])

In [17]:
print ('After parameter sweeping, we find:')
print ('Highest accuracy is:', find_best(bag_res)[0], 'with n_estimators ', find_best(bag_res)[1], ', max_samples ', find_best(bag_res)[2], ' and max_features ',find_best(bag_res)[3] )

After parameter sweeping, we find:
Highest accuracy is: 0.981958540696738 with n_estimators  5 , max_samples  1  and max_features  1


Random forest

In [18]:
from sklearn.ensemble import RandomForestClassifier
print ("Random forest accuracy: ",np.mean(cross_val_score(RandomForestClassifier(), X, y,cv=10)))

Random forest accuracy:  0.9814285761443695


In [19]:
min_samples_leaf=[1, 4, 7, 11, 14, 17] 
n_estimators=[1, 6, 11, 16, 21,26, 31, 36, 41, 46] 


In [20]:
rf_res = []
for ms in min_samples_leaf:
    for ne in n_estimators:
        score = np.mean(cross_val_score(RandomForestClassifier(n_estimators = ne,min_samples_leaf = ms), X, y, cv=10))
        rf_res.append([score,ms,ne])

In [21]:
print ('After parameter sweeping, we find:')
print ('Highest accuracy is:', find_best(rf_res)[0], 'min_samples_leaf ', find_best(rf_res)[1], 'and n_estimators', find_best(rf_res)[2])

After parameter sweeping, we find:
Highest accuracy is: 0.9819815768635198 min_samples_leaf  4 and n_estimators 16


AdaBoost

In [22]:
from sklearn.ensemble import AdaBoostClassifier
print ("AdaBoost accuracy: ",np.mean(cross_val_score(AdaBoostClassifier(), X, y,cv=10)))

AdaBoost accuracy:  0.9819124683631744


In [23]:
n_esti = [30, 40, 50, 60, 70]
lr = [1, 0.5, 0.1]

In [27]:
ada_res = []
for ne in n_esti:
    for l in lr:
        score = np.mean(cross_val_score(AdaBoostClassifier(n_estimators = ne, learning_rate = l), X, y,cv=10))
        ada_res.append([score, ne, l])

In [28]:
print ('After parameter sweeping, we find:')
print ('Highest accuracy is:', find_best(ada_res)[0], 'n_estimators ', find_best(ada_res)[1], 'and learning_rate', find_best(ada_res)[2])

After parameter sweeping, we find:
Highest accuracy is: 0.981958540696738 n_estimators  30 and learning_rate 1


Gradient Boosting

In [29]:
from sklearn.ensemble import GradientBoostingClassifier
print ("Gradient Boosting accuracy: ",np.mean(cross_val_score(GradientBoostingClassifier(), X, y,cv=10)))

Gradient Boosting accuracy:  0.9815668196844234


In [30]:
n_esti = [90, 100, 110]
lr = [1, 0.5, 0.1]

In [31]:
gb_res = []
for ne in n_esti:
    for l in lr:
        score = np.mean(cross_val_score(GradientBoostingClassifier(n_estimators = ne, learning_rate = l), X, y,cv=10))
        gb_res.append([score, ne, l])

In [35]:
print ('After parameter sweeping, we find:')
print ('Highest accuracy is:', find_best(gb_res)[0], 'n_estimators ', find_best(gb_res)[1], 'and learning_rate', find_best(gb_res)[2])

After parameter sweeping, we find:
Highest accuracy is: 0.9816359441108332 n_estimators  90 and learning_rate 0.1


SVM

In [40]:
from sklearn.svm import SVC
print ("SVM accuracy: ",np.mean(cross_val_score(SVC(), X, y,cv=10)))

SVM accuracy:  0.981958540696738


In [41]:
kernels = ['linear', 'poly', 'rbf', 'sigmoid']


In [45]:
svm_res = []
for k in kernels:
    score = np.mean(cross_val_score(SVC(kernel=k), X, y,cv=10))
    svm_res.append([score,k])

In [46]:
print ('After parameter sweeping, we find:')
print ('Highest accuracy is:', find_best(svm_res)[0], 'kernel ', find_best(svm_res)[1])

After parameter sweeping, we find:
Highest accuracy is: 0.981958540696738 kernel  linear


Logistic Regression

In [47]:
from sklearn.linear_model import LogisticRegression

print ("Logistic Regression accuracy: ",np.mean(cross_val_score(LogisticRegression(), X, y,cv=10)))

Logistic Regression accuracy:  0.981958540696738


In [63]:
penalty = ['l1', 'l2']
C = [1, 5, 10]

In [64]:
lg_res = []
for p in penalty:
    for c in C:
        score = np.mean(cross_val_score(LogisticRegression(C = c, penalty=p), X, y, cv=10))
        lg_res.append([score, p, c])

In [65]:
print ('After parameter sweeping, we find:')
print ('Highest accuracy is:', find_best(lg_res)[0], 'with penalty ', find_best(lg_res)[1], 'and C ', find_best(lg_res)[2])

After parameter sweeping, we find:
Highest accuracy is: 0.981958540696738 with penalty  l1 and C  1


## AUC score  & Classification report

Bagging meta-estimator

In [54]:
print("Bagging meta-estimator auc score:",np.mean(cross_val_score(BaggingClassifier(n_estimators = 5, max_samples = 1,max_features = 1), X, y, scoring = 'roc_auc',cv=10)))

Bagging meta-estimator auc score: 0.5


In [81]:
from sklearn.metrics import classification_report, accuracy_score, make_scorer
originalclass = []
predictedclass = []
def classification_report_with_accuracy_score(y_true, y_pred):
    originalclass.extend(y_true)
    predictedclass.extend(y_pred)
    return accuracy_score(y_true, y_pred) # return accuracy score

In [82]:
 
nested_score = cross_val_score(BaggingClassifier(n_estimators = 5, max_samples = 1,max_features = 1), X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       0.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.49      0.50      0.50     43400
weighted avg       0.96      0.98      0.97     43400



Random Forest

In [51]:
print("Random Forest auc score:",np.mean(cross_val_score(RandomForestClassifier(n_estimators = 16,min_samples_leaf = 4), X, y, scoring = 'roc_auc',cv=10)))

Random Forest auc score: 0.8079906216569255


In [84]:
originalclass = []
predictedclass = []
nested_score = cross_val_score(RandomForestClassifier(n_estimators = 16,min_samples_leaf = 4), X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       1.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.99      0.50      0.50     43400
weighted avg       0.98      0.98      0.97     43400



AdaBoost

In [53]:
print("AdaBoost auc score:",np.mean(cross_val_score(AdaBoostClassifier(n_estimators =30, learning_rate = 1), X, y, scoring = 'roc_auc',cv=10)))

AdaBoost auc score: 0.8465747519482829


In [85]:
originalclass = []
predictedclass = []
nested_score = cross_val_score(AdaBoostClassifier(n_estimators =30, learning_rate = 1), X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       0.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.49      0.50      0.50     43400
weighted avg       0.96      0.98      0.97     43400



Gradient Boosting

In [55]:
print("Gradient Boosting auc score:",np.mean(cross_val_score(GradientBoostingClassifier(n_estimators =90,learning_rate = 0.1), X, y, scoring = 'roc_auc',cv=10)))

Gradient Boosting auc score: 0.8495456612215481


In [86]:
originalclass = []
predictedclass = []
nested_score = cross_val_score(GradientBoostingClassifier(n_estimators =90,learning_rate = 0.1), X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       0.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.49      0.50      0.50     43400
weighted avg       0.96      0.98      0.97     43400



SVM

In [56]:
print("SVM auc score:",np.mean(cross_val_score(SVC(kernel = 'linear'), X, y, scoring = 'roc_auc',cv=10)))

SVM auc score: 0.5403611184241444


In [87]:
originalclass = []
predictedclass = []
nested_score = cross_val_score(SVC(kernel = 'linear'), X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       0.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.49      0.50      0.50     43400
weighted avg       0.96      0.98      0.97     43400



Logistic Regression

In [66]:
print("Logistic Regression auc score:",np.mean(cross_val_score(LogisticRegression(C = 1, penalty='l1'), X, y, scoring = 'roc_auc',cv=10)))

Logistic Regression auc score: 0.8509406001768003


In [88]:
originalclass = []
predictedclass = []
nested_score = cross_val_score(LogisticRegression(C = 1, penalty='l1'), X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       0.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.49      0.50      0.50     43400
weighted avg       0.96      0.98      0.97     43400



## Ensemble techniques

Commom ways include:<br><br>
1.Max Voting<br><br>
The max voting method is generally used for classification problems. Here, multiple models are used to make predictions for each data point and the predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.<br><br>
2.Averaging<br><br>
In this method, we take an average of predictions from all the models and use it to make the final prediction.<br><br>
3.Weighted Averaging<br><br>
This is an extension of the averaging method. All models are assigned different weights defining the importance of each model for prediction.<br><br>
In this project, I will use max voting.

In [67]:
from sklearn.ensemble import VotingClassifier

In [68]:
model1 = LogisticRegression(C = 1, penalty='l1')
model2 = GradientBoostingClassifier(n_estimators = 90,learning_rate =0.1 )
model3 = AdaBoostClassifier(n_estimators =30 ,learning_rate =1)
model = VotingClassifier(estimators=[('lr', model1), ('gbm', model2), ('adab', model3)], voting='soft')



In [70]:
print ("Ensemble classifier accuracy: ",np.mean(cross_val_score(model, X, y,cv=10)))


Ensemble classifier accuracy:  0.9819354939117646


In [69]:
print("Ensemble classifier AUC score:")
print(np.mean(cross_val_score(model, X, y, scoring = 'roc_auc',cv=10)))


Ensemble classifier AUC score:
0.8510716543330983


In [89]:
originalclass = []
predictedclass = []
nested_score = cross_val_score(model, X, y, cv=10,\
               scoring=make_scorer(classification_report_with_accuracy_score))
print(classification_report(originalclass, predictedclass)) 

              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99     42617
         1.0       0.00      0.00      0.00       783

   micro avg       0.98      0.98      0.98     43400
   macro avg       0.49      0.50      0.50     43400
weighted avg       0.96      0.98      0.97     43400

