# Intel® Extension for Scikit-learn Logistic Regression for stroke prediction dataset

In [1]:
from timeit import default_timer as timer
from sklearn import metrics
from sklearn.model_selection import train_test_split
import warnings
import pandas as pd
from IPython.display import HTML
warnings.filterwarnings('ignore')

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


### About Data

##### Context
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

##### Attribute Information
1) id: unique identifier
2) gender: "Male", "Female" or "Other"
3) age: age of the patient
4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6) ever_married: "No" or "Yes"
7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8) Residence_type: "Rural" or "Urban"
9) avg_glucose_level: average glucose level in blood
10) bmi: body mass index
11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12) stroke: 1 if the patient had a stroke or 0 if not

##### Dataset Link
https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset

### Download the data

In [2]:
st = pd.read_csv('healthcare-dataset-stroke-data.csv')

In [3]:
st.head(5)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [4]:
st.columns

Index(['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'work_type', 'Residence_type', 'avg_glucose_level', 'bmi',
       'smoking_status', 'stroke'],
      dtype='object')

In [5]:
st.shape

(5110, 12)

In [6]:
# to know the unique values in the column
st["smoking_status"].unique()

array(['formerly smoked', 'never smoked', 'smokes', 'Unknown'],
      dtype=object)

In [7]:
# to get dummies for required columns
Gender = pd.get_dummies(st["gender"],drop_first = True)
Ever_married = pd.get_dummies(st["ever_married"],drop_first = True)
residence = pd.get_dummies(st["Residence_type"],drop_first = True)
smoking = pd.get_dummies(st["smoking_status"],drop_first = True)

In [8]:
st.drop(["id","work_type"],axis = 1,inplace = True)

In [9]:
st = pd.concat([st,Gender,Ever_married,residence,smoking],axis = 1)

In [10]:
st.dropna(inplace = True)

In [11]:
st.shape

(4909, 17)

In [12]:
st.columns

Index(['gender', 'age', 'hypertension', 'heart_disease', 'ever_married',
       'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status',
       'stroke', 'Male', 'Other', 'Yes', 'Urban', 'formerly smoked',
       'never smoked', 'smokes'],
      dtype='object')

In [13]:
st['Male'] = st['Male'].astype(int)
st['Other'] = st['Other'].astype(int)
st['Yes'] = st['Yes'].astype(int)
st['Urban'] = st['Urban'].astype(int)
st['formerly smoked'] = st['formerly smoked'].astype(int)
st['never smoked'] = st['never smoked'].astype(int)
st['smokes'] = st['smokes'].astype(int)

In [14]:
x = st[[ 'age', 'hypertension', 'heart_disease',
       'avg_glucose_level', 'bmi','Male', 'Other', 'Yes', 'Urban', 'formerly smoked',
       'never smoked', 'smokes']]
y = st['stroke']

In [15]:
st

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,Male,Other,Yes,Urban,formerly smoked,never smoked,smokes
0,Male,67.0,0,1,Yes,Urban,228.69,36.6,formerly smoked,1,1,0,1,1,1,0,0
2,Male,80.0,0,1,Yes,Rural,105.92,32.5,never smoked,1,1,0,1,0,0,1,0
3,Female,49.0,0,0,Yes,Urban,171.23,34.4,smokes,1,0,0,1,1,0,0,1
4,Female,79.0,1,0,Yes,Rural,174.12,24.0,never smoked,1,0,0,1,0,0,1,0
5,Male,81.0,0,0,Yes,Urban,186.21,29.0,formerly smoked,1,1,0,1,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5104,Female,13.0,0,0,No,Rural,103.08,18.6,Unknown,0,0,0,0,0,0,0,0
5106,Female,81.0,0,0,Yes,Urban,125.20,40.0,never smoked,0,0,0,1,1,0,1,0
5107,Female,35.0,0,0,Yes,Rural,82.99,30.6,never smoked,0,0,0,1,0,0,1,0
5108,Male,51.0,0,0,Yes,Rural,166.29,25.6,formerly smoked,0,1,0,1,0,1,0,0


Split the data into train and test sets

In [16]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=43)
x_train.shape, x_test.shape, y_train.shape,y_test.shape

((4418, 12), (491, 12), (4418,), (491,))

### Patch original Scikit-learn with Intel® Extension for Scikit-learn
Intel® Extension for Scikit-learn (previously known as daal4py) contains drop-in replacement functionality for the stock Scikit-learn package. You can take advantage of the performance optimizations of Intel® Extension for Scikit-learn by adding just two lines of code before the usual Scikit-learn imports:

In [17]:
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


Intel® Extension for Scikit-learn patching affects performance of specific Scikit-learn functionality. Refer to the [list of supported algorithms and parameters](https://intel.github.io/scikit-learn-intelex/algorithms.html) for details. In cases when unsupported parameters are used, the package fallbacks into original Scikit-learn. If the patching does not cover your scenarios, [submit an issue on GitHub](https://github.com/intel/scikit-learn-intelex/issues).

Training of the Logistic Regression algorithm with Intel® Extension for Scikit-learn for stroke dataset

In [18]:
from sklearn.linear_model import LogisticRegression

params = {
    'C': 0.1,
    'solver': 'liblinear',
    'multi_class': 'ovr',
    'n_jobs': -1,
}
start = timer()
classifier = LogisticRegression(**params).fit(x_train, y_train)
train_patched = timer() - start
f"Intel® extension for Scikit-learn time: {train_patched:.2f} s"

'Intel® extension for Scikit-learn time: 0.03 s'

Predict probability and get a result of the Logistic Regression algorithm with Intel® Extension for Scikit-learn

In [19]:
y_predict = classifier.predict_proba(x_test)
log_loss_opt = metrics.log_loss(y_test, y_predict)
f"Intel® extension for Scikit-learn Log Loss: {log_loss_opt} s"

'Intel® extension for Scikit-learn Log Loss: 0.1288627505251347 s'

### Train the same algorithm with original Scikit-learn
In order to cancel optimizations, we use *unpatch_sklearn* and reimport the class LogisticRegression

In [20]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

Training of the Logistic Regression algorithm with original Scikit-learn library for CIFAR dataset

In [21]:
from sklearn.linear_model import LogisticRegression

start = timer()
classifier = LogisticRegression(**params).fit(x_train, y_train)
train_unpatched = timer() - start
f"Original Scikit-learn time: {train_unpatched:.2f} s"

'Original Scikit-learn time: 0.02 s'

Predict probability and get a result of the Logistic Regression algorithm with original Scikit-learn

In [22]:
y_predict = classifier.predict_proba(x_test)
log_loss_original = metrics.log_loss(y_test, y_predict)
f"Original Scikit-learn Log Loss: {log_loss_original} s"

'Original Scikit-learn Log Loss: 0.1288627505251347 s'

In [23]:
HTML(f"<h3>Compare Log Loss metric of patched Scikit-learn and original</h3>"
     f"Log Loss metric of patched Scikit-learn: {log_loss_opt} <br>"
     f"Log Loss metric of unpatched Scikit-learn: {log_loss_original} <br>"
     f"Metrics ratio: {log_loss_opt/log_loss_original} <br>"
     f"<h3>With Scikit-learn-intelex patching you can:</h3>"
     f"<ul>"
     f"<li>Use your Scikit-learn code for training and prediction with minimal changes (a couple of lines of code);</li>"
     f"<li>Fast execution training and prediction of Scikit-learn models;</li>"
     f"<li>Get the similar quality</li>"
     f"<li>Get speedup in <strong>{(train_unpatched/train_patched):.1f}</strong> times.</li>"
     f"</ul>")