<a href="https://colab.research.google.com/github/sarahalayan/Business_Analytics_with_Excel/blob/main/Copy_of_Project_Stroke_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Full DL Solution

© 2024, Zaka AI, Inc. All Rights Reserved.

---

###**Case Study:** Stroke Prediction

**Objective:** The goal of this project is to walk you through a case study where you can apply the deep learning concepts that you learned about during the week. By the end of this project, you would have developed a solution that predicts if a person will have a stroke or not.


**Dataset Explanation:** We will be using the stroke dataset. Its features are:


* **id:** unique identifier
* **gender:** "Male", "Female" or "Other"
* **age:** age of the patient
* **hypertension:** 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* **heart_disease:** 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* **ever_married:** "No" or "Yes"
* **work_type:** "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* **Residence_type:** "Rural" or "Urban"
* **avg_glucose_level:** average glucose level in blood
* **bmi:** body mass index
* **smoking_status:** "formerly smoked", "never smoked", "smokes" or "Unknown"*
* **stroke:** 1 if the patient had a stroke or 0 if not

#Importing Libraries

We start by importing the libraries: numpy and pandas

In [3]:
#Test Your Zaka#Test Your Zaka
import numpy as np
import pandas as pd

#Loading the Dataset

We load the dataset from a csv file, and see its first rows

In [4]:
#Test Your Zaka
df=pd.read_csv('healthcare-dataset-stroke-data.csv')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


#Exploratory Data Analysis

Now we start the exploratory data analysis.

###Shape of the data

First, you need to know the shape of our data (How many examples and features do we have)

In [5]:
#Test Your Zaka
df.shape
print(f"Number of examples: {df.shape[0]}")
print(f"Number of features: {df.shape[1]}")

Number of examples: 5110
Number of features: 12


###Types of different Columns

See the type of each of your features and see if you have any nulls

In [6]:
#Test Your Zaka
df.dtypes

Unnamed: 0,0
id,int64
gender,object
age,float64
hypertension,int64
heart_disease,int64
ever_married,object
work_type,object
Residence_type,object
avg_glucose_level,float64
bmi,float64


In [7]:
df.isnull().sum()

Unnamed: 0,0
id,0
gender,0
age,0
hypertension,0
heart_disease,0
ever_married,0
work_type,0
Residence_type,0
avg_glucose_level,0
bmi,201


###Dealing with categorical variables

Now we will walk through the categorical variables that we have to see the categories and the counts of each of them.

In [8]:
df.gender.value_counts()
df.gender.replace({'Other':'Female'}).value_counts()

Unnamed: 0_level_0,count
gender,Unnamed: 1_level_1
Female,2995
Male,2115


In [9]:
df.ever_married.value_counts()

Unnamed: 0_level_0,count
ever_married,Unnamed: 1_level_1
Yes,3353
No,1757


In [10]:
df.work_type.value_counts()

Unnamed: 0_level_0,count
work_type,Unnamed: 1_level_1
Private,2925
Self-employed,819
children,687
Govt_job,657
Never_worked,22


In [11]:
df.Residence_type.value_counts()

Unnamed: 0_level_0,count
Residence_type,Unnamed: 1_level_1
Urban,2596
Rural,2514


In [12]:
df.smoking_status.value_counts()

Unnamed: 0_level_0,count
smoking_status,Unnamed: 1_level_1
never smoked,1892
Unknown,1544
formerly smoked,885
smokes,789


In [13]:
#Test Your Zaka

#Preprocessing

Prepare the data in a way to be ready to be used to train a DL model.

In [14]:
#Test Your Zaka
df.drop('id',axis=1,inplace=True)
df.gender.replace({'Other':'Female'},inplace=True)
df=pd.get_dummies(df,columns=['gender','ever_married','work_type','Residence_type','smoking_status'])
df.bmi.fillna(df.bmi.mean(),inplace=True)
df.dtypes

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.gender.replace({'Other':'Female'},inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df.bmi.fillna(df.bmi.mean(),inplace=True)


Unnamed: 0,0
age,float64
hypertension,int64
heart_disease,int64
avg_glucose_level,float64
bmi,float64
stroke,int64
gender_Female,bool
gender_Male,bool
ever_married_No,bool
ever_married_Yes,bool


In [15]:
#Test Your Zaka
Y = df['stroke']
X = df.drop('stroke', axis=1)

In [16]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Assuming your data is in X (features) and y (labels)

# Shuffle the data
X_shuffled, y_shuffled = shuffle(X, Y, random_state=42)

# Split into training and remaining data
X_train, X_test, y_train, y_test = train_test_split(
    X_shuffled, y_shuffled, test_size=0.3, random_state=42
)


#Building the DL Model

Now it's time to build the actual model. Propose a DL architecture suitable for this problem and print its summary.

In [17]:
#Test Your Zaka
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(20, input_dim=20, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))

    return model

model = create_baseline()
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


###Compiling the model

Now we need to compile the model.

In [18]:
#Test Your Zaka
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])

###Fitting the model

we split our dataset between training and testing, and we fit the model on training data (70%), and validate on the testing data (30%).

In [19]:
from sklearn.metrics import confusion_matrix, classification_report

model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))

loss, accuracy = model.evaluate(X_test, y_test)
print('Test Accuracy:', accuracy)


Epoch 1/10
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.7524 - loss: 0.4657 - val_accuracy: 0.9498 - val_loss: 0.2461
Epoch 2/10
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - accuracy: 0.9549 - loss: 0.2375 - val_accuracy: 0.9498 - val_loss: 0.2321
Epoch 3/10
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.9553 - loss: 0.2208 - val_accuracy: 0.9498 - val_loss: 0.2211
Epoch 4/10
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9523 - loss: 0.2090 - val_accuracy: 0.9498 - val_loss: 0.2010
Epoch 5/10
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9516 - loss: 0.2004 - val_accuracy: 0.9498 - val_loss: 0.1907
Epoch 6/10
[1m112/112[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9527 - loss: 0.1799 - val_accuracy: 0.9498 - val_loss: 0.1854
Epoch 7/10
[1m112/112[0m 

In [20]:
y_pred = (model.predict(X_test) > 0.5).astype(int)

print('Classification Report:')
print(classification_report(y_test, y_pred))

[1m48/48[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      1456
           1       0.00      0.00      0.00        77

    accuracy                           0.95      1533
   macro avg       0.47      0.50      0.49      1533
weighted avg       0.90      0.95      0.93      1533



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


What can you deduce from the results you obtained?

**we can deduce that we have severe imbalance and potential for overfitting**

#Improving DL Models

**TIP: When tuning your model to obtain a better performance, make sure you use a validation set**

###Data Improvement

After having studied your data in previous parts, enhance the performance of your model with one data improvement using **SMOTE**.

In [21]:
#Test Your Zaka
from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='minority',random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, Y)
y_resampled.value_counts()

Unnamed: 0_level_0,count
stroke,Unnamed: 1_level_1
1,4861
0,4861


In [22]:
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

# Assuming your data is in X (features) and y (labels)

# Shuffle the data
X_shuffled, y_shuffled = shuffle(X_resampled, y_resampled, random_state=42)

# Split into training and remaining data
X_train, X_test, y_train, y_test = train_test_split(
    X_shuffled, y_shuffled, test_size=0.2, random_state=42
)
X_train.head()

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,gender_Female,gender_Male,ever_married_No,ever_married_Yes,work_type_Govt_job,work_type_Never_worked,work_type_Private,work_type_Self-employed,work_type_children,Residence_type_Rural,Residence_type_Urban,smoking_status_Unknown,smoking_status_formerly smoked,smoking_status_never smoked,smoking_status_smokes
175,72.0,1,0,185.49,37.1,False,True,False,True,False,False,False,True,False,True,False,False,False,True,False
5075,70.0,0,0,102.5,37.8,False,True,False,True,False,False,True,False,False,False,True,True,False,False,False
160,76.0,0,0,57.92,28.893237,True,False,False,True,False,False,True,False,False,False,True,False,True,False,False
8524,81.571528,0,1,207.875812,32.285764,True,True,True,True,False,False,True,False,False,True,True,True,False,False,True
2178,34.0,0,0,90.15,27.9,True,False,True,False,False,False,True,False,False,True,False,False,True,False,False


In [23]:

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1, random_state=42)

model = create_baseline()
model.compile(loss='binary_crossentropy', optimizer='adam',metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=64, validation_data=(X_val, y_val))

loss, accuracy = model.evaluate(X_test, y_test)
print('Test Accuracy:', accuracy)


Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.6306 - loss: 0.6285 - val_accuracy: 0.7404 - val_loss: 0.5417
Epoch 2/20
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.7477 - loss: 0.5292 - val_accuracy: 0.7468 - val_loss: 0.4925
Epoch 3/20
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7543 - loss: 0.5010 - val_accuracy: 0.7635 - val_loss: 0.4851
Epoch 4/20
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7750 - loss: 0.4692 - val_accuracy: 0.7892 - val_loss: 0.4460
Epoch 5/20
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.7814 - loss: 0.4619 - val_accuracy: 0.7995 - val_loss: 0.4254
Epoch 6/20
[1m110/110[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - accuracy: 0.8044 - loss: 0.4295 - val_accuracy: 0.8316 - val_loss: 0.4044
Epoch 7/20
[1m110/110[0m [32m━━━━━━━

In [24]:
y_pred = (model.predict(X_test) > 0.5).astype(int)

print('Classification Report:')
print(classification_report(y_test, y_pred))

[1m61/61[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.90      0.92       977
           1       0.90      0.94      0.92       968

    accuracy                           0.92      1945
   macro avg       0.92      0.92      0.92      1945
weighted avg       0.92      0.92      0.92      1945



Comment the performance you obtained

**The performance improved a lot but we hope for better accuracy.**

###Model Design

Propose one model design method to improve the performance of your model even more.

In [25]:
#Test Your Zaka
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.callbacks import EarlyStopping

def create_baseline():
    model = Sequential()
    model.add(Dense(20, input_dim=20,  activation='relu'))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(1,  activation='sigmoid'))

    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model



# Create the model
model = create_baseline()

# Define Early Stopping
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Fit the model with early stopping
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_val, y_val), callbacks=[early_stopping])

# Evaluate the model on the test data
loss, accuracy = model.evaluate(X_test, y_test)
print('Test Accuracy:', accuracy)

y_pred = (model.predict(X_test) > 0.5).astype(int)

print('Classification Report:')
print(classification_report(y_test, y_pred))

Epoch 1/20


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 3ms/step - accuracy: 0.6523 - loss: 0.6139 - val_accuracy: 0.7648 - val_loss: 0.4764
Epoch 2/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.7752 - loss: 0.4771 - val_accuracy: 0.7301 - val_loss: 0.5121
Epoch 3/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.8285 - loss: 0.3957 - val_accuracy: 0.6864 - val_loss: 0.5086
Epoch 4/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9218 - loss: 0.2230 - val_accuracy: 0.9126 - val_loss: 0.2480
Epoch 5/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9325 - loss: 0.1876 - val_accuracy: 0.9293 - val_loss: 0.1925
Epoch 6/20
[1m219/219[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9374 - loss: 0.1766 - val_accuracy: 0.9447 - val_loss: 0.1621
Epoch 7/20
[1m219/219[0m [32m━━━━━━━

Comment the performance of your model

**The performance is so good**

In [27]:
model.save('model.h5')

