# Stroke Prediction Dataset
Exploratory Data Analysis on Stroke Prediction (Python)
***

<img src="https://cdn.pixabay.com/photo/2018/02/20/17/33/brain-3168269_1280.png" width='70%' margin='0, auto'>

# 1. INTRODUCTION

## 1.1 Context

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

## 1.2 Data Description

* **`id`**: unique identifier
* **`gender`**: "Male", "Female" or "Other"
* **`age`**: age of the patient
* **`hypertension`**: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* **`heart_disease`**: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* **`ever_married`**: "No" or "Yes"
* **`work_type`**: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* **`Residence_type`**: "Rural" or "Urban"
* **`avg_glucose_level`**: average glucose level in blood
* **`bmi`**: body mass index
* **`smoking_status`**: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* **`stroke`**: 1 if the patient had a stroke or 0 if not

# 2. IMPORTING LIBRARIES

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
sns.set(style="darkgrid")
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from scipy import stats


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# 3. READING AND VIEWING DATA

In [None]:
df_stroke = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")
df_stroke.head()

In [None]:
df_stroke.info()

In [None]:
df_stroke.describe()

# 4. DATA WRANGLING

In [None]:
df_stroke.columns = df_stroke.columns.str.lower()
df_stroke.columns.to_list()

# 4.1 Duplicated Data 

In [None]:
df_stroke.duplicated().sum()

# 4.2 Missing Data

## Identifying Missing Data

In [None]:
missing_values=df_stroke.isnull().sum()
missing_values

In [None]:
# Total cells in the dataset
total_cells=np.product(df_stroke.shape)
print('Total cells in this dataset:',total_cells)

In [None]:
# Calculating the percentage of missing values:
total_missing = missing_values.sum()
percent_missing = (total_missing/total_cells) * 100
print("Total missing values: {}  =  {:.2f} %".format(total_missing, percent_missing))

In [None]:
# Calculating the percentage of missing values in each column is often more meaningful to me
print('PERCENTAGE OF MISSING VALUES IN EACH COLUMN:\n')
for col in df_stroke.columns:
    missing = np.mean(df_stroke[col].isnull())
    print('{}:  {:.2f}%'.format(col, missing*100))

## Dealing with Missing Data

In [None]:
# Showing rows where values for bmi are missing
missing_bmi=df_stroke[pd.isnull(df_stroke.bmi)]
missing_bmi.head(5)

When you look at the column 'smoking_status' you can see that there is a status 'Unknown' instead of missing values. In this case I replace the missing values in column 'bmi' also by 'Unknown'

In [None]:
df_stroke["bmi"] = df_stroke["bmi"].fillna("Unknown")
df_stroke.head()

In [None]:
df_stroke.isnull().sum()

# 5. EXPLORATORY DATA ANALYSIS (EDA)

## 5.1 Correlation

In [None]:
df_stroke.corr()

In [None]:
# Visualisation of the corralation table
correlation = df_stroke.corr()
plt.figure(figsize=(11,9))
sns.heatmap(correlation, linecolor='white',linewidths=0.1, annot=True)
plt.title('Correlation Matric', size=17)
plt.xlabel('Stroke Data')
plt.ylabel('Stroke Data')
plt.show()

As there are no strong correlations within the numeric data, I will change the other columns into numeric data to include them into the correlation.

In [None]:
df_stroke.apply(lambda x: x.factorize()[0]).corr(method='pearson')

In [None]:
# Visualisation of the correlation table
correlation_matrix = df_stroke.apply(lambda x: x.factorize()[0]).corr(method='pearson')
plt.figure(figsize=(11,9))
sns.heatmap(correlation_matrix, annot = True)
plt.title("Correlation Matrix")
plt.xlabel("Stroke Data")
plt.ylabel("Stroke Data")
plt.show()

In [None]:
# Showing the highest correlations in descending order
correlation_mat = correlation_matrix
corr_pairs = correlation_mat.unstack()
sorted_pairs = corr_pairs.sort_values(kind="quicksort", ascending=False).where(corr_pairs < 1.0)
strong_pairs = sorted_pairs[abs(sorted_pairs) > 0.1]
print(strong_pairs)

In [None]:
# Creating new columns with numeric values for gender and status
df_stroke['gender_number'] = np.where((df_stroke['gender'] == "Female"), 1, 0)
df_stroke['married_number'] = np.where((df_stroke['ever_married'] == 'Yes'),1, 0)
df_stroke.head()

## 5.2 Analysing Patterns using Visualisation

***
**GENERAL INFORMATION ABOUT PATIENTS**
***

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.countplot(x='ever_married',hue='gender_number', data=df_stroke, ax=ax[0])
ax[0].set_title('Patients Married', size=17, pad=17)
ax[0].set_xlabel('Ever Married', size=13, labelpad=11)
ax[0].set_ylabel('Count', size=13)
ax[0].legend(loc='upper right', title='Gender:', labels=['Female','Male'])

sns.countplot(x='work_type',hue='gender_number', data=df_stroke, ax=ax[1])
ax[1].set_title('Worktype of all Patients', size=17, pad=17)
ax[1].set_xlabel('Age', size=13, labelpad=11)
ax[1].set_ylabel('Count', size=13)
ax[1].legend(loc='upper right', title='Gender:', labels=['Male','Female'])

plt.show()

In [None]:
heart_diseases = df_stroke.loc[(df_stroke.heart_disease == 1)]
no_heart_disease = df_stroke.loc[(df_stroke.heart_disease == 0)]

fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.histplot(x='age',hue='gender_number', data=heart_diseases, ax=ax[0], bins=10)
ax[0].set_title('Patients with Heart Disease by Gender', size=17, pad=17)
ax[0].set_xlabel('Age', size=13, labelpad=11)
ax[0].set_ylabel('Count', size=13)
ax[0].legend(loc='upper left', title='Gender:', labels=['Female','Male'])


sns.histplot(x='age', hue='gender_number',data=no_heart_disease, ax=ax[1], bins=10)
ax[1].set_title('Patients without Hypertension by Gender', size=17, pad=17)
ax[1].set_xlabel('Age', size=13, labelpad=11)
ax[1].set_ylabel('Count', size=13)
ax[1].legend(loc='upper left', title='Gender:', labels=['Male','Female'])

plt.xlim(0,85)
plt.show()

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.histplot(x='age',hue='gender_number', data=df_stroke, ax=ax[0])
ax[0].set_title('Patients with Heart Disease by Gender', size=17, pad=17)
ax[0].set_xlabel('Age', size=13, labelpad=11)
ax[0].set_ylabel('Count', size=13)
ax[0].legend(loc='upper left', title='Gender:', labels=['Female','Male'])


sns.histplot(x='age', hue='gender_number',data=no_heart_disease, ax=ax[1], bins=10)
ax[1].set_title('Patients without Hypertension by Gender', size=17, pad=17)
ax[1].set_xlabel('Age', size=13, labelpad=11)
ax[1].set_ylabel('Count', size=13)
ax[1].legend(loc='upper left', title='Gender:', labels=['Male','Female'])

plt.xlim(0,85)
plt.show()

In [None]:
hypertension = df_stroke.loc[(df_stroke.hypertension == 1)]
no_hypertension = df_stroke.loc[(df_stroke.hypertension == 0)]

fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.histplot(x='age',hue='gender_number', data=hypertension, ax=ax[0], bins=10)
ax[0].set_title('Patients with Hypertension by Gender', size=17, pad=17)
ax[0].set_xlabel('Age', size=13, labelpad=11)
ax[0].set_ylabel('Count', size=13)
ax[0].legend(loc='upper left', title='Gender:', labels=['Female','Male'])


sns.histplot(x='age',hue='gender_number', data=no_hypertension, ax=ax[1], bins=10)
ax[1].set_title('Patients without Heart Disease by Gender', size=17, pad=17)
ax[1].set_xlabel('Age', size=13, labelpad=11)
ax[1].set_ylabel('Count', size=13)
ax[1].legend(loc='upper left', title='Gender:', labels=['Male','Female'])

plt.xlim(0,85)
plt.show()

***
**AGE**
***

In [None]:
print(df_stroke.gender.value_counts().sum())
print(df_stroke.gender.value_counts())

In [None]:
# Calculating the amount and percentage of female and male patients
male_patients = df_stroke.gender.value_counts()[0]
female_patients = df_stroke.gender.value_counts()[1]
other = df_stroke.gender.value_counts()[2]
all_patients = male_patients + female_patients + other

pct_female_patients = female_patients * 100 / all_patients
pct_male_patients = male_patients * 100 / all_patients
pct_other = other * 100 / all_patients

print('Female Patients:\t{} ---> {:.2F}%\nMale Patients:\t\t{} ---> {:.2F}%\nOther:\t\t\t{} ---> {:.2F}%\n\nTOTAL:\t\t\t{}'.format(female_patients,pct_female_patients,male_patients,pct_male_patients,other,pct_other,all_patients))

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.boxplot(x="age", data=df_stroke, ax=ax[0])
ax[0].set_title('Age of all Patients', size=17, pad=17)
ax[0].set_xlabel('Age', size=13, labelpad=11)

sns.histplot(x='age', hue='gender_number', data=df_stroke, ax=ax[1], kde=True)
ax[1].set_title('Age Distribution of all Patients', size=17, pad=17)
ax[1].set_xlabel('Age', size=13, labelpad=11)
ax[1].set_ylabel('Count', size=13)
ax[1].legend(loc='upper left', title='Gender:', labels=['Male','Female'])

fig.show()

In [None]:
# Calculating males and females suffering a stroke

# Females:
female_stroke = df_stroke.loc[(df_stroke.stroke == 1) & (df_stroke.gender == 'Female')]
number_female_stroke = female_stroke.value_counts().sum()
# Males:
male_stroke = df_stroke.loc[(df_stroke.stroke == 1) & (df_stroke.gender == 'Male')]
number_male_stroke = male_stroke.value_counts().sum()
# Total:
total_strokes = number_male_stroke + number_female_stroke

print('NUMBER OF STROKES:\nFemales:\t{}\nMales:\t\t{}\nTOTAL:\t\t{}'.format(number_female_stroke, number_male_stroke,total_strokes))

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.histplot(x='age', data=female_stroke, ax=ax[0], bins=12, kde=True)
ax[0].set_title('Female Patients suffering a Stroke', size=17, pad=17)
ax[0].set_xlabel('Age', size=13, labelpad=11)
ax[0].set_ylabel('Count', size=13)


sns.histplot(x='age', data=male_stroke, ax=ax[1], bins=6, kde=True)
ax[1].set_title('Male Patients suffering a Stroke', size=17, pad=17)
ax[1].set_xlabel('Age', size=13, labelpad=11)
ax[1].set_ylabel('Count', size=13)

plt.ylim(0, 67)
plt.xlim(0, 85)
plt.show()

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.countplot(x='heart_disease', hue='gender', data=df_stroke, ax=ax[0])
ax[0].set_title('Heart Diseases compared to Gender', size=15, pad=13)
ax[0].set_xlabel('Heart Disease')
ax[0].set_ylabel('Count')
ax[0].legend(loc='upper right',title='Gender:')


sns.boxplot(x='heart_disease',y='age', hue='gender', data=df_stroke, ax=ax[1])
ax[1].set_title('Heart Diseases and Age compared to Gender', size=15, pad=13)
ax[1].set_xlabel('Heart Diseases')
ax[1].set_ylabel('Age')
ax[1].legend(loc='lower right',title='Gender')

plt.show()

In [None]:
fig, ax= plt.subplots(1, 2, figsize=(20,7))

sns.countplot(x='gender', hue='stroke', data=df_stroke, ax=ax[0])
ax[0].set_title('Strokes compared to Gender', size=15, pad=13)
ax[0].set_xlabel('Strokes')
ax[0].set_ylabel('Count')
ax[0].legend(loc='upper right', labels=['No Stroke','Stroke'])


sns.boxplot(x='stroke',y='age', hue='gender', data=df_stroke, ax=ax[1])
ax[1].set_title('Strokes and Age compared to Gender', size=15, pad=13)
ax[1].set_xlabel('Strokes')
ax[1].set_ylabel('Age')
ax[1].legend(loc='lower right',title='Gender')

plt.show()

To be continued...