### **Context**

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
Attribute Information

1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*
12. stroke: 1 if the patient had a stroke or 0 if not<br/>
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient
Acknowledgements

(Confidential Source) - Use only for educational purposes
If you use this dataset in your research, please credit the author.

[Link for the dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset?select=healthcare-dataset-stroke-data.csv)

In [None]:
import numpy as np 
import pandas as pd 
from matplotlib import pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
df.shape

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.head()

In [None]:
df.drop('id', inplace=True, axis=1)

In [None]:
df.head()

In [None]:
df['stroke'].value_counts().plot(kind='bar')
plt.show()

# **Data is imbalanced**
**Let's do EDA First**

# **smoking status and heart disease**

In [None]:
sns.countplot(data=df[['smoking_status']],x='smoking_status')
plt.show()

In [None]:
sns.countplot(data=df[['heart_disease']],x='heart_disease')
plt.show()

In [None]:
sns.countplot(data=df[['smoking_status','heart_disease']],x='smoking_status',hue='heart_disease')
plt.show()

# **Hypertension and stroke**

In [None]:
sns.countplot(data=df[['hypertension']],x='hypertension')
plt.show()

In [None]:
sns.countplot(data=df[['hypertension','stroke']],x='hypertension',hue='stroke')
plt.show()

In [None]:
sns.histplot(data=df[['age','stroke']],x='age',element='poly',hue='stroke')
plt.show()

In [None]:
sns.countplot(data=df[['gender']],x='gender')
plt.yscale('log')
plt.show()

In [None]:
sns.countplot(data=df[['gender','stroke']],x='stroke', hue='gender')
plt.yscale('symlog')
plt.show()

In [None]:
df[['age','avg_glucose_level','bmi']].describe()

# **Age Group of genders**

In [None]:
males = df[['age','gender','stroke']][df['gender'] == 'Male']
males.shape

In [None]:
females = df[['age','gender','stroke']][df['gender'] == 'Female']
females.shape

In [None]:
others = df[['age','gender']][df['gender'] == 'Other']
others.shape

In [None]:
others

### **Dropping Other's gender**

In [None]:
df.drop(3116, axis='rows',inplace=True)

In [None]:
plt.title('Age and Stroke in males')
sns.histplot(males,x='age',hue='stroke', element='poly')
plt.show()

In [None]:
plt.title('Age and Stroke in females')
sns.histplot(females,x='age',hue='stroke', element='poly')
plt.show()

# **Average glucose level**

In [None]:
sns.histplot(data=df[['avg_glucose_level','stroke']],x='avg_glucose_level', hue='stroke',element='poly')
plt.show()
sns.histplot(data=df[['bmi','stroke']],x='bmi', hue='stroke',element='poly')
plt.show()

In [None]:
df['stroke'].value_counts()

# **Filling BMI NaN values based on gender**

In [None]:
df['bmi'][df['gender'] == 'Female'].isna().sum()

In [None]:
df['bmi'][df['gender'] == 'Male'].isna().sum()

In [None]:
male_bmi = df['bmi'][df['gender'] == 'Male']
female_bmi = df['bmi'][df['gender'] == 'Female']
male_bmi_mean = male_bmi.mean()
female_bmi_mean = female_bmi.mean()

In [None]:
male_bmi.isna().sum(),female_bmi.isna().sum()

In [None]:
male_bmi.fillna(male_bmi_mean,inplace=True)
female_bmi.fillna(female_bmi_mean,inplace=True)

In [None]:
new_bmi = pd.concat([male_bmi,female_bmi])

In [None]:
df['bmi'] = new_bmi

In [None]:
df.isna().sum()

In [None]:
sns.histplot(data=df[['bmi','stroke']],x='bmi', hue='stroke',element='poly')
plt.show()

# **Work Type and Residence type**

In [None]:
sns.countplot(data=df[['work_type']],x='work_type')
plt.show()

In [None]:
sns.countplot(data=df[['Residence_type']],x='Residence_type')
plt.show()

In [None]:
sns.countplot(data=df[['work_type','Residence_type']],x='work_type',hue='Residence_type')
plt.show()

In [None]:
sns.countplot(data=df[['work_type','Residence_type']],x='Residence_type',hue='work_type')
plt.show()

In [None]:
sns.countplot(data=df[['work_type','stroke']],x='work_type',hue='stroke')
plt.yscale('log')
plt.show()

In [None]:
sns.countplot(data=df[['work_type','heart_disease']],x='work_type',hue='heart_disease')
plt.yscale('log')
plt.show()

In [None]:
sns.countplot(data=df[['ever_married']],x='ever_married')
plt.show()

In [None]:
plt.figure(figsize=(10,10))
sns.pairplot(data=df[['bmi','avg_glucose_level','age','stroke']],hue='stroke',kind='hist')
plt.show()