In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="darkgrid")
import warnings
warnings.filterwarnings('ignore')

In [None]:
df = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [None]:
df.head()

In [None]:
df.drop('id', axis=1, inplace=True)

In [None]:
df.tail()

In [None]:
df.describe()

**Observations**
1. BMI contains missing values.
1. The average age is 43.
1. The average bmi is 28.
1. The minimum age is questionable.
1. Average glucose level is 106.

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.isnull().sum()

In [None]:
# Heatmap for null values
sns.heatmap(df.isnull(), cmap = 'viridis')

In [None]:
# handling null values with mean
df['bmi'] = df['bmi'].fillna(df['bmi'].mean())

In [None]:
df.info()

In [None]:
df['gender'].value_counts()

In [None]:
# droping other gender section
df = df[df['gender'] != 'Other']

In [None]:
# No of unique values in columns
df.nunique()

In [None]:
# Correlation heatmap
corr = df.corr()
sns.heatmap(corr, xticklabels=corr, yticklabels=corr.columns, annot=True)

**Observations**
1. No strong correlation between our features.
1. The highest correlation can be observed between body mass index(bmi) and age.
1. The weakest correlation can be observed between heart_disease and hyper_tension.

In [None]:
# pairplot with gender as hue
sns.pairplot(df, hue='gender')

**Questions to ask:**

1. At which age people have more strokes.
1. Male/Female who has more Hypertension.
1. Male/Female who has more Heart disease.
1. Who has more hypertension in term of job type?
1. A person with heart disease is more likely to get a stroke?
1. Is marriage a cause of stroke?
1. Who has more stroke in term of job type?
1. BMI observed with gender.
1. Glucose level observed.
1. People who smoke are more likely to get a stroke?

In [None]:
# 1
sns.histplot(x='age', hue='stroke', data=df, element="poly")

**Observation**

This plot shows that the age of people are ranging from 0 to 80, and the aged-40 people are the most in this data. Moreover, people aged 0 to 40 did not suffer from any heart-stroke, considerably.

In [None]:
# 2
sns.barplot(x='gender', y='hypertension', data=df.query('age > 20'))

**Observation**

Hypertension is more found in Males then females. It can lead to severe health complications and increase the risk of heart disease, stroke, and sometimes death.

In [None]:
# 3
sns.barplot(x='gender', y='heart_disease', data=df)

**Observation**

Heart disease is the leading cause of death in the United States, causing about 1 in 4 deaths. In this graph it shows Males have faced way more cases of Heart disease than Females. The term “heart disease” refers to several types of heart conditions.

In [None]:
# 4
sns.barplot(x='hypertension', y='work_type', data=df.query('age > 20'))

**Observation**

Hypertension is more found in patients that are self-employed and in this data set one who never worked have never faced any hypertension issue. Hypertension can lead to severe health complications and increase the risk of heart disease, stroke, and sometimes death.

In [None]:
# 5
sns.barplot(y='stroke', x='heart_disease', data=df)

**Observation**

Heart disease can increase your risk for stroke. 

In [None]:
# 6
sns.lineplot(x='age', y='stroke',hue='ever_married', data=df)

**Observation**

People between ages 40 to 80 suffer more stroke then 0 to 40. In this graph it shows in orange line people who are not married even after their 40s are more likely to suffer stroke than married.

In [None]:
# 7
sns.barplot(x='work_type', y='stroke', data=df)

**Observation**

Stroke is more found in Males and females that are self-employed and ones that never worked have never faced any stroke issue. After self-employed Job holders either they are private or government employees they both have faced stroke.   

In [None]:
# 8
x = df.query('gender == "Male"')
y = df.query('gender == "Female"')
sns.distplot(x['bmi'], bins=10, hist = True, label='Male')
sns.distplot(y['bmi'], bins=10, hist = True, label='Female')
plt.title('Average BMI')

plt.legend()
plt.show()

**Observation**

BMI between 15 and 45. According to Google average BMI lies between 18 and 25. So BMI between 25and 30 are little overweight while less than 18 are underweight and above 30 are obese.

In [None]:
# 9
sns.histplot(x='avg_glucose_level', hue='work_type', data=df, multiple="dodge", kde=True, bins=10)

**Observation**

Job type have no major effect on glucose level. The blue line indicates that in this dataset most of people are private job holders and are in large numbers. Other than that there is no effect of job type on your glucose level.

In [None]:
# 10
sns.histplot(x='smoking_status', hue='stroke', data=df, multiple='dodge')

**Observation**

Smoking status have no effect stroke.

*Thank you for your time. Please upvote the notebook if you like it.*