**Insights :
1. Stroke has highest correlation with age.
2. People having heart disease are more likely to have stroke.
3. People having hypertension are probable to also suffer from stroke.
4. Higher the amount of glucose, lesser chances to have stroke.
5. Person having age above 50 are highly probable to suffer from stroke.**

# **Importing Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
import plotly.express as px
import plotly.figure_factory as ff
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
import warnings
warnings.filterwarnings("ignore")

# **Reading Data**

In [None]:
data=pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')
data

# **Dataset Description :**

1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not

# **Data Analysis And Visualization**

In [None]:
stroke=Counter(data['stroke'])
classes=[]
count=[]   #list to store no of laels of each class
for i in stroke.keys():
    classes.append(i)
    count.append(stroke[i])
colors = ["#E13F29", "#D69A80"]

plt.pie(
    count,
    labels = classes,
    shadow = True,
    colors = colors,
    startangle=80,
    autopct='%1.1f%%'
)
plt.axis('equal')
plt.tight_layout()
plt.title("Percentage of strokes present in the dataset", fontsize=15)
plt.show()

Only 4.9% of people are affected by stroke so the data available to us is imbalanced and we need to balance it before giving to our machine learning algorithms.

# **1.Distribution of each attribute and its distribution w.r.t. stroke**

In [None]:
plt.figure(figsize=(10,5))
plt.xlim(-10,125)
plt.xlabel('Age')
plt.ylabel('Density')
sns.kdeplot(data['age'],shade=True)
plt.title('Distribution of age', fontsize=15)
plt.show()

In [None]:
stroke = data[data['stroke']==1]['age'].fillna(0.0).astype(float)
stroke_no = data[data['stroke']==0]['age'].fillna(0.0).astype(float)
fig = ff.create_distplot([stroke, stroke_no], ['Stroke','No Stroke'], bin_size=0.65, curve_type='normal'
                        ,colors =  ['#221F1F','#E50914'])
fig.update_layout(
    title="Stroke distibution over age",
    xaxis_title="Age",
)
fig.show()

People having age more than 50 are very very highly probable to suffer from stroke as compare to people having age less than 50.

In [None]:
data['gender'].value_counts().plot(kind='bar')
print(data['gender'].value_counts())
plt.title('Distribution of gender', fontsize=15)

In [None]:
sns.violinplot(x = 'gender', y = 'stroke', data = data)
plt.title('Stroke distibution over gender', fontsize=15)
print("Percentage of male and female suffers from stroke")
print(data.groupby('gender').stroke.apply(lambda x: (x == 1).mean()))

5.1% of male suffers from stroke whereas 4.7% of female suffer from stroke which means male are highly probable to suffer from stroke as compare to females.

In [None]:
data['hypertension'].value_counts().plot(kind='bar')
print(data['hypertension'].value_counts())
plt.title('Distribution of hypertension', fontsize=15)

In [None]:
sns.catplot(x="hypertension", hue="stroke",palette="pastel", kind="count",data=data)
plt.title('Stroke distibution over hypertension', fontsize=15)
print("Percentage of hypertenshion people and non-hypertension suffers from stroke")
print(data.groupby('hypertension').stroke.apply(lambda x: (x == 1).mean()))

13.2% of people having hpertension also suffers from stroke where 3.9% of people having no hyperteension suffer from stroke. Person having hypertension is more likely to suffer from stroke.

In [None]:
fig = plt.figure(figsize=(7,7))
sns.distplot(data['avg_glucose_level'], color="red",  kde= True)
plt.title('Distribution of Avg. Glucose Level', fontsize=15)
plt.legend()

In [None]:
stroke = data[data['stroke']==1]['avg_glucose_level'].fillna(0.0).astype(float)
stroke_no = data[data['stroke']==0]['avg_glucose_level'].fillna(0.0).astype(float)
fig = ff.create_distplot([stroke, stroke_no], ['Stroke','No Stroke'], bin_size=0.65, curve_type='normal'
                        ,colors =  ['#221F1F','green'])
fig.update_layout(
    title="Stroke distibution over avg glucose level",
    xaxis_title="Avg glucose level",
)
fig.show()

Person having glocose level less than 150 suffer more from stroke

In [None]:
plt.figure(figsize=(10,5))
plt.xlim(-10,125)
plt.xlabel('bmi')
plt.ylabel('Density')
sns.kdeplot(data['bmi'],shade=True)
plt.title('Distribution of bmi', fontsize=15)
plt.show()

In [None]:
plt.figure(figsize=(12,10))

sns.distplot(data[data['stroke'] == 0]["bmi"], color='green') 
sns.distplot(data[data['stroke'] == 1]["bmi"], color='red') 

plt.title('Stroke distibution over BMI ', fontsize=15)
plt.xlim([-10,100])
plt.show()

Peope have high bmi suffer more from stroke

In [None]:
sns.catplot(x="heart_disease", y="age", order=[0, 1], data=data)
plt.title("Distribution of heart disease w.r.t age")
sns.catplot(x="stroke", y="age", order=[0, 1], data=data)
plt.title("Distribution of stroke w.r.t age")

Person having age above 50 are highly probable to have heart disease as well as stroke.

In [None]:
sns.catplot(x="heart_disease", y="age", hue="stroke",palette="pastel", kind="bar", data=data)
print(data.groupby('heart_disease').stroke.apply(lambda x: (x == 1).mean()))

17% of person having heart  disease will also have stroke whereas 4% of the person that doesnot have heart disease will have stroke.

# **2. Distribution of each categorical attribute w.r.t. continuous attribute having hue =stroke**

In [None]:
sns.catplot(x="heart_disease", y="bmi",hue='stroke' ,kind="box", data=data)
sns.catplot(x="heart_disease", y="age",hue='stroke' ,kind="box", data=data)
sns.catplot(x="heart_disease", y="avg_glucose_level",hue='stroke' ,kind="box", data=data)

In [None]:
plt.figure(figsize=(10,12))
plt.subplot(3,1,1)
sns.violinplot(x = 'hypertension', y="bmi",hue='stroke', data = data)
plt.subplot(3,1,2)
sns.violinplot(x = 'hypertension', y = 'age',hue='stroke', data = data)
plt.subplot(3,1,3)
sns.violinplot(x = 'hypertension', y = 'avg_glucose_level',hue='stroke', data = data)
plt.show()

In [None]:
sns.catplot(x="smoking_status", y="bmi",hue='stroke' ,aspect=1.7,kind="swarm" ,data=data)
sns.catplot(x="smoking_status", y="age",hue='stroke' ,aspect=1.7,kind="swarm" ,data=data)
sns.catplot(x="smoking_status", y="avg_glucose_level",hue='stroke' ,aspect=1.7,kind="swarm" ,data=data)

In [None]:
sns.catplot(x="ever_married", y="bmi", hue='stroke',jitter=False, data=data)
sns.catplot(x="ever_married", y="age", hue='stroke',jitter=False, data=data)
sns.catplot(x="ever_married", y="avg_glucose_level", hue='stroke',jitter=False, data=data)

In [None]:
sns.catplot(x="Residence_type", y="bmi", hue='stroke',data=data)
sns.catplot(x="Residence_type", y="age", hue='stroke', data=data)
sns.catplot(x="Residence_type", y="avg_glucose_level", hue='stroke', data=data)

In [None]:
sns.catplot(x="work_type", y="bmi", hue='stroke', kind="boxen",data=data)
sns.catplot(x="work_type", y="age", hue='stroke', kind="boxen",data=data)
sns.catplot(x="work_type", y="avg_glucose_level", hue='stroke', kind="boxen",data=data)

# 3. Distribution of each continuous attribute w.r.t. continuous attribue having hue =stroke

In [None]:
plt.figure(figsize=(10,12))
plt.subplot(3,1,1)
sns.scatterplot(data=data, x="age", y="bmi", hue='stroke')
plt.subplot(3,1,2)
sns.scatterplot(data=data, x="age", y="avg_glucose_level", hue='stroke')
plt.subplot(3,1,3)
sns.scatterplot(data=data, x="bmi", y="avg_glucose_level", hue='stroke')

# **4. Correlation among the attributes**

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(data.corr(),cmap="Blues");

In [None]:
data.corr()

From the above matrix and table, we can see that age is more correlated to stroke, after that heart disease and avg glucose label.

# **Data Preprocesssing**

In [None]:
data.head(5)

As the data contains categorial feature so I will convert it into numerical value by doing encoding.

In [None]:
encode = LabelEncoder()
data = data.apply(encode.fit_transform)
data.head()

In [None]:
data.info()

As bmi feature contains NAN value so I will replace it with the mean of bmi featrue.

In [None]:
data['bmi'].fillna((data['bmi'].mean()), inplace=True)
data.info()

Let us drop the id as well as store stroke in another list

In [None]:
data = data.drop('id', axis=1)
y=data['stroke']
data = data.drop('stroke', axis=1)
data.head()

As the classes are imbalanced so I will first split the data into train and test then I will balance the classes by first doing oversampling and then by downsampling.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.2, random_state=42)
sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X_train, y_train)
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
y_train.value_counts().plot(kind='bar')
plt.title("Original distribution of training data")
plt.subplot(1,2,2)
y_res.value_counts().plot(kind='bar')
plt.title("Distribution of training data after balancing the classes")

Now our data is ready to give to any machine learning algorithm.