According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.

### Attribute Information
1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not

*Note: "Unknown" in smoking_status means that the information is unavailable for this patient

Task: Predict Storke

## Import the required Libraries


In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Importing the dataset

In [1]:
Strokedata = pd.read_csv('../input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv')

In [1]:
Strokedata = Strokedata.drop('id', axis=1)

let's see the first five rows of dataset


In [1]:
Strokedata.head()

let's see the bottom five rows of the data

In [1]:
Strokedata.tail()

 lets see size of the data

In [1]:
Strokedata.shape

### data Contain 5110 rows And 12 columns


Lets check the type of the feature

In [1]:
Strokedata.info()

## EDA : Exploratory Data Analysis

In [1]:
Strokedata.describe()

## Some Analysis 
* bmi have some Missing values
* Pretty high Difference between #Mean of avg_glucose_level and 50% of avg_glucose_level

In [1]:
Strokedata.head()

### EDA for Numerical data

In [1]:
numerical_data = Strokedata.select_dtypes("number")
numerical_data.columns

In [1]:
Strokedata['age']  = Strokedata['age'].astype(int)
Strokedata['age'].head()

In [1]:
## Lets plot the Age and Stroke 
plt.figure(figsize=(15,10))
sns.histplot(x= 'age', hue='stroke',data =Strokedata)
plt.title("Age Distribution w.r.t Stoke")
plt.show()

In [1]:
## Now plot the avg_glucose_level wrt Stroke
plt.figure(figsize=(15,10))
sns.histplot(hue='stroke' , x='avg_glucose_level',data =Strokedata,kde = True)
plt.title("Average Glucose Level w.r.t Stoke")
plt.show()

In [1]:
## for Bmi
plt.figure(figsize=(15,10))
sns.histplot(hue='stroke' , x='bmi',data =Strokedata,kde = True)
plt.title("bmi w.r.t Stoke")
plt.show()

In [1]:
sns.pairplot(Strokedata)
plt.show()

### EDA for Categorical feature

In [1]:
categorical_data = Strokedata.select_dtypes(exclude = 'number')
categorical_data.columns

In [1]:
 for feature  in categorical_data:
  print(Strokedata[feature].unique())

In [1]:
for feature in categorical_data:
  plt.figure(figsize = (15,10))
  sns.countplot(x = Strokedata[feature])
  plt.show()

## Feature Engineering

In [1]:
data = Strokedata.copy()

In [1]:
data.head()

In [1]:
data['smoking_status'].replace('Unknown', np.nan, inplace=True)

In [1]:
data.isnull().sum()

In [1]:
 # percentage of data missing
 data.isnull().mean().round(4) * 100

In [1]:
data['bmi'].fillna(data['bmi'].mean(), inplace=True)
data['smoking_status'].fillna(data['smoking_status'].mode()[0], inplace = True)

In [1]:
 data.isnull().mean().round(4) * 100

After filling the missing Values Countplot

In [1]:
sns.countplot(x = data['smoking_status'])

In [1]:
plt.figure(figsize=(10,7))
sns.histplot(x ='bmi',hue = 'stroke',data = data,kde =True)
plt.show()

In [1]:
from sklearn.preprocessing import LabelEncoder


In [1]:
le  = LabelEncoder()
en_data = data.apply(le.fit_transform)


In [1]:
en_data.head()

## Feature Selection

In [1]:
y = en_data['stroke']
X = en_data.drop('stroke',axis =1)

### train test Split

In [1]:
from sklearn.model_selection import train_test_split

In [1]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

## Feature Scaling

In [1]:
from sklearn.preprocessing import StandardScaler

In [1]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Training Model using RandomForestClassifer

In [1]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(n_estimators=100,random_state=0)
model_rf.fit(X_train, y_train)
y_pred = model_rf.predict(X_test)

In [1]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print("The Training Score of RandomForestClassifier is: {:.3f}%".format(model_rf.score(X_train, y_train)*100))
print("\n**********************************************************************\n")
print("The Confusion Matrix for RandomForestClassifier is: \n{}\n".format(confusion_matrix(y_test, y_pred)))
print("\n**********************************************************************\n")
print("The Classification report: \n{}\n".format(classification_report(y_test, y_pred)))
print("\n**********************************************************************\n") 
print("The Accuracy Score of RandomForestClassifier is: {:.3f}%".format(accuracy_score(y_test, y_pred)*100))