<a href="https://colab.research.google.com/github/shivaniii24/Stroke-prediction/blob/main/Stroke_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ***Stroke prediction***
According to the World Health Organization (WHO) stroke is
the 2nd leading cause of death globally, responsible for approximately 11% of
total deaths.This dataset is used to predict whether a patient is likely to get a
stroke based on the input parameters like gender, age, various diseases, and
smoking status. Each row in the data provides relevant information about the
patient.

### Build a model that can be used to predict stroke.


Import our libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, accuracy_score

Read in our dataset

In [None]:
df= pd.read_csv("/content/healthcare-dataset-stroke-data.csv")

In [None]:
df

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,Yes,Private,Urban,83.75,,never smoked,0
5106,44873,Female,81.0,0,0,Yes,Self-employed,Urban,125.20,40.0,never smoked,0
5107,19723,Female,35.0,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
5108,37544,Male,51.0,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0


In [None]:
df.corr()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
id,1.0,0.003538,0.00355,-0.001296,0.001092,0.003084,0.006388
age,0.003538,1.0,0.276398,0.263796,0.238171,0.333398,0.245257
hypertension,0.00355,0.276398,1.0,0.108306,0.174474,0.167811,0.127904
heart_disease,-0.001296,0.263796,0.108306,1.0,0.161857,0.041357,0.134914
avg_glucose_level,0.001092,0.238171,0.174474,0.161857,1.0,0.175502,0.131945
bmi,0.003084,0.333398,0.167811,0.041357,0.175502,1.0,0.042374
stroke,0.006388,0.245257,0.127904,0.134914,0.131945,0.042374,1.0


Droping out some columns which is not so important for prediction of stroke.

In [None]:
df_stroke= df.drop(columns=["ever_married","work_type","Residence_type","smoking_status"])

now our data is somewhat clean so lets look into it.

In [None]:
df_stroke.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


The column "bmi" has some null values so we will put the average value in place of it.

In [None]:
mean_value=df_stroke['bmi'].mean()

In [None]:
df_stroke['bmi'].fillna(value=mean_value, inplace=True)

In [None]:
df_stroke

Unnamed: 0,id,gender,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
0,9046,Male,67.0,0,1,228.69,36.600000,1
1,51676,Female,61.0,0,0,202.21,28.893237,1
2,31112,Male,80.0,0,1,105.92,32.500000,1
3,60182,Female,49.0,0,0,171.23,34.400000,1
4,1665,Female,79.0,1,0,174.12,24.000000,1
...,...,...,...,...,...,...,...,...
5105,18234,Female,80.0,1,0,83.75,28.893237,0
5106,44873,Female,81.0,0,0,125.20,40.000000,0
5107,19723,Female,35.0,0,0,82.99,30.600000,0
5108,37544,Male,51.0,0,0,166.29,25.600000,0


Splitting the data in x and y that is input and output

In [None]:
x=df_stroke.iloc[:,2:11]
y=df_stroke['stroke'].values

In [None]:
x

Unnamed: 0,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
0,67.0,0,1,228.69,36.600000,1
1,61.0,0,0,202.21,28.893237,1
2,80.0,0,1,105.92,32.500000,1
3,49.0,0,0,171.23,34.400000,1
4,79.0,1,0,174.12,24.000000,1
...,...,...,...,...,...,...
5105,80.0,1,0,83.75,28.893237,0
5106,81.0,0,0,125.20,40.000000,0
5107,35.0,0,0,82.99,30.600000,0
5108,51.0,0,0,166.29,25.600000,0


In [None]:
y

array([1, 1, 1, ..., 0, 0, 0])

Splitting the data into train and test

In [None]:
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.2, random_state = 0)

An initial SVM model with linear kernel

In [None]:
clf = SVC(kernel="linear", random_state=0)

Fit the model

In [None]:
clf.fit(x_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=False)

Now our model is ready so we can predict the values

In [None]:
pred_y = clf.predict(x_test)

In [None]:
pred_y

array([1, 0, 0, ..., 0, 1, 0])

In [None]:
y_test

array([1, 0, 0, ..., 0, 1, 0])

The confusion matrix is a summary of prediction results for a given classification problem. The confusion matrix shows the number of correct and incorrect predictions broken down by each class

In [None]:
confusion_matrix(y_test,pred_y)

array([[968,   0],
       [  0,  54]])

Here everything is done so lets check the accuracy of our model.

In [None]:
accuracy = accuracy_score(y_test, pred_y)

In [None]:
accuracy

1.0

In [None]:
print("accuracy = ", accuracy * 100, "%")

accuracy =  100.0 %


# ***Conclusion***

1)It seemed like both BMI and Age were positively correlated, though the association was not strong.

2)Older patient was more likely to suffer a stroke than a younger patient.
Higher BMI does not increase the stroke risk.

3)Diabetes is one of the risk factors for stroke occurrence and prediabetes patients have an increased risk of stroke.

4)Higher proportion of patients who suffered from hypertension or heart disease experienced a stroke, all else being equal.

5)Regardless of patient’s gender, and where they stayed, they have the same likelihood to experience stroke

6)Work type variable was highly associated with age.

7)Marital status variable was highly associated with age.

8)Using SVM we achieved 100% accuracy.