# Stroke Prediction Classification


### **Introduction:**

The 5th highest cause of death and the leading cause of disability in the United States is due in part to strokes (Cerebrovascular Accidents). Strokes are the result of an insufficient supply of oxygen and nutrients to the brain due to a blood vessel bursting. Subsequently, brain cells die, which can often cause right-sided weakness/paralysis, and sensory impairment.

This prediction classification project will answer the question of whether or not someone is likely to experience a stroke based on 8 different core factors relating to health and demographic.

The “Stroke Prediction Dataset”, acquired from kaggle.com, contains 12 columns, including different factors that may affect someone’s likelihood of experiencing a stroke, as well as whether or not the patient experienced a stroke, and other patient information.

In [19]:
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 6)

In [32]:
url <- "https://raw.githubusercontent.com/romansinkus/DS_Group_Project/main/healthcare-dataset-stroke-data.csv"
untidy_stroke_data <- read_csv(url)
stroke_data <- untidy_stroke_data %>%
    select(-id) %>% 
    mutate(stroke = as_factor(stroke)) %>%
    mutate(bmi = as.numeric(bmi)) %>% 
    mutate(heart_disease = as_factor(heart_disease)) %>%
    mutate(Residence_type = as_factor(Residence_type)) %>%
    mutate(smoking_status = as_factor(smoking_status)) %>%
    mutate(work_type = as_factor(work_type)) %>%
    mutate(gender = as_factor(gender)) %>%
    mutate(ever_married = as_factor(ever_married)) %>%
    mutate(hypertension = as_factor(hypertension)) %>% 
    rename(residence_type = Residence_type) %>%
    filter(bmi > 10 & bmi < 50)

stroke_data

Parsed with column specification:
cols(
  id = [32mcol_double()[39m,
  gender = [31mcol_character()[39m,
  age = [32mcol_double()[39m,
  hypertension = [32mcol_double()[39m,
  heart_disease = [32mcol_double()[39m,
  ever_married = [31mcol_character()[39m,
  work_type = [31mcol_character()[39m,
  Residence_type = [31mcol_character()[39m,
  avg_glucose_level = [32mcol_double()[39m,
  bmi = [31mcol_character()[39m,
  smoking_status = [31mcol_character()[39m,
  stroke = [32mcol_double()[39m
)

“Problem with `mutate()` input `bmi`.
[34mℹ[39m NAs introduced by coercion
[34mℹ[39m Input `bmi` is `as.numeric(bmi)`.”
“NAs introduced by coercion”


gender,age,hypertension,heart_disease,ever_married,work_type,residence_type,avg_glucose_level,bmi,smoking_status,stroke
<fct>,<dbl>,<fct>,<fct>,<fct>,<fct>,<fct>,<dbl>,<dbl>,<fct>,<fct>
Male,67,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
Male,80,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
Female,49,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Female,35,0,0,Yes,Self-employed,Rural,82.99,30.6,never smoked,0
Male,51,0,0,Yes,Private,Rural,166.29,25.6,formerly smoked,0
Female,44,0,0,Yes,Govt_job,Urban,85.28,26.2,Unknown,0


In [33]:
stroke_split <- initial_split(stroke_data, prop = 0.75, strata = stroke)
stroke_train <- training(stroke_split)
stroke_test <- testing(stroke_split)

In [34]:
avg_values <- stroke_train %>% 
    summarize(age_avg = mean(age, na.rm = TRUE),
            avg_glucose_level_avg = mean(avg_glucose_level, na.rm = TRUE),
            bmi_avg = mean(bmi, na.rm = TRUE))

avg_values

age_avg,avg_glucose_level_avg,bmi_avg
<dbl>,<dbl>,<dbl>
42.61487,104.7845,28.53235


In [40]:
categorical_values <- stroke_train %>% 
#     group_by(hypertension) %>% 
#     summarize(hypertension = nrow(filter(hypertension == 1))) 
    nrow(filter(hypertension == 1))

ERROR: Error in nrow(., filter(hypertension == 1)): unused argument (filter(hypertension == 1))


### **Methods:**

We will be using the variables that could possibly have a direct or indirect effect on strokes, containing a combination of categorical variables, discrete variables, and continuous variables. We will be observing the gender, age, hypertension, heart disease, residence type, average glucose level, body mass index, and smoking status of individuals to predict whether they are more likely to have a stroke. 

The data was already tidy, since each row had a single observation, each column had a single variable and each value was in a single cell. However, we removed some redundant columns that were not needed in our analysis such as id, if the patient was ever married and the work type since they were not directly related or as important as other factors.

Describe at least one way that you will visualize the results:


### **Expected Outcomes and Significance:**

We expect to find a trend that can help determine if a patient is more likely to suffer from a stroke. This depends on several factors in question, based on a patient’s health or living habits or demographics. For example, smoking, having a heart disease or living in a polluted area would lead to higher chances of a stroke. 

What impact could such findings have?

Using this prediction, predicting a stroke would be an easier and more efficient process. It could help many patients seek help at an early stage, to prevent serious outcomes from occuring.

**Future Questions:**