# Let's Predict Strokes
The goal of this workbook is to take my rando background of data analytics, data science, and other random techniques and try and come up with the most successful (or **a successful**) algorithm to predict strokes.

In [None]:
#Kaggle introductory material
# ----------------------------------------------------------------------------------------------------------------------
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter
import matplotlib.ticker as ticker
%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# ----------------------------------------------------------------------------------------------------------------------

In [None]:
strokedata = pd.read_csv("/kaggle/input/stroke-prediction-dataset/healthcare-dataset-stroke-data.csv")

The stroke dataset contains 5110 data points each with 12 attributes to describe them.
## All Attributes
* id - ID of patient/datapoint
* gender - mostly boolean - one other
* age
* hypertension - boolean
* heart_disease - boolean
* ever_married - boolean
* work_type
* Residence_type - boolean
* avg_glucose_level
* bmi
* smoking_status
* stroke - boolean

To get to know the data, I'll be using the 'describe' method to get an overview of our quantitative attributes and then 'value_counts' to get an idea of our boolean data and qualititive attributes.

The below table lists the attributes pulled in with the describe method and I listed the attributes I specifically listed in the 'value_counts' method

| All Attributes   | 'describe' data  |'value_counts' data|
| -----------      | -----------      |-----------        |
| id               | id               |gender             |
| gender           | age              |hypertension       |
| age              | hypertension     |heart_disease      |
| hypertension     | heart_disease    |ever_married       |
| heart_disease    | avg_glucose_level|work_type          |
| ever_married     | bmi              |Residence_type     |
| work_type        | stroke           |smoking_status     |
| Residence_type   |                  |stroke             |
| avg_glucose_level|                  |                   |
| bmi              |                  |                   |
| smoking_status   |                  |                   |
| stroke           |                  |                   |
### 

In [None]:
strokedata.describe()

### In addition to describe and value counts data, we should probably check for any Na data.

In [None]:
display(strokedata.isna().sum())
display(201/5110)

And apparently, all our NaN data is in BMI.  3.9% of our data has an N/A BMI.  To preserve the accuracy of our model, I'm extremely hesitant to keep this data by using an average BMI.  Especially knowing that BMI could potentially be a large predictor.  I think we nix this data and don't look back.



In [None]:
strokedata = strokedata.dropna()

In [None]:
vcattributes = ['gender','hypertension','heart_disease','ever_married','work_type','Residence_type','smoking_status','stroke']
# want to get rid of ID, age, avg_glucose_level, bmi

for vcattribute in vcattributes:
    print(strokedata[[vcattribute]].value_counts(),'\n','-------------------------')

Now that we've gotten a good look at what the data is, what values it has, and what type each attribute is, lets look at the best way to predict our data.
How to decide what algorithm to use.... good question.  Somebody who has created predictive algorithms may have keen insight into this issue, but I have 0 to no experience in this.  In the abscence somebody with experience recommending a path, a quick google search reveals this paper from India written by Ritabrata Maiti (https://arxiv.org/ftp/arxiv/papers/1802/1802.07756.pdf).  On page four there is a diagram by SciKit-Learn which details a good starting place for beginners on which algorithms to choose.  For additional reading about different algorithms by a professional, I will likely go back and refer to the article by Mr. Maiti.
![Machine Learning Map by scikit-learn](https://scikit-learn.org/stable/_static/ml_map.png)


## To start, I'm going to focus on K-Nearest Neighbors and then SVC using the boolean and numeric data that we have.
The goal: To predict stroke based on whether the patient has hypertension, heart disease, being married and residence.
*this goal largely conceived based on my skill level.  I'm hoping that minimal data preprocessing and using boolean and numeric only data will be easier to work with
* id - ID of patient/datapoint
* hypertension - boolean
* heart_disease - boolean
* ever_married - boolean
* Residence_type - boolean
* stroke - boolean



First step, convert ever_married to 1 and 0.  1 = yes, 0 = no
Second step, convert Residence_type to 1 and 0. 1 = urban, 0 = rural

In [None]:
strokedata['ever_married']= strokedata['ever_married'].map(dict(Yes=1, No=0))
strokedata[['ever_married']].value_counts() # double check make sure it worked.

In [None]:
strokedata['Residence_type']= strokedata['Residence_type'].map(dict(Urban=1, Rural=0))
strokedata[['Residence_type']].value_counts() # double check make sure it worked.

In [None]:
booleandata = strokedata[['id','hypertension','heart_disease','ever_married','Residence_type','stroke']]

boolandnumeric = strokedata[['age','hypertension','heart_disease','avg_glucose_level','ever_married','Residence_type','bmi']]
stroke = strokedata['stroke']

In [None]:
boolandnumeric #missing gender, work_type, id, and smokeing status. argueably important but that involves learning more. 

## Preprocessing pipeline includes StandardScaler, fit, and transform.
*process adapted from a great article found here:* https://towardsdatascience.com/how-to-find-the-optimal-value-of-k-in-knn-35d936e554eb

* StandardScaler
    * Standardize features by removing the mean and scaling to unit variance. The standard score of a sample x is calculated as: 
    
      z = (x - u) / s 
      
      where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
* fit
    * Compute the mean and std to be used for later scaling.
* transform
    * Perform standardization by centering and scaling


In [None]:
from sklearn import preprocessing
X = preprocessing.StandardScaler().fit(boolandnumeric).transform(boolandnumeric.astype(float))

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, stroke, test_size=0.2, random_state=4)

from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
#Train Model and Predict
best = 0
for k in range(1,20):
    k = k  
    neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
    Pred_y = neigh.predict(X_test)
    if best < metrics.accuracy_score(y_test, Pred_y):
        best = metrics.accuracy_score(y_test, Pred_y)
        bestk = k
    print("Accuracy of model at K= ",k," is",metrics.accuracy_score(y_test, Pred_y))

print("Most accurate model at K= ",bestk," is",best)

## Based on the above code, our most accurate model is at k = 6 so we will use that going forward.

In [None]:
#original code was off by one.  The index is -1 then the i value.  the minimum error is at k=6
error_rate = []
for i in range(1,40):
 knn = KNeighborsClassifier(n_neighbors=i)
 knn.fit(X_train,y_train)
 pred_i = knn.predict(X_test)
 error_rate.append(np.mean(pred_i != y_test))

plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', 
         marker='o',markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
print("Minimum error:",min(error_rate),"at K =",error_rate.index(min(error_rate))+1) #see above comment. this +1 is to fix this issue

## The error in the original code can be plainly seen below.  
The minimum can be found at index 5 but the i value above started at 1 (for the sake of the algorithm, k cannot = 0).

Where index = 0, k = 1, and the error rate is 0.07331975560081466

Where index = 1, k = 2, and the error rate is 0.04276985743380855,

Where index = 2, k = 3, and the error rate is 0.045824847250509164,

Where index = 3, k = 4, and the error rate is 0.04073319755600815,

Where index = 4, k = 5, and the error rate is 0.04175152749490835,

Where index = 5, k = 6, and the error rate is 0.03767820773930754 ------**This is the minimum value for the error at k = 6**

Where index = 6, k = 7, and the error rate is 0.04175152749490835

In [None]:
error_rate[0:7]

In [None]:
k = 6
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
Pred_y = neigh.predict(X_test)

So... turns out, yay we have a model that is largely accurate based on the parameters that were entered and the data that we have.  If I find another dataset with these parameters I'll test it out.  still an error rate of 3% but that isn't too shabby. However, lets try the SVC algorithm.  I also kind of want to see if I can figure out a system to "weight" how much each attribute effects the accuracy of the model.  Which is the strongest predictor as it were.

An individual comment list on this question: https://stackoverflow.com/questions/35815992/how-to-find-out-weights-of-attributes-in-k-nearest-neighbors-algorithm/35816344 indicated that the inverse of the distance of a given point could tell me how much the algorithm weights it for that point. 

# to be continued