# Attribute Information

1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_job", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



## Import Libraries

In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib
import matplotlib.pyplot as plt 


## Read Data

In [2]:
df = pd.read_csv("/Users/sahil/Documents/Project/Data3-3/5k_stroke-data.csv")

### View Information

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


### View Shape

In [5]:
df.shape


(5110, 12)

In [6]:
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


### View Sample

### Summary of Numerical Data

The 5-point summary or Statistical summary tells the descriptive summary which includes: 

- Mean
- Median
- Mode
- Maximum value
- Minimum value

In [7]:
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [8]:
categorical_cols = [col for col in df.columns.tolist() if col not in df.describe().columns.tolist()]
print(categorical_cols)

['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']


In [9]:
df.gender.value_counts()

Female    2994
Male      2115
Other        1
Name: gender, dtype: int64

In [10]:
# Removing 'Other' since there is just one entry
df= df[df['gender'] != 'Other']
df.gender.value_counts()

Female    2994
Male      2115
Name: gender, dtype: int64

In [11]:
#dropping ID and Residence_type since they are not important but Vertex AI giving them too much importance
df = df.drop(columns=['id'], axis=1)
df = df.drop(columns=['Residence_type'], axis=1)

## Data Pre-processing for Data1

- Label Encoding of categorical variables
- Handling Missing/Null Values
- Handling Outliers
- Oversampling of Stroke positive (stroke=1) entries


##### Label Encoding of all 4 categorical variables

In [12]:
# Label encoding of categorical variables in new dataset df_copy

df_copy=df.copy(deep=True)

from sklearn.preprocessing import LabelEncoder
LabelEncoder = LabelEncoder()
df_copy.iloc[:,0] = LabelEncoder.fit_transform(df_copy.iloc[:,0].values)
df_copy.iloc[:,4] = LabelEncoder.fit_transform(df_copy.iloc[:,4].values)
df_copy.iloc[:,5] = LabelEncoder.fit_transform(df_copy.iloc[:,5].values)
df_copy.iloc[:,8] = LabelEncoder.fit_transform(df_copy.iloc[:,8].values)

In [13]:
df_copy.head(10)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke
0,1,67.0,0,1,1,2,228.69,36.6,1,1
1,0,61.0,0,0,1,3,202.21,,2,1
2,1,80.0,0,1,1,2,105.92,32.5,2,1
3,0,49.0,0,0,1,2,171.23,34.4,3,1
4,0,79.0,1,0,1,3,174.12,24.0,2,1
5,1,81.0,0,0,1,2,186.21,29.0,1,1
6,1,74.0,1,1,1,2,70.09,27.4,2,1
7,0,69.0,0,0,0,2,94.39,22.8,2,1
8,0,59.0,0,0,1,2,76.15,,0,1
9,0,78.0,0,0,1,2,58.57,24.2,0,1


##### Handling Missing/Null Values

`Find missing values using isnull()`


- Drop will not be done since:
    - may discard too much data and hurt the model
- Imputation 
    - Mean/Median/Mode
    - End of distribution
    - Arbitrary Value 
    - Add a variable to denote NA

In [14]:
missing = pd.concat([df_copy.isnull().sum(),df_copy.isnull().mean()],axis=1)
missing = missing.rename(index=str,columns={0:'total missing',1:'proportion'})
missing


Unnamed: 0,total missing,proportion
gender,0,0.0
age,0,0.0
hypertension,0,0.0
heart_disease,0,0.0
ever_married,0,0.0
work_type,0,0.0
avg_glucose_level,0,0.0
bmi,201,0.039342
smoking_status,0,0.0
stroke,0,0.0


In [15]:
# `col`_is_NA is created, 0-not missing 1-missing for that observation

NA_col = missing[missing['total missing'] >0].index.tolist()

for i in NA_col:
    if df_copy[i].isnull().sum()>0:
        df_copy[i+'_is_NA'] = np.where(df_copy[i].isnull(),1,0)

df_copy.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_is_NA
0,1,67.0,0,1,1,2,228.69,36.6,1,1,0
1,0,61.0,0,0,1,3,202.21,,2,1,1
2,1,80.0,0,1,1,2,105.92,32.5,2,1,0
3,0,49.0,0,0,1,2,171.23,34.4,3,1,0
4,0,79.0,1,0,1,3,174.12,24.0,2,1,0


In [16]:
NA_col = missing[missing['total missing'] >0].index.tolist()
strategy='mode'

for i in NA_col:
    if df_copy[i].isnull().sum()>0:
        df_copy[i+'_impute_mode'] = df_copy[i].fillna(df[i].mode()[0])

df_copy.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_is_NA,bmi_impute_mode
0,1,67.0,0,1,1,2,228.69,36.6,1,1,0,36.6
1,0,61.0,0,0,1,3,202.21,,2,1,1,28.7
2,1,80.0,0,1,1,2,105.92,32.5,2,1,0,32.5
3,0,49.0,0,0,1,2,171.23,34.4,3,1,0,34.4
4,0,79.0,1,0,1,3,174.12,24.0,2,1,0,24.0


In [17]:
columns = [col for col in df_copy.columns.tolist()]
print(columns)

['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke', 'bmi_is_NA', 'bmi_impute_mode']


In [18]:
df_copy = df_copy[['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'avg_glucose_level', 'bmi_impute_mode', 'smoking_status', 'stroke']]
df_copy.head()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi_impute_mode,smoking_status,stroke
0,1,67.0,0,1,1,2,228.69,36.6,1,1
1,0,61.0,0,0,1,3,202.21,28.7,2,1
2,1,80.0,0,1,1,2,105.92,32.5,2,1
3,0,49.0,0,0,1,2,171.23,34.4,3,1
4,0,79.0,1,0,1,3,174.12,24.0,2,1


### Handling Outliers

We'll detect outliers by the following method:
 
- Detect by arbitrary boundary

We'll use the following method to deal with outliers:

- Discard outliers 


In [19]:
df_copy.shape

(5109, 10)

In [20]:
#Outlier detection based on arbitrary boundaries passed to the function.
def outlier_detect_arbitrary(data,col,upper_fence,lower_fence):
    para = (upper_fence, lower_fence)
    tmp = pd.concat([data[col]>upper_fence,data[col]<lower_fence],axis=1)
    outlier_index = tmp.any(axis=1)
    print('Num of outlier detected:',outlier_index.value_counts()[1])
    print('Proportion of outlier detected',outlier_index.value_counts()[1]/len(outlier_index))    
    return outlier_index, para


In [21]:
#Impute outlier with mean/median/most frequent values of that variable.

def impute_outlier_with_avg(data,col,outlier_index,strategy='mean'):
    data_copy = data.copy(deep=True)
    if strategy=='mean':
        data_copy.loc[outlier_index,col] = data_copy[col].mean()
    elif strategy=='median':
        data_copy.loc[outlier_index,col] = data_copy[col].median()
    elif strategy=='mode':
        data_copy.loc[outlier_index,col] = data_copy[col].mode()[0]   
        
    return data_copy

In [22]:
#Detecting and removing outliers in column 'age'

print('Column = age')
upper_fence=100 # decided by business
lower_fence=0 # decidede by business
try:
    index,para = outlier_detect_arbitrary(data=df_copy,col='age',upper_fence=upper_fence,lower_fence=lower_fence)
    print('Upper bound:',para[0],'\nLower bound:',para[1],'\n')
    print('Outliers treated with mode = ',index.value_counts()[1])
    df_copy=impute_outlier_with_avg(data=df_copy,col='age',outlier_index=index,strategy='mode')
except:
    print('no outliers\n')

Column = age
no outliers



In [23]:
#Detecting and removing outliers in column 'avg_glucose_level'

print('Column = avg_glucose_level')
upper_fence=300 # decided by business
lower_fence=50 # decidede by business
try:
    index,para = outlier_detect_arbitrary(data=df_copy,col='avg_glucose_level',upper_fence=upper_fence,lower_fence=lower_fence)
    print('Upper bound:',para[0],'\nLower bound:',para[1],'\n')
    print('Outliers treated with mode = ',index.value_counts()[1])
    df_copy=impute_outlier_with_avg(data=df_copy,col='avg_glucose_level',outlier_index=index,strategy='mode')
except:
    print('no outliers\n')

Column = avg_glucose_level
no outliers



In [24]:
#Detecting and removing outliers in column 'bmi_impute_mode'

print('Column = bmi_impute_mode')
upper_fence=60 # decided by business
lower_fence=5 # decidede by business
try:
    index,para = outlier_detect_arbitrary(data=df_copy,col='bmi_impute_mode',upper_fence=upper_fence,lower_fence=lower_fence)
    print('Upper bound:',para[0],'\nLower bound:',para[1],'\n')
    print('Outliers treated with mode = ',index.value_counts()[1])
    df_copy=impute_outlier_with_avg(data=df_copy,col='bmi_impute_mode',outlier_index=index,strategy='mode')
except:
    print('no outliers\n')

Column = bmi_impute_mode
Num of outlier detected: 13
Proportion of outlier detected 0.002544529262086514
Upper bound: 60 
Lower bound: 5 

Outliers treated with mode =  13


In [25]:
df_copy.shape

(5109, 10)

##### Oversampling of underrepresented phenomenon (Stroke = 1)

In [26]:
# Split data into dependent and independent variables
X = df_copy.iloc[:,:-1].values
Y = df_copy.iloc[:, -1].values
print(X,Y)

[[  1.    67.     0.   ... 228.69  36.6    1.  ]
 [  0.    61.     0.   ... 202.21  28.7    2.  ]
 [  1.    80.     0.   ... 105.92  32.5    2.  ]
 ...
 [  0.    35.     0.   ...  82.99  30.6    2.  ]
 [  1.    51.     0.   ... 166.29  25.6    1.  ]
 [  0.    44.     0.   ...  85.28  26.2    0.  ]] [1 1 1 ... 0 0 0]


In [27]:
# import counter class from collections module to check stroke positive and negative values
from collections import Counter
counter = Counter(Y)
print(counter)

Counter({0: 4860, 1: 249})


In [28]:
# Oversampling using SMOTE
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, Y = oversample.fit_resample(X, Y)

In [29]:
# checking stroke positive and negative values again to see if SMOTE worked
counter = Counter(Y)
print(counter)

Counter({1: 4860, 0: 4860})


In [30]:
# concatenating oversampled entries with the original dataset
df_copy = pd.concat([pd.DataFrame(X), pd.DataFrame(Y)], axis=1)

In [31]:
# checking if new oversampled data has been added
df_copy.shape

(9720, 10)

In [32]:
df_copy.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,0.1
0,1.0,67.0,0.0,1.0,1.0,2.0,228.69,36.6,1.0,1
1,0.0,61.0,0.0,0.0,1.0,3.0,202.21,28.7,2.0,1
2,1.0,80.0,0.0,1.0,1.0,2.0,105.92,32.5,2.0,1
3,0.0,49.0,0.0,0.0,1.0,2.0,171.23,34.4,3.0,1
4,0.0,79.0,1.0,0.0,1.0,3.0,174.12,24.0,2.0,1


In [34]:
# renaming the columns

df_copy.columns = ['gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'avg_glucose_level','bmi_impute_mode', 'smoking_status', 'stroke']
df_copy.tail()

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,avg_glucose_level,bmi_impute_mode,smoking_status,stroke
9715,0.514519,59.970963,0.0,0.485481,1.0,1.543556,119.20972,36.131126,3.0,1
9716,1.0,60.271177,1.0,0.355931,1.0,2.0,257.293542,34.127122,2.355931,1
9717,1.0,57.553086,0.0,0.0,0.111729,2.0,93.106019,31.675987,0.335186,1
9718,0.834129,76.165871,0.834129,0.0,1.0,2.502386,193.698221,27.729834,1.165871,1
9719,0.0,77.761535,0.0,1.0,1.0,2.0,229.816084,34.19655,2.0,1


In [35]:
df_copy.to_csv("Data3-3_treated.csv")