# Attribute Information

1) id: unique identifier

2) gender: "Male", "Female" or "Other"

3) age: age of the patient

4) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension

5) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease

6) ever_married: "No" or "Yes"

7) work_type: "children", "Govt_job", "Never_worked", "Private" or "Self-employed"

8) Residence_type: "Rural" or "Urban"

9) avg_glucose_level: average glucose level in blood

10) bmi: body mass index

11) smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"*

12) stroke: 1 if the patient had a stroke or 0 if not
*Note: "Unknown" in smoking_status means that the information is unavailable for this patient



## Import Libraries

In [1]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib
import matplotlib.pyplot as plt 


## Read Data

In [2]:
df = pd.read_csv("/Users/sahil/Documents/Project/Data1/5k_stroke-data.csv")

### View Information

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


### View Shape

In [4]:
df.shape
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


### View Sample

### Summary of Numerical Data

The 5-point summary or Statistical summary tells the descriptive summary which includes: 

- Mean
- Median
- Mode
- Maximum value
- Minimum value

In [5]:
df.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [6]:
categorical_cols = [col for col in df.columns.tolist() if col not in df.describe().columns.tolist()]
print(categorical_cols)

['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']


In [7]:
df.gender.value_counts()

Female    2994
Male      2115
Other        1
Name: gender, dtype: int64

In [8]:
# Removing 'Other' since there is just one entry
df= df[df['gender'] != 'Other']
df.gender.value_counts()

Female    2994
Male      2115
Name: gender, dtype: int64

## Data Pre-processing for Data1

- Label Encoding of categorical variables
- Handling Missing/Null Values
- Oversampling of Stroke positive (stroke=1) entries


##### Label Encoding of all 3 categorical variables

In [9]:
# Label encoding of categorical variables in new dataset df_copy

df_copy=df.copy(deep=True)

from sklearn.preprocessing import LabelEncoder
LabelEncoder = LabelEncoder()
df_copy.iloc[:,1] = LabelEncoder.fit_transform(df_copy.iloc[:,1].values)
df_copy.iloc[:,5] = LabelEncoder.fit_transform(df_copy.iloc[:,5].values)
df_copy.iloc[:,6] = LabelEncoder.fit_transform(df_copy.iloc[:,6].values)
df_copy.iloc[:,7] = LabelEncoder.fit_transform(df_copy.iloc[:,7].values)
df_copy.iloc[:,10] = LabelEncoder.fit_transform(df_copy.iloc[:,10].values)

In [10]:
df_copy.head(10)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,1,67.0,0,1,1,2,1,228.69,36.6,1,1
1,51676,0,61.0,0,0,1,3,0,202.21,,2,1
2,31112,1,80.0,0,1,1,2,0,105.92,32.5,2,1
3,60182,0,49.0,0,0,1,2,1,171.23,34.4,3,1
4,1665,0,79.0,1,0,1,3,0,174.12,24.0,2,1
5,56669,1,81.0,0,0,1,2,1,186.21,29.0,1,1
6,53882,1,74.0,1,1,1,2,0,70.09,27.4,2,1
7,10434,0,69.0,0,0,0,2,1,94.39,22.8,2,1
8,27419,0,59.0,0,0,1,2,0,76.15,,0,1
9,60491,0,78.0,0,0,1,2,1,58.57,24.2,0,1


##### Handling Missing/Null Values

`Find missing values using isnull()`


- Drop will not be done since:
    - may discard too much data and hurt the model
- Imputation 
    - Mean/Median/Mode
    - End of distribution
    - Arbitrary Value 
    - Add a variable to denote NA

In [12]:
missing = pd.concat([df_copy.isnull().sum(),df_copy.isnull().mean()],axis=1)
missing = missing.rename(index=str,columns={0:'total missing',1:'proportion'})
missing


Unnamed: 0,total missing,proportion
id,0,0.0
gender,0,0.0
age,0,0.0
hypertension,0,0.0
heart_disease,0,0.0
ever_married,0,0.0
work_type,0,0.0
Residence_type,0,0.0
avg_glucose_level,0,0.0
bmi,201,0.039342


In [13]:
# `col`_is_NA is created, 0-not missing 1-missing for that observation

NA_col = missing[missing['total missing'] >0].index.tolist()

for i in NA_col:
    if df_copy[i].isnull().sum()>0:
        df_copy[i+'_is_NA'] = np.where(df_copy[i].isnull(),1,0)

df_copy.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_is_NA
0,9046,1,67.0,0,1,1,2,1,228.69,36.6,1,1,0
1,51676,0,61.0,0,0,1,3,0,202.21,,2,1,1
2,31112,1,80.0,0,1,1,2,0,105.92,32.5,2,1,0
3,60182,0,49.0,0,0,1,2,1,171.23,34.4,3,1,0
4,1665,0,79.0,1,0,1,3,0,174.12,24.0,2,1,0


In [14]:
NA_col = missing[missing['total missing'] >0].index.tolist()
strategy='mean'

for i in NA_col:
    if df_copy[i].isnull().sum()>0:
        df_copy[i+'_impute_mean'] = df_copy[i].fillna(df[i].mean())

df_copy.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke,bmi_is_NA,bmi_impute_mean
0,9046,1,67.0,0,1,1,2,1,228.69,36.6,1,1,0,36.6
1,51676,0,61.0,0,0,1,3,0,202.21,,2,1,1,28.89456
2,31112,1,80.0,0,1,1,2,0,105.92,32.5,2,1,0,32.5
3,60182,0,49.0,0,0,1,2,1,171.23,34.4,3,1,0,34.4
4,1665,0,79.0,1,0,1,3,0,174.12,24.0,2,1,0,24.0


In [15]:
columns = [col for col in df_copy.columns.tolist()]
print(columns)

['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi', 'smoking_status', 'stroke', 'bmi_is_NA', 'bmi_impute_mean']


In [16]:
df_copy = df_copy[['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi_impute_mean', 'smoking_status', 'stroke']]
df_copy.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi_impute_mean,smoking_status,stroke
0,9046,1,67.0,0,1,1,2,1,228.69,36.6,1,1
1,51676,0,61.0,0,0,1,3,0,202.21,28.89456,2,1
2,31112,1,80.0,0,1,1,2,0,105.92,32.5,2,1
3,60182,0,49.0,0,0,1,2,1,171.23,34.4,3,1
4,1665,0,79.0,1,0,1,3,0,174.12,24.0,2,1


##### Oversampling of underrepresented phenomenon (Stroke = 1)

In [17]:
# Split data into dependent and independent variables
X = df_copy.iloc[:,:-1].values
Y = df_copy.iloc[:, -1].values
print(X,Y)

[[9.04600000e+03 1.00000000e+00 6.70000000e+01 ... 2.28690000e+02
  3.66000000e+01 1.00000000e+00]
 [5.16760000e+04 0.00000000e+00 6.10000000e+01 ... 2.02210000e+02
  2.88945599e+01 2.00000000e+00]
 [3.11120000e+04 1.00000000e+00 8.00000000e+01 ... 1.05920000e+02
  3.25000000e+01 2.00000000e+00]
 ...
 [1.97230000e+04 0.00000000e+00 3.50000000e+01 ... 8.29900000e+01
  3.06000000e+01 2.00000000e+00]
 [3.75440000e+04 1.00000000e+00 5.10000000e+01 ... 1.66290000e+02
  2.56000000e+01 1.00000000e+00]
 [4.46790000e+04 0.00000000e+00 4.40000000e+01 ... 8.52800000e+01
  2.62000000e+01 0.00000000e+00]] [1 1 1 ... 0 0 0]


In [18]:
# import counter class from collections module to check stroke positive and negative values
from collections import Counter
counter = Counter(Y)
print(counter)

Counter({0: 4860, 1: 249})


In [19]:
# Oversampling using SMOTE
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X, Y = oversample.fit_resample(X, Y)

In [20]:
# checking stroke positive and negative values again to see if SMOTE worked
counter = Counter(Y)
print(counter)

Counter({1: 4860, 0: 4860})


In [21]:
# concatenating oversampled entries with the original dataset
df_copy = pd.concat([pd.DataFrame(X), pd.DataFrame(Y)], axis=1)

In [22]:
# checking if new oversampled data has been added
df_copy.shape

(9720, 12)

In [23]:
df_copy.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,0.1
0,9046.0,1.0,67.0,0.0,1.0,1.0,2.0,1.0,228.69,36.6,1.0,1
1,51676.0,0.0,61.0,0.0,0.0,1.0,3.0,0.0,202.21,28.89456,2.0,1
2,31112.0,1.0,80.0,0.0,1.0,1.0,2.0,0.0,105.92,32.5,2.0,1
3,60182.0,0.0,49.0,0.0,0.0,1.0,2.0,1.0,171.23,34.4,3.0,1
4,1665.0,0.0,79.0,1.0,0.0,1.0,3.0,0.0,174.12,24.0,2.0,1


In [24]:
# renaming the columns

df_copy.columns = ['id', 'gender', 'age', 'hypertension', 'heart_disease', 'ever_married', 'work_type', 'Residence_type', 'avg_glucose_level', 'bmi_impute_mean', 'smoking_status', 'stroke']
df_copy.tail()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi_impute_mean,smoking_status,stroke
9715,14502.39165,1.0,72.638375,1.0,0.0,1.0,2.893604,0.106396,173.768363,36.226975,1.893604,1
9716,40533.923737,0.0,60.985301,0.0,0.0,1.0,2.630115,0.0,66.388418,29.266457,1.73977,1
9717,12562.03044,0.816022,70.023758,0.0,0.0,1.0,2.816022,1.0,81.483002,26.396132,3.0,1
9718,10575.354827,0.831143,64.31143,0.0,0.0,1.0,2.0,0.168857,94.81644,24.442054,1.0,1
9719,37904.15772,0.0,71.674873,0.444992,0.0,1.0,2.444992,0.444992,86.553375,27.310017,0.555008,1


In [25]:
df_copy.to_csv("Data2_treated.csv")