https://towardsdatascience.com/step-by-step-exploratory-data-analysis-on-stroke-dataset-840aefea8739

In [2]:
import pandas as pd
import matplotlib.pyplot as plot
import seaborn as sns
import numpy as np

In [3]:
df = pd.read_csv("data/stroke.csv", encoding="utf-8")
df2 = pd.read_csv('data/train_strokes.csv', encoding='utf-8')
df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


In [5]:
df['stroke'].value_counts()

0    4861
1     249
Name: stroke, dtype: int64

Bộ dữ liệu này ta thấy rõ sự mất cân bằng. Chỉ có 249 bệnh nhân đã từng trải qua đột quỵ trong khi 4861 bệnh nhân không bị.

# Data preprocessing

## 1. ID attribute
Dùng định danh các bệnh nhân, không có nhiều ý nghĩa nên ta xóa cả cột này đi.

In [6]:
df.drop(columns=['id'], inplace=True)

## 2. BMI attribute
Có 40 bệnh nhân không có chỉ số BMI nhưng đã từng đột quỵ (so với 249 bệnh nhân ở trên, tỉ lệ khá lớn) nên với các giá trị NaN này ta sẽ cho bằng chỉ số trung bình BMI

In [7]:
df[df['bmi'].isna() & df['stroke'] == 1]

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
1,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
8,Female,59.0,0,0,Yes,Private,Rural,76.15,,Unknown,1
13,Male,78.0,0,1,Yes,Private,Urban,219.84,,Unknown,1
19,Male,57.0,0,1,No,Govt_job,Urban,217.08,,Unknown,1
27,Male,58.0,0,0,Yes,Private,Rural,189.84,,Unknown,1
29,Male,59.0,0,0,Yes,Private,Rural,211.78,,formerly smoked,1
43,Female,63.0,0,0,Yes,Private,Urban,90.9,,formerly smoked,1
46,Female,75.0,0,1,No,Self-employed,Urban,109.78,,Unknown,1
50,Female,76.0,0,0,No,Private,Urban,89.96,,Unknown,1
51,Male,78.0,1,0,Yes,Private,Urban,75.32,,formerly smoked,1


In [16]:
df['bmi'].fillna(np.round(df['bmi'].mean(), 1), inplace=True)

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             5110 non-null   object 
 1   age                5110 non-null   float64
 2   hypertension       5110 non-null   int64  
 3   heart_disease      5110 non-null   int64  
 4   ever_married       5110 non-null   object 
 5   work_type          5110 non-null   object 
 6   Residence_type     5110 non-null   object 
 7   avg_glucose_level  5110 non-null   float64
 8   bmi                5110 non-null   float64
 9   smoking_status     5110 non-null   object 
 10  stroke             5110 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 439.3+ KB


## 3. Smoking Status attribute <br/>
In addition, 13,292 records or about 30.6% of the dataset had missing values in smoking status feature (from Fig. 1). It was a huge proportion of the dataset. As such, a new category named “not known” was created to account for all these records, rather than dropping them altogether.

In [12]:
print(df['smoking_status'].value_counts())

never smoked       1892
Unknown            1544
formerly smoked     885
smokes              789
Name: smoking_status, dtype: int64


## 4. Gender attribute <br/>

In [19]:
print(df['gender'].value_counts())

Female    2994
Male      2115
Other        1
Name: gender, dtype: int64


In [20]:
df = df[df['gender'] != 'Other']

## 5. Normalize numerical attributes
In this dataset, there are 3 numerical attributes, i.e. age, average glucose level and bmi. Let’s normalize them to ensure that they have equal weightage when building a classifier

In [21]:
# Create a new column for normalized age
df['age_norm']=(df['age']-df['age'].min())/(df['age'].max()-df['age'].min())

# Create a new column for normalized avg glucose level
df['avg_glucose_level_norm']=(df['avg_glucose_level']-df['avg_glucose_level'].min())/(df['avg_glucose_level'].max()-df['avg_glucose_level'].min())

# Create a new column for normalized bmi
df['bmi_norm']=(df['bmi']-df['bmi'].min())/(df['bmi'].max()-df['bmi'].min())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['age_norm']=(df['age']-df['age'].min())/(df['age'].max()-df['age'].min())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['avg_glucose_level_norm']=(df['avg_glucose_level']-df['avg_glucose_level'].min())/(df['avg_glucose_level'].max()-df['avg_glucose_level'].min())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-vi

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5109 entries, 0 to 5109
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   gender                  5109 non-null   object 
 1   age                     5109 non-null   float64
 2   hypertension            5109 non-null   int64  
 3   heart_disease           5109 non-null   int64  
 4   ever_married            5109 non-null   object 
 5   work_type               5109 non-null   object 
 6   Residence_type          5109 non-null   object 
 7   avg_glucose_level       5109 non-null   float64
 8   bmi                     5109 non-null   float64
 9   smoking_status          5109 non-null   object 
 10  stroke                  5109 non-null   int64  
 11  age_norm                5109 non-null   float64
 12  avg_glucose_level_norm  5109 non-null   float64
 13  bmi_norm                5109 non-null   float64
dtypes: float64(6), int64(3), object(5)
memor