# Data Pre Processing 

In this notebook, we go through the data pre-processing phase which is mandatory before we jump to modeling. In this phase, we need to handle missing values, make dummy variables for categorical data, and normalize or standardize the numerical values.

## Reading the Data

In [1]:
import pandas as pd

data = pd.read_csv("project_4_data.csv")
data.head(5)

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [2]:
data.describe()

Unnamed: 0,id,age,hypertension,heart_disease,avg_glucose_level,bmi,stroke
count,5110.0,5110.0,5110.0,5110.0,5110.0,4909.0,5110.0
mean,36517.829354,43.226614,0.097456,0.054012,106.147677,28.893237,0.048728
std,21161.721625,22.612647,0.296607,0.226063,45.28356,7.854067,0.21532
min,67.0,0.08,0.0,0.0,55.12,10.3,0.0
25%,17741.25,25.0,0.0,0.0,77.245,23.5,0.0
50%,36932.0,45.0,0.0,0.0,91.885,28.1,0.0
75%,54682.0,61.0,0.0,0.0,114.09,33.1,0.0
max,72940.0,82.0,1.0,1.0,271.74,97.6,1.0


In [4]:
data.isnull().sum()

id                     0
gender                 0
age                    0
hypertension           0
heart_disease          0
ever_married           0
work_type              0
Residence_type         0
avg_glucose_level      0
bmi                  201
smoking_status         0
stroke                 0
dtype: int64

In [5]:
data = data.dropna(how='any')

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4909 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 4909 non-null   int64  
 1   gender             4909 non-null   object 
 2   age                4909 non-null   float64
 3   hypertension       4909 non-null   int64  
 4   heart_disease      4909 non-null   int64  
 5   ever_married       4909 non-null   object 
 6   work_type          4909 non-null   object 
 7   Residence_type     4909 non-null   object 
 8   avg_glucose_level  4909 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     4909 non-null   object 
 11  stroke             4909 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 498.6+ KB


In [7]:
data.id.nunique()

4909

In [8]:
data = data.drop(["id"], axis=1).reset_index(drop=True)

In [9]:
data = data.drop(data[data['gender'] == 'Other'].index)

In [15]:
data.sample(frac = 1)

Unnamed: 0,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
2057,Male,28.0,0,0,Yes,Private,Rural,169.49,27.2,Unknown,0
3282,Female,59.0,0,0,Yes,Self-employed,Urban,90.06,28.9,smokes,0
1135,Female,79.0,0,1,Yes,Private,Urban,68.40,22.1,formerly smoked,0
2595,Female,17.0,0,0,No,Never_worked,Rural,88.57,31.1,never smoked,0
4277,Female,10.0,0,0,No,children,Rural,83.03,18.5,Unknown,0
...,...,...,...,...,...,...,...,...,...,...,...
190,Female,80.0,1,0,No,Private,Urban,66.03,35.4,never smoked,1
1105,Male,62.0,1,0,Yes,Private,Urban,211.49,41.1,Unknown,0
3091,Female,25.0,0,0,No,Private,Rural,111.65,35.2,formerly smoked,0
604,Female,54.0,0,0,Yes,Self-employed,Urban,92.39,22.1,never smoked,0


In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4908 entries, 0 to 4908
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4908 non-null   object 
 1   age                4908 non-null   float64
 2   hypertension       4908 non-null   int64  
 3   heart_disease      4908 non-null   int64  
 4   ever_married       4908 non-null   object 
 5   work_type          4908 non-null   object 
 6   Residence_type     4908 non-null   object 
 7   avg_glucose_level  4908 non-null   float64
 8   bmi                4908 non-null   float64
 9   smoking_status     4908 non-null   object 
 10  stroke             4908 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 460.1+ KB


In [10]:
num_ones = data.stroke.value_counts()[1]
num_zero = data.stroke.value_counts()[0]

In [11]:
print(f"Number of ones in stroke column : {num_ones}")
print(f"Number of zero in stroke column : {num_zero}")

Number of ones in stroke column : 209
Number of zero in stroke column : 4699


As we can see, the data is unbalance. There are only 209 patient with stroke out of 4909, which means only 4.25% of data are in the class **"1"** and 95.75% of data are in the class **"0"**.
When we work on modeling, we must consider this unbalance data to make our model more accurate.

## Extract the processed data

In [12]:
data.to_csv("processed_data.csv", index_label=False)