## Data Cleaning & Pre-processing
![data-cleaning-in-python](https://daxg39y63pxwu.cloudfront.net/images/blog/data-cleaning-in-python/data-cleaning-in-python.png)

First step of an analytics project is to clean the datasets and pre-processed it to make it suitable for use by analytical model and visualization


In [1]:
# import basic libraries
import pandas as pd
import numpy as np

### 1. Import dataset into the notebook

In [2]:
# heart_pki_original_df -> personal_key_indicators_dataframe
heart_pki_original_df = pd.read_csv('datasets/heart_pki_2020_original.csv')
heart_pki_original_df.head()

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,AgeCategory,Race,Diabetic,PhysicalActivity,GenHealth,SleepTime,Asthma,KidneyDisease,SkinCancer
0,No,16.6,Yes,No,No,3.0,30.0,No,Female,55-59,White,Yes,Yes,Very good,5.0,Yes,No,Yes
1,No,20.34,No,No,Yes,0.0,0.0,No,Female,80 or older,White,No,Yes,Very good,7.0,No,No,No
2,No,26.58,Yes,No,No,20.0,30.0,No,Male,65-69,White,Yes,Yes,Fair,8.0,Yes,No,No
3,No,24.21,No,No,No,0.0,0.0,No,Female,75-79,White,No,No,Good,6.0,No,No,Yes
4,No,23.71,No,No,No,28.0,0.0,Yes,Female,40-44,White,No,Yes,Very good,8.0,No,No,No


In [3]:
# heart_attack_df
heart_attack_original_df = pd.read_csv('datasets/heart_attack_original.csv')
heart_attack_original_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [4]:
# o2Saturation_original_df
o2Saturation_original_df = pd.read_csv('datasets/o2Saturation_original.csv').rename(columns={'98.6': 'o2_saturation'})
o2Saturation_original_df.head()

Unnamed: 0,o2_saturation
0,98.6
1,98.6
2,98.6
3,98.1
4,97.5


### 2. Data Cleaning & Pre-processing (heart_pki_original_df)

In [5]:
heart_pki_original_df = pd.read_csv('datasets/heart_pki_2020_original.csv')
heart_pki_original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

#### **2.1 Removing rows with NA values (if they exist)**

In [6]:
print('Number of NA entries: ', heart_pki_original_df.isna().sum().sum())

Number of NA entries:  0


- there are no NA entries, so there are no rows with NA to be dropped :)

#### **2.2 Cleaning up continuous variables**

In [7]:
heart_pki_original_df.describe()

Unnamed: 0,BMI,PhysicalHealth,MentalHealth,SleepTime
count,319795.0,319795.0,319795.0,319795.0
mean,28.325399,3.37171,3.898366,7.097075
std,6.3561,7.95085,7.955235,1.436007
min,12.02,0.0,0.0,1.0
25%,24.03,0.0,0.0,6.0
50%,27.34,0.0,0.0,7.0
75%,31.42,2.0,3.0,8.0
max,94.85,30.0,30.0,24.0


- **`PhysicalHealth`** and **`MentalHealth`** contain '0's.
- Those '0's are valid because of the nature of the variable **(no. of days)**
  - **`PhysicalHealth`**: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 was your physical health not good
  - **`MentalHealth`**: Thinking about your mental health, for how many days during the past 30 days was your mental health not good?


Conclusion:
- no need to do anything to those '0' values

**2.2 b) Removing outliers - `SleepTime`**

In [8]:
max(heart_pki_original_df['SleepTime'])

24.0

- maximum `SleepTime` is 24hrs --> does not make sense logically
- how can someone sleep so long every day?
- so, we should remove outliers with respects to `SleepTime` by removing the associated rows

In [9]:
# confidence interval: (µ - 3σ, µ + 3σ)
conf_interval = np.mean(heart_pki_original_df['SleepTime'])-3*np.std(heart_pki_original_df['SleepTime']),np.mean(heart_pki_original_df['SleepTime'])+3*np.std(heart_pki_original_df['SleepTime'])
print('Confidence Interval:', conf_interval)

Confidence Interval: (2.7890602411750294, 11.405089135769575)


- `SleepTime` outside of (2.7890602411828116, 11.405089135761793) confidence interval can be considered outliers (0.03%)
- remove rows that contain outlier `SleepTime`

In [10]:
numofRowsBefore = len(heart_pki_original_df)
print('Number of rows before removing outliers:', numofRowsBefore)

(lower, upper) = conf_interval
heart_pki_original_df.drop(heart_pki_original_df[heart_pki_original_df['SleepTime'] < lower].index, inplace=True)
heart_pki_original_df.drop(heart_pki_original_df[heart_pki_original_df['SleepTime'] > upper].index, inplace=True)

numofRowsAfter = len(heart_pki_original_df)
print('Number of rows before removing outliers:', numofRowsAfter)
print('Number of rows dropped:', numofRowsBefore-numofRowsAfter)

Number of rows before removing outliers: 319795
Number of rows before removing outliers: 315252
Number of rows dropped: 4543


#### **2.3 Cleaning up categorical variables**

In [11]:
for feature in heart_pki_original_df.columns:
  if np.dtype(heart_pki_original_df[feature]) != 'object':
    continue
  print(heart_pki_original_df[feature].value_counts(), end='\n\n')

No     288651
Yes     26601
Name: HeartDisease, dtype: int64

No     185635
Yes    129617
Name: Smoking, dtype: int64

No     293753
Yes     21499
Name: AlcoholDrinking, dtype: int64

No     303658
Yes     11594
Name: Stroke, dtype: int64

No     272545
Yes     42707
Name: DiffWalking, dtype: int64

Female    165384
Male      149868
Name: Sex, dtype: int64

65-69          33690
60-64          33186
70-74          30612
55-59          29332
50-54          25042
80 or older    23531
45-49          21528
75-79          21060
18-24          20798
40-44          20768
35-39          20334
30-34          18584
25-29          16787
Name: AgeCategory, dtype: int64

White                             242309
Hispanic                           27022
Black                              22151
Other                              10725
Asian                               7981
American Indian/Alaskan Native      5064
Name: Race, dtype: int64

No                         266353
Yes                         

- all the categorical variables are consistent in their values
- there are no missing values

Therefore, no data cleaning is needed for categorical variables. However, one hot encoding is needed for categorical variables to encode them into numeric forms to allow analytical models to operate on these categorical variables

#### **2.4 Encoding nominal (unordered) categorical variables using `OneHotEncoding` for predictors & `Integer Encoding` for response**
The `heart_pki_original_df` dataset contains 13 categorical predictor variables:
- Smoking
- AlcoholDrinking
- Stroke
- DiffWalking
- Sex
- AgeCategory
- Race
- Diabetic
- PhysicalActivity
- GenHealth
- Asthma
- Kidney
- Disease

And 1 categorical response variable:
- HeartDisease

**2.4 a) OneHotEncoding**

In [12]:
# Import the OneHotEncoder from sklearn
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors
cat_variables = [
    'Smoking', 'AlcoholDrinking', 'Stroke', 'DiffWalking',
    'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity',
    'GenHealth', 'Asthma', 'KidneyDisease', 'SkinCancer'
]

heart_pki_cat = heart_pki_original_df[cat_variables]

ohe.fit(heart_pki_cat)
heart_pki_cat_ohe = pd.DataFrame(
    ohe.transform(heart_pki_cat).toarray(),
    columns=ohe.get_feature_names_out(heart_pki_cat.columns))

# Check the encoded variables
heart_pki_cat_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315252 entries, 0 to 315251
Data columns (total 46 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   Smoking_No                           315252 non-null  float64
 1   Smoking_Yes                          315252 non-null  float64
 2   AlcoholDrinking_No                   315252 non-null  float64
 3   AlcoholDrinking_Yes                  315252 non-null  float64
 4   Stroke_No                            315252 non-null  float64
 5   Stroke_Yes                           315252 non-null  float64
 6   DiffWalking_No                       315252 non-null  float64
 7   DiffWalking_Yes                      315252 non-null  float64
 8   Sex_Female                           315252 non-null  float64
 9   Sex_Male                             315252 non-null  float64
 10  AgeCategory_18-24                    315252 non-null  float64
 11  AgeCategory_2

In [13]:
# head of dataframe
heart_pki_cat_ohe

Unnamed: 0,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,Stroke_Yes,DiffWalking_No,DiffWalking_Yes,Sex_Female,Sex_Male,...,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
1,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
3,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315247,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
315248,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
315249,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
315250,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


**2.4 b) Combine encoded dataframe with continuous variables**

In [14]:
num_variable = []
for i in heart_pki_original_df:
    if i not in cat_variables:
        num_variable.append(i)
num_variable

['HeartDisease', 'BMI', 'PhysicalHealth', 'MentalHealth', 'SleepTime']

In [15]:
# Combining Numeric features with the OHE Categorical features
heart_pki_num = heart_pki_original_df[num_variable]
heart_pki_cat_ohe
heart_pki_ohe_df = pd.concat(
    [heart_pki_num.reset_index(drop=True), heart_pki_cat_ohe.reset_index(drop=True)],
    sort=False,
    axis=1)

# Check the final dataframe
heart_pki_ohe_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 315252 entries, 0 to 315251
Data columns (total 51 columns):
 #   Column                               Non-Null Count   Dtype  
---  ------                               --------------   -----  
 0   HeartDisease                         315252 non-null  object 
 1   BMI                                  315252 non-null  float64
 2   PhysicalHealth                       315252 non-null  float64
 3   MentalHealth                         315252 non-null  float64
 4   SleepTime                            315252 non-null  float64
 5   Smoking_No                           315252 non-null  float64
 6   Smoking_Yes                          315252 non-null  float64
 7   AlcoholDrinking_No                   315252 non-null  float64
 8   AlcoholDrinking_Yes                  315252 non-null  float64
 9   Stroke_No                            315252 non-null  float64
 10  Stroke_Yes                           315252 non-null  float64
 11  DiffWalking_N

In [16]:
# ALL of the columns
for col in heart_pki_ohe_df.columns:
    print(col, end=',  ')

HeartDisease,  BMI,  PhysicalHealth,  MentalHealth,  SleepTime,  Smoking_No,  Smoking_Yes,  AlcoholDrinking_No,  AlcoholDrinking_Yes,  Stroke_No,  Stroke_Yes,  DiffWalking_No,  DiffWalking_Yes,  Sex_Female,  Sex_Male,  AgeCategory_18-24,  AgeCategory_25-29,  AgeCategory_30-34,  AgeCategory_35-39,  AgeCategory_40-44,  AgeCategory_45-49,  AgeCategory_50-54,  AgeCategory_55-59,  AgeCategory_60-64,  AgeCategory_65-69,  AgeCategory_70-74,  AgeCategory_75-79,  AgeCategory_80 or older,  Race_American Indian/Alaskan Native,  Race_Asian,  Race_Black,  Race_Hispanic,  Race_Other,  Race_White,  Diabetic_No,  Diabetic_No, borderline diabetes,  Diabetic_Yes,  Diabetic_Yes (during pregnancy),  PhysicalActivity_No,  PhysicalActivity_Yes,  GenHealth_Excellent,  GenHealth_Fair,  GenHealth_Good,  GenHealth_Poor,  GenHealth_Very good,  Asthma_No,  Asthma_Yes,  KidneyDisease_No,  KidneyDisease_Yes,  SkinCancer_No,  SkinCancer_Yes,  

In [17]:
heart_pki_ohe_df

Unnamed: 0,HeartDisease,BMI,PhysicalHealth,MentalHealth,SleepTime,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,...,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,No,16.60,3.0,30.0,5.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
1,No,20.34,0.0,0.0,7.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
2,No,26.58,20.0,30.0,8.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
3,No,24.21,0.0,0.0,6.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,No,23.71,28.0,0.0,8.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315247,No,22.22,0.0,0.0,8.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
315248,Yes,27.41,7.0,0.0,6.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
315249,No,29.84,0.0,0.0,5.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
315250,No,24.24,0.0,0.0,6.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


**2.4 c) Integer Encoding**

For `HeartDisease`:
- Yes: 1
- No: 0

In [18]:
heart_pki_ohe_df['HeartDisease'].head()

mapping = {
    "Yes": 1,
    "No": 0
}

for val in mapping:
    rows = heart_pki_ohe_df['HeartDisease'] == val
    heart_pki_ohe_df.loc[rows, 'HeartDisease'] = mapping[val]

heart_pki_ohe_df['HeartDisease'].unique()

array([0, 1], dtype=object)

In [19]:
heart_pki_ohe_df

Unnamed: 0,HeartDisease,BMI,PhysicalHealth,MentalHealth,SleepTime,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,Stroke_No,...,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,0,16.60,3.0,30.0,5.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0
1,0,20.34,0.0,0.0,7.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
2,0,26.58,20.0,30.0,8.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
3,0,24.21,0.0,0.0,6.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,0,23.71,28.0,0.0,8.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315247,0,22.22,0.0,0.0,8.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
315248,1,27.41,7.0,0.0,6.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0
315249,0,29.84,0.0,0.0,5.0,0.0,1.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0
315250,0,24.24,0.0,0.0,6.0,1.0,0.0,1.0,0.0,1.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


**2.4 d) Export encoded `heart_pki_ohe_df` dataframe as csv**

In [20]:
heart_pki_ohe_df.to_csv('datasets/heart_pki_2020_encoded.csv', index=False)

#### **2.5 Export cleaned dataset**

In [21]:
# since the dataset is already very cleaned, the exported csv will be the same as the original csv
heart_pki_original_df.to_csv('datasets/heart_pki_2020_cleaned.csv', index=False)

### 3. Data Cleaning and Pre-processing (heart_attack_original_df)

In [22]:
heart_attack_original_df = pd.read_csv('datasets/heart_attack_original.csv')
heart_attack_original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


**About this dataset**
- `Age` : Age of the patient
- `Sex` : Sex of the patient
- `exng`: exercise induced angina (1 = yes; 0 = no)
- `caa`: number of major vessels (0-3)
- `cp` : Chest Pain type chest pain type
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
- `trtbps` : resting blood pressure (in mm Hg)
- `chol` : cholestoral in mg/dl fetched via BMI sensor
- `fbs` : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- `restecg` : resting electrocardiographic results
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `thalach` : maximum heart rate achieved
- `thall` : Thal rate
- `output` : 0= less chance of heart attack 1= more chance of heart attack

#### **3.0 Combining o2 dataset into heart_attack_original_df**

In [23]:
# o2Saturation_original_df
o2Saturation_original_df = pd.read_csv('datasets/o2Saturation_original.csv').rename(columns={'98.6': 'o2_saturation'})

heart_attack_original_df['o2_saturation'] = o2Saturation_original_df

heart_attack_original_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output,o2_saturation
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1,98.6
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1,98.6
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1,98.6
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1,98.1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1,97.5


#### **3.1 Removing irrelavant variables (columns)**

In [24]:
heart_attack_cleaned_df = heart_attack_original_df.drop(['oldpeak', 'slp', 'thall'], axis=1, inplace=False)
heart_attack_cleaned_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,caa,output,o2_saturation
0,63,1,3,145,233,1,0,150,0,0,1,98.6
1,37,1,2,130,250,0,1,187,0,0,1,98.6
2,41,0,1,130,204,0,0,172,0,0,1,98.6
3,56,1,1,120,236,0,1,178,0,0,1,98.1
4,57,0,0,120,354,0,1,163,1,0,1,97.5


#### **3.2 Removing rows with NA values (if they exist)**

In [25]:
print('Number of NA entries: ', heart_attack_cleaned_df.isna().sum().sum())

Number of NA entries:  0


- there are no NA entries, so there are no rows with NA to be dropped😃

#### **3.3 Removing outliers**

In [26]:
# thalachh - maximum heart rate
print('minimum thalacch:', min(heart_attack_cleaned_df['thalachh']))
print('maximum thalacch:', max(heart_attack_cleaned_df['thalachh']))

minimum thalacch: 71
maximum thalacch: 202


- does not make sense that the your `thalachh` (maximum heart rate) is 71

In [27]:
# confidence interval: (µ - 3σ, µ + 3σ)
conf_interval2 = np.mean(heart_attack_cleaned_df['thalachh'])-3*np.std(heart_attack_cleaned_df['thalachh']),np.mean(heart_attack_cleaned_df['thalachh'])+3*np.std(heart_attack_cleaned_df['thalachh'])
print('Confidence Interval:', conf_interval2)

Confidence Interval: (81.04486694080096, 218.24886243213638)


- `thalachh` outside of (81.04486694080094, 218.24886243213638) confidence interval can be considered outliers (0.03%)
- remove rows that contain outlier `thalachh`

In [28]:
numofRowsBefore2 = len(heart_attack_cleaned_df)
print('Number of rows before removing outliers:', numofRowsBefore2)

(lower2, upper2) = conf_interval2
heart_attack_cleaned_df.drop(heart_attack_cleaned_df[heart_attack_cleaned_df['thalachh'] < lower2].index, inplace=True)
heart_attack_cleaned_df.drop(heart_attack_cleaned_df[heart_attack_cleaned_df['thalachh'] > upper2].index, inplace=True)

numofRowsAfter2 = len(heart_attack_cleaned_df)
print('Number of rows before removing outliers:', numofRowsAfter2)
print('Number of rows dropped:', numofRowsBefore2-numofRowsAfter2)

Number of rows before removing outliers: 303
Number of rows before removing outliers: 302
Number of rows dropped: 1


#### **3.4 Cleaning up continuous variables**
The `heart_attack_cleaned_df` dataset contains 5 continuous variables:
- age
- trtbps
- chol
- thalachh
- o2_saturation

In [29]:
heart_attack_cleaned_df[['age', 'trtbps', 'chol', 'thalachh', 'o2_saturation']].describe()

Unnamed: 0,age,trtbps,chol,thalachh,o2_saturation
count,302.0,302.0,302.0,302.0,302.0
mean,54.324503,131.662252,246.294702,149.907285,97.484106
std,9.067887,17.554429,51.914022,22.489378,0.342667
min,29.0,94.0,126.0,88.0,96.5
25%,47.25,120.0,211.0,134.5,97.5
50%,55.0,130.0,240.5,153.0,97.5
75%,61.0,140.0,274.75,166.0,97.5
max,77.0,200.0,564.0,202.0,98.6


- `age`, `trtbps`, `chol`, `thalachh`, `o2_saturation` all do not contain '0'

#### **3.5 Cleaning up categorical variables**
Note: Categorical variables in `heart_attack_cleaned_df` are already integer encoded

The `heart_attack_cleaned_df` dataset contain 7 categorical variables:
- sex
- caa
- cp
- exng
- fbs
- restecg
- thall

In [30]:
cat_variables2 = [
    'sex', 'caa', 'cp', 'exng', 'fbs', 'restecg'
]
for feature in cat_variables2:
  print(heart_attack_cleaned_df[feature].value_counts(), end='\n\n')

1    206
0     96
Name: sex, dtype: int64

0    174
1     65
2     38
3     20
4      5
Name: caa, dtype: int64

0    142
2     87
1     50
3     23
Name: cp, dtype: int64

0    203
1     99
Name: exng, dtype: int64

0    257
1     45
Name: fbs, dtype: int64

1    151
0    147
2      4
Name: restecg, dtype: int64



- all the categorical variables are consistent in their values
- there are no missing values

Therefore, no data cleaning is needed for categorical variables. However, we can transform categorical variables' values to meaningful string for data analysis.

#### **3.6 Converting categorical variables values to meaningful string**
The `heart_attack_cleaned_df` contain the following categorical variable with meaningful string text attached to the value:
- `sex` : Sex of the patient (1 = male, 0 = female)
- `exng`: exercise induced angina (1 = yes; 0 = no)
- `cp` : Chest Pain type chest pain type
  - Value 1: typical angina
  - Value 2: atypical angina
  - Value 3: non-anginal pain
  - Value 4: asymptomatic
- `fbs` : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
- `restecg` : resting electrocardiographic results
  - Value 0: normal
  - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
  - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
- `output` : 0= less chance of heart attack 1= more chance of heart attack

In [31]:
# make a copy of the cleaned dataframe
heart_attack_cleaned_text_df = heart_attack_cleaned_df.copy()

**3.5 a) converting categorical predictors**

In [32]:
mapping = {
    'sex': {
        0: 'Female',
        1: 'Male'
    },
    'exng': {
        0: 'No',
        1: 'Yes'
    },
    'cp':{
        0: 'No Chest Pain',
        1: 'typical angina',
        2: 'atypical angina',
        3: 'non-anginal pain',
        4: 'asymptomatic'
    },
    'fbs': {
        0: 'False',
        1: 'True'
    },
    'restecg': {
        0: 'normal',
        1: 'having ST-T wave abnormality',
        2: 'showing probable or definite left ventricular hypertrophy Estes\' criteria'
    }
}

for mapping_type in mapping:
    for val in mapping[mapping_type]:
        condition = heart_attack_cleaned_text_df[mapping_type] == val
        heart_attack_cleaned_text_df.loc[condition, mapping_type] = mapping[mapping_type][val]

    print(mapping_type, ':', heart_attack_cleaned_text_df[mapping_type].unique())


sex : ['Male' 'Female']
exng : ['No' 'Yes']
cp : ['non-anginal pain' 'atypical angina' 'typical angina' 'No Chest Pain']
fbs : ['True' 'False']
restecg : ['normal' 'having ST-T wave abnormality'
 "showing probable or definite left ventricular hypertrophy Estes' criteria"]


In [33]:
heart_attack_cleaned_text_df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,caa,output,o2_saturation
0,63,Male,non-anginal pain,145,233,True,normal,150,No,0,1,98.6
1,37,Male,atypical angina,130,250,False,having ST-T wave abnormality,187,No,0,1,98.6
2,41,Female,typical angina,130,204,False,normal,172,No,0,1,98.6
3,56,Male,typical angina,120,236,False,having ST-T wave abnormality,178,No,0,1,98.1
4,57,Female,No Chest Pain,120,354,False,having ST-T wave abnormality,163,Yes,0,1,97.5


**3.5 a) converting categorical target**
- `output`

In [34]:
output_mapping = {
    0: 'less chance',
    1: 'more chance'
}
for value in output_mapping:
    condition = heart_attack_cleaned_text_df['output'] == value
    heart_attack_cleaned_text_df.loc[condition, 'output'] = output_mapping[value]

print('output:', heart_attack_cleaned_text_df['output'].unique())

output: ['more chance' 'less chance']


#### **3.7 Renaming columns to more meaningful names**

In [35]:
heart_attack_cleaned_df.rename(
    {
        'exng': 'exercise_induced_angina',
        'cp': 'chest_pain',
        'caa': 'num_of_major_vessels',
        'trtbps': 'resting_blood_pressure',
        'fbs': 'fasting_blood_sugar',
        'restecg': 'rest_ecg',
        'thalachh': 'max_heart_rate',
        'thall': 'thal_rate',
        'output': 'heart_attack_chance'
    }, 
    axis='columns',
    inplace=True
)
heart_attack_cleaned_df.head()

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,chol,fasting_blood_sugar,rest_ecg,max_heart_rate,exercise_induced_angina,num_of_major_vessels,heart_attack_chance,o2_saturation
0,63,1,3,145,233,1,0,150,0,0,1,98.6
1,37,1,2,130,250,0,1,187,0,0,1,98.6
2,41,0,1,130,204,0,0,172,0,0,1,98.6
3,56,1,1,120,236,0,1,178,0,0,1,98.1
4,57,0,0,120,354,0,1,163,1,0,1,97.5


In [36]:
heart_attack_cleaned_text_df.rename(
    {
        'exng': 'exercise_induced_angina',
        'cp': 'chest_pain',
        'caa': 'num_of_major_vessels',
        'trtbps': 'resting_blood_pressure',
        'fbs': 'fasting_blood_sugar',
        'restecg': 'rest_ecg',
        'thalachh': 'max_heart_rate',
        'thall': 'thal_rate',
        'output': 'heart_attack_chance'
    }, 
    axis='columns',
    inplace=True
)
heart_attack_cleaned_text_df.head()

Unnamed: 0,age,sex,chest_pain,resting_blood_pressure,chol,fasting_blood_sugar,rest_ecg,max_heart_rate,exercise_induced_angina,num_of_major_vessels,heart_attack_chance,o2_saturation
0,63,Male,non-anginal pain,145,233,True,normal,150,No,0,more chance,98.6
1,37,Male,atypical angina,130,250,False,having ST-T wave abnormality,187,No,0,more chance,98.6
2,41,Female,typical angina,130,204,False,normal,172,No,0,more chance,98.6
3,56,Male,typical angina,120,236,False,having ST-T wave abnormality,178,No,0,more chance,98.1
4,57,Female,No Chest Pain,120,354,False,having ST-T wave abnormality,163,Yes,0,more chance,97.5


#### **3.8 Export cleaned heart_attack datasets**

In [37]:
# cleaned dataset with renamed columns
heart_attack_cleaned_df.to_csv('datasets/heart_attack_cleaned.csv', index=False)

# cleaned dataset with renamed columns and categorical values
heart_attack_cleaned_text_df.to_csv('datasets/heart_attack_cleaned_text.csv', index=False)


---

#### Dataset created from this notebook:

    .
    ├── heart_pki_2020_original.csv       # original dataset
    |   ├── heart_pki_2020_cleaned.csv        # for EDA and visualization
    |   └── heart_pki_2020_encoded.csv        # for analytical models (OneHotEncoding done)
    |
    ├── o2Saturation_original.csv         # original dataset
    ├── heart_attack_original.csv         # original dataset
    │   ├── heart_attack_cleaned.csv          # for analytical model (default integer encoding)
    │   └── heart_attack_cleaned_text.csv     # for EDA and visualization (meaningful values)
    └──|

 