# Heart Disease Prediction

## Data Exploration
This notebook focuses on the initial exploration of the dataset to understand its structure, data types, and potential issues. The goal is to perform a general analysis, unify the representation of missing values, and ensure that all columns have consistent and appropriate data types. Additionally, the cleaned dataset will be stored in a parquet format for further processing.

`Simón Correa Marín`

### **1. Import Libraries**

In [1]:
# base libraries for data science
from pathlib import Path

import numpy as np
import pandas as pd
import pyarrow as pa

### **2. Load Data**

In [2]:
# data directory path
DATA_DIR = Path.cwd().resolve().parents[0] / "data"

# hd -> heart disease
hd_df = pd.read_csv(DATA_DIR / "01_raw/heartdisease_raw.csv")

### **3. Data Description**

In [3]:
hd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   rest_ecg    6403 non-null   object 
 1   ca          6482 non-null   object 
 2   thal        6557 non-null   object 
 3   max_hr      6456 non-null   object 
 4   exang       6499 non-null   object 
 5   old_peak    6496 non-null   object 
 6   chol        6649 non-null   object 
 7   rest_bp     6661 non-null   object 
 8   chest_pain  6654 non-null   object 
 9   disease     6602 non-null   object 
 10  sex         6700 non-null   object 
 11  fbs         6624 non-null   float64
 12  slope       6495 non-null   object 
 13  age         6769 non-null   object 
dtypes: float64(1), object(13)
memory usage: 749.1+ KB


In [4]:
hd_df.sample(10)

Unnamed: 0,rest_ecg,ca,thal,max_hr,exang,old_peak,chol,rest_bp,chest_pain,disease,sex,fbs,slope,age
5321,normal,0.0,normal,181.0,0,1.2,303.0,115.0,asymptomatic,0,Male,0.0,2,43
3274,,3.0,reversable,,1,0.0,,,,1,,,2,45
1576,left ventricular hypertrophy,1.0,normal,153.0,0,0.0,290.0,112.0,asymptomatic,1,Male,0.0,1,44
4877,left ventricular hypertrophy,2.0,reversable,170.0,0,1.2,293.0,140.0,asymptomatic,1,Male,0.0,2,60
2129,left ventricular hypertrophy,0.0,normal,160.0,0,0.0,234.0,138.0,asymptomatic,0,Female,0.0,1,53
3139,normal,0.0,normal,170.0,0,0.0,215.0,120.0,nonanginal,0,Female,0.0,1,37
1936,left ventricular hypertrophy,0.0,normal,144.0,1,1.8,211.0,110.0,typical,0,Male,0.0,2,64
5140,left ventricular hypertrophy,1.0,normal,132.0,1,0.1,212.0,112.0,asymptomatic,1,Male,0.0,1,66
4214,normal,1.0,reversable,151.0,0,1.0,277.0,118.0,nonanginal,0,Male,0.0,1,68
1663,,0.0,normal,,0,0.0,,,,0,,,1,46


### **4. Null Values**

In [5]:
# Dataset lenght
len(hd_df)

6848

In [6]:
# Percentage of missing values for each column
missing_values = hd_df.isnull().sum() * 100 / len(hd_df)

for column, percentage in missing_values.items():
    print(f"{column}: {percentage:.3f}%")

rest_ecg: 6.498%
ca: 5.345%
thal: 4.249%
max_hr: 5.724%
exang: 5.096%
old_peak: 5.140%
chol: 2.906%
rest_bp: 2.731%
chest_pain: 2.833%
disease: 3.592%
sex: 2.161%
fbs: 3.271%
slope: 5.155%
age: 1.154%


In [7]:
# Check if there are another representation of missing values
mv = ["?", " ", "", "nan", "N/A", "na", "NA", "NAN", "None", "none", "NONE", "null", "NULL", "Null"]
for col in hd_df.columns:
    print(col, hd_df[col].isin(mv).sum())

rest_ecg 0
ca 0
thal 0
max_hr 0
exang 0
old_peak 0
chol 0
rest_bp 0
chest_pain 0
disease 0
sex 0
fbs 0
slope 0
age 0


There are not other representations for missing values in the heart disease dataset.

**I choose not to delete any column because the percentage of the nan values is low.**

### **5. Data Types**

#### **Categorical Values**
- **Ordinal**
    - **ca** → Number of major blood vessels (0-3) colored by fluoroscopy.
    - **slope** → Slope of the peak exercise ST segment (1-3)

- **Nominal**
    - **chest_pain** → Type of chest pain
        - Asymptomatic
        - Non-anginal
        - Atypical angina
        - Typical angina
    - **rest_ecg** → Resting electrocardiogram results
        - Normal
        - ST-T abnormality
        - Left ventricular hypertrophy
    - **thal** → Thalassemia test result
        - Normal
        - Fixed defect
        - Reversible defect
    - **sex** → Patient’s gender
        - Male
        - Female

- **Boolean**
    - **exang** → Exercise-induced angina
        - 0 = False
        - 1 = True
    - **fbs** → Fasting blood sugar > 120 mg/dL
        - 0 = False 
        - 1 = True
    - **disease (target)** → Presence and severity of heart disease
        - 0 = No disease (False)
        - 1 = Disease (True)

#### **Numerical Values**
- **Discrete**
    - **age** → Patient’s age in years.
    - **chol** → Serum cholesterol level (mg/dL).
    - **rest_bp** → Resting blood pressure (mmHg)
    - **max_hr** → Maximum heart rate achieved during a stress test.

- **Continous**
    - **old_peak** → ST depression induced by exercise relative to rest.

#### **Convert data types**

In [8]:
# Unique values for each column
for col in hd_df.columns:
    print(col, hd_df[col].unique())

rest_ecg ['normal' 'left ventricular hypertrophy ' nan 'ST-T wave abnormality'
 '5653' '36653' '3563' '435647']
ca ['1.0' '0.0' '3.0' '2.0' nan 'afd']
thal ['normal' nan 'fixed' 'reversable' '87654' '56']
max_hr ['158' '163' '152' '115' nan '168' '190' '140' '182' '165' '125' '174'
 '117' '142' '166' '143' '194' '147' '126' '112' '139' '162' '88' '161'
 '123' '195' '164' '159' '169' '109' '122' '175' '187' '171' '99' '130'
 '127' '157' '167' '186' '145' '141' '173' '132' '136' '151' '118' '114'
 '138' '172' '155' '146' '111' '97' '170' '179' '154' '177' '90' '160'
 '108' '133' '180' '137' '156' '150' '131' '202' '144' '105' '96' '103'
 '153' '181' '121' '185' '188' '120' '184' '178' '192' '116' '113' '148'
 '128' '106' '95' '129' '149' '71' '134' '124' 'adfs']
exang ['0' '1' nan 'adfs' 'f']
old_peak ['0.8' '0.6' '0.0' '4.4' '1.0' '3.4' '1.6' '1.2' '0.5' '1.9' '2.9' '2.0'
 '1.4' '0.2' '0.1' nan '0.4' '3.6' '3.8' '2.2' '4.2' '3.5' '0.9' '1.8'
 '3.0' '2.8' '2.6' '5.6' '0.3' '4.0' '2.3' '2

There are strange values in the variables, we'll handle them. 

In [9]:
categorical_cols = ["chest_pain", "slope", "ca", "rest_ecg", "thal", "sex"]
boolean_cols = ["exang", "fbs", "disease"]
disc_numerical_cols = ["age", "max_hr", "chol", "rest_bp"]
cont_numerical_cols = ["old_peak"]

In [None]:
# Cleaning numerical and boolean columns that have string values

# disease, ca, slope, old_peak and exang has not numerical values so we have to convert it to numeric
hd_df[["disease", "ca", "slope", "exang", "old_peak"]] = hd_df[
    ["disease", "ca", "slope", "exang", "old_peak"]
].apply(pd.to_numeric, errors="coerce")

# Convert all discrete numerical columns to numbers
hd_df[disc_numerical_cols] = hd_df[disc_numerical_cols].apply(pd.to_numeric, errors="coerce")

# Take strings away
for col in disc_numerical_cols + cont_numerical_cols + boolean_cols:
    hd_df[col] = hd_df[col].apply(lambda x: x if isinstance(x, (int, float)) else np.nan)

In [11]:
# Cleaning categorical columns that have numeric values
# Take numeric values away
for col in categorical_cols:
    if col not in ["ca", "slope"]:  # ca and slope are an ordinal columns
        hd_df[col] = hd_df[col].apply(
            lambda x: x if isinstance(x, str) and not x.isnumeric() else np.nan
        )

In [12]:
# Categorical
hd_df[categorical_cols] = hd_df[categorical_cols].astype("category")

# Boolean
hd_df[boolean_cols] = hd_df[boolean_cols].astype("bool")

# Discrete numerical (must be int64 but we have nan value so we'll change it in a future process)
hd_df[disc_numerical_cols] = hd_df[disc_numerical_cols].astype("float")

# Continuous numerical
hd_df[cont_numerical_cols] = hd_df[cont_numerical_cols].astype("float")

In [13]:
# Unique values for each categorical column
for col in hd_df.select_dtypes(include="category").columns:
    print(col, hd_df[col].unique())

rest_ecg ['normal', 'left ventricular hypertrophy ', NaN, 'ST-T wave abnormality']
Categories (3, object): ['ST-T wave abnormality', 'left ventricular hypertrophy ', 'normal']
ca [1.0, 0.0, 3.0, 2.0, NaN]
Categories (4, float64): [0.0, 1.0, 2.0, 3.0]
thal ['normal', NaN, 'fixed', 'reversable']
Categories (3, object): ['fixed', 'normal', 'reversable']
chest_pain ['nontypical', 'asymptomatic', 'nonanginal', NaN, 'typical']
Categories (4, object): ['asymptomatic', 'nonanginal', 'nontypical', 'typical']
sex ['Male', 'Female', NaN]
Categories (2, object): ['Female', 'Male']
slope [1.0, 2.0, 3.0, NaN]
Categories (3, float64): [1.0, 2.0, 3.0]


In [None]:
# Unique values for each boolean column
for col in hd_df.select_dtypes(include="boolean").columns:
    print(col, hd_df[col].unique())

exang [False  True]
disease [False  True]
fbs [False  True]


In [15]:
# Numerical columns overview
hd_df.describe()

Unnamed: 0,max_hr,old_peak,chol,rest_bp,age
count,6453.0,6493.0,6643.0,6655.0,6763.0
mean,149.805207,1.027768,246.340659,131.696018,54.434866
std,22.708598,1.166625,50.071028,17.55022,9.003089
min,71.0,0.0,126.0,94.0,29.0
25%,134.0,0.0,212.0,120.0,48.0
50%,153.0,0.6,241.0,130.0,56.0
75%,166.0,1.6,275.0,140.0,61.0
max,202.0,6.2,564.0,200.0,77.0


In [16]:
hd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   rest_ecg    6392 non-null   category
 1   ca          6479 non-null   category
 2   thal        6552 non-null   category
 3   max_hr      6453 non-null   float64 
 4   exang       6848 non-null   bool    
 5   old_peak    6493 non-null   float64 
 6   chol        6643 non-null   float64 
 7   rest_bp     6655 non-null   float64 
 8   chest_pain  6648 non-null   category
 9   disease     6848 non-null   bool    
 10  sex         6692 non-null   category
 11  fbs         6848 non-null   bool    
 12  slope       6492 non-null   category
 13  age         6763 non-null   float64 
dtypes: bool(3), category(6), float64(5)
memory usage: 328.7 KB


In [17]:
hd_df.sample(3)

Unnamed: 0,rest_ecg,ca,thal,max_hr,exang,old_peak,chol,rest_bp,chest_pain,disease,sex,fbs,slope,age
1993,normal,0.0,normal,154.0,True,1.4,244.0,150.0,asymptomatic,True,Female,False,2.0,62.0
154,left ventricular hypertrophy,2.0,normal,125.0,True,0.9,299.0,100.0,asymptomatic,True,Male,False,2.0,67.0
2575,normal,0.0,normal,96.0,False,0.0,178.0,120.0,nonanginal,False,Female,True,1.0,60.0


### **6. Save dataframe with data types**

In [18]:
schema = pa.Table.from_pandas(hd_df).schema

In [19]:
(DATA_DIR / "02_intermediate").mkdir(parents=True, exist_ok=True)

# Save DataFrame in parquet format
hd_df.to_parquet(DATA_DIR / "02_intermediate/hd_type_fixed.parquet", index=False, schema=schema)

## **Analysis of Results**
- There are more categorical values than numerical in this dataset.
- There was no column removal because there are not too much nan values.
- Weird values were replaced by nan values to make sure the data type convertion were correct.
- The final dataframe were saved in parquet format.