# Heart Disease Prediction

## Data Exploration
This notebook focuses on the initial exploration of the dataset to understand its structure, data types, and potential issues. The goal is to perform a general analysis, unify the representation of missing values, and ensure that all columns have consistent and appropriate data types. Additionally, the cleaned dataset will be stored in a parquet format for further processing.

`Simón Correa Marín`

### **1. Import Libraries**

In [20]:
# base libraries for data science
from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa

### **2. Load Data**

In [21]:
# data directory path
DATA_DIR = Path.cwd().resolve().parents[0] / "data"

#hd -> heart disease
hd_df = pd.read_csv(DATA_DIR / "01_raw/heartdisease_raw.csv")

### **3. Data Description**

In [22]:
hd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   rest_ecg    6403 non-null   object 
 1   ca          6482 non-null   object 
 2   thal        6557 non-null   object 
 3   max_hr      6456 non-null   object 
 4   exang       6499 non-null   object 
 5   old_peak    6496 non-null   object 
 6   chol        6649 non-null   object 
 7   rest_bp     6661 non-null   object 
 8   chest_pain  6654 non-null   object 
 9   disease     6602 non-null   object 
 10  sex         6700 non-null   object 
 11  fbs         6624 non-null   float64
 12  slope       6495 non-null   object 
 13  age         6769 non-null   object 
dtypes: float64(1), object(13)
memory usage: 749.1+ KB


In [23]:
hd_df.sample(10)

Unnamed: 0,rest_ecg,ca,thal,max_hr,exang,old_peak,chol,rest_bp,chest_pain,disease,sex,fbs,slope,age
6310,normal,0.0,reversable,131,1,1.8,309,125,nonanginal,1,Male,0.0,2,64
3211,normal,0.0,normal,170,0,0.0,220,120,nontypical,0,Male,0.0,1,44
6532,normal,0.0,normal,137,1,1.0,243,150,nonanginal,0,Male,1.0,2,61
6383,normal,0.0,normal,162,0,1.1,244,120,nontypical,0,Female,0.0,1,50
1826,normal,1.0,fixed,134,0,2.2,218,126,nonanginal,1,Male,1.0,2,59
771,normal,1.0,reversable,143,1,3.0,335,110,asymptomatic,1,Male,0.0,2,57
2432,ST-T wave abnormality,3.0,fixed,140,0,4.4,318,114,asymptomatic,1,Male,0.0,3,58
2393,normal,0.0,normal,165,0,0.2,213,122,nonanginal,0,Female,0.0,2,43
2707,left ventricular hypertrophy,3.0,reversable,154,0,4.0,407,150,asymptomatic,1,Female,0.0,2,63
3204,left ventricular hypertrophy,3.0,normal,109,0,2.4,322,130,asymptomatic,1,Male,0.0,2,70


### **4. Null Values**

In [24]:
# Dataset lenght
len(hd_df)

6848

In [26]:
# Percentage of missing values for each column
missing_values = hd_df.isnull().sum() * 100 / len(hd_df)

for column, percentage in missing_values.items():
    print(f"{column}: {percentage:.3f}%")

rest_ecg: 6.498%
ca: 5.345%
thal: 4.249%
max_hr: 5.724%
exang: 5.096%
old_peak: 5.140%
chol: 2.906%
rest_bp: 2.731%
chest_pain: 2.833%
disease: 3.592%
sex: 2.161%
fbs: 3.271%
slope: 5.155%
age: 1.154%


In [27]:
#Check if there are another representation of missing values
mv = ["?", " ", "", "nan", "N/A", "na", "NA", "NAN", "None", "none", "NONE", "null", "NULL", "Null"]
for col in hd_df.columns:
    print(col, hd_df[col].isin(mv).sum())

rest_ecg 0
ca 0
thal 0
max_hr 0
exang 0
old_peak 0
chol 0
rest_bp 0
chest_pain 0
disease 0
sex 0
fbs 0
slope 0
age 0


There are not other representations for missing values in the heart disease dataset.

**I choose not to delete any column because the percentage of the nan values is low.**

### **5. Data Types**

#### **Categorical Values**
- **Ordinal**
    - **ca** → Number of major blood vessels (0-3) colored by fluoroscopy.
    - **slope** → Slope of the peak exercise ST segment (1-3)

- **Nominal**
    - **chest_pain** → Type of chest pain
        - Asymptomatic
        - Non-anginal
        - Atypical angina
        - Typical angina
    - **rest_ecg** → Resting electrocardiogram results
        - Normal
        - ST-T abnormality
        - Left ventricular hypertrophy
    - **thal** → Thalassemia test result
        - Normal
        - Fixed defect
        - Reversible defect
    - **sex** → Patient’s gender
        - Male
        - Female

- **Boolean**
    - **exang** → Exercise-induced angina
        - 0 = False
        - 1 = True
    - **fbs** → Fasting blood sugar > 120 mg/dL
        - 0 = False 
        - 1 = True
    - **disease (target)** → Presence and severity of heart disease
        - 0 = No disease (False)
        - 1 = Disease (True)

#### **Numerical Values**
- **Discrete**
    - **age** → Patient’s age in years.
    - **chol** → Serum cholesterol level (mg/dL).
    - **rest_bp** → Resting blood pressure (mmHg)
    - **max_hr** → Maximum heart rate achieved during a stress test.

- **Continous**
    - **old_peak** → ST depression induced by exercise relative to rest.

#### **Convert data types**

In [28]:
#Unique values for each column
for col in hd_df.columns:
    print(col, hd_df[col].unique())

rest_ecg ['normal' 'left ventricular hypertrophy ' nan 'ST-T wave abnormality'
 '5653' '36653' '3563' '435647']
ca ['1.0' '0.0' '3.0' '2.0' nan 'afd']
thal ['normal' nan 'fixed' 'reversable' '87654' '56']
max_hr ['158' '163' '152' '115' nan '168' '190' '140' '182' '165' '125' '174'
 '117' '142' '166' '143' '194' '147' '126' '112' '139' '162' '88' '161'
 '123' '195' '164' '159' '169' '109' '122' '175' '187' '171' '99' '130'
 '127' '157' '167' '186' '145' '141' '173' '132' '136' '151' '118' '114'
 '138' '172' '155' '146' '111' '97' '170' '179' '154' '177' '90' '160'
 '108' '133' '180' '137' '156' '150' '131' '202' '144' '105' '96' '103'
 '153' '181' '121' '185' '188' '120' '184' '178' '192' '116' '113' '148'
 '128' '106' '95' '129' '149' '71' '134' '124' 'adfs']
exang ['0' '1' nan 'adfs' 'f']
old_peak ['0.8' '0.6' '0.0' '4.4' '1.0' '3.4' '1.6' '1.2' '0.5' '1.9' '2.9' '2.0'
 '1.4' '0.2' '0.1' nan '0.4' '3.6' '3.8' '2.2' '4.2' '3.5' '0.9' '1.8'
 '3.0' '2.8' '2.6' '5.6' '0.3' '4.0' '2.3' '2

There are strange values in the variables, we'll handle them. 