# **Heart Disease Prediction**  

## **Feature Engineering**  
This notebook focuses on performing feature engineering to prepare the dataset for model training. The goal is to clean, transform, and modify the data to ensure its predictive power.  

### **Feature Engineering Process**  
- **Data Cleaning:** Handling missing values, removing duplicates, and addressing outliers if necessary.  
- **Feature Selection:** Eliminating redundant or uninformative attributes.  
- **Feature Engineering:** Transforming variables, discretizing continuous features, creating new meaningful attributes, and applying mathematical transformations where appropriate.  
- **Feature Scaling:** Standardizing or normalizing numerical features.  
- **Encoding:** Converting categorical variables into numerical representations suitable for modeling.  

All transformations will be implemented using **scikit-learn pipelines** to ensure a structured and reproducible workflow. 

`Simón Correa Marín`


### **1. Import Libraries and Configurations**

In [1]:
# base libraries for data science

from pathlib import Path
import pandas as pd
import sklearn as sk
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

### **2. Load Data**

In [2]:
DATA_DIR = Path.cwd().resolve().parents[0] / "data"

hd_df = pd.read_parquet(
    DATA_DIR / "02_intermediate/hd_type_fixed.parquet", engine="pyarrow"
)

In [3]:
# print library version for reproducibility

print("Pandas version: ", pd.__version__)
print("sklearn version: ", sk.__version__)

Pandas version:  2.2.3
sklearn version:  1.6.1


### **3. Data Preparation**

In [4]:
hd_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848 entries, 0 to 6847
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   rest_ecg    6392 non-null   category
 1   ca          6479 non-null   float64 
 2   thal        6552 non-null   category
 3   max_hr      6453 non-null   float64 
 4   exang       6848 non-null   bool    
 5   old_peak    6763 non-null   float64 
 6   chol        6643 non-null   float64 
 7   rest_bp     6655 non-null   float64 
 8   chest_pain  6648 non-null   category
 9   disease     6848 non-null   bool    
 10  sex         6692 non-null   category
 11  fbs         6848 non-null   bool    
 12  slope       6492 non-null   float64 
 13  age         6763 non-null   float64 
dtypes: bool(3), category(4), float64(7)
memory usage: 422.0 KB


#### **Missing Values**

#### **Duplicated Data**

### **3. Feature Engineering**

#### **Train/Test split**

#### **Preprocessing Pipeline**

### **4. Conclusions and results**