## Heart Failure Dataset 
Cardiovascular diseases (CVDs) are the **leading cause of death globally**, responsible for approximately **17.9 million deaths annually**—about **31% of all global deaths**. Most of these are caused by heart attacks and strokes, with a third occurring in individuals under 70.

Early detection is critical, especially for individuals at high cardiovascular risk (e.g., those with **hypertension, diabetes, or hyperlipidemia**). This dataset is designed to help build machine learning models to **predict heart disease** based on key health indicators.

---

### Features

The dataset contains **11 input features** and **1 target** (`HeartDisease`):

| Feature          | Description                                                  |
| ---------------- | ------------------------------------------------------------ |
| `Age`            | Age of the patient (years)                                   |
| `Sex`            | Sex of the patient (`M`: Male, `F`: Female)                  |
| `ChestPainType`  | Chest pain type (`TA`, `ATA`, `NAP`, `ASY`)                  |
| `RestingBP`      | Resting blood pressure (mm Hg)                               |
| `Cholesterol`    | Serum cholesterol (mg/dl)                                    |
| `FastingBS`      | Fasting blood sugar (`1` if >120 mg/dl, else `0`)            |
| `RestingECG`     | Resting ECG results (`Normal`, `ST`, `LVH`)                  |
| `MaxHR`          | Maximum heart rate achieved                                  |
| `ExerciseAngina` | Exercise-induced angina (`Y`, `N`)                           |
| `Oldpeak`        | ST depression induced by exercise                            |
| `ST_Slope`       | Slope of the peak exercise ST segment (`Up`, `Flat`, `Down`) |
| `HeartDisease`   | Target: `1` = heart disease, `0` = no disease                |

---


All datasets are publicly available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/).

---

### Citation

**Source:** [Kaggle - Heart Failure Prediction](https://www.kaggle.com/fedesoriano/heart-failure-prediction)
**Creator:** fedesoriano (September 2021)

---

#### Acknowledgements

**Institutions & Contributors:**

* Hungarian Institute of Cardiology, Budapest: *Andras Janosi, M.D.*
* University Hospital, Zurich: *William Steinbrunn, M.D.*
* University Hospital, Basel: *Matthias Pfisterer, M.D.*
* V.A. Medical Center, Long Beach & Cleveland Clinic Foundation: *Robert Detrano, M.D., Ph.D.*
* **Donor:** David W. Aha ([aha@ics.uci.edu](mailto:aha@ics.uci.edu))



**IMPORTING DEPENDENCIES**

In [14]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

### Data Collection and Pre-Processing

In [2]:
file_path = "C:/Users/USER/Desktop/Datasets/heart.csv"

In [3]:
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
df.shape

(918, 12)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


**OUTLIER REMOVAL**

In [6]:
# columns to check for outliers
cols_to_check = ["Oldpeak", "MaxHR", "Cholesterol", "Age", "RestingBP"]

# calculate mean and standard deviation
mean = df[cols_to_check].mean()
std = df[cols_to_check].std()

# calculate Z-scores manually
z_scores = (df[cols_to_check] - mean) / std

# keep rows where all Z-scores are between -3 and 3
mask = (np.abs(z_scores) < 3).all(axis=1)
df_cleaned = df[mask]

# Check before and after shapes
print(f"Original shape: {df.shape}")
print(f"Shape after outlier removal: {df_cleaned.shape}")

Original shape: (918, 12)
Shape after outlier removal: (899, 12)


**FEATURE ENCODING**

In [7]:
df_encoded = pd.get_dummies(
    df_cleaned,
    columns=['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'],
    drop_first=True,
    dtype='int'
)

df_encoded.head()


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,1,1,0,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,1,1,0,0,0,1,0,0,1
3,48,138,214,0,108,1.5,1,0,0,0,0,1,0,1,1,0
4,54,150,195,0,122,0.0,0,1,0,1,0,1,0,0,0,1


**SEPERATE FEATURES AND TARGET**

In [8]:
X = df_encoded.drop(['HeartDisease'], axis=1)
y = df_encoded['HeartDisease']

In [9]:
print(X)

     Age  RestingBP  Cholesterol  FastingBS  ...  RestingECG_ST  ExerciseAngina_Y  ST_Slope_Flat  ST_Slope_Up
0     40        140          289          0  ...              0                 0              0            1
1     49        160          180          0  ...              0                 0              1            0
2     37        130          283          0  ...              1                 0              0            1
3     48        138          214          0  ...              0                 1              1            0
4     54        150          195          0  ...              0                 0              0            1
..   ...        ...          ...        ...  ...            ...               ...            ...          ...
913   45        110          264          0  ...              0                 0              1            0
914   68        144          193          1  ...              0                 0              1            0
915   57  

In [10]:
print(y)

0      0
1      1
2      0
3      1
4      0
      ..
913    1
914    1
915    1
916    1
917    0
Name: HeartDisease, Length: 899, dtype: int64


**SCALING**

In [11]:

# Columns to scale
cols_to_scale = ['MaxHR', 'Cholesterol', 'RestingBP', 'Oldpeak', 'Age']

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform only the selected columns
X[cols_to_scale] = scaler.fit_transform(X[cols_to_scale])


In [12]:
print(X)

          Age  RestingBP  Cholesterol  FastingBS  ...  RestingECG_ST  ExerciseAngina_Y  ST_Slope_Flat  ST_Slope_Up
0   -1.428154   0.465900     0.849636          0  ...              0                 0              0            1
1   -0.475855   1.634714    -0.168122          0  ...              0                 0              1            0
2   -1.745588  -0.118507     0.793612          0  ...              1                 0              0            1
3   -0.581666   0.349019     0.149344          0  ...              0                 1              1            0
4    0.053200   1.050307    -0.028064          0  ...              0                 0              0            1
..        ...        ...          ...        ...  ...            ...               ...            ...          ...
913 -0.899099  -1.287320     0.616205          0  ...              0                 0              1            0
914  1.534554   0.699663    -0.046738          1  ...              0            

**SEPERATE TRAINING AND TEST DATA**

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=13)
print(X.shape, X_train.shape, X_test.shape)

(899, 15) (764, 15) (135, 15)


### Model Selection and Training

**Coss Validation(supoort vector machines)**

In [15]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(SVC(), X_train, y_train, cv=5, scoring='accuracy')

# Print individual fold scores and mean accuracy
print("Cross-validation scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())

Cross-validation scores: [0.85620915 0.88888889 0.83660131 0.85620915 0.89473684]
Mean Accuracy: 0.8665290677674579


**Bagging(supoort vector machines)**

In [18]:
# Base model
base_model = SVC()

# Bagging wrapper
bagging_model = BaggingClassifier(
    estimator=base_model,
    n_estimators=10,          # number of SVC models in the ensemble
    max_samples=0.8,          # fraction of the training data used for each base model
    max_features=0.8,         # use 80% features
    bootstrap=True,           # sample with replacement
    n_jobs=-1,                # use all processors
    random_state=42
)

# Cross-validation
cv_scores = cross_val_score(bagging_model, X_train, y_train, cv=5, scoring='accuracy')

# Results
print("Bagging with SVM - CV Scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())


Bagging with SVM - CV Scores: [0.8627451  0.88235294 0.83006536 0.85620915 0.88815789]
Mean Accuracy: 0.8639060887512902


**Coss Validation(Decision Tree)**

In [20]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(DecisionTreeClassifier(), X_train, y_train, cv=5, scoring='accuracy')

# Print individual fold scores and mean accuracy
print("Cross-validation scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())

Cross-validation scores: [0.83006536 0.79738562 0.77777778 0.84313725 0.75657895]
Mean Accuracy: 0.8009889920880633


**Bagging(Decision Tree)**

In [21]:
# Base model
base_model = DecisionTreeClassifier()

# Bagging wrapper
bagging_model = BaggingClassifier(
    estimator=base_model,
    n_estimators=10,          # number of DecisionTreeClassifier models in the ensemble
    max_samples=0.8,          # fraction of the training data used for each base model
    max_features=0.8,         # use 80% features
    bootstrap=True,           # sample with replacement
    n_jobs=-1,                # use all processors
    random_state=42
)

# Cross-validation
cv_scores = cross_val_score(bagging_model, X_train, y_train, cv=5, scoring='accuracy')

# Results
print("Bagging with Decision Tree - CV Scores:", cv_scores)
print("Mean Accuracy:", cv_scores.mean())


Bagging with Decision Tree - CV Scores: [0.8496732  0.84313725 0.79084967 0.8496732  0.84868421]
Mean Accuracy: 0.8364035087719298


| Model         | CV Accuracy (No Bagging) | CV Accuracy (With Bagging)    |
| ------------- | ------------------------ | ----------------------------- |
| Decision Tree | 80%                      | **84%**                       |
| SVM           | 86%                      | **86%** |


| Feature                  | Decision Tree            | SVM                            |
| ------------------------ | ------------------------ | ------------------------------ |
| Tendency to Overfit      | ✅ Yes (High Variance)    | ❌ No (Low Variance, High Bias) |
| Suitable for Bagging     | ✅ Yes                    | ⚠️ No (Limited Benefit)        |
| Interpretability         | ✅ Easy to Interpret      | ❌ Hard to Interpret            |
| Sensitive to Noise       | ✅ Yes                    | ❌ Robust to Outliers           |
| Performance with Bagging | ✅ Improves Significantly | ❌ Minimal Gain, May Slow Down  |
