# AIG 100 _ Machine Learning _ Project 3

|   AIG 100 | Project 3          |
|------|------------------|
| Name | Student Number |
| Siavash Tadayonnia | 123782252 |

----------------------------------------------------



# üîµ Introduction / Problem Definition

### üü© **Introduction**

 Early detection of Heart Failure  is important because it can help prevent serious complications, reduce healthcare costs, and improve patients‚Äô quality of life.

In this project, we use a large healthcare dataset containing approximately 100,000 patient records. The dataset includes clinical and biological features such as age, blood pressure, cholesterol levels, heart rate, and other indicators related to cardiovascular health.

### üü© **Problem Definition**

The goal of this project is to build a machine learning model that can predict the risk of heart failure based on a patient‚Äôs medical characteristics.

This problem can be framed as a binary classification task, where:

Input (Features): Clinical measurements and patient information

Output (Target):

- 0 ‚Üí No heart failure / low risk

- 1 ‚Üí Heart failure present / high risk

Such a model could potentially support medical professionals by providing early risk assessments.

### üü© **Why This Dataset?**

This dataset was selected based on the following criteria:

- Real-world relevance: Heart failure prediction is a meaningful and impactful healthcare problem.

- Dataset size: Around 253,000 records, which is suitable for training advanced machine learning models.

- Structured format: Clean, tabular dataset with clear documentation.

- Feature richness: Approximately 30 features, which allows deeper analysis and stronger model performance.

- Applicability: Supports advanced models such as Random Forest, Gradient Boosting, and MLP neural networks.

### üü© **Project Objective**

In this project, I aim to:

- Explore and preprocess the dataset

- Implement advanced machine learning models

- Evaluate model performance using appropriate metrics

- Interpret the results and identify important features

- Discuss the potential real-world implications of using such a model

# üîµ Dataset Description

### üü© Link to Dataset: [Kaggle Link](https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset)

### üü© Dataset Description
- Dataset has 253,680 patient records.
- Each record includes several clinical and demgraphic features that are commonly used to assess cardiovascular health.
- The dataset contains 21 input features along with one target variable. Most variables are represented as numeric values (0/1) indicating the presence or absence of a condition, while others provide ordinal or continuous health measurements such as BMI, general health rating, and number of unhealthy days.

### üü© Features Overview

- HeartDiseaseorAttack:
Target variable. 1 = Coronary heart disease or myocardial infarction; 0 = No history.

- HighBP:
1 = High blood pressure; 0 = Normal blood pressure.

- HighChol:
1 = High cholesterol; 0 = Normal cholesterol.

- CholCheck:
1 = Cholesterol checked in the past 5 years; 0 = Not checked.

- BMI:
Body Mass Index (weight/height¬≤), numeric.

- Smoker:
1 = Smoked at least 100 cigarettes in lifetime; 0 = Otherwise.

- Stroke:
1 = History of stroke; 0 = No stroke.

- Diabetes:
1 = Diabetes; 0 = No diabetes (may include borderline cases).

- PhysActivity:
1 = Physical activity in the past 30 days; 0 = No activity.

- Fruits:
1 = Consumes fruit at least once per day; 0 = Does not.

- Veggies:
1 = Consumes vegetables at least once per day; 0 = Does not.

- HvyAlcoholConsump:
1 = Heavy alcohol consumption; 0 = Not a heavy drinker.

- AnyHealthcare:
1 = Has health care coverage; 0 = No coverage.

- NoDocbcCost:
1 = Could not see a doctor due to cost in the past year; 0 = No cost-related barrier.

- GenHlth:
General health rating on a scale from 1 (excellent) to 5 (poor).

- MentHlth:
Number of days (0‚Äì30) in the past month where mental health was ‚Äúnot good.‚Äù

- PhysHlth:
Number of days (0‚Äì30) in the past month where physical health was ‚Äúnot good.‚Äù

- DiffWalk:
1 = Difficulty walking or climbing stairs; 0 = No difficulty.

- Sex:
1 = Male; 0 = Female.

- Age:
Age category represented as ordinal values (ranges from 1 to 13).

- Education:
Education level from 1 = Never attended school to 6 = College graduate.

- Income:
Income category from 1 = Lowest bracket to 8 = Highest bracket.


### üü© Target Variable

HeartDiseaseorAttack:
Binary classification target

1 ‚Üí Patient has had heart disease or a heart attack

0 ‚Üí No history of heart disease

# üîµ Data Preprocessing

### üü© Import necessary libraries

In [11]:
# Importing required libraries for the project

import pandas as pd
import numpy as np


# For reading the data
import kagglehub
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Models (will decide the main model later)
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

# Evaluation metrics
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix



### üü© Loading the Dataset



In [12]:

import kagglehub

# Download latest version
path = kagglehub.dataset_download("alexteboul/heart-disease-health-indicators-dataset")



csv_path = os.path.join(path, "heart_disease_health_indicators_BRFSS2015.csv")

df = pd.read_csv(csv_path)

print (df.dtypes)

df.head()


HeartDiseaseorAttack    float64
HighBP                  float64
HighChol                float64
CholCheck               float64
BMI                     float64
Smoker                  float64
Stroke                  float64
Diabetes                float64
PhysActivity            float64
Fruits                  float64
Veggies                 float64
HvyAlcoholConsump       float64
AnyHealthcare           float64
NoDocbcCost             float64
GenHlth                 float64
MentHlth                float64
PhysHlth                float64
DiffWalk                float64
Sex                     float64
Age                     float64
Education               float64
Income                  float64
dtype: object


Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


### üü© Inspcting Data


In [13]:

# The first 5 rows of the dataset
print("First 5 rows of the dataset:")
display(df.head())
print ('\n','=' *40 , '\n')


# shape of the dataset
print("\nDataset shape (rows, columns):", df.shape)
print ('\n','=' *40 , '\n')

# data types and non-null counts
print("\nDataset info:")
df.info()
print ('\n','=' *40 , '\n')


# Statistical summary of numerical features
print("\nStatistical summary of numerical features:")
display(df.describe())
print ('\n','=' *40 , '\n')


# missing values
print("\nMissing values per column:")
print(df.isnull().sum())
print ('\n','=' *40 , '\n')

# Check the distribution of the target variable
print("\nTarget variable distribution:")
print(df["HeartDiseaseorAttack"].value_counts())
print("\nPercentage of classes:")
print(df["HeartDiseaseorAttack"].value_counts(normalize=True))


First 5 rows of the dataset:


Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0





Dataset shape (rows, columns): (253680, 22)



Dataset info:
<class 'pandas.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   HeartDiseaseorAttack  253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   Diabetes              253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14 

Unnamed: 0,HeartDiseaseorAttack,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
count,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,...,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0,253680.0
mean,0.094186,0.429001,0.424121,0.96267,28.382364,0.443169,0.040571,0.296921,0.756544,0.634256,...,0.951053,0.084177,2.511392,3.184772,4.242081,0.168224,0.440342,8.032119,5.050434,6.053875
std,0.292087,0.494934,0.49421,0.189571,6.608694,0.496761,0.197294,0.69816,0.429169,0.481639,...,0.215759,0.277654,1.068477,7.412847,8.717951,0.374066,0.496429,3.05422,0.985774,2.071148
min,0.0,0.0,0.0,0.0,12.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0
25%,0.0,0.0,0.0,1.0,24.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,6.0,4.0,5.0
50%,0.0,0.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,7.0
75%,0.0,1.0,1.0,1.0,31.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,2.0,3.0,0.0,1.0,10.0,6.0,8.0
max,1.0,1.0,1.0,1.0,98.0,1.0,1.0,2.0,1.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,1.0,13.0,6.0,8.0





Missing values per column:
HeartDiseaseorAttack    0
HighBP                  0
HighChol                0
CholCheck               0
BMI                     0
Smoker                  0
Stroke                  0
Diabetes                0
PhysActivity            0
Fruits                  0
Veggies                 0
HvyAlcoholConsump       0
AnyHealthcare           0
NoDocbcCost             0
GenHlth                 0
MentHlth                0
PhysHlth                0
DiffWalk                0
Sex                     0
Age                     0
Education               0
Income                  0
dtype: int64



Target variable distribution:
HeartDiseaseorAttack
0.0    229787
1.0     23893
Name: count, dtype: int64

Percentage of classes:
HeartDiseaseorAttack
0.0    0.905814
1.0    0.094186
Name: proportion, dtype: float64


### ‚ö° Key Takeawyas of Data Inspection

- 253,680 rows + 22 Columns

- All features are numerical values (float64)

- There are no missing values

- Target Variable Distribution

  * 0 ‚Üí 90.58% (No heart disease or attack)

  * 1 ‚Üí 9.42% (Heart disease or heart attack)

- Database is imbalanced, important to consider during model training.

Overall, the dataset is clean, large, and suitable for training advanced machine learning models. The main consideration going forward is handling the ***class imbalance*** in the target variable.

In [15]:
# Save csv for future processes

from pathlib import Path


PROJECT_ROOT = Path("..")

RAW_DIR = PROJECT_ROOT / "data" / "raw"
RAW_DIR.mkdir(parents=True, exist_ok=True)

out_path = RAW_DIR / "heart.csv"

df.to_csv(out_path, index=False)

print("Saved to:", out_path.resolve())


Saved to: C:\Users\sia\Desktop\capstone\Assignment\Heart-Disease-ML-Deployment\data\raw\heart.csv


# üîµ Model(s) and Method

### üü© Train / Test Split

In [4]:


X = df.drop("HeartDiseaseorAttack", axis=1)
y = df["HeartDiseaseorAttack"].astype(int)


X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y  # keep the same class percentage in train and test
)



### üü© Scaling

In [5]:

scaler = StandardScaler()


scaler.fit(X_train)


X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)


X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)


X_train_scaled.head()


Unnamed: 0,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,Diabetes,PhysActivity,Fruits,Veggies,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,-0.867237,1.163608,0.197394,0.697512,1.121256,-0.205264,2.439953,-1.761447,0.759064,0.481231,...,0.226807,-0.303655,2.32857,3.614343,2.955268,2.224242,-0.887617,-0.010732,-1.06553,-0.991431
1,-0.867237,-0.859396,0.197394,1.605293,-0.891857,-0.205264,-0.424948,-1.761447,0.759064,0.481231,...,0.226807,-0.303655,-1.415547,-0.429514,-0.486395,-0.449591,-0.887617,-0.338206,-1.06553,-0.991431
2,1.153087,-0.859396,0.197394,-0.664158,1.121256,-0.205264,-0.424948,0.567715,-1.317412,-2.078004,...,0.226807,-0.303655,-0.479518,-0.429514,-0.486395,-0.449591,1.126612,0.971691,-0.051543,0.45723
3,1.153087,1.163608,0.197394,0.092325,-0.891857,4.871774,-0.424948,0.567715,0.759064,-2.078004,...,0.226807,-0.303655,1.392541,-0.429514,-0.486395,-0.449591,1.126612,0.971691,-0.051543,0.940117
4,-0.867237,1.163608,0.197394,-0.512862,-0.891857,-0.205264,-0.424948,-1.761447,0.759064,0.481231,...,-4.409027,3.293207,-1.415547,0.244462,-0.371673,-0.449591,1.126612,-0.010732,0.962445,0.45723


### üü© Handling Class Imbalance

The target variable in this dataset is highly imbalanced: only about 9% of the samples belong to the positive class (HeartDiseaseorAttack = 1).  
If I train a model directly on this data, it would likely predict the majority class most of the time, resulting in high accuracy but poor detection of high-risk individuals.

To address this imbalance, I use **class weighting**, which increases the importance of the minority class during training. By using `"balanced"` class weights, the model automatically adjusts the contribution of each class based on their inverse frequencies in the training data.

This approach is simple, does not change the dataset itself, and is supported by models such as Random Forest and MLP.  
It is also suitable for this project because it prevents overfitting and avoids artificially altering the data with oversampling or undersampling techniques.


### üü© Models
I implement two models, `Random Forest` and `MLP`, following the project instructions that allow the use of ensemble methods and neural networks. I keep the comparison simple and focused to avoid unnecessary complexity.

#### üíõ Random Forest Classifier


In [6]:

rf_model = RandomForestClassifier(
    n_estimators=200,          # number of trees
    max_depth=None,           # allow full depth
    class_weight='balanced',  # handle class imbalance
    n_jobs=-1                 # use all CPU cores for speed
)


rf_model.fit(X_train_scaled, y_train)


rf_pred = rf_model.predict(X_test_scaled)

print("Random Forest model training completed!")


Random Forest model training completed!


#### üíõ MLP (Neural Network) Classifier

I configure the MLP with two hidden layers and ReLU activation. This keeps the architecture relatively simple while still allowing the model to capture complex patterns in the data. I also enable early stopping to prevent overfitting and to stop training when the validation performance no longer improves.

‚ö† **Important Consideration:**

Since `MLPClassifier` in scikit-learn does not support a `class_weight` parameter directly, I handle the class imbalance by `oversampling` the training data.


In [7]:
from sklearn.utils import resample, shuffle


scaler = StandardScaler()
scaler.fit(X_train)

# Keep the original index for alignment with y_train
X_train_scaled = pd.DataFrame(
    scaler.transform(X_train),
    index=X_train.index,
    columns=X_train.columns
)

X_test_scaled = pd.DataFrame(
    scaler.transform(X_test),
    index=X_test.index,
    columns=X_test.columns
)

X_train_scaled.head()


# Separate majority and minority classes in the training set
X_train_majority = X_train_scaled[y_train == 0]
X_train_minority = X_train_scaled[y_train == 1]

y_train_majority = y_train[y_train == 0]
y_train_minority = y_train[y_train == 1]

print("Original class distribution in y_train:")
print(y_train.value_counts())

# Upsample the minority class to match the majority size
X_minority_upsampled, y_minority_upsampled = resample(
    X_train_minority,
    y_train_minority,
    replace=True,                      # sample with replacement
    n_samples=len(y_train_majority),   # match majority class size
)

# Combine majority class with upsampled minority class
X_train_balanced = pd.concat([X_train_majority, X_minority_upsampled])
y_train_balanced = pd.concat([y_train_majority, y_minority_upsampled])

# Shuffle the balanced training set
X_train_balanced, y_train_balanced = shuffle(
    X_train_balanced,
    y_train_balanced,
)

print("\nBalanced class distribution in y_train_balanced:")
print(y_train_balanced.value_counts())


Original class distribution in y_train:
HeartDiseaseorAttack
0    183830
1     19114
Name: count, dtype: int64

Balanced class distribution in y_train_balanced:
HeartDiseaseorAttack
1    183830
0    183830
Name: count, dtype: int64


In [8]:
from sklearn.neural_network import MLPClassifier

mlp_model = MLPClassifier(
    hidden_layer_sizes=(64, 32),
    activation='relu',
    solver='adam',
    max_iter=300,
    early_stopping=True,
    validation_fraction=0.1,
    verbose=False
)

mlp_model.fit(X_train_balanced, y_train_balanced)

mlp_pred = mlp_model.predict(X_test_scaled)

print("MLP model training completed on balanced training data!")





MLP model training completed on balanced training data!


# üîµ Evaluation

Metrics:

- **Accuracy**  
- **Confusion Matrix**  
- **Precision, Recall, and F1-score**  

Since the dataset is highly imbalanced, accuracy alone is not sufficient.  
Therefore, I pay special attention to the recall and F1-score of the positive class.

### üü© Random Forest Evaluation


In [9]:


print ("""

 ============================
   Evaluation for RandomForest
 ============================

""")

rf_accuracy = accuracy_score(y_test, rf_pred)
print("Accuracy:", rf_accuracy)
print ('\n','=' *40)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, rf_pred))
print ('\n','=' *40)

print("\nClassification Report:")
print(classification_report(y_test, rf_pred))




   Evaluation for RandomForest


Accuracy: 0.9005637023021129


Confusion Matrix:
[[45208   749]
 [ 4296   483]]


Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.98      0.95     45957
           1       0.39      0.10      0.16      4779

    accuracy                           0.90     50736
   macro avg       0.65      0.54      0.55     50736
weighted avg       0.86      0.90      0.87     50736



### üü© MLP Evaluation

In [10]:


print ("""

 ============================
       Evaluation for MLP
 ============================

""")

mlp_accuracy = accuracy_score(y_test, mlp_pred)
print("Accuracy:", mlp_accuracy)
print ('\n','=' *40)

print("\nConfusion Matrix:")
print(confusion_matrix(y_test, mlp_pred))
print ('\n','=' *40)

print("\nClassification Report:")
print(classification_report(y_test, mlp_pred))




       Evaluation for MLP


Accuracy: 0.7261313465783664


Confusion Matrix:
[[33082 12875]
 [ 1020  3759]]


Classification Report:
              precision    recall  f1-score   support

           0       0.97      0.72      0.83     45957
           1       0.23      0.79      0.35      4779

    accuracy                           0.73     50736
   macro avg       0.60      0.75      0.59     50736
weighted avg       0.90      0.73      0.78     50736



# üîµ Results & Discussion

I evaluated both models using accuracy, confusion matrices, and precision/recall/F1-score. Because the dataset is highly imbalanced, I focus especially on how well each model identifies the positive class (heart disease or previous heart attack).

---

### üü© Random Forest Performance

- **Accuracy:** 0.899  
- **Strength:** Very high recall for class 0  
- **Weakness:** Very low recall for class 1  

- **Confusion Matrix:**  
  - True Negatives: 45,190  
  - False Positives: 767  
  - False Negatives: 4,332  
  - True Positives: 447  

The Random Forest model performs well overall and correctly classifies most of the negative cases.  
However, it struggles to detect positive cases. The recall for class 1 is only **0.09**, meaning it correctly identifies only 9 percent of patients with heart disease. This is expected because Random Forest tends to favour the majority class in highly imbalanced datasets, even with class weighting enabled.

---

### üü© MLP (Neural Network) Performance

- **Accuracy:** 0.748  
- **Strength:** Very high recall for class 1  
- **Weakness:** Higher false-positive rate  

- **Confusion Matrix:**  
  - True Negatives: 34,402  
  - False Positives: 11,555  
  - False Negatives: 1,185  
  - True Positives: 3,594  

The MLP model shows a very different behaviour. Because the training data was balanced before fitting, the model becomes much more sensitive to the positive class. It achieves a recall of **0.75** for class 1, which means it identifies 75 percent of high-risk patients.  
This comes at the cost of a lower overall accuracy, as the model produces more false positives.

---

### üü© Comparison Between Models

The two models show contrasting strengths:

- **Random Forest**:
  - High accuracy (0.899)
  - Excellent at identifying healthy individuals (class 0)
  - Very weak at identifying heart-disease patients (class 1)

- **MLP**:
  - Lower accuracy (0.748)
  - Excellent recall for class 1 (0.75)
  - Much better for identifying patients at risk

This comparison highlights the importance of looking beyond accuracy in imbalanced datasets. In medical contexts, missing a positive case (false negative) is often more serious than generating a false positive. Because of this, recall for the positive class is a more meaningful metric than accuracy alone.

---

### üü© Final Conclusion

Both models provide useful insights, but since in Problem Defenition we declared our goal is to *"build a machine learning model that can predict the risk of heart failure based on a patient‚Äôs medical characteristics."* I choose the **`MLP`** model.



# üîµ Real World Implications

The results of this project have several important real-world implications, especially in the context of public health and early detection of heart disease. Heart disease remains one of the leading causes of death globally, and identifying high-risk individuals early can significantly improve health outcomes.

The models developed in this project can support healthcare systems in different ways. For example, the Random Forest model, with its high accuracy and strong performance on the negative class, could be used as a first-stage screening tool to efficiently process large populations and identify low-risk individuals.For example, in some physically demanding situations such as emergency response work or rescue operations, it is important to identify whether an individual may have a high risk of heart disease. These roles can involve extreme physical stress, long hours, and exposure to dangerous environments.

On the other hand, the MLP model showed much higher recall for the positive class, which means it is better at identifying individuals who may actually have heart disease. Although it produces more false positives, this behaviour is often acceptable in medical contexts, where failing to detect a real positive case can be far more harmful than incorrectly flagging someone for additional tests. In practice, this model could be used as a second-stage risk-detection system or as part of a digital health platform that alerts users when their lifestyle or health indicators suggest elevated risk.

Overall, the models demonstrate how machine learning can help support early detection, reduce the burden on healthcare resources, and potentially improve patient outcomes. While these models should not replace medical professionals, they can serve as valuable decision-support tools that enhance awareness, improve screening efficiency, and provide an additional layer of insight in real-world healthcare environments.


# üîµ Reflection



#### üü© a) Challenges and How I Addressed Them

During this project, one of the main challenges I faced was dealing with the highly imbalanced dataset. The positive class (individuals with heart disease or a previous heart attack) represented less than ten percent of the total population, which makes many models perform poorly on detecting high-risk cases. My first attempts using standard methods such as class weighting did not fully solve the issue, especially for models like MLP that do not directly support class weights. I addressed this by oversampling the minority class in the training set, which significantly improved the model‚Äôs sensitivity to positive cases.

---

#### üü© b) Insights Gained from Applying the Methods

 First, I learned how model performance can change dramatically depending on the class distribution of the dataset. Even a model with almost 90 percent accuracy can be ineffective if it fails to identify high-risk individuals. This made me appreciate the importance of evaluating models using metrics that match the real-world goal of the application.

I also gained a better understanding of how ensemble methods and neural networks behave differently. Random Forest was strong in overall accuracy and stability, while the MLP model, after balancing the training data, became much more effective at detecting positive cases. This showed me that different models can serve different purposes, and that choosing a model should depend on the specific requirements of the real-world problem.

Finally, this project improved my understanding of the overall machine learning workflow: from data exploration and preprocessing, to handling imbalance, to model training and evaluation. Each step influenced the next one, and I realized how important it is to approach the project in a structured and iterative way. Overall, the experience helped me develop a clearer and more practical view of how machine learning methods can support real-world decision-making.


# üîµ References

üîó [Dataset](https://www.kaggle.com/datasets/alexteboul/heart-disease-health-indicators-dataset)

üîó [Random Forest Classifier Tutorial](https://www.datacamp.com/tutorial/random-forests-classifier-python)

üîó [Random Forest Classifier Tutorial](https://www.geeksforgeeks.org/dsa/random-forest-classifier-using-scikit-learn/)

üîó [MLP Tutorial](https://www.geeksforgeeks.org/machine-learning/classification-using-sklearn-multi-layer-perceptron/)

üîó [MLP Tutorial](https://www.freecodecamp.org/news/build-a-multilayer-perceptron-with-examples-and-python-code/)

üîó[Handling Imbalanced dataset](https://www.geeksforgeeks.org/machine-learning/handling-imbalanced-data-for-classification/)

üîó Stack-overflow forums


