## Predicting Heart Disease Risk: A Model Comparison Study

### Susan Murithi - 158864
### Salima Ali - 169964


- Goal: Use machine learning models to predict whether a patient has heart disease based on medical features.

- Task Type: Binary Classification

- Target Variable: target (1 = Disease, 0 = No Disease)

- Dataset: UCI Heart Disease Dataset from Kaggle

#### Problem Statement:
We want to build a predictive model that can determine the likelihood of a person having heart disease based on their health attributes. Early prediction can support timely medical intervention, which is crucial in preventing life-threatening complications.

In [1]:
import pandas as pd
df = pd.read_csv("heart.csv")
df.head()
df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


#### 3. Preprocessing
- Label encoding for categorical features (sex, cp, thal, slope, etc.)

- Feature scaling for k-NN and ANN

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = df.drop("target", axis=1)
y = df["target"]

# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 50% train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.5, random_state=42)


#### 4. Train & Evaluate Models

-  K-Nearest Neighbors

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)


-  Decision Tree

In [4]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)


- Artificial Neural Network 

In [5]:
from sklearn.neural_network import MLPClassifier

ann = MLPClassifier(hidden_layer_sizes=(32, 16), max_iter=500, random_state=42)
ann.fit(X_train, y_train)
y_pred_ann = ann.predict(X_test)




#### 5. Evaluate Accuracy + Confusion Matrix

In [6]:
from sklearn.metrics import confusion_matrix, accuracy_score

def evaluate(y_true, y_pred, model_name):
    cm = confusion_matrix(y_true, y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    TP = cm[1][1]
    TN = cm[0][0]
    print(f"{model_name} Accuracy: {accuracy:.2f}")
    print(f"True Positives: {TP}, True Negatives: {TN}, Total Test Cases: {len(y_true)}")
    print(f"TP + TN / Total = {(TP + TN)/len(y_true):.2f}\n")

evaluate(y_test, y_pred_knn, "KNN")
evaluate(y_test, y_pred_tree, "Decision Tree")
evaluate(y_test, y_pred_ann, "ANN")


KNN Accuracy: 0.84
True Positives: 68, True Negatives: 59, Total Test Cases: 152
TP + TN / Total = 0.84

Decision Tree Accuracy: 0.74
True Positives: 58, True Negatives: 55, Total Test Cases: 152
TP + TN / Total = 0.74

ANN Accuracy: 0.80
True Positives: 65, True Negatives: 57, Total Test Cases: 152
TP + TN / Total = 0.80



In [7]:
# creating a comparison table
import pandas as pd

# Replace these with your actual results
results = {
    "Model": ["KNN", "Decision Tree", "ANN"],
    "Accuracy (%)": [84.0, 74.0, 80.0],
    "True Positives (TP)": [68, 58, 65],
    "True Negatives (TN)": [59, 55, 57],
    "Total Test Cases": [152, 152, 152]
}

df_results = pd.DataFrame(results)
df_results


Unnamed: 0,Model,Accuracy (%),True Positives (TP),True Negatives (TN),Total Test Cases
0,KNN,84.0,68,59,152
1,Decision Tree,74.0,58,55,152
2,ANN,80.0,65,57,152


#### Analysis and Interpretation
- KNN achieved the highest accuracy at 84%, correctly identifying the most true positives and true negatives overall. Its strength here likely came from the numerical nature of the features and the scaled dataset, which benefits distance-based models like KNN.

- ANN (Artificial Neural Network) followed closely with an 80% accuracy. It struck a balance, performing well on both positive and negative predictions. Its slightly lower accuracy compared to KNN might be due to limited model tuning (e.g., hidden layers, epochs), but it still generalized effectively.

- Decision Tree lagged behind with 74% accuracy. While still respectable, it may have been more prone to overfitting, or perhaps it didn’t capture the deeper, nonlinear relationships in the data as well as the other two models.

#### Conclusion
- Ranking (based on accuracy): KNN > ANN > Decision Tree

- KNN is the best performing model for this dataset, suggesting that proximity-based classification worked well.

- ANN is a close second and might outperform with further tuning.

- Decision Trees, while interpretable, underperformed slightly and may require pruning or boosting for better results.

#### Real-Life Applicability                                                                                                                                       
- Doctors and Nurses: The model can help them quickly check if someone might have heart problems during regular hospital visits.

- Health Apps: It can be used in mobile apps to help people in remote or rural areas check their heart health without needing to visit a doctor.

- Healthy Living Advice: If the model says someone is at risk, they can get advice on how to eat better, exercise, or reduce stress to stay healthy.

- Insurance Companies: It can help insurance companies understand a person’s health risk before giving them a health cover or deciding the price of a plan.