# ü©∫ Heart Disease Prediction using Machine Learning

### üìò Overview
This project aims to predict **whether a patient is at high risk of a heart attack** ('positive' or 'negative') based on clinical data.

In this project:
- We'll visualize the dataset to find insights
- Clean and prepare the data (including removing anomalies)
- Train **Random Forest**, **Naive Bayes**, and **KNN** models
- Compare their **Accuracy** and **F1 Scores**
- Test predictions on a **custom patient record**

In [None]:
# Step 1Ô∏è‚É£: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score
import warnings

# Set plot style
sns.set(style="whitegrid")

## üìÇ Step 2: Load and Clean the Dataset
- Load the dataset
- Remove anomalous rows where 'Heart rate' or 'Troponin' is 1000.

In [None]:
# Load dataset
data = pd.read_csv("/kaggle/input/heart-attack-prediction/Medicaldataset.csv")

## üìä Step 3: Visualize the Data (EDA)
Before modeling, let's explore the data.

In [None]:
# --- 3a: Target Variable Distribution ---
plt.figure(figsize=(6, 4))
sns.countplot(x='Result', data=data, palette=['#4CAF50', '#F44336'])
plt.title('Distribution of Target Variable (Result)')
plt.xlabel('Patient Result')
plt.ylabel('Count')
plt.savefig("target_distribution.png")

# --- 3b: Key Feature Distributions (Histograms) ---
key_features = ['Age', 'Heart rate', 'Blood sugar', 'Troponin']
data[key_features].hist(figsize=(12, 8), bins=30, color='royalblue', edgecolor='black')
plt.suptitle('Distributions of Key Numerical Features', y=1.02, size=16)
plt.tight_layout()
plt.savefig("feature_histograms.png")

# --- 3c: Correlation Heatmap ---
# We need to encode 'Result' first to include it in the heatmap
df_for_corr = data.copy()
if 'Result' in df_for_corr.columns and df_for_corr['Result'].dtype == 'object':
    df_for_corr['Result'] = LabelEncoder().fit_transform(df_for_corr['Result'])

plt.figure(figsize=(10, 7))
corr_matrix = df_for_corr.corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm', linewidths=0.5)
plt.title('Feature Correlation Heatmap')
plt.savefig("correlation_heatmap.png")

## üßπ Step 4: Data Cleaning and Encoding
Convert the categorical `Result` column into numeric values.

In [None]:
# Create a copy to work with
df = data.copy()

# Encode the target variable 'Result'
le = LabelEncoder()
if 'Result' in df.columns and df['Result'].dtype == 'object':
    df['Result'] = le.fit_transform(df['Result'])
    # 0 = 'negative', 1 = 'positive'
    label_map = {i: class_name for i, class_name in enumerate(le.classes_)}
else:
    label_map = {0: '0', 1: '1'} # Default mapping

## ‚úÇÔ∏è Step 5: Split the Data
- **Features (X)**: All columns except 'Result'
- **Target (y)**: The 'Result' column
- Split with an **80/20** train-test ratio.

In [None]:
X = df.drop("Result", axis=1)
y = df["Result"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## ü§ñ Step 6: Initialize and Train Models
1. **Random Forest Classifier**
2. **Naive Bayes (GaussianNB)**
3. **K-Nearest Neighbors (KNN)**

In [None]:
# Initialize models
rf_model = RandomForestClassifier(n_estimators=20, class_weight='balanced', random_state=42)
nb_model = GaussianNB()
knn_model = KNeighborsClassifier()

# Train models
rf_model.fit(X_train, y_train)
nb_model.fit(X_train, y_train)
knn_model.fit(X_train, y_train)

## üßÆ Step 7: Evaluate the Models
Compare models using **Accuracy** and **F1 Score**.

In [None]:
# Get predictions
y_pred_rf = rf_model.predict(X_test)
y_pred_nb = nb_model.predict(X_test)
y_pred_knn = knn_model.predict(X_test)

# Get Accuracy
rf_acc = accuracy_score(y_test, y_pred_rf)
nb_acc = accuracy_score(y_test, y_pred_nb)
knn_acc = accuracy_score(y_test, y_pred_knn)

# Get F1 Score
rf_f1 = f1_score(y_test, y_pred_rf, average='binary')
nb_f1 = f1_score(y_test, y_pred_nb, average='binary')
knn_f1 = f1_score(y_test, y_pred_knn, average='binary')

print("\nüîπ Model Performance")
print("---------------------------------")
print(f"Random Forest : Accuracy={rf_acc:.4f} | F1 Score={rf_f1:.4f}")
print(f"Naive Bayes   : Accuracy={nb_acc:.4f} | F1 Score={nb_f1:.4f}")
print(f"KNN           : Accuracy={knn_acc:.4f} | F1 Score={knn_f1:.4f}")

## üìä Step 8: Visualize Model Comparison

In [None]:
scores = {
    "Random Forest": rf_acc,
    "Naive Bayes": nb_acc,
    "KNN": knn_acc
}

colors = ['#4CAF50', '#FFC107', '#2196F3'] # Green, Amber, Blue

plt.figure(figsize=(8, 5))
bars = plt.bar(scores.keys(), scores.values(), color=colors)

# Add labels on top of bars
for bar in bars:
    yval = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2.0, yval + 0.01, f'{yval:.4f}', ha='center', va='bottom')

plt.title("Model Accuracy Score Comparison")
plt.ylabel("Accuracy Score")
plt.ylim([0, 1.1]) # Set y-axis from 0 to 1.1 to make space for labels

# Save the plot
plt.savefig("heart_attack_model_accuracy.png")

## üß™ Step 9: Predict on a Custom Patient Case
Feature format:
`[Age, Gender, Heart rate, Systolic blood pressure, Diastolic blood pressure, Blood sugar, CK-MB, Troponin]`

In [None]:
# Create a custom patient record
# Example: 55yo male, 80bpm, 140/90, 120 sugar, 3.0 CK-MB, 0.05 Troponin
custom_patient = [[55, 1, 80, 140, 90, 120, 3.0, 0.05]]

# Predict for custom patient
rf_result_code = rf_model.predict(custom_patient)[0]
nb_result_code = nb_model.predict(custom_patient)[0]
knn_result_code = knn_model.predict(custom_patient)[0]

# Map numeric codes back to original string labels ('negative', 'positive')
rf_result_label = label_map.get(rf_result_code, 'unknown')
nb_result_label = label_map.get(nb_result_code, 'unknown')
knn_result_label = label_map.get(knn_result_code, 'unknown')

print("\nüîπ Custom Test Patient Prediction")
print("\nüîπ 55yo male, 80bpm, 140/90, 120 sugar, 3.0 CK-MB, 0.05 Troponin")
print(f"   Patient Data: {custom_patient[0]}")
print("---------------------------------")
print(f"Random Forest : {rf_result_code} ({rf_result_label})")
print(f"Naive Bayes   : {nb_result_code} ({nb_result_label})")
print(f"KNN           : {knn_result_code} ({knn_result_label})")

## üèÅ Step 10: Conclusion
- **Random Forest** was the top-performing model.
- The **F1 Score** for Random Forest is high, indicating it's good at correctly identifying both 'positive' and 'negative' cases.
- **Data visualization** revealed that the `Troponin` and `CK-MB` levels have a strong positive correlation with a positive heart attack result.
- **Data cleaning** was performed to remove anomalous values for 'Heart rate' and 'Troponin'.