<a href="https://colab.research.google.com/github/touhid0503/Heart_Disease_Diagnosis_Support_System/blob/main/HDDSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

# ðŸ›  Revised Methodology

## Heart Disease Diagnosis Support System Using Machine Learning

---

## Step 1: Problem Identification & Objective Definition

Heart disease remains one of the leading causes of death worldwide. Early identification of patients at risk can significantly reduce mortality through timely medical intervention.

**Objective:**
To develop a **machine learningâ€“based heart disease diagnosis support system** that predicts whether a patient has heart disease using clinical and physiological attributes.

**System Output:**

* Binary classification:

  * **1 â†’ Heart Disease Present**
  * **0 â†’ No Heart Disease**
* Probability score indicating disease risk

> This system is designed to assist medical professionals and does not replace clinical diagnosis.

---

## Step 2: Dataset Description & Understanding

The dataset used in this project is collected from **Kaggle**, originally derived from the **UCI Heart Disease dataset**.

### Dataset Overview:

* **Total records:** 303 patients
* **Total features:** 13 input features + 1 target variable
* **Target variable:** `target`

  * 1 = Heart disease present
  * 0 = No heart disease

### Feature Description:

| Category             | Features                                                               |
| -------------------- | ---------------------------------------------------------------------- |
| Demographic          | Age, Sex                                                               |
| Clinical             | Resting Blood Pressure, Cholesterol, Fasting Blood Sugar               |
| ECG & Heart Function | Resting ECG, Maximum Heart Rate (thalach)                              |
| Exercise Related     | Exercise Induced Angina, ST Depression (oldpeak), Slope                |
| Medical Indicators   | Chest Pain Type (cp), Number of Major Vessels (ca), Thalassemia (thal) |

Most features are **numerical or ordinal**, already encoded, making the dataset suitable for machine learning models.

---

## Step 3: Data Preprocessing

Since healthcare data must be handled carefully, the following preprocessing steps are applied:

### 3.1 Missing Value Analysis

* The dataset contains **no missing values**
* Data integrity is verified before modeling

### 3.2 Feature Encoding

* Categorical variables (e.g., `sex`, `cp`, `thal`, `slope`) are **already numerically encoded**
* No additional encoding is required

### 3.3 Feature Scaling

* Numerical features are scaled using **Standardization (StandardScaler)**
* This ensures equal contribution of features, especially for distance-based models like SVM

### 3.4 Trainâ€“Test Split

* Dataset split into:

  * **80% training data**
  * **20% testing data**
* Stratified sampling is used to preserve class balance

---

## Step 4: Exploratory Data Analysis (EDA)

EDA is conducted to understand data distribution and feature relationships.

### EDA Techniques:

* Distribution analysis of numerical features
* Correlation heatmap
* Comparison of feature values between diseased and non-diseased patients

Key influencing factors identified include:

* Age
* Chest pain type
* Maximum heart rate
* Exercise-induced angina
* ST depression (`oldpeak`)

---

## Step 5: Class Distribution Analysis

* The dataset shows a **slightly balanced class distribution**
* No heavy class imbalance is observed
* However, stratified sampling is maintained to ensure fairness

---

## Step 6: Model Selection

Multiple machine learning models are used to ensure robust comparison.

### Models Implemented:

* **Logistic Regression** (baseline and interpretable)
* **Random Forest Classifier** (ensemble model)
* **Support Vector Machine (SVM)**

These models are selected for their effectiveness in medical classification tasks.

---

## Step 7: Model Training & Hyperparameter Tuning

* Models are trained using the training dataset
* Hyperparameter tuning is performed using:

  * Grid Search / Cross-Validation
* Overfitting is monitored by comparing training and test performance

---

## Step 8: Model Evaluation

Models are evaluated using the test dataset.

### Evaluation Metrics:

* Accuracy
* Precision
* **Recall (primary metric)**
* F1-Score
* ROC-AUC
* Confusion Matrix

> **Recall is prioritized** to minimize false negatives, which is critical in heart disease detection.

---

## Step 9: Model Comparison & Final Model Selection

The best model is selected based on:

* High recall
* Balanced precision and F1-score
* Stable ROC-AUC performance

---

## Step 10: Model Explainability

To ensure transparency and clinical trust:

* Feature importance is analyzed (Random Forest)
* Key predictors influencing heart disease are identified
* Model decisions are interpreted in a clinical context

Explainability is crucial for healthcare-related ML systems.

---

## Step 11: System Output Design

The final system provides:

* Heart disease prediction (Yes / No)
* Risk probability score
* Important contributing features for the prediction

---

## Step 12: Ethical Considerations & Limitations

* The system is a **decision support tool**, not a replacement for doctors
* Limited dataset size may affect generalization
* Predictions depend on data quality and feature availability
* Patient data privacy and ethical AI principles are respected

---

## Step 13: Conclusion & Future Scope

This project demonstrates how machine learning can assist in early heart disease detection using clinical data.

### Future Enhancements:

* Larger real-world clinical datasets
* Deep learning models
* Real-time health monitoring integration
* Web or mobile-based deployment for hospitals

---


#import

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#Load the dataset

In [30]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "heart.csv"

df = kagglehub.dataset_load(
  KaggleDatasetAdapter.PANDAS,
  "arezaei81/heartcsv",
  file_path,
)

print("First 5 records:", df.head())

Using Colab cache for faster access to the 'heartcsv' dataset.
First 5 records:    age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  


#Explore the dataset

In [31]:
df.shape

(303, 14)

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [33]:
df.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [34]:
df.ndim

2

In [35]:
df.dtypes

Unnamed: 0,0
age,int64
sex,int64
cp,int64
trestbps,int64
chol,int64
fbs,int64
restecg,int64
thalach,int64
exang,int64
oldpeak,float64


In [36]:
type(df)

In [37]:
df.values

array([[63.,  1.,  3., ...,  0.,  1.,  1.],
       [37.,  1.,  2., ...,  0.,  2.,  1.],
       [41.,  0.,  1., ...,  0.,  2.,  1.],
       ...,
       [68.,  1.,  0., ...,  2.,  3.,  0.],
       [57.,  1.,  0., ...,  1.,  3.,  0.],
       [57.,  0.,  1., ...,  1.,  2.,  0.]])

In [38]:
df.index

RangeIndex(start=0, stop=303, step=1)

In [39]:
df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [40]:
df.tail(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2,0


In [41]:
df.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
71,51,1,2,94,227,0,1,154,1,0.0,2,1,3,1
233,64,1,0,120,246,0,0,96,1,2.2,0,1,2,0
148,44,1,2,120,226,0,1,169,0,0.0,2,0,2,1
113,43,1,0,110,211,0,1,161,0,0.0,2,0,3,1
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3,0


#Null value check


In [42]:
df.isnull().sum()

Unnamed: 0,0
age,0
sex,0
cp,0
trestbps,0
chol,0
fbs,0
restecg,0
thalach,0
exang,0
oldpeak,0


#checking Duplicates

In [43]:
df.duplicated().sum()

np.int64(1)

In [44]:
df.drop_duplicates(inplace=True)

In [45]:
df.duplicated().sum()

np.int64(0)

In [46]:
df.shape

(302, 14)

#Train-Test Split + Feature Scaling

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

numerical_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

print("Data split into training and testing sets:")
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

Data split into training and testing sets:
X_train shape: (241, 13)
X_test shape: (61, 13)
y_train shape: (241,)
y_test shape: (61,)


In [48]:
df.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [49]:
df.shape

(302, 14)

#Exploratory Data Analysis (EDA)