# Heart Disease Prediction

## Disclaimer

AI (ChatGPT) was user to help write this project, for both code and documentation. Any code generated has been studied, and any documentation has been verified.

## Dataset

"This dataset contains 1,888 records merged from five publicly available heart disease datasets. It includes 14 features that are crucial for predicting heart attack and stroke risks, covering both medical and demographic factors. Below is a detailed description of each feature." Source: Kaggle.

Dataset Source: https://www.kaggle.com/datasets/mfarhaannazirkhan/heart-dataset?resource=download&select=raw_merged_heart_dataset.csv  
The dataset is taken from Kaggle, but luckily there was a raw, uncleaned version of the dataset available. This is the dataset that is being used in this project. The dataset contains numerous columns, fourteen to be exact. They are as follows:

- **age**: Age of the patient (Numeric).
- **sex**: Gender of the patient. Values: 
  - 1 = male
  - 0 = female
- **cp**: Chest pain type. Values:
  - 0 = Typical angina
  - 1 = Atypical angina
  - 2 = Non-anginal pain
  - 3 = Asymptomatic
- **trestbps**: Resting Blood Pressure (in mm Hg) (Numeric).
- **chol**: Serum Cholesterol level (in mg/dl) (Numeric).
- **fbs**: Fasting blood sugar > 120 mg/dl. Values:
  - 1 = true
  - 0 = false
- **restecg**: Resting electrocardiographic results. Values:
  - 0 = Normal
  - 1 = ST-T wave abnormality
  - 2 = Left ventricular hypertrophy
- **thalach**: Maximum heart rate achieved (Numeric).
- **exang**: Exercise-induced angina. Values:
  - 1 = yes
  - 0 = no
- **oldpeak**: ST depression induced by exercise relative to rest (Numeric).
- **slope**: Slope of the peak exercise ST segment. Values:
  - 0 = Upsloping
  - 1 = Flat
  - 2 = Downsloping
- **ca**: Number of major vessels (0-3) colored by fluoroscopy. Values:
  - 0, 1, 2, 3
- **thal**: Thalassemia types. Values:
  - 1 = Normal
  - 2 = Fixed defect
  - 3 = Reversible defect
- **target**: Outcome variable (heart attack risk). Values:
  - 1 = more chance of heart attack
  - 0 = less chance of heart attack


Here are some definitions to help understand the dataset:

- **Angina**: Chest pain or discomfort due to reduced blood flow to the heart muscle.

- **Electrocardiographic**: Related to electrocardiography, a test that records the electrical activity of the heart.

- **Thalassemia**: A group of inherited blood disorders characterized by abnormal hemoglobin production, leading to anemia.

- **Fluoroscopy**: A medical imaging technique that provides real-time X-ray images, often used to observe the movement of internal organs or guide procedures.

- **ST Segment**: A portion of the electrocardiogram (ECG) tracing that represents the interval between ventricular depolarization and repolarization; abnormalities can indicate heart issues.

- **Ventricular Hypertrophy**: Thickening of the walls of the heart's ventricles, which can result from high blood pressure or other heart conditions.

- **Ischemia**: A condition characterized by reduced blood flow to a part of the body, often leading to oxygen deprivation in tissues.





In [38]:
import pandas as pd

# Loading the dataset
df = pd.read_csv('../datasets/raw_merged_heart_dataset.csv')

# Various statistics of the dataset
description = df.describe()
shape = df.shape
columns = df.columns
null_values = df.isnull().sum()

# # Displaying the statistics
print(f'----- Description -----\n{description}\n')
print(f'----- Shape -----\n{shape}\n')
print(f'----- Columns -----\n{columns}\n')
print(f'----- Null Values -----\n{null_values}\n')


----- Description -----
               age          sex           cp      oldpeak       target
count  2181.000000  2181.000000  2181.000000  2181.000000  2181.000000
mean     53.477762     0.693260     1.507565     0.990509     0.496103
std       9.194787     0.461246     1.371587     1.141851     0.500099
min      28.000000     0.000000     0.000000     0.000000     0.000000
25%      46.000000     0.000000     0.000000     0.000000     0.000000
50%      54.000000     1.000000     2.000000     0.600000     0.000000
75%      60.000000     1.000000     2.000000     1.600000     1.000000
max      77.000000     1.000000     4.000000     6.200000     1.000000

----- Shape -----
(2181, 14)

----- Columns -----
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalachh',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')

----- Null Values -----
age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
tha

## Data Cleaning
 
In order to make sure the model is accurate we have to work with clean data. This includes, but is not limited to:

1. Removing NaN, null or empty rows
2. Removing outliers using Z-score

In [39]:
rows_before = df.shape[0]
# Removing the rows with missing values
df = df.dropna()

# Removing rows where any column has a '?' value instead of a number
df = df[(df != '?').all(axis=1)]

rows_after = df.shape[0]
rows_dropped = rows_before - rows_after
print(f'Number of rows dropped: {rows_dropped}')

Number of rows dropped: 293


## Finding Outliers

Next I want to see if there is any major outliers in the dataset. This can be done by calculating the Z-score for each element, and if it is beyond a certain threshold it should be removed. The Z-score tells us the element's deviation from the mean.

In [40]:
from scipy.stats import zscore

# Calculate Z-scores for each numeric column
numeric_cols = df.select_dtypes(include=['float64', 'int64']).columns
z_scores = df[numeric_cols].apply(zscore)

# Filter the dataframe to remove rows where any Z-score is beyond |3|
filtered_df = df[(z_scores.abs() <= 3).all(axis=1)]

removed_counts = df.shape[0] - filtered_df.shape[0]
print(f'Number of rows removed: {removed_counts}')

Number of rows removed: 13


In [42]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Split the dataset into features and target variable
X = filtered_df.drop('target', axis=1)
y = filtered_df['target']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=9123091)

# Initialize the Decision Tree Classifier
clf = DecisionTreeClassifier(random_state=798219)

# Train the model
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Classification Report:\n{report}')

Accuracy: 0.9413333333333334
Classification Report:
              precision    recall  f1-score   support

           0       0.94      0.93      0.94       171
           1       0.94      0.95      0.95       204

    accuracy                           0.94       375
   macro avg       0.94      0.94      0.94       375
weighted avg       0.94      0.94      0.94       375



## Predictions

I have created categorical mappings to help with the output of a prediction.

In [None]:
# Define mappings for categorical variables
categorical_mappings = {
    'sex': {
        0: 'Female',
        1: 'Male'
    },
    'cp': {
        0: 'Typical angina',
        1: 'Atypical angina',
        2: 'Non-anginal pain',
        3: 'Asymptomatic'
    },
    'fbs': {
        0: 'False (≤120 mg/dl)',
        1: 'True (>120 mg/dl)'
    },
    'restecg': {
        0: 'Normal',
        1: 'ST-T wave abnormality',
        2: 'Left ventricular hypertrophy'
    },
    'exang': {
        0: 'No',
        1: 'Yes'
    },
    'slope': {
        0: 'Upsloping',
        1: 'Flat',
        2: 'Downsloping'
    },
    'ca': {
        0: '0 vessels',
        1: '1 vessel',
        2: '2 vessels',
        3: '3 vessels'
    },
    'thal': {
        1: 'Normal',
        2: 'Fixed defect',
        3: 'Reversible defect'
    },
    'target': {
        0: 'Lower risk',
        1: 'Higher risk'
    }
}

# Function to interpret numerical values into ranges
def get_range_interpretation(value, feature):
    ranges = {
        'age': lambda x: f"{x} years old",
        'trestbps': lambda x: f"{x} mm Hg",
        'chol': lambda x: f"{x} mg/dl",
        'thalach': lambda x: f"{x} bpm",
        'oldpeak': lambda x: f"{x:.1f} mm"
    }
    return ranges.get(feature, lambda x: str(x))(value)

# Function to interpret a prediction
def interpret_prediction(row_values):
    interpretation = {}
    
    for feature, value in row_values.items():
        if feature in categorical_mappings:
            interpretation[feature] = categorical_mappings[feature].get(value, f"Unknown-{value}")
        else:
            interpretation[feature] = get_range_interpretation(value, feature)
            
    return interpretation

def get_patient_data():
    print("\nEnter patient information (press Enter to use example values):")
    
    try:
        # Dictionary of example values
        examples = {
            'age': 55,
            'sex': 1,
            'cp': 1,
            'trestbps': 140,
            'chol': 230,
            'fbs': 0,
            'restecg': 0,
            'thalach': 150,
            'exang': 0,
            'oldpeak': 2.3,
            'slope': 0,
            'ca': 0,
            'thal': 2
        }
        
        data = {}
        
        # Get age (29-77 typical range)
        age_input = input(f"Age (29-77) [example: {examples['age']}]: ").strip()
        data['age'] = float(age_input) if age_input else examples['age']
        if not 0 <= data['age'] <= 120:
            raise ValueError("Age must be between 0 and 120")

        # Get sex
        sex_input = int(input(f"Sex (0=female, 1=male) [example: {examples['sex']}]: ").strip())
        data['sex'] = int(sex_input) if sex_input else examples['sex']
        if data['sex'] not in [0, 1]:
            raise ValueError("Sex must be 0 or 1")

        # Get chest pain type
        cp_input = input(f"Chest pain type\n0=Typical angina\n1=Atypical angina\n2=Non-anginal pain\n3=Asymptomatic\n[example: {examples['cp']}]: ").strip()
        data['cp'] = int(cp_input) if cp_input else examples['cp']
        if data['cp'] not in [0, 1, 2, 3]:
            raise ValueError("Chest pain type must be between 0 and 3")

        # Get resting blood pressure
        bp_input = input(f"Resting blood pressure in mm Hg (90-200) [example: {examples['trestbps']}]: ").strip()
        data['trestbps'] = float(bp_input) if bp_input else examples['trestbps']
        if not 60 <= data['trestbps'] <= 250:
            raise ValueError("Blood pressure must be between 60 and 250")

        # Get cholesterol
        chol_input = input(f"Serum cholesterol in mg/dl (100-600) [example: {examples['chol']}]: ").strip()
        data['chol'] = float(chol_input) if chol_input else examples['chol']
        if not 100 <= data['chol'] <= 600:
            raise ValueError("Cholesterol must be between 100 and 600")

        # Get fasting blood sugar
        fbs_input = input(f"Fasting blood sugar >120 mg/dl (1=true, 0=false) [example: {examples['fbs']}]: ").strip()
        data['fbs'] = int(fbs_input) if fbs_input else examples['fbs']
        if data['fbs'] not in [0, 1]:
            raise ValueError("Fasting blood sugar must be 0 or 1")

        # Get resting ECG
        ecg_input = input(f"Resting ECG results\n0=Normal\n1=ST-T wave abnormality\n2=Left ventricular hypertrophy\n[example: {examples['restecg']}]: ").strip()
        data['restecg'] = int(ecg_input) if ecg_input else examples['restecg']
        if data['restecg'] not in [0, 1, 2]:
            raise ValueError("Resting ECG must be 0, 1, or 2")

        # Get maximum heart rate
        hr_input = input(f"Maximum heart rate achieved (60-220) [example: {examples['thalach']}]: ").strip()
        data['thalach'] = float(hr_input) if hr_input else examples['thalach']
        if not 60 <= data['thalach'] <= 220:
            raise ValueError("Heart rate must be between 60 and 220")

        # Get exercise induced angina
        ang_input = input(f"Exercise induced angina (1=yes, 0=no) [example: {examples['exang']}]: ").strip()
        data['exang'] = int(ang_input) if ang_input else examples['exang']
        if data['exang'] not in [0, 1]:
            raise ValueError("Exercise induced angina must be 0 or 1")

        # Get ST depression
        peak_input = input(f"ST depression induced by exercise (0.0-6.2) [example: {examples['oldpeak']}]: ").strip()
        data['oldpeak'] = float(peak_input) if peak_input else examples['oldpeak']
        if not 0 <= data['oldpeak'] <= 7:
            raise ValueError("ST depression must be between 0 and 7")

        # Get slope
        slope_input = input(f"Slope of peak exercise ST segment\n0=Upsloping\n1=Flat\n2=Downsloping\n[example: {examples['slope']}]: ").strip()
        data['slope'] = int(slope_input) if slope_input else examples['slope']
        if data['slope'] not in [0, 1, 2]:
            raise ValueError("Slope must be 0, 1, or 2")

        # Get number of vessels
        ca_input = input(f"Number of major vessels (0-3) [example: {examples['ca']}]: ").strip()
        data['ca'] = int(ca_input) if ca_input else examples['ca']
        if data['ca'] not in [0, 1, 2, 3]:
            raise ValueError("Number of vessels must be between 0 and 3")

        # Get thalassemia type
        thal_input = input(f"Thalassemia type\n1=Normal\n2=Fixed defect\n3=Reversible defect\n[example: {examples['thal']}]: ").strip()
        data['thal'] = int(thal_input) if thal_input else examples['thal']
        if data['thal'] not in [1, 2, 3]:
            raise ValueError("Thalassemia type must be 1, 2, or 3")

        return pd.DataFrame([data])
    
    except ValueError as e:
        print(f"\nError: {e}")
        return None

def predict_heart_disease():
    while True:
        patient_data = get_patient_data()
        if patient_data is None:
            continue
            
        # Make prediction
        prediction = clf.predict(patient_data)[0]
        probability = clf.predict_proba(patient_data)[0]
        
        print("\nPrediction Results:")
        print("-" * 20)
        if prediction == 1:
            print(f"Higher Risk of Heart Disease")
            print(f"Confidence: {probability[1]:.2%}")
        else:
            print(f"Lower Risk of Heart Disease")
            print(f"Confidence: {probability[0]:.2%}")
            
        # Show feature importance for this prediction
        feature_importance = pd.DataFrame({
            'feature': feature_names,
            'value': patient_data.iloc[0],
            'importance': clf.feature_importances_
        }).sort_values('importance', ascending=False)
        
        print("\nTop influential factors:")
        for _, row in feature_importance.head(3).iterrows():
            print(f"{row['feature']}: {row['value']} (importance: {row['importance']:.3f})")
        
        if input("\nPredict another? (y/n): ").lower() != 'y':
            break

# Run predictions
predict_heart_disease()


Enter patient information (press Enter to use example values):

Error: Sex must be 0 or 1

Enter patient information (press Enter to use example values):

Error: Sex must be 0 or 1

Enter patient information (press Enter to use example values):


NameError: name 'pd' is not defined