In [6]:
import os
import pandas as pd
import matplotlib.pyplot as plt

## DSCI 522 – Milestone 1 - Group 25 
### Project Name: Heart Disease Prediction Model

### Team Members:
#### Johnson Chuang | Eduardo Sanches | Azadeh Ramesh | Jose Davila





#### Data analysis and workflow project for DSCI 522 (Data Science Workflows), a course in the Master of Data Science program at the University of British Columbia.


#### **GitHub link:** https://github.com/stoyq/heart-disease-predictor

### Summary

Heart disease is one of the leading causes of death globally, and early detection is critical for prevention and treatment. In this project, we use the UCI Heart Disease dataset to build a machine-learning model that predicts whether a patient is likely to have heart disease based on clinical and physiological attributes. We load the dataset directly from the web, clean and wrangle the data, perform exploratory data analysis (EDA), and train a classification model (Decision Tree) to identify important predictors of heart disease. Our results highlight key risk indicators that align with well-known medical knowledge, demonstrating how machine learning can support early screening and clinical decision-making.

### Introduction 
The objective of this project is to develop a predictive model that determines whether a patient is at risk of heart disease using a set of clinical measurements. Heart disease diagnoses often rely on many interacting factors such as chest pain symptoms, blood pressure, cholesterol levels, and exercise response. Machine-learning models can help uncover patterns in these variables and support early identification of high-risk patients.

Our research question is:

“Given a patient’s clinical and physiological attributes, can we accurately predict whether they have heart disease?”

To answer this question, we use the publicly available Heart Disease dataset from the UCI Machine Learning Repository. This dataset contains multiple medically relevant variables, making it suitable for a classification model such as a Decision Tree.

### Dataset Description

We use the Heart Disease dataset from the UCI Machine Learning Repository, a widely used benchmark dataset for medical prediction tasks. The dataset includes the following 14 attributes:
- Age
- Sex
- Chest Pain Type (cp)
- Resting Blood Pressure (trestbps)
- Cholesterol (chol)
- Fasting Blood Sugar (fbs)
- Resting ECG results (restecg)
- Maximum heart rate achieved (thalach)
- Exercise induced angina (exang)
- ST depression (oldpeak)
- Slope of ST segment (slope)
- Number of major vessels (ca)
- Thalassemia result (thal)
- num (Target: the predicted attribute (0 = no heart disease, 1 = heart disease))

These variables include both continuous and categorical measurements commonly used in clinical diagnostics.

### Methodology
We build a machine-learning classification model using the UCI Heart Disease dataset:

1. Load data from the original source on the web: https://archive.ics.uci.edu/dataset/45/heart+disease

2. Wrangle and clean the data

- Replace missing values
- Assign meaningful column names
- Convert categorical variables to numeric where needed
- Ensure that the target variable is binary (0 = no heart disease, 1 = heart disease)

3. Perform exploratory data analysis (EDA)
- Summary statistics for continuous variables
- Count plots for categorical variables
- Histograms and boxplots to understand feature distributions

4. Create visualizations relevant to the classification task
- Pairplots to explore relationships between key features
- Distribution of target classes
- Feature correlation matrix

5. Build a classification model
- A Decision Tree Classifier is trained to predict heart disease.
- We split the dataset into training and testing subsets and evaluate model accuracy.

6. Visualize the model results
- Plot of the trained Decision Tree
- Feature importance bar chart

### Importing the Dataset

A special note about our data download process: The following code downloads the zip file from UCI's website, unpacks them, and grabs the data of interest (Cleveland data). It is then processed minimally by adding the correct column names, and finally written out as a CSV to the data/processed folder.

In our actual analysis, we fetch the same data directly using UCI's own `ucimlrepo` library. The data is the same. But we include this part to show how you can download the data without UCI's own library.

In [7]:
import os
import requests
import zipfile
from io import BytesIO

import warnings
warnings.filterwarnings("ignore")

# This is the URL to the data. There are many files in the zip file
# In particular we will retrieve the cleveland data
url = "https://archive.ics.uci.edu/static/public/45/heart+disease.zip"

# Make sure the proper data folders exist
os.makedirs("../data/raw", exist_ok=True)
os.makedirs("../data/processed", exist_ok=True)

# Download the zip file into memory
response = requests.get(url)

# Open the zip from memory
with zipfile.ZipFile(BytesIO(response.content)) as z:
    # We only want the Cleveland data
    z.extract("processed.cleveland.data", "../data/raw")

print("Download complete! File saved to data/raw/processed.cleveland.data")

Download complete! File saved to data/raw/processed.cleveland.data


In [8]:
import pandas as pd

cols = [
    "age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
    "thalach", "exang", "oldpeak", "slope", "ca", "thal", "target"
]

df = pd.read_csv("../data/raw/processed.cleveland.data", header=None, names=cols)

In [9]:
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63.0,1.0,1.0,145.0,233.0,1.0,2.0,150.0,0.0,2.3,3.0,0.0,6.0,0
1,67.0,1.0,4.0,160.0,286.0,0.0,2.0,108.0,1.0,1.5,2.0,3.0,3.0,2
2,67.0,1.0,4.0,120.0,229.0,0.0,2.0,129.0,1.0,2.6,2.0,2.0,7.0,1
3,37.0,1.0,3.0,130.0,250.0,0.0,0.0,187.0,0.0,3.5,3.0,0.0,3.0,0
4,41.0,0.0,2.0,130.0,204.0,0.0,2.0,172.0,0.0,1.4,1.0,0.0,3.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,45.0,1.0,1.0,110.0,264.0,0.0,0.0,132.0,0.0,1.2,2.0,0.0,7.0,1
299,68.0,1.0,4.0,144.0,193.0,1.0,0.0,141.0,0.0,3.4,2.0,2.0,7.0,2
300,57.0,1.0,4.0,130.0,131.0,0.0,0.0,115.0,1.0,1.2,2.0,1.0,7.0,3
301,57.0,0.0,2.0,130.0,236.0,0.0,2.0,174.0,0.0,0.0,2.0,1.0,3.0,1


In [10]:
# Write out processed data
df.to_csv("../data/processed/cleveland_clean.csv", index=False)
print("Write complete! File saved to data/processed/cleveland_clean.csv")

Write complete! File saved to data/processed/cleveland_clean.csv


### Importing the Dataset Using `ucimlrepo`

As mentioned in the above section, we are fetching the same data. This time we are using the `ucimlrepo` library to do this.

In [11]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
heart_disease = fetch_ucirepo(id=45) 
  
# data (as pandas dataframes) 
X = heart_disease.data.features 
y = heart_disease.data.targets 
  
# Debug metadata (uncomment to see)
#print(heart_disease.metadata) 
  
# Debug variable information (uncomment to see)
#print(heart_disease.variables) 


ModuleNotFoundError: No module named 'ucimlrepo'

### Exploratory Data Analysis (EDA)

In [15]:
X.head(5)

NameError: name 'X' is not defined

In [None]:
#Merge X and y in df for EDA
df = X.copy()
df["target"] = y

def plot_overlap(feature):
    plt.figure(figsize=(6,4))
    plt.hist(df[df.target == 0][feature], bins = 20, alpha = 0.6, label ="No Disease")
    plt.hist(df[df.target == 1][feature], bins = 20,alpha = 0.6, label = "Disease")
    plt.title(f"Distribution of {feature} by Heart Disease Status")
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.legend()
    plt.show()

plot_overlap("age")
plot_overlap("chol")
plot_overlap("trestbps")
plot_overlap("thalach")
plot_overlap("oldpeak")
plot_overlap("sex")

### Column Transformations

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


numerical = ["age", "trestbps", "chol", "thalach", "oldpeak"]
categorical = ["cp", "restecg", "slope", "ca", "thal"]
binary = ["sex", "fbs", "exang"]

preprocessor = make_column_transformer(
    (StandardScaler() , numerical),
    (OneHotEncoder(handle_unknown = "ignore"), categorical),
    ("passthrough", binary)
)


### Create the Pipeline

In [None]:

from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.svm import SVC

numeric = ["age", "trestbps", "chol", "thalach", "oldpeak"]
categorical = ["cp", "restecg", "slope", "ca", "thal"]
binary = ["sex", "fbs", "exang"]

preprocessor = make_column_transformer(
    (StandardScaler() , numeric),
    (OneHotEncoder( handle_unknown = "ignore"), categorical),
    ("passthrough", binary)
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size =0.2, random_state = 123
)

svc_pipe = make_pipeline( preprocessor, SVC())

### Crossvalidation

In [None]:
from sklearn.model_selection import  cross_validate

cross_val_results = {}
cross_val_results['SVC'] = pd.DataFrame(cross_validate(svc_pipe, X_train, y_train, cv =5, return_train_score= True)).agg(['mean', 'std']).round(3).T


cross_val_results['SVC']

### Fit the Model

In [None]:
svc_pipe.fit(X_train, y_train)

### Predict (X_test) and compatr with Actuals (y_test)

In [None]:

comparison = pd.DataFrame()
comparison["Predictions"] = svc_pipe.predict(X_test)
comparison["Actual"] = y_test.values 

comparison

### Discussion
The Decision Tree model was able to identify meaningful patterns to predict heart disease based on the data, with a test score of 0.61 and train score of 0.78. Based on these results, it might indicate that there was some overfitting based on the large difference between training and test results.

From the EDA, we see that various features such as age, sex, chol and more have clear differences in their distribution between disease and no disease which will help the model to predict between the two. For a better predictor, we may want to incorporate additional features given the complexity of heart disease.

### Results and Conclusion

Our analysis shows that several clinical features differ noticeably between patients with and without heart disease. As seen in the EDA histograms, patients with heart disease tend to have higher resting blood pressure (trestbps), higher ST-depression values (oldpeak), and lower maximum heart rate achieved (thalach) compared to individuals without disease. After preprocessing the dataset using scaling for numerical variables and one-hot encoding for categorical variables, we trained a Support Vector Classifier (SVC) model. Cross-validation results indicate an average test accuracy of 0.61, with a higher training accuracy of 0.78, suggesting some overfitting. When evaluating predictions on the unseen test set, the model correctly identified many cases but also showed several misclassifications, especially where the model predicted “0” (no disease) but the true label was “1” or “2.” Overall, while the model captures meaningful patterns in the dataset, its moderate predictive performance suggests that further tuning, alternative models, or feature engineering may be needed to improve accuracy and reduce classification bias.

### References
- UCI Machine Learning Repository. Heart Disease Dataset: https://archive.ics.uci.edu/dataset/45/heart+disease
- International application of a new probability algorithm for the diagnosis of coronary artery disease. By R. Detrano, A. Jánosi, W. Steinbrunn, M. Pfisterer, J. Schmid, S. Sandhu, K. Guppy, S. Lee, V. Froelicher. 1989 Published in American Journal of Cardiology

### Data Validation Checks

### Step 1. Data Types Check

In [None]:
# Data Validation -> Step 1: Data Types Check

expected_types = {
    "age": "int64",
    "sex": "int64",
    "cp": "int64",
    "trestbps": "int64",
    "chol": "int64",
    "fbs": "int64",
    "restecg": "int64",
    "thalach": "int64",
    "exang": "int64",
    "oldpeak": "float64",
    "slope": "int64",
    "ca": "float64",
    "thal": "int64",
    "target": "int64"
}

type_errors = []

for col, expected in expected_types.items():
    if str(df[col].dtype) != expected:
        type_errors.append(f"Column '{col}' has incorrect type: {df[col].dtype} (expected {expected})")

if type_errors:
    print("Data Type Errors Found:")
    for e in type_errors:
        print("-", e)
else:
    print("All data types are correct.")

### Step 2. Missing Values Check

In [1]:
# Data Validation -> Step 2: Missing Values Check

missing = df.isna().sum()

print("Missing values per column:\n")
print(missing)

if missing.sum() == 0:
    print("\n No missing values detected.")
else:
    print("\n Missing values found. Please investigate before modeling.")

NameError: name 'df' is not defined

### Step 3. Duplicate Rows Check

In [3]:
# Data Validation -> Step 3: Duplicate Rows Check

duplicate_count = df.duplicated().sum()

print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count == 0:
    print("No duplicate rows detected.")
else:
    print("Duplicate rows found. Please review and remove them before modeling.")
This step:

SyntaxError: invalid syntax (3173464293.py, line 11)

### Step 4. Category Levels Check

In [None]:
# Data Validation Check -> Step 4: Category Levels

# Expected allowed values for categorical fields
expected_categories = {
    "sex": [0, 1],
    "cp": [1, 2, 3, 4],
    "fbs": [0, 1],
    "restecg": [0, 1, 2],
    "exang": [0, 1],
    "slope": [1, 2, 3],
    "ca": [0, 1, 2, 3, 4],   
    "thal": [3, 6, 7],       
    "target": [0, 1]         
}

category_errors = []

for col, allowed in expected_categories.items():
    if col in df.columns:
        invalid_values = set(df[col].unique()) - set(allowed)
        if invalid_values:
            category_errors.append(f"Column '{col}' contains invalid values: {invalid_values}")

if category_errors:
    print("Category Level Errors Found:")
    for e in category_errors:
        print("-", e)
else:
    print("All categorical columns contain valid allowed values.")

### Step 5. Logical Ranges Check

In [4]:
# Data Validation -> Step 5: Logical Ranges Check

range_errors = []

# Define expected valid ranges for each column
valid_ranges = {
    "age": (1, 120),              # Age of adults
    "trestbps": (50, 300),        # Resting blood pressure
    "chol": (50, 700),            # Serum cholesterol
    "thalach": (50, 250),         # Max heart rate
    "oldpeak": (0.0, 10.0),       # ST depression induced by exercise
}

for col, (min_val, max_val) in valid_ranges.items():
    if col in df.columns:
        invalid_low = df[df[col] < min_val]
        invalid_high = df[df[col] > max_val]
        
        if not invalid_low.empty:
            range_errors.append(f"Column '{col}' has values below {min_val}. Examples: {invalid_low[col].tolist()[:5]}")
        if not invalid_high.empty:
            range_errors.append(f"Column '{col}' has values above {max_val}. Examples: {invalid_high[col].tolist()[:5]}")

# Special logical relationships
# max heart rate must be >= 50
if (df["thalach"] < df["trestbps"]).any():
    range_errors.append("Some rows have thalach < trestbps, which is physiologically unlikely.")

if (df["oldpeak"] < 0).any():
    range_errors.append("oldpeak contains negative values, which are invalid.")

# Print results
if range_errors:
    print("Logical Range Errors Found:")
    for err in range_errors:
        print("-", err)
else:
    print("All numeric columns fall within expected logical / medical ranges.")

NameError: name 'df' is not defined

### Step 6: Train/Test Leakage Check

In [5]:
# Data Validation - > Step 6: Train/Test Leakage Check

from sklearn.model_selection import train_test_split

# Split the data (same as used in your notebook)
X = df.drop(columns=["target"])
y = df["target"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=2024, stratify=y
)

leakage_errors = []

# Check 1: No overlapping rows

train_indices = set(X_train.index)
test_indices = set(X_test.index)

if train_indices & test_indices:
    leakage_errors.append("Train and test sets have overlapping indices!")

# Check 2: No test data seen by encoders

categorical_cols = ["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal"]

for col in categorical_cols:
    test_extra = set(X_test[col].unique()) - set(X_train[col].unique())
    if test_extra:
        leakage_errors.append(
            f"Column '{col}' has categories in TEST not present in TRAIN → potential leakage or mismatch: {test_extra}"
        )

# Check 3: No target leakage into features

if "target" in X_train.columns:
    leakage_errors.append("Target column found in training features! Ensure drop(columns=['target']) is applied.")

# Print results
if leakage_errors:
    print("Data Leakage Detected:")
    for err in leakage_errors:
        print("-", err)
else:
    print("No data leakage detected. Train and test sets are fully independent.")


NameError: name 'df' is not defined