# Implementation of Heart Disease Classification

This notebook walks through the complete workflow of a project, featuring implementation of data preparation, visualizations, preprocessing and modeling.

---

## Table of Contents

1. [Environment Setup](#Environment-Setup)
2. [Data Ingestion](#Data-Ingestion)



---
## Environment Setup

In [1]:
# This cell imports the core libraries used in this project.
# - pandas: for data loading and manipulation
# - numpy: for numeric processing
# - matplotlib.pyplot: for plotting
# - scikit-learn: for preprocessing, model building, and evaluation metrics

import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, RobustScaler

from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import VotingClassifier

from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             confusion_matrix, classification_report, roc_curve, auc,
                             RocCurveDisplay, roc_auc_score)


---
## Data Ingestion

In [2]:
try: 
    import ucimlrepo as uci
    heart_disease = uci.fetch_ucirepo(id=45) 

    X = heart_disease.data.features 
    y = heart_disease.data.targets 

    df = heart_disease.data.original
except:
    url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"

    df = pd.read_csv(
        url, # In case of library absence, specify path to the dataset
        header=None,
    )
    df = df.replace('?', np.nan)

# Column names based on the UCI Heart Disease (Cleveland) documentation
column_names = [
    "age",       # 0 
    "sex",       # 1 - (0 = female, 1 = male)
    "cp",        # 2 - chest pain type (1 = typical angina, 2 = atypical angina, 3 = non-anginal pain, 4 = asymptomatic)
    "trestbps",  # 3 - resting blood pressure
    "chol",      # 4 - serum cholesterol
    "fbs",       # 5 - fasting blood sugar > 120 mg/dl (0 = false; 1 = true)
    "restecg",   # 6 - resting ECG (0 = normal, 
                 #                  1 = having ST-T wave abnormality, 
                 #                  2 = showing probable or definite left ventricular hypertrophy by Estes' criteria)
    "thalach",   # 7 - max heart rate
    "exang",     # 8 - exercise induced angina (0 = no; 1 = yes)
    "oldpeak",   # 9 - ST depression
    "slope",     # 10 - slope of ST segment (1 = upsloping, 2 = flat, 3 = downsloping)
    "ca",        # 11 - number of major vessels
    "thal",      # 12 - thalassemia (3 = normal, 6 = fixed defect, 7 = reversable defect)
    "target"        # 13 - diagnosis (0 = no disease, 1â€“4 = disease)
]

df = df.rename(columns=dict(zip(df.columns, column_names)))
df.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0,2
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0,0


---