# Project 1: Classification - Predicting Heart Disease

## Dataset: 

Heart Disease UCI (available from UCI Machine Learning Repository)

## Analysis Goals:

1. Data Preprocessing
    - Handle missing values using imputation techniques like mean or median.
    - Encode categorical variables using one-hot encoding.
    - Scale numerical features using StandardScaler.

2. Feature Engineering
    - Create new features such as the ratio of maximum heart rate achieved to age.
    - Use SelectKBest to select relevant features.

3. Model Building:
    - Algorithms
        - Logistic Regression
        - Random Forest
        - Decision Tree
        - KNN
        - Gradient Boosting
        - Support Vector Machines
        - Gaussian Naive Bayes
        - Majority Vote
    - Train/Test Split: 80/20 split.
    - Model evaluation
        - Accuracy
        - Precision
        - Recall
        - F1-score
        - ROC-AUC.

4. Hyperparameter Tuning:
    - GridSearchCV or RandomizedSearchCV to find optimal parameters.
    - Cross-validation with k-fold (e.g., 5-fold).

5. Model Evaluation:
    - ROC curves and AUC-ROC.
    - Confusion matrices for each model.
    - Compare models using metrics like
        - Accuracy
        - Precision
        - Recall
        - F1-score.

## Analysis

### Load Data

| Variable Name   | Role             | Description                                   |
|-----------------|------------------|-----------------------------------------------|
| Age             | Feature          | Age                                           |
| Sex             | Feature          | Sex                                           |
| ChestPainType   | Feature          | Chest pain type                               |
| RestingBP       | Feature          | Resting blood pressure (mm Hg)                |
| Cholesterol     | Feature          | Serum cholesterol (mg/dl)                     |
| FastingBS       | Feature          | Fasting blood sugar (> 120 mg/dl)            |
| RestingECG      | Feature          | Resting electrocardiogram results             |
| MaxHR           | Feature          | Maximum heart rate achieved                   |
| ExerciseAngina  | Feature          | Exercise induced angina                       |
| Oldpeak         | Feature          | ST depression induced by exercise relative to rest |


In [16]:
import pandas as pd

df = pd.read_csv('data/heart.csv')
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


### Data Preprocessing

#### Missing Values

In [17]:
# Only 6 missing values. Remove those rows
print(df.isnull().sum())
df.dropna(inplace=True)

# One-Hot Encoding
df = pd.get_dummies(df, drop_first=True)

df.head()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64


Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_M,ChestPainType_ATA,ChestPainType_NAP,ChestPainType_TA,RestingECG_Normal,RestingECG_ST,ExerciseAngina_Y,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,1,1,0,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,1,1,0,0,0,1,0,0,1
3,48,138,214,0,108,1.5,1,0,0,0,0,1,0,1,1,0
4,54,150,195,0,122,0.0,0,1,0,1,0,1,0,0,0,1


In [19]:
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Separate dataframes into X and y
y = df['HeartDisease']
X = df.drop(['HeartDisease'], axis =1)

# Preprocessing: Scaling features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1, stratify=y)

print(len(X_train),len(y_train))
print(len(X_test),len(y_test))

734 734
184 184


### Model Building

In [52]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
import xgboost as xgb

from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline




# Logisitic Regression
lr_clf = LogisticRegression(max_iter=10000)
lr_pipe = Pipeline([['sc', StandardScaler()], ['clf', lr_clf]])
# Decision Tree
dt_clf = DecisionTreeClassifier(criterion='entropy')
# K-Nearest Neighbors
knn_clf = KNeighborsClassifier()
knn_pipe = Pipeline([['sc', StandardScaler()], ['clf', knn_clf]])
# Random Forest
rf_clf = RandomForestClassifier(n_jobs=-1)
# Support Vector Machine
svc_clf = SVC(probability=True)
svc_pipe = Pipeline([['sc', StandardScaler()], ['clf', svc_clf]])
# Naive Bayes
gnb_clf = GaussianNB()
# Gradient Boost
xgb_clf = xgb.XGBClassifier(n_estimators=1000, learning_rate=0.01, max_depth=4, random_state=1, use_label_encoder=False)

classifiers = {
    'Logistic Regression':lr_pipe,
    'Decision Tree':dt_clf,
    'KNN':knn_pipe,
    'Random Forest':rf_clf,
    'SVM':svc_pipe,
    'Gaussian Naive Bayes':gnb_clf,
    'XGBoost':xgb_clf,
}

param_grids = {
    'Logistic Regression': {'clf__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'clf__penalty': ['l2']},
    'Decision Tree': {'max_depth': [3, 4, 5, 6]},
    'KNN': {'clf__n_neighbors': [3, 5, 7], 'clf__p': [1, 2]},
    'Random Forest': {'n_estimators': [100, 500, 1000]},
    'SVM': {'clf__C': [0.1, 1, 10], 'clf__kernel': ['linear', 'rbf', 'poly'], 'clf__gamma': ['scale', 'auto']},
    'Gaussian Naive Bayes': {'priors': [None, [0.5, 0.5]], 'var_smoothing': [1e-9, 1e-8, 1e-7]},
    'XGBoost': {'n_estimators': [100, 500, 1000], 'learning_rate': [0.01, 0.1, 0.2], 'max_depth': [1,2,3, 4, 5]},      
}

In [53]:
from sklearn.model_selection import GridSearchCV
# Perform grid search for each classifier
best_estimators = {}
for clf_label, clf in classifiers.items():
    print(f"Grid search for {clf_label}...")
    grid_search = GridSearchCV(clf, param_grids[clf_label], cv=5, scoring='accuracy')
    grid_search.fit(X_train, y_train)
    best_estimators[clf_label] = grid_search.best_estimator_
    print(f"Best parameters: {grid_search.best_params_}")
    print(f"Best cross-validation accuracy: {grid_search.best_score_:.2f}\n")

Grid search for Logistic Regression...
Best parameters: {'clf__C': 10, 'clf__penalty': 'l2'}
Best cross-validation accuracy: 0.87

Grid search for Decision Tree...
Best parameters: {'max_depth': 5}
Best cross-validation accuracy: 0.84

Grid search for KNN...
Best parameters: {'clf__n_neighbors': 5, 'clf__p': 1}
Best cross-validation accuracy: 0.87

Grid search for Random Forest...
Best parameters: {'n_estimators': 500}
Best cross-validation accuracy: 0.88

Grid search for SVM...
Best parameters: {'clf__C': 1, 'clf__gamma': 'scale', 'clf__kernel': 'rbf'}
Best cross-validation accuracy: 0.87

Grid search for Gaussian Naive Bayes...
Best parameters: {'priors': None, 'var_smoothing': 1e-07}
Best cross-validation accuracy: 0.86

Grid search for XGBoost...
Best parameters: {'learning_rate': 0.2, 'max_depth': 1, 'n_estimators': 100}
Best cross-validation accuracy: 0.89



In [54]:
from sklearn.metrics import accuracy_score
# Evaluate best estimators on test set
print("Evaluation on test set:")
for clf_label, clf in best_estimators.items():
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"{clf_label}: Accuracy = {accuracy:.2f}")

Evaluation on test set:
Logistic Regression: Accuracy = 0.85
Decision Tree: Accuracy = 0.82
KNN: Accuracy = 0.88
Random Forest: Accuracy = 0.86
SVM: Accuracy = 0.86
Gaussian Naive Bayes: Accuracy = 0.84
XGBoost: Accuracy = 0.85


In [None]:
# Majority Voting
classifiers = [('lr',lr_pipe),('dt',dt_clf),('knn',knn_pipe),('svc',svc_pipe),('gnb',gnb_clf)]
mv_clf = VotingClassifier(estimators=classifiers, voting='soft')
param_grids = {
    'lr__clf__C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
    'lr__clf__penalty': ['l2'],
    'dt__max_depth': [3, 4, 5, 6],
    'knn__clf__n_neighbors': [3, 5, 7],
    'knn__clf__p': [1, 2],
    'svc__clf__C': [0.1, 1, 10],
    'svc__clf__gamma': [0.1, 1, 10],
    'gnb__priors':[None, [0.5, 0.5]],
    'gnb__var_smoothing':[1e-9, 1e-8, 1e-7],
}

# Perform grid search
grid_search = GridSearchCV(mv_clf, param_grids, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Get best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [57]:
print("Best Parameters:", best_params)
print(f"Best Score: {best_score:.2f}")

Best Parameters: {'dt__max_depth': 6, 'gnb__priors': None, 'gnb__var_smoothing': 1e-09, 'knn__clf__n_neighbors': 3, 'knn__clf__p': 2, 'lr__clf__C': 0.001, 'lr__clf__penalty': 'l2', 'svc__clf__C': 1, 'svc__clf__gamma': 1}
Best Score: 0.88
