# Machine Learning Practicum: Heart Disease Dataset

In this practicum, we will explore the Heart Disease dataset using unsupervised learning methods and then apply several supervised learning algorithms. 

This Practicum covers the following topics:

* Principle Component Analysis
* t-SNE
* KMeans Clustering
* Evaluation of ML Models
* Ensemble Learning
* [BONUS] Neural Networks

It will consist of 5 tasks:

| Task ID  | Description                                      | Points |
|----------|--------------------------------------------------|--------|
| 00       | Load and Prepare Dataset                                     |       |
| 01       | Visualizations                              |       |
| &nbsp;&nbsp;&nbsp;&nbsp;01-1     | &nbsp;&nbsp;&nbsp;&nbsp;-  Standardize the dataset                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;01-2     | &nbsp;&nbsp;&nbsp;&nbsp;-  Visualize with PCA                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;01-3     | &nbsp;&nbsp;&nbsp;&nbsp;-  Visualize with t-SNE                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;01-4     | &nbsp;&nbsp;&nbsp;&nbsp;-  Cluster with KMeans                  |       |
| 02       | Evaluating Supervised Classifiers                 |       |
| &nbsp;&nbsp;&nbsp;&nbsp;02-1     | &nbsp;&nbsp;&nbsp;&nbsp;-  Create SVM and Param grid                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;02-2     | &nbsp;&nbsp;&nbsp;&nbsp;-  Perform grid search                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;02-3     | &nbsp;&nbsp;&nbsp;&nbsp;-  Calculate Evaluation metrics                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;02-4     | &nbsp;&nbsp;&nbsp;&nbsp;-  Visualize with Confusion Matrix                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;02-5     | &nbsp;&nbsp;&nbsp;&nbsp;-  Synthetically Balance Dataset                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;02-6     | &nbsp;&nbsp;&nbsp;&nbsp;-  Compare Classifiers on Balanced and Unbalanced Dataset                  |       |
| 03       | Ensemble Learning                     |        |
| &nbsp;&nbsp;&nbsp;&nbsp;03-1     | &nbsp;&nbsp;&nbsp;&nbsp;-  Simple voting ensemble                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;03-2     | &nbsp;&nbsp;&nbsp;&nbsp;- Stacking ensemble                  |       |
| 04       | [BONUS] Simple Neural Network                     |        |
| &nbsp;&nbsp;&nbsp;&nbsp;04-1     | &nbsp;&nbsp;&nbsp;&nbsp;- Network Architecture                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;04-2     | &nbsp;&nbsp;&nbsp;&nbsp;- Network Parameters                  |       |
| &nbsp;&nbsp;&nbsp;&nbsp;04-3     | &nbsp;&nbsp;&nbsp;&nbsp;- Test Network on Balanced and Unbalanced Data                  |       |

### *Story Progression*
Per Dogtor Golden, we need to make sure that we get at least an 80% accuracy on the dataset so we can save more pets at the pet hospital!


## Task 00: Load and Prepare Data
### Task 00: Description
#### Load data

For this task, we need to load in the dataset and fix any missing values. The Heart Disease dataset contains 303 samples with 14 features. The task is to classify the risk level a patient has for heart disease with a class label of 0-4. Let's load it in and take a quick look at the structure of the dataset.

### Task 00: Code

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore", category=RuntimeWarning)
np.seterr(divide='ignore', over='ignore', invalid='ignore')

try:
    import google.colab
    REPO_URL = "https://github.com/nd-cse-30124-fa25/cse-30124-homeworks.git"
    REPO_NAME = "cse-30124-homeworks"
    HW_FOLDER = "practicum" 

    # Clone repo if not already present
    if not os.path.exists(REPO_NAME):
        !git clone {REPO_URL}

    # cd into the homework folder
    %cd {REPO_NAME}/{HW_FOLDER}

except ImportError:
    pass

# url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
col_names = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target']
# Load Heart Disease dataset
# df = pd.read_csv(url, names=col_names, na_values='?')
df = pd.read_csv('./heart.csv', names=col_names, na_values='?')
df.head()

# Assuming df is your DataFrame and 'target' is your target column
# Number of samples
num_samples = df.shape[0]

# Number of features (excluding the target column)
num_features = df.shape[1] - 1  # Subtract 1 for the target column

# Number of unique target values
num_unique_targets = df['target'].nunique()

# Unique target values and their counts
unique_target_values = df['target'].value_counts()

print(f"Number of samples: {num_samples}")
print(f"Number of features: {num_features}")
print(f"Number of unique target values: {num_unique_targets}")
print("\nUnique target values and their counts:")
print(unique_target_values)

# Check for missing values
df.isnull().sum()

# Fill missing values with the median for simplicity
df.fillna(df.median(), inplace=True)

## Task 01: Visualizations
### Task 01-1: Description
#### Standardize Data

To visualize our data we'll probably need to reduce its dimensionality, since it has 13 features out of the box. One method for doing that we know of is Principal Component Analysis!
The first thing we'll need to do is standardize the data.

### Task 01-1: Code

In [None]:
from sklearn.preprocessing import StandardScaler

# TODO: Scale the dataset

### Task 01-2: Description
#### Visualize Data with PCA

Now that our data is standardized we can run PCA on it to reduce it to two dimensions and then display the data using `pyplot`

### Task 01-2: Code

In [None]:
from sklearn.decomposition import PCA

# TODO: Use PCA to reduce to 2D for visualization

# Plot PCA result
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('PCA of Heart Disease Dataset')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.show()

### Task 01-3: Description
#### Visualize Data with t-SNE

Unfortunately PCA doesn't do anything to make sure our data points maintain their local structure, so it's actually not the best option for visualizing high dimensional data. A better option is using a technique called t-SNE, which better preserves the local relationships between points.

### Task 01-3: Code

In [None]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# TODO: Apply t-SNE to the scaled dataset

# Plot the t-SNE result
plt.figure(figsize=(10, 6))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.colorbar(label='Target')
plt.title('t-SNE Visualization of Heart Disease Dataset')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.show()

Well things are certainly more spread out than in PCA but I don't really see any clear clusters. However we can try clustering with K-Means and see if we can find any, no need to just guess visually!

### Task 01-4: Description
#### Cluster with kmeans

Luckily we know there should be 5 distinct groups in our data somewhere, lets see how KMeans would split our data into 5 different groups and then because we actually have the labels, lets overlay those to try and get a sense of how seperable our data actually is!

### Task 01-4: Code

In [None]:
from sklearn.cluster import KMeans

# TODO: Apply K-Means

# Plot K-Means result
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters_kmeans)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['target'], marker='x')
plt.title('K-Means Clustering')
plt.show()

There's definitely some clear groupings here, but that's kind of to be expected with kmeans, since it will always force an answer.

## Task 02: Evaluating Supervised Classifiers
### Task 02-1: Description
#### Initialize Hyperparameter grid

Since we have ground-truth, we can use supervised learning techniques to try and get that `80%` accuracy we need! Lets start with an SVM. We should probably try several different combinations.

### Task 02-1: Code

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt

# Define the SVM model
svm = SVC()

# TODO: Define the parameter grid to search

# TODO: Initialize GridSearchCV with cross-validation

### Task 02-2: Description
#### Test SVM with grid search

Lets try all of our different combinations of parameters we defined above with a grid search to figure out what the best combination is!

### Task 02-2: Code

In [None]:
# TODO: Fit the grid search to the training data

### Task 02-3: Description
#### Evaluate results of grid search

Lets print out the metrics for our different combinations and try and see what the best set is!

### Task 02-3: Code

In [None]:
# Extract results into a DataFrame
results = pd.DataFrame(grid_search.cv_results_)

# Select relevant columns and rename 'mean_test_score' to 'accuracy'
results = results[['mean_test_score', 'param_kernel', 'param_C', 'param_degree', 'param_gamma']]
results.rename(columns={'mean_test_score': 'accuracy'}, inplace=True)

# Fill NaN values for non-applicable parameters with 'None' for clarity
results.fillna('None', inplace=True)

# Reorder columns to make 'accuracy' the first column
results = results[['accuracy', 'param_kernel', 'param_C', 'param_degree', 'param_gamma']]

# Sort by accuracy
results = results.sort_values(by='accuracy', ascending=False)

# Print the table with all columns
pd.set_option('display.max_columns', None)  # Ensure all columns are displayed
print(results)

# Print the best parameters and best score
print("\nBest parameters found: ", grid_search.best_params_)
print("Best cross-validation accuracy: {:.2f}".format(grid_search.best_score_))

# Use the best estimator to make predictions
best_svm = grid_search.best_estimator_
y_pred_best_svm = best_svm.predict(X_test)

# TODO: Display classification report for the best SVM model

### Task 02-4: Description
#### Visualize Classification Results with Confusion Matrix

Lets use a confusion matrix to see what kind of errors our classifier is making.

### Task 02-4: Code

In [None]:
# TODO: Confusion matrix for the best SVM model

sns.heatmap(cm_best_svm, annot=True, fmt='d')
plt.title('Best SVM Confusion Matrix')
plt.show()

### Task 02-5: Description
#### Balance Dataset

The confusion matrix makes it clear that our dataset is really imbalanced and this probably effects the accuracy of our classifier. Luckily we can use a special technique to help balance our dataset

### Task 02-5: Code

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from imblearn.over_sampling import SMOTE

# EDA: Check class distribution
sns.countplot(x='target', data=df)
plt.title('Class Distribution')
plt.show()

X = df.drop(columns='target')
y = df['target']

# TODO: Standardize features

# TODO: Apply SMOTE

# Check new class distribution
sns.countplot(x=y_resampled)
plt.title('Class Distribution After SMOTE')
plt.show()

### Task 02-6: Description
#### Balance Dataset

Now that we have a balanced dataset, lets see what difference it made for a collection of different supervised classifiers! In addition to the RBF SVM, lets try Logistic Regression, KNN, Decision Trees, and Naive Bayes.

### Task 02-6: Code

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB

# Define a function to evaluate different models with additional strategies
def evaluate_models(X, y):
    # TODO: List of models to evaluate
    
    # TODO: Feature selection

    # TODO: Dimensionality reduction

    # TODO: Evaluate each model
    for name, model in models.items():
        
        # TODO: Perform cross-validation
        
        print(f"{name} Accuracy: {scores.mean():.2f} (+/- {scores.std() * 2:.2f})")

# Evaluate models on the scaled dataset
print('Imbalanced Results:')
evaluate_models(X_scaled, df['target'])

print('-'*80)

# Evaluate models on the resampled dataset
print('Balanced Results:')
evaluate_models(X_resampled, y_resampled)



## TODO: Ensemble Classifiers

We could use an ensemble of classifiers!

Let's try an ensemble of Logistic Regression, Random Forest, and Gradient Boosting.

### *Story Progression*
While balancing the classes didn't help every method equally, it greatly improved the SVM's and KNN's performance.

Unfortunately though, none of these classifiers were able to get to the 80% accuracy that the Dogtors wanted and so maybe we can't save any pets after all :(

## Task 03: Ensemble Learning
### Task 03-1: Description
#### Simple Voting Ensemble Classifier 

Lets try and combine several ML classifiers to see if we can improve our results!

### Task 03-1: Code

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Define a function to evaluate different models with additional strategies
def evaluate_models(X, y):    
    # TODO: Feature selection

    # TODO: Dimensionality reduction

    # TODO: Ensemble method: Voting Classifier
    
    # TODO: Create a pipeline for the ensemble

    # TODO: Perform cross-validation for the ensemble

    print(f"Ensemble (Voting) Accuracy: {ensemble_scores.mean():.2f} (+/- {ensemble_scores.std() * 2:.2f})")

# Evaluate models on the scaled dataset
print('Imbalanced Results:')
evaluate_models(X_scaled, df['target'])

print('-'*80)

# Evaluate models on the resampled dataset
print('Balanced Results:')
evaluate_models(X_resampled, y_resampled)



### Task 03-2: Description
#### Stacking Ensemble Classifier 

We tried voting, but lets try a stacking classifier too

### Task 03-2: Code

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Define a function to evaluate different models with additional strategies
def evaluate_models(X, y):    
    # TODO: Feature selection

    # TODO: Dimensionality reduction

    # TODO: Ensemble method: Voting Classifier
    
    # TODO: Create a pipeline for the ensemble

    # TODO: Perform cross-validation for the ensemble

    print(f"Ensemble (Voting) Accuracy: {ensemble_scores.mean():.2f} (+/- {ensemble_scores.std() * 2:.2f})")

# Evaluate models on the scaled dataset
print('Imbalanced Results:')
evaluate_models(X_scaled, df['target'])

print('-'*80)

# Evaluate models on the resampled dataset
print('Balanced Results:')
evaluate_models(X_resampled, y_resampled)

### *Story Progression*
#### OH MY GOODNESS WE DID IT
Would you look at that, we were able to get above 80% accuracy by simply combining different basic ML models together!

But wait, the Dogtors said that given the current state of the market they'd really love to use AI (obviously this ML stuff isn't *real* AI) to solve this problem so they can bring in more funding. It turns out there's another technique we could use!

## Task 04: Simple Neural Network
### Task 04: Description
#### Simple FFN

Lets see if we can create a basic neural network to help the dogtors get extra funding while still saving the pets!

### Task 04

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

def make_loader(X, y, batch_size=32, shuffle=False):
    X_t = torch.tensor(X, dtype=torch.float32)
    y_t = torch.tensor(y.values, dtype=torch.long)
    return DataLoader(TensorDataset(X_t, y_t), batch_size=batch_size, shuffle=shuffle)

X_train_u, X_test_u, y_train_u, y_test_u = train_test_split(
    X_scaled, df['target'], test_size=0.3, random_state=42
)

train_loader_u = make_loader(X_train, y_train, shuffle=True)
test_loader_u  = make_loader(X_test, y_test)

X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
    X_resampled, y_resampled, test_size=0.3, random_state=42
)

train_loader_b = make_loader(X_train, y_train, shuffle=True)
test_loader_b  = make_loader(X_test, y_test)

# TODO: Define a simple neural network
class SimpleNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(SimpleNN, self).__init__()

    def forward(self, x):
        pass

# Initialize the model, loss function, and optimizer
input_size = X_train.shape[1]
hidden_size = 128
num_classes = len(df['target'].unique())

# TODO: Initialize model, criterion, optimizer, and epochs

def train_and_evaluate(model, train_loader, test_loader, criterion, optimizer, epochs=100):
    results = []
    for run in range(10):
        # Training loop
        num_epochs = epochs
        for epoch in range(num_epochs):
            model.train()
            for X_batch, y_batch in train_loader:
                # Forward pass
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                
                # Backward pass and optimization
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

        # Evaluate the model
        model.eval()
        with torch.no_grad():
            correct = 0
            total = 0
            for X_batch, y_batch in test_loader:
                outputs = model(X_batch)
                _, predicted = torch.max(outputs.data, 1)
                total += y_batch.size(0)
                correct += (predicted == y_batch).sum().item()

            print(f'Accuracy of the model on the test set: {100 * correct / total:.2f}%')
            results.append(100 * correct / total)

    print(f'Average accuracy over {len(results)} runs: {np.mean(results):.2f}%')

print("Unbalanced Dataset Results:")
train_and_evaluate(model, train_loader_u, test_loader_u, criterion, optimizer)

print('-'*80)

print("Balanced Dataset Results:")
train_and_evaluate(model, train_loader_b, test_loader_b, criterion, optimizer)

## Wrap-Up
In this practicum, we applied both unsupervised (PCA and clustering) and supervised (Random Forest) methods to the Heart Disease dataset. By first exploring the data using unsupervised techniques, we were able to gain insights that informed our supervised task.