#### Introduction

In this homework, you will build and evaluate classification models using the **Heart Failure Prediction Dataset**. Your goal is to train and optimize different classifiers—Decision Tree, K-Nearest Neighbors (KNN), and Naive Bayes—by exploring various hyperparameter settings for each algorithm. You will also apply a train-test split to evaluate the models on unseen data and use k-fold cross-validation to select the best hyperparameter configurations.

By the end of this assignment, you will have gained hands-on experience in training classifiers, tuning hyperparameters, and performing model selection, which are critical steps in many data mining and machine learning workflows. All tasks should be completed using Python in a Jupyter Notebook. For questions requiring textual explanations, please use Markdown to format your responses within text blocks.

### Heart Failure Prediction Dataset

#### Data Set Columns:

1. **Age:** The age of the patient in years.
2. **Sex:** The gender of the patient (`M` for male, `F` for female).
3. **ChestPainType:** The type of chest pain experienced by the patient (e.g., `TA` for typical angina, `ATA` for atypical angina, `NAP` for non-anginal pain, `ASY` for asymptomatic).
4. **RestingBP:** The patient’s resting blood pressure (in mm Hg on admission to the hospital).
5. **Cholesterol:** The serum cholesterol level (in mg/dL).
6. **FastingBS:** The patient's fasting blood sugar (`1` if fasting blood sugar > 120 mg/dL, otherwise `0`).
7. **RestingECG:** Results of the patient's resting electrocardiogram (e.g., `Normal`, `ST`, `LVH`).
8. **MaxHR:** The patient’s maximum heart rate achieved during an exercise test.
9. **ExerciseAngina:** Whether the patient experienced exercise-induced angina (`Y` for yes, `N` for no).
10. **Oldpeak:** ST depression induced by exercise relative to rest.
11. **ST_Slope:** The slope of the peak exercise ST segment (`Up`, `Flat`, `Down`).
12. **HeartDisease:** The target variable indicating whether the patient has heart disease (`1` for heart disease, `0` for no heart disease).

### Question 1: Load and Preprocess the Dataset

1. **Load the Dataset:**
   - Use the `pandas.read_csv()` function to load the Heart Failure Prediction Dataset from the provided CSV file.
   
   Refer to this link for more details on how to use `read_csv`: [pandas.read_csv() Documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

2. **Encode Categorical Variables:**
   - Convert categorical columns (`Sex`, `ChestPainType`, `RestingECG`, `ExerciseAngina`, and `ST_Slope`) to numerical values.
     - You may choose to use **one-hot encoding** or **ordinal encoding** based on your preference.
     - **Hint:** One-hot encoding preserves more detail, while ordinal encoding assumes some order in the categorical variables, which may be appropriate for certain columns (like `ST_Slope`).
   

3. **Prepare K-Fold Cross-Validation:**
   - Create a **KFold** object using the `KFold` function from `scikit-learn` with `k=10` for 10-fold cross-validation. This object will later be used in the cross-validation process for tuning hyperparameters.
   
   Refer to this link for more details on `KFold`: [scikit-learn KFold Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).


4. **Train-Test Split:**
   - Split the dataset into training and testing sets using `train_test_split` for final model evaluation after tuning the hyperparameters.

   Refer to this link for more details on `train_test_split`: [scikit-learn train_test_split Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).



In [127]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import CategoricalNB
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import KBinsDiscretizer
import numpy as np

In [128]:
# Load dataset
df = pd.read_csv('/heart-2.csv')

In [129]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [130]:
# Check for missing values
df.isnull().sum()

Unnamed: 0,0
Age,0
Sex,0
ChestPainType,0
RestingBP,0
Cholesterol,0
FastingBS,0
RestingECG,0
MaxHR,0
ExerciseAngina,0
Oldpeak,0


In [131]:
# View the transformed dataset
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [132]:
# Identify categorical columns
categorical_columns = ['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope']

In [133]:
# LabelEncoder for each categorical column to convert them into integer codes
label_encoders = {}
for col in categorical_columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

In [134]:
# Handle Negative Values
for col in df.columns:
    if df[col].dtype in ['int64', 'float64']:
        df[col] = df[col].apply(lambda x: max(x, 0))

In [135]:
# Prepare Features and Target
X = df.drop('HeartDisease', axis=1)
y = df['HeartDisease']

In [136]:
for col in X.columns:
    if X[col].dtype not in ['int64', 'category']:
        X[col] = X[col].astype(int)

In [137]:
# Apply binning
binner = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
X_binned = binner.fit_transform(X)

In [138]:
# Set up K-Fold Cross-Validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

In [139]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X_binned, y, test_size=0.2, random_state=42)

### Question 2: Decision Tree Classifier with Hyperparameter Tuning

1. **Train a Decision Tree Classifier:**
   - Use the `DecisionTreeClassifier` from `scikit-learn` to train a model on the preprocessed dataset.
   - Try the following hyperparameters:
     - `max_depth`: [2, 3, 4, 5]

2. **Perform 10-Fold Cross-Validation:**
   - Use 10-fold cross-validation with the `KFold` object you prepared earlier to evaluate each hyperparameter combination.
   - For each `max_depth`, report the cross-validated accuracy.
   - Report your best model.
   
   Refer to these links for details:
   - Grid Search CV: [sklearn.model_selection.GridSearchCV Documentation](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html)

3. **Select the Best Model:**
   - Based on the cross-validation results, select the hyperparameter combination that yields the best performance.
   - Re-train the Decision Tree classifier on the entire training set using the best hyperparameters.

4. **Predict on the Test Set:**
   - Use your trained best Decision Tree model to make predictions on the test set.
   - Report the **accuracy** on the test set using `accuracy_score` from `scikit-learn`.


In [140]:
# Define the model and hyperparameters
model = DecisionTreeClassifier(random_state=42)
param_grid = {'max_depth': [2, 3, 4, 5]}

In [141]:
# Set up GridSearchCV
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
                           cv=kfold, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

In [142]:
# Report cross-validated accuracy for each max_depth
print("Cross-Validated Accuracy for Each Hyperparameter:")
for mean_score, params in zip(grid_search.cv_results_['mean_test_score'], grid_search.cv_results_['params']):
    print(f"Max Depth: {params['max_depth']}, Cross-Validated Accuracy: {mean_score:.4f}")

Cross-Validated Accuracy for Each Hyperparameter:
Max Depth: 2, Cross-Validated Accuracy: 0.8186
Max Depth: 3, Cross-Validated Accuracy: 0.8541
Max Depth: 4, Cross-Validated Accuracy: 0.8486
Max Depth: 5, Cross-Validated Accuracy: 0.8433


In [143]:
# Best parameters and accuracy
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best Hyperparameters: {best_params}')
print(f'Best Cross-Validated Accuracy: {best_score}')


Best Hyperparameters: {'max_depth': 3}
Best Cross-Validated Accuracy: 0.8541095890410959


In [144]:
# Retrain the model on the entire training set with the best parameters
best_model = DecisionTreeClassifier(max_depth=best_params['max_depth'], random_state=42)
best_model.fit(X_train, y_train)


In [145]:
# Make predictions and evaluate
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f'Test Set Accuracy: {test_accuracy}')

Test Set Accuracy: 0.8369565217391305


### Question 3: K-Nearest Neighbors (KNN) Classifier with Hyperparameter Tuning

1. **Train a K-Nearest Neighbors (KNN) Classifier:**
   - Use the `KNeighborsClassifier` from `scikit-learn` to train a model on the preprocessed dataset.
   - Try the following hyperparameters:
     - `n_neighbors`: [2, 3, 5, 10]

2. **Perform 10-Fold Cross-Validation:**
   - Use 10-fold cross-validation with the `KFold` object you prepared earlier to evaluate each hyperparameter combination.
   - For each `max_depth`, report the cross-validated accuracy.
   - Report your best model.
   
   Refer to these links for details:
   - Grid Search CV: [sklearn.model_selection.GridSearchCV Documentation](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html)

3. **Select the Best Model:**
   - Based on the cross-validation results, select the hyperparameter combination that yields the best performance.
   - Re-train the Decision Tree classifier on the entire training set using the best hyperparameters.

4. **Predict on the Test Set:**
   - Use your trained best Decision Tree model to make predictions on the test set.
   - Report the **accuracy** on the test set using `accuracy_score` from `scikit-learn`.


In [146]:
# Define the Model and Hyperparameters
model = KNeighborsClassifier()
param_grid = {'n_neighbors': [2, 3, 5, 10]}

In [147]:
# Set Up and Fit Grid Search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid,
                           cv=kfold, scoring='accuracy', n_jobs=-1)
grid_search.fit(X_train, y_train)

In [148]:
# Report cross-validated accuracy for each n_neighbors
print("Cross-Validated Accuracy for Each Hyperparameter:")
for mean_score, params in zip(grid_search.cv_results_['mean_test_score'], grid_search.cv_results_['params']):
    print(f"N Neighbors: {params['n_neighbors']}, Cross-Validated Accuracy: {mean_score:.4f}")

Cross-Validated Accuracy for Each Hyperparameter:
N Neighbors: 2, Cross-Validated Accuracy: 0.7928
N Neighbors: 3, Cross-Validated Accuracy: 0.8310
N Neighbors: 5, Cross-Validated Accuracy: 0.8406
N Neighbors: 10, Cross-Validated Accuracy: 0.8338


In [149]:
# Retrieve Best Hyperparameters and Score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f'Best Hyperparameters: {best_params}')
print(f'Best Cross-Validated Accuracy: {best_score:.4f}')

Best Hyperparameters: {'n_neighbors': 5}
Best Cross-Validated Accuracy: 0.8406


In [150]:
# Retrain the Model on the Entire Training Set with the Best Parameters
best_model = KNeighborsClassifier(n_neighbors=best_params['n_neighbors'])
best_model.fit(X_train, y_train)

In [151]:
# Make Predictions and Evaluate on the Test Set
y_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f'Test Set Accuracy: {test_accuracy:.4f}')

Test Set Accuracy: 0.8478


### Authentication: Write Down Your Information

In the following code block, print your Student ID, Name, and Homework number in the specified format:

```python
# Replace the placeholders with your actual information
info = [yourid, yourname, homework_number]
for id, name, homework in info:
    print(f'ID: {id}\nName: {name}\nHomework: {homework}')


In [152]:
info = [('1002162937', 'Swathi Manjunatha', '004')]
for id, name, homework in info:
    print(f'ID: {id}\nName: {name}\nHomework: {homework}')

ID: 1002162937
Name: Swathi Manjunatha
Homework: 004


### Question 4: Naive Bayesian Classifier with Hyperparameter Tuning

1. **Train a CategoricalNB Classifier:**
   - Use the `CategoricalNB` from `scikit-learn` to train a model on the preprocessed dataset.
   - Try the following hyperparameters:
     - `alpha`: [0.1, 0.5, 1.0]

2. **Perform 10-Fold Cross-Validation:**
   - Use 10-fold cross-validation with the `KFold` object you prepared earlier to evaluate each hyperparameter combination.
   - For each `max_depth`, report the cross-validated accuracy.
   - Report your best model.
   
   Refer to these links for details:
   - Grid Search CV: [sklearn.model_selection.GridSearchCV Documentation](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.GridSearchCV.html)

3. **Select the Best Model:**
   - Based on the cross-validation results, select the hyperparameter combination that yields the best performance.
   - Re-train the Decision Tree classifier on the entire training set using the best hyperparameters.

4. **Predict on the Test Set:**
   - Use your trained best Decision Tree model to make predictions on the test set.
   - Report the **accuracy** on the test set using `accuracy_score` from `scikit-learn`.

In [153]:
# Define Model
model = CategoricalNB()

In [154]:
# Define the parameter grid for alpha
param_grid = {'alpha': [0.1, 0.5, 1.0]}

In [155]:
# Set up GridSearchCV for CategoricalNB
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=kfold, scoring='accuracy')

In [156]:
# Fit the model using GridSearchCV
grid_search.fit(X_train, y_train)

In [157]:
# Get cross-validated results for each alpha
results = grid_search.cv_results_

In [158]:
# Report cross-validated accuracy for each alpha
for mean_score, params in zip(results['mean_test_score'], results['params']):
    print(f"Alpha: {params['alpha']}, Cross-Validated Accuracy: {mean_score:.4f}")

Alpha: 0.1, Cross-Validated Accuracy: 0.8475
Alpha: 0.5, Cross-Validated Accuracy: 0.8475
Alpha: 1.0, Cross-Validated Accuracy: 0.8475


In [159]:
# Select the best model
best_alpha = grid_search.best_params_['alpha']
best_model = grid_search.best_estimator_

In [160]:
# Report the best alpha and cross-validated accuracy
print(f"Best Alpha: {best_alpha}")
print(f"Best Cross-Validated Accuracy: {grid_search.best_score_:.4f}")

Best Alpha: 0.1
Best Cross-Validated Accuracy: 0.8475


In [161]:
# Predict on the Test Set using the best model
y_pred = best_model.predict(X_test)

In [162]:
# Report the accuracy on the test set
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Test Set Accuracy: {test_accuracy:.4f}")

Test Set Accuracy: 0.8533
