<font size="+3"><b>Pipelines and Hyperparameter Tuning</b></font>

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [None]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** This is a reasonable way to handle these columns because when such a significant portion of the data is missing for a certain attribute, this could result in biased conclusions regarding the impact of the attribute on the model. In addition, it can make it more difficult for the models to establish a reliable pattern from attributes with missing data, and imputing more than 60% of the data could introduce unnecessary bias to the model rather than improving it.

In [None]:
# 1.1

# Check which columns have more than 60% of their values missing (each column has 294 instances)
missing_values = df.isnull().sum() / 294
columns_to_drop = missing_values[missing_values > 0.6].index.tolist()
print(columns_to_drop)

# Drop columns from the data frame
df_new = df.drop(columns = columns_to_drop)
df_new.head()

['slope', 'ca', 'thal']


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,0


<font color='Green'><b>Answer:</b></font>

**1.2**
- For the columns with numerical data, I chose to impute the missing values using the median because it is a better choice when the data is skewed or there are outliers in the continuous data compared to using the mean. This imputation would remove bias that could arise from missing values, without greatly affecting the data's structure.
- On the other hand, I used the most_frequent imputation strategy for columns with categorical data which causes minimal distortion to the original distribution of the data because it uses the most common value per column and ensures that the data is filled with existing categories without creating new ones. Mean and median can only be used with numerical data, so the mode strategy was the most appropriate choice.

In [None]:
# 1.2

# Referenced Lab 5 to find details about SimpleImputer
# Also used documentation: https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
from sklearn.impute import SimpleImputer

# Imputation for numerical columns
num_cols = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
imputer_num = SimpleImputer(strategy = 'median')
df_new[num_cols] = imputer_num.fit_transform(df_new[num_cols])

# Imputation for categorical columns
cat_cols = ['sex', 'cp', 'fbs', 'restecg', 'exang', 'num']
imputer_cat = SimpleImputer(strategy = 'most_frequent')
df_new[cat_cols] = imputer_cat.fit_transform(df_new[cat_cols])

# Making sure all missing values were imputed
df_new.isnull().any()

age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
num         False
dtype: bool

<font color='Green'><b>Answer:</b></font>

- **1.3** The features that belong to each group are listed in the variables below: numerical_features, categorical_features, and binary_features. They each need different transformers because each type has its own characteristics and needs to be prepared for modeling differently. The numerical features are transformed using StandardScaler in order to eliminate biases that could arise from varying scales and units and allow the model to perform better. The categorical features are transformed using one hot encoding to convert the unordered values into numerical values, allowing machine learning algorithms to process and understand the distinct categories of the features. Binary features are passed through because they are already in a binary numerical format, so the algorithm should be able to process them based on their original values.

In [None]:
# 1.3

# Referenced Lab 5 and course notes to find details about ColumnTransformer
# Also used documentation: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Assign the num column to the variable y and the rest of the columns to the variable X
X = df_new.drop('num', axis = 1)
y = df_new['num']

# Show features in each group
numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']
categorical_features = ['cp', 'restecg']
binary_features = ['sex', 'fbs', 'exang']

# Create a ColumnTransformer object that applies different preprocessing steps to different subsets of features.
# Use StandardScaler for the numerical features, OneHotEncoder for the categorical features, and passthrough for the binary features.
preprocessor = ColumnTransformer(
    transformers = [
        ('numerical', StandardScaler(), numerical_features),
        ('categorical', OneHotEncoder(handle_unknown = 'ignore'), categorical_features)],
    remainder = 'passthrough'
)

# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

**2.1** <br>
Pipeline #1:
- I selected the Logistic Regression model because it is a straightforward model for binary classification, and it works very well if the relationship between the features and the target variable is approximately linear. It is efficient and interpretable, and it could be effective for this potentially linear data set.
- As a linear model for classification, its strengths include that it is fast to train and predict, it is relatively easy to understand, it scales to very large datasets, and it is good for understanding feature importance.
- Its weaknesses include that it is less ideal in lower-dimensional spaces (the data set above with 14 attributes and 294 instances is not very large) and with data that has complex, non-linear decision boundaries; therefore, if the features in the data set do not have linear relationships with the target variable, other models could outperform it.

Pipeline #2:
- I selected the SVC model because it is a powerful model that is suitable for binary classification and allows for complex decision boundaries. It is suitable for the size of the current data set.
- Its strengths include that it performs well on a variety of datasets, it allows for complex decision boundaries, and it works well on both low-dimensional and high-dimensional data (which is good for the size of the current data set).
- Its weaknesses include that it doesn't scale very well with the number of samples, it can take a very long time to train, it requires careful preprocessing and tuning of parameters, and it can be hard to inspect.

Pipeline #3:
- I selected the Random Forest Classifier model because it is a powerful tree-based ensemble method that is much less likely to overfit the data than the other two methods.
- Its strengths include that it works very well without heavy tuning of parameters, it doesn't require scaling of the data, and it's good for handling complex, non-linear relationships.
- Its weaknesses include that it can be time-consuming to train, requires more memory, and doesn't perform as well on high-dimensional, sparse data.

In [None]:
# 2.1

# Referenced Lab 6 and course notes to find details about Pipeline
# Also used documentation: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# First Pipeline object with Logistic Regression model
lr_pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(random_state = 0))
])

# Second Pipeline object with SVC model
svm_pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('classifier', SVC(probability = True, random_state = 0))
])

# Third Pipeline object with Random Forest Classifier model
rf_pipeline = Pipeline(steps = [
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state = 0))
])

<font color='Green'><b>Answer:</b></font>

**2.2** <br>
Logistic Regression
- Best Parameters: {'classifier__C': 0.1, 'classifier__max_iter': 100}
- Best F1 Score: 0.7201635703043407
- Best Accuracy: 0.8196960841613091

SVM
- Best Parameters: {'classifier__C': 0.1, 'classifier__kernel': 'linear'}
- Best F1 Score: 0.7346270559734795
- Best Accuracy: 0.8231443600233781

Random Forest
- Best Parameters: {'classifier__max_depth': 5, 'classifier__n_estimators': 200}
- Best F1 Score: 0.7133180836192655
- Best Accuracy: 0.7886031560490941

In [None]:
# 2.2

# Referenced Lab 6 and course notes to find details about GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# Set up the parameter grids
lr_param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__max_iter': [100, 200, 300]
}

svm_param_grid = {
    'classifier__C': [0.1, 1, 10],
    'classifier__kernel': ['linear', 'rbf']
}

rf_param_grid = {
    'classifier__n_estimators': [100, 200, 300],
    'classifier__max_depth': [5, 10, 15]
}

# Define the scoring metrics to use
scoring = {
    'accuracy': make_scorer(accuracy_score),         # Scoring based on accuracy_score
    'f1_score': make_scorer(f1_score)                # Scoring based on F1_score
}

# Initialized GridSearchCV instances
lr_grid_search = GridSearchCV(lr_pipeline, lr_param_grid, cv = 5, scoring = scoring, refit = 'accuracy')
svm_grid_search = GridSearchCV(svm_pipeline, svm_param_grid, cv = 5, scoring = scoring, refit = 'accuracy')
rf_grid_search = GridSearchCV(rf_pipeline, rf_param_grid, cv = 5, scoring = scoring, refit = 'accuracy')

# Fit the models
lr_grid_search.fit(X, y)
svm_grid_search.fit(X, y)
rf_grid_search.fit(X, y)

# Found the best parameters and the best score for the models
best_params_lr = lr_grid_search.best_params_
best_score_lr_f1 = lr_grid_search.cv_results_['mean_test_f1_score'][lr_grid_search.best_index_]
best_score_lr_acc = lr_grid_search.cv_results_['mean_test_accuracy'][lr_grid_search.best_index_]

best_params_svm = svm_grid_search.best_params_
best_score_svm_f1 = svm_grid_search.cv_results_['mean_test_f1_score'][svm_grid_search.best_index_]
best_score_svm_acc = svm_grid_search.cv_results_['mean_test_accuracy'][svm_grid_search.best_index_]

best_params_rf = rf_grid_search.best_params_
best_score_rf_f1 = rf_grid_search.cv_results_['mean_test_f1_score'][rf_grid_search.best_index_]
best_score_rf_acc = rf_grid_search.cv_results_['mean_test_accuracy'][rf_grid_search.best_index_]

# Updated the hyperparameters of each pipeline using the best parameters from the grid search
lr_pipeline.set_params(**lr_grid_search.best_params_)
svm_pipeline.set_params(**svm_grid_search.best_params_)
rf_pipeline.set_params(**rf_grid_search.best_params_)

# Reported the results
print("Logistic Regression", "\nBest Parameters:", best_params_lr, "\nBest F1 Score:", best_score_lr_f1, "\nBest Accuracy:", best_score_lr_acc)
print("\nSVM", "\nBest Parameters:", best_params_svm, "\nBest F1 Score:", best_score_svm_f1, "\nBest Accuracy:", best_score_svm_acc)
print("\nRandom Forest", "\nBest Parameters:", best_params_rf, "\nBest F1 Score:", best_score_rf_f1, "\nBest Accuracy:", best_score_rf_acc)

Logistic Regression 
Best Parameters: {'classifier__C': 0.1, 'classifier__max_iter': 100} 
Best F1 Score: 0.7201635703043407 
Best Accuracy: 0.8196960841613091

SVM 
Best Parameters: {'classifier__C': 0.1, 'classifier__kernel': 'linear'} 
Best F1 Score: 0.7346270559734795 
Best Accuracy: 0.8231443600233781

Random Forest 
Best Parameters: {'classifier__max_depth': 5, 'classifier__n_estimators': 200} 
Best F1 Score: 0.7133180836192655 
Best Accuracy: 0.7886031560490941


<font color='Green'><b>Answer:</b></font>

**2.3**
- I chose Logistic Regression for my meta-model because it is a straightforward model for binary classification, and it is efficient and interpretable. The meta-model in a stacking classifier is responsible for weighing the predictions of each base model in the best way possible, and a Logistic Regression model would provide clear probability outcomes in order to make an appropriate final decision. Specifically, Logistic Regression works well as a meta-model because it combines the predictions from the base estimators by learning how to best utilize these predictions to estimate the final target. It does so by considering the output from each base estimator as a feature and applying logistic regression to these features to predict the outcome.
- My final results are:
  - Accuracy: 0.82 ± 0.03
  - F1 Score: 0.73 ± 0.07

In [None]:
# 2.3

# Referenced documentation to find details about StackingClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html
from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import make_scorer, accuracy_score, f1_score

# Define the base estimators
base_estimators = [
    ('lr', lr_pipeline),
    ('svm', svm_pipeline),
    ('rf', rf_pipeline)
]

# I chose Logistic Regression as the meta-model
meta_model = LogisticRegression(random_state = 0)

# Create the stacking classifier
stacking_classifier = StackingClassifier(
    estimators = base_estimators,
    final_estimator = meta_model,
    n_jobs = -1
)

# Referenced documentation to find details about StratifiedKFold: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
# Defined StratifiedKFold for cross-validation
cv = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 0)

# Calculate and present the accuracy and F1 scores for each fold
accuracy_scores = cross_val_score(stacking_classifier, X, y, cv = cv, scoring = 'accuracy')
f1_scores = cross_val_score(stacking_classifier, X, y, cv = cv, scoring = make_scorer(f1_score))

print("Accuracy scores for each fold:")
for i, score in enumerate(accuracy_scores, 1):
    print(f"Fold {i}: {score}")

print("\nF1 scores for each fold:")
for i, score in enumerate(f1_scores, 1):
    print(f"Fold {i}: {score}")

# Report the mean and the standard deviation of each score
mean_accuracy = accuracy_scores.mean()
std_accuracy = accuracy_scores.std()
mean_f1 = f1_scores.mean()
std_f1 = f1_scores.std()

# Printed the results in the format required
print(f'\nAccuracy: {mean_accuracy:.2f} ± {std_accuracy:.2f}')
print(f'F1 Score: {mean_f1:.2f} ± {std_f1:.2f}')

Accuracy scores for each fold:
Fold 1: 0.8135593220338984
Fold 2: 0.8813559322033898
Fold 3: 0.7966101694915254
Fold 4: 0.8135593220338984
Fold 5: 0.8103448275862069

F1 scores for each fold:
Fold 1: 0.7441860465116279
Fold 2: 0.8292682926829269
Fold 3: 0.6249999999999999
Fold 4: 0.717948717948718
Fold 5: 0.7317073170731706

Accuracy: 0.82 ± 0.03
F1 Score: 0.73 ± 0.07


---

<font color='Green'><b>Answer:</b></font>

**2.4**
- The stacking classifier has a mean accuracy of about 0.82, which is slightly higher than the accuracy scores of the individual models Random Forest (0.7886031560490941) and Logistic Regression (0.8196960841613091). It is also approximately equal to the accuracy performance of the SVM model (0.8231443600233781). This slight improvement in accuracy overall portrays the strength of the stacking classifier, being able to leverage the different strengths of the individual models.
- In terms of the f1 score values, the stacking classifier had an average score of 0.73 which is higher than both the individual models Random Forest (0.7133180836192655) and Logistic Regression (0.7201635703043407). It also performed similar again to the SVM model (0.7346270559734795). This small improvement in f1 score shows that the stacking classifier is not only maintaining but slightly improving the balance between precision and recall compared to individual models; however, the standard deviation of 0.07 portrays variability in performance across different subsets of the data, reflecting potential inconsistencies in the classifier's predictive power across the sample distributions. This could be due to the stacking model balancing the output of the base estimators in such a way that it improves overall accuracy at the expense of the precision-recall balance, or because the individual models did not have sufficiently diverse or complementary decision boundaries.

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer: Two ways that the stacking classifier could be improved are by improving the base models by tuning their hyperparameters individually and making them more diverse, or tuning the hyperparameters of the meta-model to improve its performance.
- Ultimately, the stacking classifier depends on the performance of its base models; therefore, by tuning the parameters of these models individually, their performance can be optimized and improve the overall decision-making process of the stacking classifier. You could also add more base models to give a more diverse set of predictions that can be appropriately weighed and interpreted by the meta-model.
- The meta-model of the stacking classifier learns how to best combine the predictions of the base models; therefore, optimizing its performance through hyperparameter tuning, or even comparing the performance of more sophisticated models to find the best fit, can allow it to more accurately interpret the base models' predictions and improve the overall performance of the model.</b></font>

































