<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***
* **Full Name** = Sarah Qin     
* **UCID** = 10156892
***

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **25**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [2]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** - It is reasonable to drop them from the data frame because the majority of the values are missing from these columns. Since the majority of data are missing from these columns, they might not contain useful information, dropping them allows us to focus on features that contains more useful data. 

In [24]:
# 1.1
# Add necessary code here.

for col in df.columns:
    if (df[col].isna().sum()) / len(df) > 0.6:
        df.drop(col, axis=1, inplace=True)



<font color='Green'><b>Answer:</b></font>

- **1.2** - I filled the remaining missing data with with the mean because this maintains the overall distribution. 

In [26]:
# 1.2
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df.isna().sum()

df = df[df['restecg'] != 0.21843003412969283]

<font color='Green'><b>Answer:</b></font>

- **1.3** 
    - Numerical - Age, trestbps, chol, thalach, oldpeak
    - Catgorical - cp, restecg
    - Binary - sex, fbs, exang
They need different transformation because they represent different types of data. Numerical values have meanings where categorical and binary values are just a number assigned to the category, thus they need to be processed differently. Categorical and binary need to be processed differently because categorical has more than 2 categories and binary contains only 2 categories. Different transformations of data makes it easier for some algorithms to handle.

In [5]:
# 1.3
y = df['num']
X = df.drop('num', axis=1)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

scaler = StandardScaler()
ohe = OneHotEncoder(sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', scaler, ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']),
        ('categorial', ohe, ['cp',  'restecg']),
        ('binary', 'passthrough', ['sex', 'fbs', 'exang'])
    ],
    
)

# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. Interpret the results and compare them with the baseline scores from the previous assignment. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1** 
    - I chose logistic regression, random forest classifier, and SVC because these models can be used for classification.
    - 

In [6]:
# 2.1
# Add necessary code here.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from sklearn.pipeline import make_pipeline

pipe_lr = make_pipeline(preprocessor, LogisticRegression(max_iter=2000)) 
pipe_rf = make_pipeline(preprocessor, RandomForestClassifier())
pipe_svm = make_pipeline(preprocessor, SVC())



<font color='Green'><b>Answer:</b></font>

- **2.2** The best parameters for logistic regression is with C=100, and solver=liblinear. The best accuracy score for logistic regression is 0.85 and the best F1 scores is 0.78. The best parameters for random forest is with max_depth=3 and n_estimators=100. The best accuracy score for random forest is 0.86 and the best F1 scores is 0.79. The best parameters for SVC is with C=10, kernel=linear. The best accuracy score for SVC is 0.87 and the best F1 scores is 0.80. 

In [7]:
# 2.2
# Add necessary code here.
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score, make_scorer, accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)

param_grid_lr = {
    'logisticregression__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'logisticregression__solver': ['liblinear', 'saga']
}

param_grid_rf = {
    'randomforestclassifier__max_depth': [1, 3, 5, 10],
    'randomforestclassifier__n_estimators': [10, 50, 100]  
}

param_grid_svm = {
    'svc__C': [0.001, 0.01, 0.1, 1, 10, 100],
    'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid']
}

param_grids = {'lr': param_grid_lr, 'rf': param_grid_rf, 'svm': param_grid_svm}

pipelines = {'lr':pipe_lr, 'rf': pipe_rf,  'svm': pipe_svm} 

scoring = {
    'accuracy': make_scorer(accuracy_score),         # Scoring based on accuracy_score
    'f1_score': make_scorer(f1_score)                # Scoring based on F1_score
}

for model_name, param_grid in param_grids.items():
    grid_search = GridSearchCV(pipelines[model_name], param_grid, cv=5, scoring=scoring,refit='f1_score', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    best_param = grid_search.best_params_
    best_score = grid_search.cv_results_
    print('Model: ', model_name)
    print('Best params: ', best_param)
    print("Best accuracy scores:", max(best_score['mean_test_accuracy'])) 
    print("Best F1 scores:", max(best_score['mean_test_f1_score']) )  


Model:  lr
Best params:  {'logisticregression__C': 100, 'logisticregression__solver': 'liblinear'}
Best accuracy scores: 0.8461609620721553
Best F1 scores: 0.7755416549239751
Model:  rf
Best params:  {'randomforestclassifier__max_depth': 3, 'randomforestclassifier__n_estimators': 100}
Best accuracy scores: 0.8591119333950046
Best F1 scores: 0.7852128427128428
Model:  svm
Best params:  {'svc__C': 10, 'svc__kernel': 'linear'}
Best accuracy scores: 0.8716928769657724
Best F1 scores: 0.7984139784946237


<font color='Green'><b>Answer:</b></font>

- **2.3** - I chose SVC as the meta-model because it performed the best in 2.2. It evaluates the strengths of each base model and weighs their predictions with the meta model, then combines them together. The mean accuracy is 0.83 ± 0.08 and the mean F1 score: 0.74 ± 0.12. Compared to the baseline scores of the last assignment, 

In [18]:
# 2.3
# Add necessary code here.

pipe_lr = make_pipeline(preprocessor, LogisticRegression(C=100, solver='liblinear', max_iter=2000))
pipe_rf = make_pipeline(preprocessor, RandomForestClassifier(max_depth=3, n_estimators=100))
pipe_svm = make_pipeline(preprocessor, SVC(C=10, kernel='linear'))

from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

estimators = [('lr', pipe_lr), ('rf', pipe_rf), ('svm', pipe_svm)]

stacking = StackingClassifier(estimators=estimators, final_estimator=SVC(C=10, kernel='linear'))

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)


results = {'accuracy':[], 'f1_score': []}


for train_index, test_index in skf.split(X, y):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

    stacking.fit(X_train, y_train)
    y_pred = stacking.predict(X_test)
    results['accuracy'].append(accuracy_score(y_test, y_pred))
    results['f1_score'].append(f1_score(y_test, y_pred))

index = ['Fold 1', 'Fold 2', 'Fold 3', 'Fold 4', 'Fold 5']

results = pd.DataFrame(results, index=index)

print(results)
print(f'Mean accuracy: {results['accuracy'].mean():.2f} ± {results['accuracy'].std():.2f}')
print(f'Mean F1 score: {results['f1_score'].mean():.2f} ± {results['f1_score'].std():.2f}')
    
    

        accuracy  f1_score
Fold 1  0.830508  0.750000
Fold 2  0.830508  0.736842
Fold 3  0.711864  0.564103
Fold 4  0.827586  0.772727
Fold 5  0.931034  0.900000
Mean accuracy: 0.83 ± 0.08
Mean F1 score: 0.74 ± 0.12


<font color='Green'><b>Answer:</b></font>
- **2.4** - The stacking classifier performed well overall, with an accuracy score of 0.93 and f1_score of 0.9 on the last fold. Compared to the individual models, stacking performed better than all the individual models(compared to accuracy score of 0.85, 0.86, 0.87, and F1 score of 0.78, 0.79, 0.80). Stacking classifier imporved the scores becuase it takes the strength of each model and weighs their predictions to form a generalization on the predictions. 

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font>

- Tuning of the hyperparameters of the base and meta-models because stacking uses the base and metal-models to form its prediction.
- Add more diverse base models becuase this will allow the stacking clasiifier to generalize better with more models to learn from. 