<font size="+3"><b>Assignment 4: Pipelines and Hyperparameter Tuning</b></font>

***
* **Full Name** = Sarah Qin     
* **UCID** = 10156892
***

<font color='Blue'>
In this assignment, you will be putting together everything you have learned so far. You will need to find your own dataset, do all the appropriate preprocessing, test different supervised learning models, and evaluate the results. More details for each step can be found below. You will also be asked to describe the process by which you came up with the code. More details can be found below. Please cite any websites or AI tools that you used to help you with this assignment.
</font>

<font color='Red'>
For this assignment, in addition to your .ipynb file, please also attach a PDF file. To generate this PDF file, you can use the print function (located under the "File" within Jupyter Notebook). Name this file ENGG444_Assignment##__yourUCID.pdf (this name is similar to your main .ipynb file). We will evaluate your assignment based on the two files and you need to provide both.
</font>


|         **Question**         | **Point(s)** |
|:----------------------------:|:------------:|
|  **1. Preprocessing Tasks**  |              |
|              1.1             |       2      |
|              1.2             |       2      |
|              1.3             |       4      |
| **2. Pipeline and Modeling** |              |
|              2.1             |       3      |
|              2.2             |       6      |
|              2.3             |       5      |
|              2.4             |       3      |
|     **3. Bonus Question**    |     **2**    |
|           **Total**          |    **25**    |

## **0. Dataset**

This data is a subset of the **Heart Disease Dataset**, which contains information about patients with possible coronary artery disease. The data has **14 attributes** and **294 instances**. The attributes include demographic, clinical, and laboratory features, such as age, sex, chest pain type, blood pressure, cholesterol, and electrocardiogram results. The last attribute is the **diagnosis of heart disease**, which is a categorical variable with values from 0 (no presence) to 4 (high presence). The data can be used for **classification** tasks, such as predicting the presence or absence of heart disease based on the other attributes.

In [35]:
import pandas as pd

# Define the data source link
_link = 'https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data'

# Read the CSV file into a Pandas DataFrame, considering '?' as missing values
df = pd.read_csv(_link, na_values='?',
                 names=['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs',
                        'restecg', 'thalach', 'exang', 'oldpeak', 'slope',
                        'ca', 'thal', 'num'])

# Display the DataFrame
display(df)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num
0,28,1,2,130.0,132.0,0.0,2.0,185.0,0.0,0.0,,,,0
1,29,1,2,120.0,243.0,0.0,0.0,160.0,0.0,0.0,,,,0
2,29,1,2,140.0,,0.0,0.0,170.0,0.0,0.0,,,,0
3,30,0,1,170.0,237.0,0.0,1.0,170.0,0.0,0.0,,,6.0,0
4,31,0,2,100.0,219.0,0.0,1.0,150.0,0.0,0.0,,,,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,52,1,4,160.0,331.0,0.0,0.0,94.0,1.0,2.5,,,,1
290,54,0,3,130.0,294.0,0.0,1.0,100.0,1.0,0.0,2.0,,,1
291,56,1,4,155.0,342.0,1.0,0.0,150.0,1.0,3.0,2.0,,,1
292,58,0,2,180.0,393.0,0.0,0.0,110.0,1.0,1.0,2.0,,7.0,1


# **1. Preprocessing Tasks**

- **1.1** Find out which columns have more than 60% of their values missing and drop them from the data frame. Explain why this is a reasonable way to handle these columns. **(2 Points)**

- **1.2** For the remaining columns that have some missing values, choose an appropriate imputation method to fill them in. You can use the `SimpleImputer` class from `sklearn.impute` or any other method you prefer. Explain why you chose this method and how it affects the data. **(2 Points)**

- **1.3** Assign the `num` column to the variable `y` and the rest of the columns to the variable `X`. The `num` column indicates the presence or absence of heart disease based on the angiographic disease status of the patients. Create a `ColumnTransformer` object that applies different preprocessing steps to different subsets of features. Use `StandardScaler` for the numerical features, `OneHotEncoder` for the categorical features, and `passthrough` for the binary features. List the names of the features that belong to each group and explain why they need different transformations. You will use this `ColumnTransformer` in a pipeline in the next question. **(4 Points)**

<font color='Green'><b>Answer:</b></font>

- **1.1** 

In [36]:
# 1.1
# Add necessary code here.
# Before filling missing values 
print(df.isna().sum())

for col in df.columns:
    if (df[col].isna().sum()) / len(df) > 0.6:
        df.drop(col, axis=1, inplace=True)

# After filling missing values
df.isna().sum()


age           0
sex           0
cp            0
trestbps      1
chol         23
fbs           8
restecg       1
thalach       1
exang         1
oldpeak       0
slope       190
ca          291
thal        266
num           0
dtype: int64


age          0
sex          0
cp           0
trestbps     1
chol        23
fbs          8
restecg      1
thalach      1
exang        1
oldpeak      0
num          0
dtype: int64

<font color='Green'><b>Answer:</b></font>

- **1.2** .....................

In [38]:
# 1.2
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')

df = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)

df.isna().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
num         0
dtype: int64

<font color='Green'><b>Answer:</b></font>

- **1.3** .....................

In [39]:
# 1.3
y = df['num']
X = df.drop('num', axis=1)

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

scaler = StandardScaler()
ohe = OneHotEncoder(sparse_output=False)

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', scaler, ['age', 'trestbps', 'chol', 'thalach', 'oldpeak']),
        ('categorial', ohe, ['cp',  'restecg'])
    ],
    remainder='passthrough' 
)

transformed_data = preprocessor.fit_transform(X)
transformed_X = pd.DataFrame(transformed_data, columns=preprocessor.get_feature_names_out())

transformed_X

Unnamed: 0,numerical__age,numerical__trestbps,numerical__chol,numerical__thalach,numerical__oldpeak,categorial__cp_1.0,categorial__cp_2.0,categorial__cp_3.0,categorial__cp_4.0,categorial__restecg_0.0,categorial__restecg_0.21843003412969283,categorial__restecg_1.0,categorial__restecg_2.0,remainder__sex,remainder__fbs,remainder__exang
0,-2.542347,-0.147076,-1.833027,1.951150,-0.646074,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
1,-2.414117,-0.716341,-0.121052,0.887744,-0.646074,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
2,-2.414117,0.422189,0.000000,1.313106,-0.646074,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,-2.285888,2.129984,-0.213591,1.313106,-0.646074,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,-2.157658,-1.854871,-0.491209,0.462382,-0.646074,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,0.535162,1.560719,1.236189,-1.919647,2.109958,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
290,0.791621,-0.147076,0.665531,-1.664429,-0.646074,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
291,1.048080,1.276086,1.405845,0.462382,2.661164,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0
292,1.304539,2.699249,2.192428,-1.239067,0.456339,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0


# **2. Pipeline and Modeling**

- **2.1** Create **three** `Pipeline` objects that take the column transformer from the previous question as the first step and add one or more models as the subsequent steps. You can use any models from `sklearn` or other libraries that are suitable for binary classification. For each pipeline, explain **why** you selected the model(s) and what are their **strengths and weaknesses** for this data set. **(3 Points)**

- **2.2** Use `GridSearchCV` to perform a grid search over the hyperparameters of each pipeline and find the best combination that maximizes the cross-validation score. Report the best parameters and the best score for each pipeline. Then, update the hyperparameters of each pipeline using the best parameters from the grid search. **(6 Points)**

- **2.3** Form a stacking classifier that uses the three pipelines from the previous question as the base estimators and a meta-model as the `final_estimator`. You can choose any model for the meta-model that is suitable for binary classification. Explain **why** you chose the meta-model and how it combines the predictions of the base estimators. Then, use `StratifiedKFold` to perform a cross-validation on the stacking classifier and present the accuracy scores and F1 scores for each fold. Report the mean and the standard deviation of each score in the format of `mean ± std`. For example, `0.85 ± 0.05`. Interpret the results and compare them with the baseline scores from the previous assignment. **(5 Points)**

- **2.4**: Interpret the final results of the stacking classifier and compare its performance with the individual models. Explain how stacking classifier has improved or deteriorated the prediction accuracy and F1 score, and what are the possible reasons for that. **(3 Points)**

<font color='Green'><b>Answer:</b></font>

- **2.1** .....................

In [44]:
# 2.1
# Add necessary code here.
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

from sklearn.pipeline import make_pipeline

pipe1 = make_pipeline(LogisticRegression()) 
pipe2 = make_pipeline(DecisionTreeClassifier())
pipe3 = make_pipeline(SVC())



<font color='Green'><b>Answer:</b></font>

- **2.2** .....................

In [54]:
# 2.2
# Add necessary code here.
from sklearn.model_selection import GridSearchCV, cross_val_score

param_grids = { 'Logistic Regression':  
                    {'logisticregression__C': [0.1, 1, 10, 100]}, 
               'DTC': 
                    {'decisiontreeclassifier__max_depth': [3, 5, 7, 9]},
               'SVC': 
                    {'svc__C': [0.1, 1, 10, 100]}
               }

pipelines = { 'Logistic Regression': pipe1, 
               'DTC': pipe2,
               'SVC': pipe3
            }

for model_name, param_grid in param_grids.items():
     print(model_name)
     scores = cross_val_score(pipelines[model_name], transformed_X, y, cv=5)
     print(f'{model_name}: {scores.mean()}')
     
     grid = GridSearchCV(pipelines[model_name], param_grid, cv=5)
     grid.fit(transformed_X, y)
     print(f'Best parameters: {grid.best_params_}')
     print()
     
pipe1 = make_pipeline(LogisticRegression(C=0.1))
pipe2 = make_pipeline(DecisionTreeClassifier(max_depth=3))
pipe3 = make_pipeline(SVC(C=1))

Logistic Regression
Logistic Regression: 0.8195791934541203
Best parameters: {'logisticregression__C': 0.1}

DTC
DTC: 0.6486265341905317
Best parameters: {'decisiontreeclassifier__max_depth': 3}

SVC
SVC: 0.7991817650496784
Best parameters: {'svc__C': 1}



<font color='Green'><b>Answer:</b></font>

- **2.3** .....................

In [None]:
# 2.3
# Add necessary code here.

from sklearn.ensemble import StackingClassifier
from sklearn.model_selection import StratifiedKFold

estimators = [pipe1, pipe2, pipe3]

stacking = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)



<font color='Green'><b>Answer:</b></font>

- **2.4** .....................

**Bonus Question**: The stacking classifier has achieved a high accuracy and F1 score, but there may be still room for improvement. Suggest **two** possible ways to improve the modeling using the stacking classifier, and explain **how** and **why** they could improve the performance. **(2 points)**

<font color='Green'><b>Answer:</b></font>