# Notebook A: Ensemble Models

Use this notebook to go through the process of using ensemble models to predict heart disease. Load the [Cleveland heart disease dataset](https://archive.ics.uci.edu/dataset/45/heart+disease), preprocess the data, perform a train-test split, and use a grid search to find the best parameters for a bagging classifier and an adaboost classifier.


### Setup imports

In [1]:
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Load data

In [4]:
cleveland_df = pd.read_csv('cleveland_heart_disease.csv')

In [5]:
assert not cleveland_df.empty, "DataFrame is empty"
assert cleveland_df.shape == (303, 14), "DataFrame has incorrect number of columns"

### Proprocess data
Remove the rows that have question marks in them

In [8]:
cleveland_df = cleveland_df[~df.apply(lambda row: row.astype(str).str.contains(r"\?").any(), axis=1)].copy()


In [9]:
assert cleveland_df.shape == (297, 14), "DataFrame has incorrect number of columns"

In [13]:
print(cleveland_df.columns)
print(cleveland_df.head())

Index(['age', 'gender', 'cp', 'trestbps', 'chol', 'fps', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'class'],
      dtype='object')
   age  gender  cp  trestbps  chol  fps  restecg  thalach  exang  oldpeak  \
0   63       1   1       145   233    1        2      150      0      2.3   
1   67       1   4       160   286    0        2      108      1      1.5   
2   67       1   4       120   229    0        2      129      1      2.6   
3   37       1   3       130   250    0        0      187      0      3.5   
4   41       0   2       130   204    0        2      172      0      1.4   

   slope ca thal  class  
0      3  0    6      0  
1      2  3    3      2  
2      2  2    7      1  
3      3  0    3      0  
4      1  0    3      0  


### Train-Test Split
Use 80% of the data for training data, and set the random state to 42.

In [26]:
cleveland_df ["class"] = cleveland_df["class"].apply(lambda x: 1 if x > 0 else 0)
X = cleveland_df.drop(columns=["class"])  
y = cleveland_df["class"]  

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# Ensure that the split was successful
assert X_train.shape[0] > 0 and X_test.shape[0] > 0, "Something went wrong in train-test split."

### Bagging Model Training
First, define a bagging classifier that uses KNeighbors classifier as the base estimator, and has random_state=42.

Then, use a grid search, with five fold cross validation, to find the best parameters for the bagging classifier.

Use accuracy for scoring the parameter combinations, and use this parameter grid:

```python
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_samples': [0.5, 0.75, 1.0], 
    'max_features': [0.5, 0.75, 1.0] 
}
```



In [17]:
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

In [27]:
# define the base estimator
knn = KNeighborsClassifier()

# define the BaggingClassifier with KNN as the base estimator
bagging_clf = BaggingClassifier(estimator=knn, random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_samples': [0.5, 0.75, 1.0], 
    'max_features': [0.5, 0.75, 1.0]
}

bagging_grid_search = GridSearchCV(
    estimator=bagging_clf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5, # 5-fold cross-validation
    n_jobs=-1
)

bagging_grid_search.fit(X_train, y_train)

print(bagging_grid_search.best_params_, bagging_grid_search.best_score_)

{'max_features': 0.5, 'max_samples': 1.0, 'n_estimators': 50} 0.7590425531914894


In [None]:
assert bagging_grid_search.best_params_ == {'max_features': 0.5, 'max_samples': 0.75, 'n_estimators': 50}, "Incorrect best parameters"
assert bagging_grid_search.best_score_ > 0.560 and bagging_grid_search.best_score_ < 0.561, "Incorrect best score"

AssertionError: Incorrect best score

### It seems like my model is too good?

### AdaBoost Model Training
First, define a adaboost classifier that uses the decision tree classifier as the base estimator (the default behavior), and has random_state=42.

Then, use a grid search, with five fold cross validation, to find the best parameters for the adaboost classifier. 

Use this parameter grid:

```python
param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0]
}
```

There may be warnings. You can ignore these.


In [29]:
from sklearn.ensemble import AdaBoostClassifier

In [32]:
adaboost_clf = AdaBoostClassifier(random_state=42)

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 1.0]
}

ada_grid_search = GridSearchCV(
    estimator=adaboost_clf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1
)

ada_grid_search.fit(X_train, y_train)

print(ada_grid_search.best_params_, ada_grid_search.best_score_)



{'learning_rate': 0.1, 'n_estimators': 100} 0.8185283687943261


In [33]:
assert ada_grid_search.best_params_ == {'learning_rate': 0.01, 'n_estimators': 100}, "Incorrect best parameters"
assert ada_grid_search.best_score_ > 0.579 and ada_grid_search.best_score_ < 0.580, "Incorrect best score"

AssertionError: Incorrect best parameters

The same error as the last test
