In [None]:
%load_ext nb_black

# ❤️ Heart Disease 🤒

The data today is from the [Framingham Heart Study](https://www.framinghamheartstudy.org/).  Below excerpt from [their wikipedia page](https://en.wikipedia.org/wiki/Framingham_Heart_Study):

> The Framingham Heart Study is a long-term, ongoing cardiovascular cohort study of residents of the city of Framingham, Massachusetts. The study began in 1948 with 5,209 adult subjects from Framingham, and is now on its fourth generation of participants. Prior to the study almost nothing was known about the epidemiology of hypertensive or arteriosclerotic cardiovascular disease. Much of the now-common knowledge concerning heart disease, such as the effects of diet, exercise, and common medications such as aspirin, is based on this longitudinal study.

### Warm-up 🥵

Warm-up warm-ups
* Describe what boosting is.
* How do random forests avoid overfitting?

Actual warm-up
* How do we use residuals in gradient boosted trees?
* How do we avoid overfitting in gradient boosted trees?

## Data Import and EDA

In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# p much in practice:
# *if you want to use GradientBoostingClassifier
#     * use XGBClassifier instead
# *if you want to use GradientBoostingRegressor
#     * use XGBRegressor instead
from xgboost import XGBClassifier

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


data_url = "https://docs.google.com/spreadsheets/d/1Tx7KJ7iW8IkiU-aERYFXsKvDsbFJbr80POW_2DyuYGQ/export?format=csv"
heart = pd.read_csv(data_url)

Do basic EDA to get familiar with this heart data.

What would be the impact of dropping NAs?
* Calculate the percentage of rows that would remain after dropping NAs

Do we have balanced classes?  If our model gets 85% accuracy, should we consider that good?
* Calculate percentages of each class using `value_counts` and the `normalize` argument
* Show a bar plot of the counts of each class

Let's visualize our data with respect to our target variable, `'TenYearCHD'`.  We actually have a lot of categorical variables here that are already encoded as numbers for us. We might consider re-encoding education, but it's already encoded as ordinal, let's keep it as is and come back if we think it will help.

However, it might make more sense to visualize these as categorical rather than continuous.

In [None]:
cat_cols = ["education"]

bin_cols = [
    "male",
    "currentSmoker",
    "BPMeds",
    "prevalentStroke",
    "prevalentHyp",
    "diabetes",
]

# Grouping binary with other categorical for viz
all_cat_cols = cat_cols + bin_cols

num_cols = [
    "age",
    "cigsPerDay",
    "totChol",
    "sysBP",
    "diaBP",
    "BMI",
    "heartRate",
    "glucose",
]

* What's an appropriate chart type to plot our categorical variables with our categorical target variable?
* Write a `for` loop to iterate over the categorical column names (in `all_cat_cols`)
* Show a plot of `'TenYearCHD'` with each of the categorical variables.
* Use the target as the main variable and color by the categorical column

* What's an appropriate chart type to plot our continuous variables with our categorical target variable?

## Model Prep

* Perform a train test split

In [None]:
X = heart.drop(columns=["TenYearCHD"])
y = heart["TenYearCHD"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

* Define a `ColumnTransformer` to scale the numeric columns
   * Leave the remaining columns untouched

In [None]:
# fmt: off
preprocessing = ColumnTransformer([
    ____
], ____)
# fmt: on

* Define a `Pipeline` with:
    * the `ColumnTransformer` preprocessing as the first step
    * an `XGBClassifier` as the second step

In [None]:
# fmt: off
pipeline = Pipeline([
    ____,
    ____
])
# fmt: on

* Fit the pipeline to the training data with the default params


* What is the overall accuracy?
* Are we overfitting?
* Is this a good accuracy?

In [None]:
pipeline.fit(X_train, y_train)

train_score = pipeline.score(X_train, y_train)
test_score = pipeline.score(X_test, y_test)

print(f"Train score: {train_score}")
print(f"Test score: {test_score}")

* How are we making mistakes?
  * Show a `confusion_matrix` and a `classification report`
* In the context of the problem, what kind of mistake is the worst to make?
   * Mistake 1: Tell someone they're at risk when they're not
   * Mistake 2: Tell someone they're not at risk when they are
* Based on that, what number from a `classification_report` are we most interested in?
   * Do we want to maximize or minimize this value?

In [None]:
y_pred = pipeline.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

We can try a grid search to see if we get better performance with better parameters.

<img src='https://i.stack.imgur.com/9GgQK.jpg' width='70%'>

Translation of main parameters of interest:
* Name in table - `xgb_parameter_name`

---

* \# of Trees - `n_estimators`
* Learning Rate - `learning_rate`
* Row Sampling - `subsample`
* Column Sampling - `colsample_bytree`
* Max Tree Depth - `max_depth`

---

* Set up a grid search using this pictured slide as guidance
* What were the best params according to this search?

In [None]:
# Adjusted max_features/max_depth to have smaller grid
params = {
    "____": [0.5, 0.75, 1.0],
    "____": [0.5, 0.75, 1.0],
    "____": [5, 7, 10],
}

n_trees = 100
learning_rate = 2 / n_trees

In [None]:
pipeline_cv = GridSearchCV(pipeline, params, verbose=1, cv=2)
pipeline_cv.fit(X_train, y_train)

pipeline_cv.best_params_

* How does this affect our performance?
* Would we want to deploy this model to predict heart disease?

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"Train score: {train_score}")
print(f"Test score: {test_score}\n")

y_pred = pipeline_cv.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

We're having a lot of trouble with this class imbalance problem, our model is really biased towards predicting the negative class because most the time it would be correct to do so.

There are strategies for dealing with class imbalance, and some common ones that aren't too bad to use are listed out here: https://elitedatascience.com/imbalanced-classes.

Let's look into an sampling approach to balance the classes in our training set.

* Separate the training data into 2 dataframes:
    * One with the majority class
    * One with the minority class

* How many rows does each have?

* Use sampling to make both sides of the story have the same number of rows
    * 'Up sample' with replacement for the minority class
    * 'Down sample' without replacement for the majority class

In [None]:
n = ____

* Redefine `X_train` and `y_train` with your resampled data

* Refit the same GridSearchCV object but with this new training data
* Print out the best parameters

In [None]:
params = {
    "xgb__subsample": [0.5, 0.75, 1.0],
    "xgb__max_features": [0.5, 0.75, 1.0],
    "xgb__max_depth": [3, 4, 5],
}

n_trees = 100
learning_rate = 2 / n_trees

In [None]:
pipeline_cv = GridSearchCV(pipeline, params, verbose=1, cv=2)
pipeline_cv.fit(X_train, y_train)

pipeline_cv.best_params_

* Is the performance better? worse? different at all?

In [None]:
train_score = pipeline_cv.score(X_train, y_train)
test_score = pipeline_cv.score(X_test, y_test)

print(f"Train score: {train_score}")
print(f"Test score: {test_score}\n")

y_pred = pipeline_cv.predict(X_test)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))