In [1]:
%reload_ext nb_black
import matplotlib.pyplot as plt

plt.style.use(["dark_background"])
%matplotlib ipympl

<IPython.core.display.Javascript object>

In [2]:
# import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.metrics import confusion_matrix, classification_report

# !pip install category_encoders
from category_encoders import LeaveOneOutEncoder

<IPython.core.display.Javascript object>

# 🎄🌳🌴🌱🌲

☝️That's a pretty random forest

We're going to revisit the mammographic mass data set.  Details below.

Dataset from UCI can be found [here](http://archive.ics.uci.edu/ml/datasets/mammographic+mass).

1. BI-RADS assessment: 1 to 5 (ordinal)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=0 or malignant=1 (binary)

## Data prep time!

In [3]:
data_url = "https://docs.google.com/spreadsheets/d/1d4TGnU2PYppNiRJIby7NQB2hfvWb8I8eyWWi2og_Zf4/export?format=csv"
columns = ["BI-RADS", "Age", "Shape", "Margin", "Density", "Severity"]

<IPython.core.display.Javascript object>

In [4]:
df = pd.read_csv(data_url, names=columns)
df.head(5)

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


<IPython.core.display.Javascript object>

This data encoded NaNs as `?`.  Convert the `?`s to NA and the columns to numeric.

In [5]:
df.dtypes

BI-RADS     object
Age         object
Shape       object
Margin      object
Density     object
Severity     int64
dtype: object

<IPython.core.display.Javascript object>

> all columns except for severity are objects not numeric vars

In [6]:
df = df.apply(pd.to_numeric, errors="coerce", axis="columns")

<IPython.core.display.Javascript object>

In [7]:
df.dtypes

BI-RADS     float64
Age         float64
Shape       float64
Margin      float64
Density     float64
Severity    float64
dtype: object

<IPython.core.display.Javascript object>

In [8]:
# change object vars to numeric cols w/ pd.to_numeric
# for col in df:
#     df[col] = pd.to_numeric(df[col], errors="coerce")

<IPython.core.display.Javascript object>

Drop NAs

In [9]:
df = df.dropna()

<IPython.core.display.Javascript object>

Okie doke, from the description we had some 'nominal' (aka categorical columns).  We want to encode these.  The nominal columns are: `['Shape', 'Margin']`.

We're going to switch things up and use `category_encoders.LeaveOneOutEncoder` instead of `sklearn.preprocessing.OneHotEncoder`.  More on this encoder can be seen in the `leave_one_out_encoding.ipynb` notebook in this folder.

In [10]:
from sklearn.compose import ColumnTransformer

<IPython.core.display.Javascript object>

In [11]:
cat_cols = ["Shape", "Margin"]
drop_cats = [1, 1]
num_cols = ["Age", "Density"]

<IPython.core.display.Javascript object>

In [12]:
# using leave_one_out_encoding on cat_cols
preprocessing = ColumnTransformer(
    [
        ("encode_cats", LeaveOneOutEncoder(), cat_cols),
        #         ('encode_cats', OneHotEncoder(drop=drop_cats), cat_cols),
    ],
    remainder="passthrough",
)

<IPython.core.display.Javascript object>

Last bit of data prep is to separate out into our `X` and `y` components and `train_test_split()`.  We're predicting the `'Severity'` variable.

In [13]:
# train test split on vars
X = df.drop(columns=["Severity", "BI-RADS"])
y = df["Severity"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

<IPython.core.display.Javascript object>

In [14]:
# fit preprocessing to the X_train data
preprocessing.fit(X_train, y_train)

# transform train and test data
X_train = pd.DataFrame(
    preprocessing.transform(X_train), index=X_train.index, columns=X_train.columns
)
X_test = pd.DataFrame(
    preprocessing.transform(X_test), index=X_test.index, columns=X_test.columns
)

# X_train.head()

<IPython.core.display.Javascript object>

## Random Forest background
### Concept 1: Bootstrapping ☠️

Fancier name than method.  Bootstrapping is repeatedly sampling with replacement.

In [17]:
# X_train.sample?

<IPython.core.display.Javascript object>

Sample 3 rows from `X_train`.

In [24]:
X_sample = X_train.sample(3, replace=True)
X_sample

Unnamed: 0,Age,Shape,Margin,Density
876,3.0,1.0,41.0,3.0
836,2.0,1.0,42.0,3.0
234,4.0,5.0,64.0,3.0


<IPython.core.display.Javascript object>

Select the same 3 rows from `y_train`

In [25]:
y_sample = y_train[X_sample.index]
y_sample

876    0.0
836    0.0
234    1.0
Name: Severity, dtype: float64

<IPython.core.display.Javascript object>

Let's write a function to do this for us.

In [26]:
def xy_sample(X, y, size, random_state=None):
    X_sample = X.sample(3, replace=True, random_state=random_state)
    y_sample = y[X_sample.index]

    return X_sample, y_sample

<IPython.core.display.Javascript object>

In [28]:
# X_sample, y_sample = xy_sample(X_train, y_train, 3, 28)
# display(X_sample)
# display(y_sample)

Unnamed: 0,Age,Shape,Margin,Density
344,1.0,1.0,62.0,3.0
154,4.0,4.0,34.0,3.0
850,2.0,1.0,37.0,2.0


344    0.0
154    0.0
850    0.0
Name: Severity, dtype: float64

<IPython.core.display.Javascript object>

So all we want to do is repeat that a few times.

In [29]:
n_samples = 5
sample_size = 3

bootstrap_samples = []
# Fill in the for loop for us to iterate and make samples
# The number of samples we want to make is stored in n_samples
for i in range(n_samples):
    # Perform the sampling like we just did
    # Use the sample_size variable
    X_sample, y_sample = xy_sample(X_train, y_train, sample_size)

    # Store in a dictionary to have nice X y labels
    train_sample = {"X": X_sample, "y": y_sample}

    # Store all our samples together in a list
    bootstrap_samples.append(train_sample)


bootstrap_samples

[{'X':      Age  Shape  Margin  Density
  253  4.0    4.0    70.0      3.0
  18   1.0    1.0    54.0      3.0
  698  4.0    4.0    46.0      2.0,
  'y': 253    1.0
  18     1.0
  698    0.0
  Name: Severity, dtype: float64},
 {'X':      Age  Shape  Margin  Density
  959  4.0    5.0    66.0      3.0
  173  4.0    4.0    44.0      3.0
  630  4.0    4.0    63.0      3.0,
  'y': 959    1.0
  173    0.0
  630    0.0
  Name: Severity, dtype: float64},
 {'X':      Age  Shape  Margin  Density
  695  2.0    1.0    73.0      3.0
  31   4.0    4.0    54.0      3.0
  781  1.0    1.0    23.0      3.0,
  'y': 695    0.0
  31     1.0
  781    0.0
  Name: Severity, dtype: float64},
 {'X':      Age  Shape  Margin  Density
  864  4.0    4.0    55.0      3.0
  527  2.0    4.0    61.0      3.0
  782  4.0    5.0    56.0      3.0,
  'y': 864    1.0
  527    1.0
  782    1.0
  Name: Severity, dtype: float64},
 {'X':      Age  Shape  Margin  Density
  588  4.0    4.0    59.0      3.0
  846  2.0    1.0    59.0

<IPython.core.display.Javascript object>

Boom 💥we're bonified bootstrappers.

### Concept 2: Bagging 💰

Kind of some overlap with concept 1....

<font color='red'>B</font><font color='blue'>AGGING</font> = <font color='red'>B</font>ootstrap <font color='blue'>AGG</font>regat<font color='blue'>ING</font>

* Step 1: Build a bunch of models on bootstrap samples
* Step 2: Aggregate the predictions of each model
* Step 3: dQw4w9WgXcQ
* Step 4: Profit

In [None]:
# Create a sample of size 10 like we've been doing


# Fit a decision tree to this sample
tree_1 = DecisionTreeClassifier()
tree_1.fit(X_sample, y_sample)

Second verse, same as the first.

In [None]:
# Create a sample of size 10 like we've been doing


# Fit a decision tree to this sample
tree_2 = DecisionTreeClassifier()
tree_2.fit(X_sample, y_sample)

Again!

In [None]:
# Create a sample of size 10 like we've been doing


# Fit a decision tree to this sample
tree_3 = DecisionTreeClassifier()
tree_3.fit(X_sample, y_sample)

In [None]:
pred_1 = tree_1.predict(X_test)
pred_2 = tree_2.predict(X_test)
pred_3 = tree_3.predict(X_test)
pred_df = pd.DataFrame({'pred_1': pred_1, 'pred_2': pred_2, 'pred_3': pred_3})
pred_df

Who do we believe??  Let's be fair and just rulers, we'll take all our trees' votes into consideration like a true democracy.

In [None]:
pred_df['avg_vote'] = ____
pred_df

Convert the `'avg_vote'` column to a binary label.  Use 0.5 as a cutoff

What Percentage of the predictions are correct?

We just fit 3 pretty naive models.  I say naive because they each only saw 10 records, but there's strength in numbers! This is the idea behind bagging, each model sees a different side of the data so they have different 'experiences' and 'perspectives' on whats right and wrong.  By considering all of the 'opinions' equally we avoid overfitting and we're able to get higher accuracy (in general) than using a single model.

Here comes the downside...

When we did just 1 decision tree, we were able to plot a nice diagram of how it made its decisions.  In our example we just made 3 trees, we could plot each one, but trying to view all these decisions would be a lot.  So we just lost the nice intrepretability that came with a single tree.  In practice, we'll typically have more than 3 trees and this becomes harder and harder to explain (we'll see a way to deal with this).

### Concept 3: Random subspace 🌒

Our `X` component is sometimes referred to as our 'feature space'.  A 'subspace' is a subset of a 'space'.  So this fancy term just means that we'll be taking a sample of our columns.  We do this without replacement.

In [None]:
# X_train.sample?

Well that wasn't too bad, but how does it fit into a random forest?  A random forest will only look at a few of the columns for each decision (i.e. a random subspace).  By doing this, we further protect against overfitting.  It's assuming that we want to learn patterns from every one of our features, if we happened to have a really powerful feature, we might end up only learning from it.  But with a random subspace, that powerful feature won't always be there as a crutch and so we're forced to learn from our other columns too.

So we just defined all the concepts of a random forest. Let's use one.

## Random Forests in action

Fit a random forest classifier to the data.

In [None]:
model = _____
model.fit(X_train, y_train)

Print out the accuracy of the predictor on the training and test data.

In [None]:
train_score = model.____
test_score = model.____

print(f'train_score: ____')
print(f'test_score: ____')

Let's see more than just accuracy, how can we see a view of our true-positives, false-positives, etc.?

Based on this output, do we have higher precision or recall?  What `sklearn` function could we use to prove this?

### Importance for intepretability

The 'importances' are stored in the `feature_importances_` attribute of our model.  What does the trailing underscore mean?

Store the importances in a dataframe with a column for each features name.

Order the dataframe from most to least important.

So shape is the most important feature in determining if these mammographic masses are benign or malignant.  What does that mean?  Remember that each feature is only chosen if it's the best split available, and that the way this is chosen is based on the 'information gain'.  We have a lot of trees, and we aggregate these measures of information gain across all the trees to get importance.  So the more important a feature, the more useful it was in separating our 2 classes across all of our forest.

### Importance for feature selection

Our forest's feature importances are letting us know what is best to identify the classes.  Why not use these to indicate which features we should use in a model.

You might do this in the case that you'd like to use linear regression but you want to subset down to useful features.  This is sort of like a more manual LASSO, but linearity isn't considered in the selection.

We still need to consider multicollinearity, highly correlated features can lead to unreliable importance numbers.  (i.e. maybe the model finds temperature is really important, but it used a celsius column half the time and a farenheit column the other times, this would lower the importance of those 2 columns)

* Use `sklearn`'s `SelectFromModel` to select the best 3 features for predicting the target.

### Tuning Random Forest

We have the same hyperparameters as we did for decision trees.  In addition, we have a parameter for how many trees should be in our forest.

* Use `sklearn`'s `GridSearchCV` to choose the best combination of hyperparameters and fit a model 
* Evaluate the model's performance.