In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

# !pip install category_encoders
from category_encoders import LeaveOneOutEncoder

# !pip3 install scikit-optimize
from skopt import BayesSearchCV

<IPython.core.display.Javascript object>

In [3]:
import sklearn
import skopt

print(f"sklearn version: {sklearn.__version__}")
print(f"skopt version: {skopt.__version__}")

sklearn version: 0.22
skopt version: 0.7.4


<IPython.core.display.Javascript object>

# 🎄🌳🌴🌱🌲

☝️That's a pretty random forest

We're going to revisit the mammographic mass data set.  Details below.

Dataset from UCI can be found [here](http://archive.ics.uci.edu/ml/datasets/mammographic+mass).

1. BI-RADS assessment: 1 to 5 (ordinal)
2. Age: patient's age in years (integer)
3. Shape: mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal)
4. Margin: mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal)
5. Density: mass density high=1 iso=2 low=3 fat-containing=4 (ordinal)
6. Severity: benign=0 or malignant=1 (binary)

## Data prep time!

In [4]:
data_url = "https://docs.google.com/spreadsheets/d/1d4TGnU2PYppNiRJIby7NQB2hfvWb8I8eyWWi2og_Zf4/export?format=csv"
columns = ["BI-RADS", "Age", "Shape", "Margin", "Density", "Severity"]

<IPython.core.display.Javascript object>

In [5]:
breast_cancer = pd.read_csv(data_url, names=columns)

<IPython.core.display.Javascript object>

Do some things to get to know your data.

In [6]:
breast_cancer.head()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5,67,3,5,3,1
1,4,43,1,1,?,1
2,5,58,4,5,3,1
3,4,28,1,1,3,0
4,5,74,1,5,?,1


<IPython.core.display.Javascript object>

In [7]:
breast_cancer.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 961 entries, 0 to 960
Data columns (total 6 columns):
BI-RADS     961 non-null object
Age         961 non-null object
Shape       961 non-null object
Margin      961 non-null object
Density     961 non-null object
Severity    961 non-null int64
dtypes: int64(1), object(5)
memory usage: 45.2+ KB


<IPython.core.display.Javascript object>

In [8]:
breast_cancer.describe()

Unnamed: 0,Severity
count,961.0
mean,0.463059
std,0.498893
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


<IPython.core.display.Javascript object>

We see in the `head()` output some `?` in the `Density` column.  This might be the cause for why every column is an object rather than numeric.  How can we investigate if `?` is the only cause of our columns being object type?  We want to make sure we won't drop out anything useful.

In [9]:
breast_cancer = breast_cancer.replace("?", np.nan)
breast_cancer = breast_cancer.apply(pd.to_numeric, axis=1)
breast_cancer = breast_cancer.dropna()
breast_cancer.head(4)

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
0,5.0,67.0,3.0,5.0,3.0,1.0
2,5.0,58.0,4.0,5.0,3.0,1.0
3,4.0,28.0,1.0,1.0,3.0,0.0
8,5.0,57.0,1.0,5.0,3.0,1.0


<IPython.core.display.Javascript object>

In [10]:
breast_cancer.describe()

Unnamed: 0,BI-RADS,Age,Shape,Margin,Density,Severity
count,830.0,830.0,830.0,830.0,830.0,830.0
mean,4.393976,55.781928,2.781928,2.813253,2.915663,0.485542
std,1.888371,14.671782,1.242361,1.567175,0.350936,0.500092
min,0.0,18.0,1.0,1.0,1.0,0.0
25%,4.0,46.0,2.0,1.0,3.0,0.0
50%,4.0,57.0,3.0,3.0,3.0,0.0
75%,5.0,66.0,4.0,4.0,3.0,1.0
max,55.0,96.0,4.0,5.0,4.0,1.0


<IPython.core.display.Javascript object>

In [11]:
breast_cancer.loc[breast_cancer["BI-RADS"] == 55, "BI-RADS"] = 5.0
breast_cancer = breast_cancer[~breast_cancer["BI-RADS"].isin([6.0, 0.0])]
breast_cancer["BI-RADS"].value_counts()

4.0    468
5.0    317
3.0     24
2.0      7
Name: BI-RADS, dtype: int64

<IPython.core.display.Javascript object>

Last bit of data prep is to separate out into our `X` and `y` components and `train_test_split()`.  We're predicting the `'Severity'` variable.

In [12]:
X = breast_cancer.drop(columns=["Severity", "BI-RADS"])
y = breast_cancer["Severity"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

<IPython.core.display.Javascript object>

Okie doke, from the description we had some 'nominal' (aka categorical columns).  We want to encode these.  The data description was a long way up.  So I'll just let you know the nominal columns are: `['Shape', 'Margin']`.

We're going to switch things up and use `category_encoders.LeaveOneOutEncoder`.

In [13]:
encoder = LeaveOneOutEncoder(cols=["Shape", "Margin"])

encoder.fit(X_train, y_train)

X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

<IPython.core.display.Javascript object>

## Random Forest background
### Concept 1: Bootstrapping ☠️

Fancier name than method.  Bootstrapping is repeatedly sampling with replacement.

In [14]:
# X_train.sample?

<IPython.core.display.Javascript object>

Sample 3 rows from `X_train`.

In [15]:
X_sample = X_train.sample(n=3, replace=True)
X_sample

Unnamed: 0,Age,Shape,Margin,Density
365,39.0,0.171975,0.604938,3.0
788,67.0,0.765886,0.695,3.0
31,54.0,0.765886,0.695,3.0


<IPython.core.display.Javascript object>

Select the same 3 rows from `y_train`

In [16]:
y_sample = y_train.loc[X_sample.index]
y_sample

365    0.0
788    1.0
31     1.0
Name: Severity, dtype: float64

<IPython.core.display.Javascript object>

Let's write a function to do this for us.

In [17]:
def xy_sample(X, y, n, random_state=None):
    X_sample = X.sample(n=n, replace=True, random_state=random_state)
    y_sample = y.loc[X_sample.index]

    return X_sample, y_sample

<IPython.core.display.Javascript object>

So all we want to do is repeat that a few times.

In [18]:
n_samples = 5
sample_size = 3

bootstrap_samples = []
# Fill in the for loop for us to iterate and make samples
# The number of samples we want to make is stored in n_samples
for _ in range(n_samples):
    # Perform the sampling like we just did
    # Use the sample_size variable
    X_sample, y_sample = xy_sample(X_train, y_train, n=sample_size)

    # Store in a dictionary to have nice X y labels
    train_sample = {"X": X_sample, "y": y_sample}

    # Store all our samples together in a list
    bootstrap_samples.append(train_sample)


bootstrap_samples

[{'X':       Age     Shape    Margin  Density
  320  71.0  0.765886  0.714286      3.0
  749  56.0  0.765886  0.695000      3.0
  24   59.0  0.141791  0.695000      3.0, 'y': 320    1.0
  749    0.0
  24     1.0
  Name: Severity, dtype: float64}, {'X':       Age     Shape    Margin  Density
  328  72.0  0.765886  0.695000      3.0
  129  40.0  0.765886  0.806122      3.0
  304  54.0  0.532258  0.103175      3.0, 'y': 328    1.0
  129    1.0
  304    0.0
  Name: Severity, dtype: float64}, {'X':       Age     Shape    Margin  Density
  244  76.0  0.765886  0.695000      3.0
  647  64.0  0.141791  0.695000      3.0
  125  59.0  0.765886  0.604938      2.0, 'y': 244    1.0
  647    0.0
  125    0.0
  Name: Severity, dtype: float64}, {'X':       Age     Shape    Margin  Density
  913  57.0  0.765886  0.695000      3.0
  213  43.0  0.141791  0.103175      3.0
  801  83.0  0.765886  0.695000      2.0, 'y': 913    0.0
  213    1.0
  801    1.0
  Name: Severity, dtype: float64}, {'X':       Age

<IPython.core.display.Javascript object>

Boom 💥we're bonified bootstrappers.

### Concept 2: Bagging 💰

Kind of some overlap with concept 1....

<font color='red'>B</font><font color='blue'>AGGING</font> = <font color='red'>B</font>ootstrap <font color='blue'>AGG</font>regat<font color='blue'>ING</font>

* Step 1: Build a bunch of models on bootstrap samples
* Step 2: Aggregate the predictions of each model
* Step 3: dQw4w9WgXcQ
* Step 4: Profit

In [19]:
# Create a sample of size 10 like we've been doing
X_sample, y_sample = xy_sample(X_train, y_train, n=10, random_state=42)

# Fit a decision tree to this sample
tree_1 = DecisionTreeClassifier()
tree_1.fit(X_sample, y_sample)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

<IPython.core.display.Javascript object>

Second verse, same as the first.

In [20]:
# Create a sample of size 10 like we've been doing
X_sample, y_sample = xy_sample(X_train, y_train, n=10, random_state=8675309)

# Fit a decision tree to this sample
tree_2 = DecisionTreeClassifier()
tree_2.fit(X_sample, y_sample)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

<IPython.core.display.Javascript object>

Again!

In [21]:
# Create a sample of size 10 like we've been doing
X_sample, y_sample = xy_sample(X_train, y_train, n=10, random_state=1337)

# Fit a decision tree to this sample
tree_3 = DecisionTreeClassifier()
tree_3.fit(X_sample, y_sample)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

<IPython.core.display.Javascript object>

In [22]:
pred_1 = tree_1.predict(X_test)
pred_2 = tree_2.predict(X_test)
pred_3 = tree_3.predict(X_test)
pred_df = pd.DataFrame({"pred_1": pred_1, "pred_2": pred_2, "pred_3": pred_3})
pred_df.head()

Unnamed: 0,pred_1,pred_2,pred_3
0,1.0,1.0,1.0
1,1.0,1.0,1.0
2,1.0,0.0,1.0
3,0.0,0.0,0.0
4,1.0,1.0,1.0


<IPython.core.display.Javascript object>

Who do we believe??  Let's be fair and just rulers, we'll take all our trees' votes into consideration like a true democracy.

In [23]:
pred_df["avg_vote"] = pred_df.mean(axis=1)
pred_df.head()

Unnamed: 0,pred_1,pred_2,pred_3,avg_vote
0,1.0,1.0,1.0,1.0
1,1.0,1.0,1.0,1.0
2,1.0,0.0,1.0,0.666667
3,0.0,0.0,0.0,0.0
4,1.0,1.0,1.0,1.0


<IPython.core.display.Javascript object>

Convert the `'avg_vote'` column to a binary label.  Use 0.5 as a cutoff

In [24]:
pred_df["final_pred"] = (pred_df["avg_vote"] > 0.3).astype(int)
pred_df.head()

Unnamed: 0,pred_1,pred_2,pred_3,avg_vote,final_pred
0,1.0,1.0,1.0,1.0,1
1,1.0,1.0,1.0,1.0,1
2,1.0,0.0,1.0,0.666667,1
3,0.0,0.0,0.0,0.0,0
4,1.0,1.0,1.0,1.0,1


<IPython.core.display.Javascript object>

What Percentage of the predictions are correct?

In [25]:
pred_df["actual"] = y_test.reset_index(drop=True)
pred_df["is_correct"] = pred_df["actual"] == pred_df["final_pred"]

pred_df["is_correct"].mean()

0.7073170731707317

<IPython.core.display.Javascript object>

We just fit 3 pretty naive models.  I say naive because they each only saw 10 records, but there's strength in numbers! This is the idea behind bagging, each model sees a different side of the data so they have different 'experiences' and 'perspectives' on whats right and wrong.  By considering all of the 'opinions' equally we avoid overfitting and we're able to get higher accuracy (in general) than using a single model.

Here comes the downside...

When we did just 1 decision tree, we were able to plot a nice diagram of how it made its decisions.  In our example we just made 3 trees, we could plot each one, but trying to view all these decisions would be a lot.  So we just lost the nice intrepretability that came with a single tree.  In practice, we'll typically have more than 3 trees and this becomes harder and harder to explain (we'll see a way to combat this).

### Concept 3: Random feature subspace 🌒

Our `X` component is sometimes referred to as our 'feature space'.  A 'subspace' is a subset of a 'space'.  So this fancy term just means that we'll be taking a sample of our columns.  We do this without replacement.

In [26]:
X_train.sample(frac=0.6, axis=1).head()

Unnamed: 0,Margin,Age
562,0.695,53.0
428,0.806122,58.0
382,0.806122,52.0
822,0.103175,48.0
135,0.806122,46.0


<IPython.core.display.Javascript object>

Well that wasn't too bad, but how does it fit into a random forest?  A random forest will only look at a few of the columns for each decision (i.e. a random subspace).  By doing this, we further protect against overfitting.  It's assuming that we want to learn patterns from every one of our features, if we happened to have a really powerful feature, we might end up only learning from it.  But with a random subspace, that powerful feature won't always be there as a crutch and so we're forced to learn from our other columns too.

So we just defined all the concepts of a random forest. Let's use one.

## Random Forests in action

Fit a random forest classifier to the data.

In [27]:
grid = {
    # Note, these arent really best practice params (they might work well)
    # Just made up
    # Decision tree hyperparams
    "max_depth": [5, 10, 50, 100],  # too high -> overfit
    "max_leaf_nodes": [5, 10, 50, 100],  # too high -> overfit
    "min_samples_leaf": [5, 10, 50, 100],  # too low -> overfit
    # Forest hyperparams
    "n_estimators": [10, 50, 100],  # too low -> underfit (rel robust to overfit)
    "max_features": [0.4, 0.6, 0.8],  # too high -> overfit
    "max_samples": [0.4, 0.6, 0.8],  # too high -> overfit
}

<IPython.core.display.Javascript object>

In [28]:
# fmt: off
model = BayesSearchCV(
    RandomForestClassifier(), 
    grid, 
    # Controls how many hyperparam combinations to fit
    # So instead of exhaustive combinations, this will fit
    # 10 separate combinations (smartly chosen combinations)
    # 10 was chosen to make this run faster, in practice with this large 
    # of a grid you would want a bigger number
    n_iter=10
)

model.fit(
    # There were warnings about converting int to float
    # Bypassing warnings by converting everything to float rn
    X_train.values.astype(np.float64),
    y_train.values.astype(np.float64)
)
# fmt: on

model.best_params_



OrderedDict([('max_depth', 5),
             ('max_features', 0.6),
             ('max_leaf_nodes', 100),
             ('max_samples', 0.8),
             ('min_samples_leaf', 10),
             ('n_estimators', 100)])

<IPython.core.display.Javascript object>

Print out the accuracy of the predictor on the training and test data.

In [29]:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print(f"train_score: {train_score}")
print(f"test_score: {test_score}")

train_score: 0.8297546012269938
test_score: 0.7804878048780488


<IPython.core.display.Javascript object>

Let's see more than just accuracy, how can we see a view of our true-positives, false-positives, etc.?

In [30]:
y_pred_prob = model.predict_proba(X_test)[:, 1]
y_pred = (y_pred_prob > 0.2).astype(int)

confusion_matrix(y_test, y_pred)

array([[44, 35],
       [ 8, 77]])

<IPython.core.display.Javascript object>

Based on this output, do we have higher precision or recall?  What `sklearn` function could we use to prove this?

In [31]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

         0.0       0.85      0.56      0.67        79
         1.0       0.69      0.91      0.78        85

    accuracy                           0.74       164
   macro avg       0.77      0.73      0.73       164
weighted avg       0.76      0.74      0.73       164



<IPython.core.display.Javascript object>

### Importance for intepretability

The 'importances' are stored in the `feature_importances_` attribute of our model.  What does the trailing underscore mean?

In [32]:
importances = model.best_estimator_.feature_importances_
importances

array([0.19691261, 0.35702399, 0.44383363, 0.00222976])

<IPython.core.display.Javascript object>

Store the importances in a dataframe with a column for each features name.

In [33]:
importance_df = pd.DataFrame({"feat": X_train.columns, "importance": importances})

<IPython.core.display.Javascript object>

Order the dataframe from most to least important.

In [34]:
importance_df = importance_df.sort_values("importance", ascending=False)
importance_df

Unnamed: 0,feat,importance
2,Margin,0.443834
1,Shape,0.357024
0,Age,0.196913
3,Density,0.00223


<IPython.core.display.Javascript object>

So shape is the most important feature in determining if these mammographic masses are benign or malignant.  What does that mean?  Remember that each feature is only chosen if it's the best split available, and that the way this is chosen is based on the 'information gain'.  We have a lot of trees, and we aggregate these measures of information gain across all the trees to get importance.  So the more important a feature, the more useful it was in separating our 2 classes across all of our forest.