# Part 1 - Categorically Speaking

**Notice**: This notebook is a modification of [cats.ipynb and targetencode.ipynb](https://mlbook.explained.ai/notebooks/index.html) by Terence Parr and Jeremy Howard, which were used by permission of the author.

## Recap

To train a model we need:

 - all the data to be numeric;
 - no missing data/values.
 
And what we have done so far is:
 - ignored non-numeric data;
 - built and evaluated a random forest model, which had:
     - a poor avg $R^2\,$ and *mean absolute error* on the validation data;
     - high variance in $R^2\,$ and *mean absolute error* on the validation data;
 - explored our data for anomalies in the context of our objective to predict apartment rental prices for a typical apartment in New York City;
 - cleaned our data to remove the anomalies we discovered;
 - built and evaluated a random forest model using the cleaned data.

### Reestablish Baseline

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
import pandas as pd

### Without denoising

In [24]:
rent = pd.read_csv('rent.csv')

numfeatures = ['bathrooms', 'bedrooms', 'longitude', 'latitude']

X = rent[numfeatures]
y = rent['price']

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
rf.fit(X, y)

oob_noisy = rf.oob_score_
print(f"Out-of-bag R^2 for baseline model is: {oob_noisy}")

Out-of-bag R^2 for baseline model is: 0.051345739046082084


In [38]:
from sklearn.model_selection import train_test_split

for i in range(20):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
    rf.fit(X, y)

    val_score = rf.score(X_test, y_test)
    print(f"Validation R^2 for noisy model is: {val_score}")

Validation R^2 for noisy model is: 0.03195870870926465
Validation R^2 for noisy model is: 0.8625596483548396
Validation R^2 for noisy model is: 0.9382727814644269
Validation R^2 for noisy model is: 0.8019069969787642
Validation R^2 for noisy model is: 0.8635663461946572
Validation R^2 for noisy model is: 0.8269381208400364
Validation R^2 for noisy model is: 0.8844483752761421
Validation R^2 for noisy model is: 0.9856243585797346
Validation R^2 for noisy model is: -4.061873479328193
Validation R^2 for noisy model is: 0.8881302455281123
Validation R^2 for noisy model is: 0.9040709538550382
Validation R^2 for noisy model is: 0.9906608792025959
Validation R^2 for noisy model is: 0.8338491534500002
Validation R^2 for noisy model is: 0.8074717585113296
Validation R^2 for noisy model is: 0.9748838361235869
Validation R^2 for noisy model is: 0.9780104337007731
Validation R^2 for noisy model is: 0.801748941649217
Validation R^2 for noisy model is: 0.8176079052364837
Validation R^2 for noisy mod

### With denoising

In [39]:
rent = pd.read_csv('rent.csv')

rent_clean = rent[(rent['price'] > 1000) & (rent['price'] < 10000)]
rent_clean = rent_clean[(rent_clean['longitude'] !=0) | (rent_clean['latitude']!=0)]
rent_clean = rent_clean[(rent_clean['latitude']>40.55) &
                        (rent_clean['latitude']<40.94) &
                        (rent_clean['longitude']>-74.1) &
                        (rent_clean['longitude']<-73.67)]

In [46]:
numfeatures = ['bathrooms', 'bedrooms', 'longitude', 'latitude']

X = rent_clean[numfeatures]
y = rent_clean['price']

for i in range(20):
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
    rf.fit(X, y)

    oob_baseline = rf.oob_score_
    print(f"Out-of-bag R^2 for baseline model is: {oob_baseline}")

Out-of-bag R^2 for baseline model is: 0.8680051027195075
Out-of-bag R^2 for baseline model is: 0.8678009108868311
Out-of-bag R^2 for baseline model is: 0.8669448115889948
Out-of-bag R^2 for baseline model is: 0.8672657944975786
Out-of-bag R^2 for baseline model is: 0.8687651894828148
Out-of-bag R^2 for baseline model is: 0.8674891364203413
Out-of-bag R^2 for baseline model is: 0.8681466598575913
Out-of-bag R^2 for baseline model is: 0.8675027359911786
Out-of-bag R^2 for baseline model is: 0.8677235575230293
Out-of-bag R^2 for baseline model is: 0.8679393696474594
Out-of-bag R^2 for baseline model is: 0.8681246211811058
Out-of-bag R^2 for baseline model is: 0.867564689027841
Out-of-bag R^2 for baseline model is: 0.8674369195510829
Out-of-bag R^2 for baseline model is: 0.8671392646294537
Out-of-bag R^2 for baseline model is: 0.8677955391958313
Out-of-bag R^2 for baseline model is: 0.8683588218462751
Out-of-bag R^2 for baseline model is: 0.8680161399084121
Out-of-bag R^2 for baseline mode

### $R^2$ Reminder

Recall the formula:

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
$$

This tells us that:
- $R^2 = 1$ means our model is perfect; 
- $R^2 \approx 0$ means our model does no better than just predicting the average;
- $R^2 \lt\lt 0$ means our model does worse than predicting the average.

Also, as $R^2 \rightarrow 1$ it gets harder and harder to improve model performance.

### Other Indicators

We are not only interested in $R^2$. We would also like to know: 
- how much work the random forest model has to do to capture the relationship between the features and the target; 
- the typical tree depth, as this will impact the speed of predictions for new data;
- how important different features are for a given model.

To help with this we will use the `rfpimp` package. (Note: you will see some *FutureWarning* messages when using this package but these can be ignored as they are just warnings that some parts of the *sklearn* code are changing in the future.)

In [5]:
from rfpimp_MC import *

Since we will be evaluating many models this way, we will use some functions to help keep our code clean:
- one to evaluate our model and report the OOB score, the number of nodes across all trees in the forest, and the median tree depth in the forest; and, 
- one to show the feature importances for a given model.

In [6]:
def evaluate(X, y):
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
    rf.fit(X, y)
    oob = rf.oob_score_
    n = rfnnodes(rf)
    h = np.median(rfmaxdepths(rf))
    print(f"OOB R^2 is {oob:.5f} using {n:,d} tree nodes with {h} median tree depth")
    return rf, oob

In [7]:
def showimp(rf, X, y):
    features = list(X.columns)
    features.remove('latitude')
    features.remove('longitude')
    features += [['latitude','longitude']]

    I = importances(rf, X, y, features=features)
    plot_importances(I, color='#4575b4')

Let's try both of these out on our baseline model that uses only the cleaned numeric data.

In [None]:
evaluate(X, y)

In [None]:
showimp(rf, X, y)

### Feature Importance

Many times, a model's ability to generalize (predict) well is not all we are hoping for; we would also like to understand what the model is doing, which is referred to as a model's interpretability. Random Forests have this as a built in feature, however the implementation in *sklearn* suffers from bias when:
- the scales of the features vary; and/or, 
- there are many categories for a feature.

A better way to assess feature importance, in any model, is to use:
- permutation importance; or 
- dropped feature importance.

##### Permutation Importance

We can calculate the feature importances using a permutation method, which consists of the following steps:
- use all features and establish a baseline value for $R^2$;
- select one feature and randomly permute its values leaving all other features unchanged;
- calculate the new value for $R^2$ with this one feature permuted;
- calculate the change in $R^2$ from the baseline; and, 
- repeat for the other features.

Let's see how this works.

In [10]:
def perm_importances(X, y):
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True, random_state=999)
    rf.fit(X, y)
    r2 = rf.oob_score_
    print(f"Baseline R^2 with no columns permuted: {r2:.5f}\n")
    for col in X.columns:
        X_col = X.copy()
        X_col[col] = X_col[col].sample(frac=1).values
        rf.fit(X_col, y)
        r2_col = rf.oob_score_
        print(f"Permuting column {col}: new R^2 is {r2_col:.5f} and difference from baseline is {r2 - r2_col:.5f}")


In [None]:
perm_importances(X, y)

##### Dropped Column Importance

We can also calculate the importance of the features using a dropped column, which consists of the following steps:
- use all features and establish a baseline value for $R^2$;
- select one feature and remove it from the data;
- calculate the new value for $R^2$ with this one feature removed;
- calculate the change in $R^2$ from the baseline; and, 
- repeat for the other features.

Let's see how this works.

In [12]:
def drop_importances(X, y):
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True, random_state=999)
    rf.fit(X, y)
    r2 = rf.oob_score_
    print(f"Baseline R^2 with no columns dropped: {r2:.5f}\n")
    for col in X.columns:
        X_col = X.copy()
        X_col = X_col.drop(col, axis=1) 
        rf.fit(X_col, y)
        r2_col = rf.oob_score_
        print(f"Dropping column {col}: new R^2 is {r2_col:.5f} and difference from baseline is {r2 - r2_col:.5f}")

In [None]:
drop_importances(X, y)

##### Be Careful With Correlation

In [None]:
X_dup = X.copy()
X_dup['bedrooms_dup'] = X_dup['bedrooms']
X_dup.head()

In [None]:
drop_importances(X_dup, y)

##### Breaking Correlation

In [None]:
noise = np.random.normal(0, 2, X_dup.shape[0])

X_dup['bedrooms_dup'] = X_dup['bedrooms'] + noise
X_dup.head()

In [None]:
drop_importances(X_dup, y)