# Part 4 - Categorically Speaking

**Notice**: This notebook is a modification of [cats.ipynb and targetencode.ipynb](https://mlbook.explained.ai/notebooks/index.html) by Terence Parr and Jeremy Howard, which were used by permission of the author.

### Reestablish Baseline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
import pandas as pd
from rfpimp_MC import *
import category_encoders as ce

In [None]:
def evaluate(X, y):
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
    rf.fit(X, y)
    oob = rf.oob_score_
    n = rfnnodes(rf)
    h = np.median(rfmaxdepths(rf))
    print(f"OOB R^2 is {oob:.5f} using {n:,d} tree nodes with {h} median tree depth")
    return rf, oob

In [None]:
def showimp(rf, X, y):
    features = list(X.columns)
    features.remove('latitude')
    features.remove('longitude')
    features += [['latitude','longitude']]

    I = importances(rf, X, y, features=features)
    plot_importances(I, color='#4575b4')

In [None]:
rent = pd.read_csv('rent.csv')

rent_clean = rent[(rent['price'] > 1000) & (rent['price'] < 10000)]
rent_clean = rent_clean[(rent_clean['longitude'] !=0) | (rent_clean['latitude']!=0)]
rent_clean = rent_clean[(rent_clean['latitude']>40.55) &
                        (rent_clean['latitude']<40.94) &
                        (rent_clean['longitude']>-74.1) &
                        (rent_clean['longitude']<-73.67)]

In [None]:
numfeatures = ['bathrooms', 'bedrooms', 'longitude', 'latitude']

X = rent_clean[numfeatures]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

### Feature Engineering Continued

#### Combining Existing Features

Another thing we can do is mix features that we have in our data to create new ones. To give an example of this, let's create a new feature that combines the number of bedrooms with the number of bathrooms. What we are interested in is the ratio of bedroooms to bathrooms as this may an important predictor as well. The reasoning is that the more bedrooms you have, presumably the more people, and the more people you have can determine how long you have to wait for a bathroom to get ready, say for work or school. So, maybe the ratio of bedrooms to bathrooms would be a good feature to have. 

In [None]:
rent_clean['beds_to_baths'] = rent_clean['bedrooms'] / (rent_clean['bathrooms'] + 1)
rent_clean.head(3).T

Now we'll see if this has any impact on our model's performance. 

In [None]:
X = rent_clean[numfeatures + ['beds_to_baths']]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

Maybe total number of bedrooms plus bathrooms is what is important, as opposed to having them both listed separately. 

In [None]:
rent_clean['total_rooms'] = rent_clean['bathrooms'] + rent_clean['bedrooms']
rent_clean.head(3).T

In [None]:
X = rent_clean[['longitude', 'latitude', 'total_rooms']]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

Or, maybe not! 

We don't have a lot of raw numeric features to play around with here, but if you did, you would want to explore other combinations and measure their impact on how well the model generalizes. 

We do have another numeric variable to consider, which is the target, `price`. And yes, combining the target with other features is as dangerous as it sounds. Let's explore that next. 

##### Target Encoding

Any new feature that includes information about the target is referred to as *target encoding*. This is a technique that can sometimes be used to encode categorical variables. The simplest way to do this would be to use the mean of the target for each unique category value. Let's see how this would work for `building_id`. 

In [None]:
rent_clean[['building_id', 'price']].groupby('building_id').mean()

The code above groups all the rows according to unique building_id values and then calculates the mean of all the apartments associated with that ID. We would now use the mean prices in place of building_id. (We will ignore  the funny looking `building_id = 0`, although we would normally inverstigate that.) 

**Warning**: Using target information in our features is prone to overfitting so it is usually best to use a library like `categorical_encoders` to do this, which is what we will do now. And, to be careful, we will use a validation set, instead of the out-of-bag score.  

In [None]:
X = rent_clean[numfeatures + ['building_id']]
y = rent_clean['price']
X.head(3)

Now let's get a baseline for the original 4 numeric features but using a validation set instead of the out-of-bag score. 

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20)

X_train_orig = X_train[numfeatures]
y_train_orig = y_train.copy()
X_val_orig = X_val[numfeatures]
y_val_orig = y_val.copy()

rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(X_train_orig, y_train_orig)

val_score_orig = rf.score(X_val_orig, y_val_orig)

print(f"{val_score_orig:4f} score {rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height")

In [None]:
showimp(rf, X_train_orig, y_train_orig)

Now let's get a validation score for a model that uses the target encoded `building_id`. 

In [None]:
encoder = ce.TargetEncoder(cols=['building_id'])

encoder.fit(X_train, y_train)

X_train_enc = encoder.transform(X_train, y_train)
y_train_enc = y_train.copy()
X_val_enc = encoder.transform(X_val)
y_val_enc = y_val.copy()

In [None]:
X_train_enc.head(3)

In [None]:
X_val_enc.head(3)

In [None]:
rf = RandomForestRegressor(n_estimators=100, n_jobs=-1)
rf.fit(X_train_enc, y_train_enc)

val_enc_score = rf.score(X_val_enc, y_val_enc)

print(f"{val_enc_score:4f} score {rfnnodes(rf):,d} tree nodes and {np.median(rfmaxdepths(rf))} median tree height")

In [None]:
showimp(rf, X_train_enc, y_train_enc)

The model feels that the target encoded `building_id` feature is, by a substantial margin, the most important feature. This is suspicious and indicates that the model is overfitting, that is, attaching too much importance to this feature as opposed to the other features that we know are important; most likely a result of using the target information. 

It may not be useful here, but it is good to know this technique and how to properly apply it. 

##### Exercise

Try *target encoding* the `manager_id` and `display_address` features and comparing the results to the baseline. 

#### Adding Features Using Other Data

After you have explored all you can do with the original raw data, it may be worthwhile to include information from other sources. For apartments in New York City, as is the case in most cities, adding a *neighbourhood* feature may have predictive power. While we have `latitude` and `longitude` different neighbourhoods may command different prices for an apartment with otherwise similar features because some areas of the city are more trendy than others; and, as a result, people may be willing to pay more to live there. (See [Chapter 6](https://mlbook.explained.ai/catvars.html#sec:6.6) for more details.)

We can first using something like maps.google.com (similar) to estimate the latitude and longitude of some neighbourhoods in New York City and store that information in a dictionary. 

In [None]:
hoods = {
    "hells" : [40.7622, -73.9924],
    "astoria" : [40.7796684, -73.9215888],
    "Evillage" : [40.723163774, -73.984829394],
    "Wvillage" : [40.73578, -74.00357],
    "LowerEast" : [40.715033, -73.9842724],
    "UpperEast" : [40.768163594, -73.959329496],
    "ParkSlope" : [40.672404, -73.977063],
    "Prospect Park" : [40.93704, -74.17431],
    "Crown Heights" : [40.657830702, -73.940162906],
    "financial" : [40.703830518, -74.005666644],
    "brooklynheights" : [40.7022621909, -73.9871760513],
    "gowanus" : [40.673, -73.997]
}

We now want to create a new feature for each neighbourhood that calculates the, appropriately named, *Manhattan distance* from each apartment to that neighbourhood. We use Manhattan distance as that could be considered more connected to walking distance from the apartment to the neighbourhood in question. 


<img src="manhattan_dist.jpg" width=600 align="center">

In [None]:
for hood, loc in hoods.items():
    rent_clean[hood] = np.abs(rent_clean['latitude'] - loc[0]) + np.abs(rent_clean['longitude'] - loc[1])

In [None]:
rent_clean[numfeatures + list(hoods.keys())].head()

Let's check to see if this helps out our model at all. 

In [None]:
X = rent_clean[numfeatures + list(hoods.keys())]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

We could probably remove the original `latitude` and `longitude` columns as there is now enough *location* information contained within the neghbourhood data, but we'll leave them all in here. 

### Exercises

- Try combining all the features we have created in **Part 1**, **Part 2**, **Part 3**, and **Part 4** of our categorical variables notebooks. Build and evaluate a model against the baseline, being sure to look at the feature importances. 
- Explore the data contained in *adult.csv* using some of the techniques you now know about:
    - Note that this is a classification problem; and 
    - You should do label encoding on the target as your first step. The easiest way to do this is to use the [Pandas `.map()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html). 