# Part 3 - Categorically Speaking

**Notice**: This notebook is a modification of [cats.ipynb and targetencode.ipynb](https://mlbook.explained.ai/notebooks/index.html) by Terence Parr and Jeremy Howard, which were used by permission of the author.

### Reestablish Baseline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
import pandas as pd
from rfpimp_MC import *
import category_encoders as ce

In [None]:
def evaluate(X, y):
    rf = RandomForestRegressor(n_estimators=100, n_jobs=-1, oob_score=True)
    rf.fit(X, y)
    oob = rf.oob_score_
    n = rfnnodes(rf)
    h = np.median(rfmaxdepths(rf))
    print(f"OOB R^2 is {oob:.5f} using {n:,d} tree nodes with {h} median tree depth")
    return rf, oob

In [None]:
def showimp(rf, X, y):
    features = list(X.columns)
    features.remove('latitude')
    features.remove('longitude')
    features += [['latitude','longitude']]

    I = importances(rf, X, y, features=features)
    plot_importances(I, color='#4575b4')

In [None]:
rent = pd.read_csv('rent.csv')

rent_clean = rent[(rent['price'] > 1000) & (rent['price'] < 10000)]
rent_clean = rent_clean[(rent_clean['longitude'] !=0) | (rent_clean['latitude']!=0)]
rent_clean = rent_clean[(rent_clean['latitude']>40.55) &
                        (rent_clean['latitude']<40.94) &
                        (rent_clean['longitude']>-74.1) &
                        (rent_clean['longitude']<-73.67)]

In [None]:
numfeatures = ['bathrooms', 'bedrooms', 'longitude', 'latitude']

X = rent_clean[numfeatures]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

### Feature Engineering

What we will embark on now can be broadly categorized as *feature engineering*, which can be defined as: 

> the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data. (Jason Brownlee)

The goal is to discover (through domain knowledge, experimentation, etc.) which input features make it easier for the model to best predict the corresponding outputs. We won't cover all aspects of this topic, but will cover a few approaches that should prove useful in your practical work. 

#### Extracting Features from Strings

Our apartment rental data has a few features that are made up of *string* (or text): 
- `description` contains free-form text that presumably highlights all the desirable qualities of the apartment;
- `features` contains lists of specific amenities each apartment offers; and, 
- `photos` contains strings representing the filenames of pictures of the apartment. 

It would seem reasonable to think that some of the information contained in these columns would be connected to the price, since, after all, they are listed to show how much you are getting for the amount you have to spend. 

Let's consider each in turn, noting that processing free-form text may require a little more work and creativity than other columns. 

In [None]:
rent_text = rent_clean[['description', 'features', 'photos']].copy()
rent_text.head()

The first thing we should do is check to see if there are any missing values.

In [None]:
rent_text.isnull().sum()

So, we will need to deal with this for the `description` column only. 

#### `Description`

If we look at the first couple descriptions, we see words like *spacious* and *renovated*. These may have predictive power because: 
- even though we have the number of bedrooms and bathrooms, it doesn't tell us how big the apartment is overall, since one 1 bed/1 bath apartment could have twice the square footage of another, which may be worth more money; and, 
- a renovated apartment could mean that everything has been updated so it may look nicer and may have fewer problems compared to an apartment that has not been renovated. 

It's worth seeing if these can provide us with any performance improvement. 

First, let's replace the missing values with an empty string '' and convert the description to lower case text. 

In [None]:
rent_clean['description'] = rent_clean['description'].fillna('')
rent_clean['description'] = rent_clean['description'].str.lower() 
rent_clean.head(3).T

Note that converting the text to lower case will remove the problem of Python treating *Spacious* and *spacious* as different, which is the default behaviour. 

Now let's create two new columns based on these two concepts of *spacious* and *renovated*. 

In [None]:
rent_clean['renov'] = rent_clean['description'].str.contains("renov")
rent_clean['large_apt'] = rent_clean['description'].str.contains("spacious")
rent_clean.head(3).T

It's probably a good idea to check how many apartments have these new features we have created, because if almost all (or none) of them have this feature it may not be much help in making predictions. 

In [None]:
rent_clean['renov'].sum()

In [None]:
rent_clean['large_apt'].sum()

That seems like a decent amount so let's use them to build a model. Remember that `False` and `True` get treated as 0 and 1, respectively, so even though they look like words, the computer is going to treat them as numbers. 

In [None]:
X = rent_clean[numfeatures + ['renov', 'large_apt']]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

While these new features don't seem to have much predictive power, it is usually a good idea to try out different ideas like this based on our understanding of what may or may not be important to someone looking for an apartment. 

##### Exercise

In a manner similar to what we did to create the new features `renov` and `large_apt`, try creating other features for: 
- apartments that have a balcony (see if you can do it in a way so you capture descriptions that say both *balcony* and *balconies*; and 
- apartments that allow pets, which in this case, is either *cats* or *dogs*. (Hint: you will need to combine the two conditions with a [logical operator](https://www.w3schools.com/python/gloss_python_logical_operators.asp).)

#### `Features`

The column `features`, which is not to be confused with the general machine learning term, refers to specific amenities that an apartment offers. Let's take a look at some entries to see what kind of information is available and whether or not it may help our model improve its performance. 

In [None]:
rent_clean['features'][:10]

We can see that some of the features listed could reasonably be related to the price of an apartment. For example, one would expect that having a *doorman* would be associated with higher-priced apartments. 

In [None]:
rent_clean['features'] = rent_clean['features'].fillna('')
rent_clean['features'] = rent_clean['features'].str.lower() 

for feature in ['doorman', 'parking', 'garage', 'laundry']:
    rent_clean[feature] = rent_clean['features'].str.contains(feature)
    
rent_clean[['doorman', 'parking', 'garage', 'laundry']].head(5)

And now check how many apartments have these new features we have created, because if almost all (or none) of them have this feature it may not be much help in making predictions. 

In [None]:
rent_clean[['doorman', 'parking', 'garage', 'laundry']].sum()

Seems like enough apartments have these particular features, so we can build and evaluate our model. 

In [None]:
X = rent_clean[numfeatures + ['doorman', 'parking', 'garage', 'laundry']]
y = rent_clean['price']
X.head(3).T

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

##### Exercise

In a manner similar to what we did using `features`, try creating a few more columns based on other apartment characteristics that you can find. (Hint: Try using `rent_clean['features'][0:10]` and replace the start and end rows to explore what's listed in the `features` column. See if you can find anything that might be related to price. ) 

##### More Counting

Counting is something that humans do well, so let's see if we can make it work for us here. Maybe higher-priced apartments have more for the manager to talk about so descriptions of the rental unit, lists of amenities, and the number of pictures of the apartment would all be higher than for lower-priced apartments. We can see if this has any impact on our model by doing some straightforward counting on the strings in these columns.  The basic approach is that for each apartment (row) we: 
- split the words in the `description` after each space (" ") to create a list and then count how many elements are in the list; 
- split the list of amenities in `features` after each comma (",") to create a list and then count how many elements are in the list; and, 
- split the list of filenames in `photos` after each comma (",") to create a list and then count how many elements are in the list.

Using the `description` as an example, here is what the original data looks like:

In [None]:
rent_clean["description"]

And here is what it looks like after we split each description on the white spaces between the words:

In [None]:
rent_clean["description"].apply(lambda x: x.split())

And here is what it looks like after we split and count:

In [None]:
rent_clean["description"].apply(lambda x: len(x.split()))

Let's do this procedure on all 3 columns and add the new columns to our data for training. 

In [None]:
rent_clean['num_desc_words'] = rent_clean["description"].apply(lambda x: len(x.split()))
rent_clean['num_features'] = rent_clean["features"].apply(lambda x: len(x.split(",")))
rent_clean['num_photos'] = rent_clean["photos"].apply(lambda x: len(x.split(",")))
rent_clean[['num_desc_words', 'num_features', 'num_photos']].head()

Now let's use these columns to build a model: 

In [None]:
X = rent_clean[numfeatures + ['num_desc_words', 'num_features', 'num_photos']]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)

### What We Have So Far

Let's put all the features we've worked on so far together and build a model so we can see how it all fits together. 

In [None]:
X = rent_clean[numfeatures + 
               ['renov', 'large_apt'] + 
               ['doorman', 'parking', 'garage', 'laundry'] +
               ['num_desc_words', 'num_features', 'num_photos']]
y = rent_clean['price']

In [None]:
rf, oob = evaluate(X, y)

In [None]:
showimp(rf, X, y)