## Part 2 - Dealing with Missing Data


**Notice: This notebook is a modification of [sniff.ipynb](https://mlbook.explained.ai/notebooks/index.html) by Terence Parr and Jeremy Howard, which was used by permission of the author.**

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from rfpimp_MC import * 

In [None]:
def evaluate(X, y, n_estimators=50):
    rf = RandomForestRegressor(n_estimators=n_estimators, n_jobs=-1, oob_score=True)
    rf.fit(X, y)
    oob = rf.oob_score_
    n = rfnnodes(rf)
    h = np.median(rfmaxdepths(rf))
    print(f"OOB R^2 is {oob:.5f} using {n:,d} tree nodes with {h} median tree depth")
    return rf, oob

In [None]:
def showimp(rf, X, y):
    features = list(X.columns)
    I = importances(rf, X, y, features=features)
    plot_importances(I, color='#4575b4')

In [None]:
from pandas.api.types import is_string_dtype, is_object_dtype

def df_normalize_strings(df):
    for col in df.columns:
        if is_string_dtype(df[col]) or is_object_dtype(df[col]):
            df[col] = df[col].str.lower()
            df[col] = df[col].fillna(np.nan)
            df[col] = df[col].replace('none or unspecified', np.nan)
            df[col] = df[col].replace('none', np.nan)
            df[col] = df[col].replace('#name?', np.nan)
            df[col] = df[col].replace('', np.nan)

In [None]:
def extract_sizes(df, colname):
    df[colname] = df[colname].str.extract(r'(\d+\.\d+|\d+)', expand=True)
    df[colname] = df[colname].replace('', np.nan)
    df[colname] = pd.to_numeric(df[colname])

In [None]:
# modified version of: https://stackoverflow.com/questions/26986655/changing-height-feet-and-inches-to-an-integer-in-python-pandas

def parse_length(length):
    if not pd.isnull(length):
      split_length = length.split("' ")
      feet = float(split_length[0])
      inches = float(split_length[1].replace("\"",""))
      return (12*feet) + inches
    else:
      return np.nan

### Recap

It is a good idea to recap what we did to the data last time:

- dropped the `SalesID` and `MachineID` features;
- converted `auctioneerID` to 'string' data type so we can treat it as a categorical feature;
- decided to leave `Blade_Width` as 'string' and treat as a categorical feature instead of converting to numeric;
- extracted numeric features from the original `Undercarriage_Pad_Width` and `Tire_Size` strings;
- converted `Stick_Length` to a numeric feature from the original string representation;
- normalized the representation of missing values to `np.nan`.

### Next Steps

Our next steps will be to carry out the following:
- convert all 'string' features to ordered categorical features;
- label encode all these features using the value of 0 to represent missing data;
- fix some remaining problems with numeric columns; and 
- replace missing numeric data by:
    - adding a new feature to say whether or not that value was missing; 
    - replace missing values in the original feature with the median of all values for that feature.

### Reset Data

In this notebook we are going to pick up where we left off in **Part 1** so we'll load and process the data according to what we did in the last notebook. 

In [None]:
df_raw = pd.read_feather("bulldozer-train.feather")
df = df_raw.copy()
df = df.iloc[-100000:]

In [None]:
df.drop(['SalesID', 'MachineID'], axis=1, inplace=True)
df['auctioneerID'] = df['auctioneerID'].astype(str)
df_normalize_strings(df)
extract_sizes(df, 'Tire_Size')
extract_sizes(df, 'Undercarriage_Pad_Width')
df['Stick_Length'] = df['Stick_Length'].apply(lambda x: parse_length(x))

In [None]:
def sniff_modified(df):
    with pd.option_context("display.max_colwidth", 20):
        info = pd.DataFrame()
        info['data type'] = df.dtypes
        info['percent missing'] = df.isnull().sum()*100/len(df)
        info['No. unique'] = df.apply(lambda x: len(x.unique()))
        info['unique values'] = df.apply(lambda x: x.unique())
        return info.sort_values('data type')

In [None]:
sniff_modified(df)

### Handling Categorical Data

For this part we are going to use some built functionality of Pandas, as opposed to the `catgory_encoders` package we used last time. To see how this is going to work, we'll carry out our procedure on a toy dataframe first.

In [None]:
hyd = pd.DataFrame({'Hydraulics_Flow': df['Hydraulics_Flow'].unique()})
hyd

In [None]:
hyd.info()

Now we convert the feature `Hydraulics_Flow`, which is a string feature, to a categorical feature.

In [None]:
hyd['Hydraulics_Flow'] = hyd['Hydraulics_Flow'].astype('category').cat.as_ordered()
hyd.info()

Now we label encode the feature. 

In [None]:
hyd['default cat code'] = hyd['Hydraulics_Flow'].cat.codes
hyd

And now we add 1 so that all missing values (`np.nan`) will be coded as 0. 

In [None]:
hyd['our cat code'] = hyd['Hydraulics_Flow'].cat.codes + 1
hyd

In [None]:
hyd.info()

In practice we would do these two steps and replace the original feature with the encoded values. 

In [None]:
hyd = pd.DataFrame({'Hydraulics_Flow': df['Hydraulics_Flow'].unique()})
hyd['Hydraulics_Flow'] = hyd['Hydraulics_Flow'].astype('category').cat.as_ordered()
hyd['Hydraulics_Flow'] = hyd['Hydraulics_Flow'].cat.codes + 1
hyd

In [None]:
hyd.info()

Since we have many string features that we would like to convert in this way, we will use functions to make applying this procedure to many features more efficient.

In [None]:
from pandas.api.types import is_categorical_dtype, is_string_dtype

def df_string_to_cat(df):
    for col in df.columns:
        if is_object_dtype(df[col]) or is_string_dtype(df[col]):
            df[col] = df[col].astype('category').cat.as_ordered()

def df_cat_to_catcode(df):
    for col in df.columns:
        if isinstance(df[col].dtype, pd.CategoricalDtype):
            df[col] = df[col].cat.codes + 1

Now we can convert all string features to categorical features and encode them with the following two lines of code. Note that we have also dealt with all the missing values, as they are encoded with the value of 0 for every feature. 

In [None]:
df_string_to_cat(df)
df_cat_to_catcode(df)

In [None]:
sniff_modified(df)

At this point we have dealt with all of the categorical features and can now move on to dealing with missing values in the numeric features. 

### Notice

> **The unreasonable effectiveness of label encoding categorical variables**
*You might be wondering why it's “legal” to convert all of those unordered (nominal) categorical variables to ordered integers. We know for sure that assuming an order between categories is wrong. The short answer is that RF models can still partition such converted categorical features in a way that is predictive, possibly at the cost of a more complex tree model. This is definitely not true for many models, such as linear regression models (which require so-called “dummy” boolean columns, one for each unique categorical value* - [that is, one hot encoding]). *In practice, we've found label encoding categorical variables surprisingly effective, even when it seems more advanced methods would work better.* (Jeremy Howard and Terence Parr, end of Section 7.5.1 of *Mechanics of Machine Learning*)

### Handling Missing Values for Numeric Data

Now that the categorical features have been encoded and missing values have been taken care of we need to address the missing values in the remaining numeric features: `Tire_Size`, `Undercarriage_Pad_Width`, `YearMade`, `Stick_Length`, and `MachineHoursCurrentMeter`. 

The recipe we are going to use here consists of two steps: 
- create a boolean column that has a `True` entry if that corresponds to a missing value and `False` otherwise; and,
- fill in the missing values with the median value for that feature. 

To see how it will work in practice, let's try out our recipe on a toy dataset: 

In [None]:
df_toy = pd.DataFrame(data={'YearMade':[1995,2001,np.nan]})
df_toy

In step 1, we add a new boolean column to keep track of where the missing data was. 

In [None]:
df_toy['YearMade_na'] = df_toy['YearMade'].isnull()
df_toy

And in step 2, we replace the missing value with the mean value for that feature. 

In [None]:
median_value = df_toy['YearMade'].median()
df_toy['YearMade'] = df_toy['YearMade'].fillna(median_value)
df_toy

We will use a function to apply both of these steps to any given feature in our data. 

In [None]:
def fix_missing_num(df, colname):
    df[colname+'_na'] = pd.isnull(df[colname])
    df[colname] = df[colname].fillna(df[colname].median())

Before we start replacing, let's asses the situation with the remaining numeric features that have missing values to see if there are any remaining problems. 

In [None]:
df['Tire_Size'].unique()

In [None]:
df['Undercarriage_Pad_Width'].unique()

In [None]:
df['Stick_Length'].unique()

In [None]:
np.sort(df['MachineHoursCurrentMeter'].unique())

In [None]:
np.sort(df['YearMade'].unique())

It seems that `Tire_Size`, `Undercarriage_Pad_Width`, and `Stick_Length` are good to go so let's start with them. 

In [None]:
fix_missing_num(df, 'Tire_Size')
fix_missing_num(df, 'Undercarriage_Pad_Width')
fix_missing_num(df, 'Stick_Length')

And check that the missing numbers are now gone. 

In [None]:
df['Tire_Size'].unique()

In [None]:
df['Undercarriage_Pad_Width'].unique()

In [None]:
df['Stick_Length'].unique()

`YearMade` and `MachineHoursCurrentMeter` have some potential issues that we'll need to explore: 
- `YearMade` problem 1: it's doubtful that any bulldozers were made in the year 1000 so we will need to treat these as missing values. In fact some of the other entries seem suspicious so we will use a cutoff year of 1950; any year before 1950 we will consider as missing. So for this feature, we will need to:
    - replace all values below 1950 with `np.nan`; and then,
    - replace those missing values with the median value;
- `YearMade` problem 2: some of the bulldozers have sale dates that come before it was made. For these, we will:
    - replace the `YearMade` value with the sale date;
- `MachineHoursCurrentMeter`: some are listed as having been used for 0 hours; while this may indicate that they are new, the age of the bulldozers suggests that this is probably a missing value (or the owner did not want to put in the true value). For this we will need to:
    - replace the value of 0 with `np.nan`; and then,
    - replace the missing values with the median value

Let's fix `YearMade` first. 

In [None]:
df.loc[df['YearMade']<1950, 'YearMade'] = np.nan
fix_missing_num(df, 'YearMade')

In [None]:
df.loc[df.eval("saledate.dt.year < YearMade"), 'YearMade'] = df['saledate'].dt.year

Now let's fix `MachineHoursCurrentMeter`. 

In [None]:
df.loc[df.eval("MachineHoursCurrentMeter==0"), 'MachineHoursCurrentMeter'] = np.nan
fix_missing_num(df, 'MachineHoursCurrentMeter')

And check that it worked. 

In [None]:
np.sort(df['MachineHoursCurrentMeter'].unique())

In [None]:
np.sort(df['YearMade'].unique())

And check the final cleaned data set.

In [None]:
sniff_modified(df)

We now have everything cleaned up except for `saledate`, which we'll tackle in the next notebook.

### Recall Our Baseline Model

In [None]:
basefeatures = ['SalesID', 'MachineID', 'ModelID',
                'datasource', 'YearMade',
                'auctioneerID', 'MachineHoursCurrentMeter']

In [None]:
df_baseline = df_raw.copy() 
df_baseline = df_baseline.iloc[-100000:]

In [None]:
X_baseline = df_baseline[basefeatures]
y_baseline = df_baseline['SalePrice']

X_baseline = X_baseline.fillna(0)

In [None]:
%%time
rf_baseline, oob_baseline_initial = evaluate(X_baseline, y_baseline, n_estimators=50)

In [None]:
showimp(rf_baseline, X_baseline, y_baseline)

### Train a New Model

Now let's use our cleaned up data to train a new model and see if we have improved the performance compared to the baseline model. 

In [None]:
X  = df.drop(['SalePrice','saledate'], axis=1) 
y = df['SalePrice']

rf, oob_all = evaluate(X, y, n_estimators=50)

We see that with the cleaned up features we get a nice increase in our OOB $R^2$ score. To see where else we may get an improvement, let's look at the feature importances. 

In [None]:
showimp(rf, X, y)

From the plot we can see what we should try next. There is not much left to do with `YearMade` but the other important features, like `Productsize`, `fiProductClassDesc`, `Enclosure`, `Hydraulics_Flow`, `fiSecondaryDesc`, etc deserve a closer look. Along with `saledate`, that is what we will do next

Before we finish, we will save our cleaned data so we don't have to repeat the cleaning process as we explore how to further improve our model's performance. 

### Save the Cleaned Data

In [None]:
df = df.reset_index(drop=True)
df.to_feather("bulldozer-train-clean.feather")