## Part 1 - Dealing with Missing Data


**Notice: This notebook is a modification of [sniff.ipynb](https://mlbook.explained.ai/notebooks/index.html) by Terence Parr and Jeremy Howard, which was used by permission of the author.**

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from rfpimp_MC import * 

We will use slightly modified versions of the `evaluate` and `showimp` functions from the last few notebooks. 

In [None]:
def evaluate(X, y, n_estimators=50):
    rf = RandomForestRegressor(n_estimators=n_estimators, n_jobs=-1, oob_score=True)
    rf.fit(X, y)
    oob = rf.oob_score_
    n = rfnnodes(rf)
    h = np.median(rfmaxdepths(rf))
    print(f"OOB R^2 is {oob:.5f} using {n:,d} tree nodes with {h} median tree depth")
    return rf, oob

In [None]:
def showimp(rf, X, y):
    features = list(X.columns)
    I = importances(rf, X, y, features=features)
    plot_importances(I, color='#4575b4')

### Read in the Data

To prep the data for loading, please refer to [Section 7.1](https://mlbook.explained.ai/bulldozer-intro.html#sec:7.1) of *The Mechanics of Machine learning*. Once we have the data in the proper format, we are going to read it in and make a copy. Making a copy is a good idea as then we will always have the original data available without having to reload it. 

In [None]:
df_raw = pd.read_feather("bulldozer-train.feather")
df = df_raw.copy()

Now let's see how much data we are dealing wih:

In [None]:
df.shape

And get an idea of what it looks like:

In [None]:
df.head().T

Let's get a bit more information on the data so we can start planning what we need to do. To do this, we will use the following function:

In [None]:
def sniff_modified(df):
    with pd.option_context("display.max_colwidth", 20):
        info = pd.DataFrame()
        info['data type'] = df.dtypes
        info['percent missing'] = df.isnull().sum()*100/len(df)
        info['No. unique'] = df.apply(lambda x: len(x.unique()))
        info['unique values'] = df.apply(lambda x: x.unique())
        return info.sort_values('data type')

In [None]:
sniff_modified(df)

### Quickly get a Baseline Model

As we did before, we need to specify our target (`SalePrice`) and then focus in on the numeric data and create and evaluate a baseline model. That means we will consider only these features for now: `SalesID`, `MachineID`, `ModelID`, `datasource`, `YearMade`, `auctioneerID`, and `MachineHoursCurrentMeter`. 

However, as seen above, the last two contain missing values, so we will have to deal with that in order to create a model. 

Remember that we need two things to train a model:
- all numeric data
- no missing values

In [None]:
basefeatures = ['SalesID', 'MachineID', 'ModelID',
                'datasource', 'YearMade',
                'auctioneerID', 'MachineHoursCurrentMeter']

In [None]:
X = df[basefeatures]
y = df['SalePrice']

X = X.fillna(0)

In [None]:
%%time
rf, oob_baseline_initial = evaluate(X, y, n_estimators=100)

In [None]:
%%time
rf, oob_baseline_initial = evaluate(X, y, n_estimators=50)

To speed up the training time even further, we will only work with a portion of the data: 100,000 samples. If the data had no time sensitivity, then we would take a random sample. Since we have time sensitive data we will take the last 100,000 samples as more recent data should be better at predicting near future sale prices, since we know that prices can change over time due to inflation, etc. 

In [None]:
df = df.iloc[-100000:]

In [None]:
X = df[basefeatures]
y = df['SalePrice']

X = X.fillna(0)

In [None]:
%%time
rf, oob_baseline_initial = evaluate(X, y, n_estimators=50)

In [None]:
showimp(rf, X, y)

### Cleaning up the Data

In order to try to improve our model performance (which may or may not be possible) we will clean up our data with the following procedure:
- Drop features that have no predictive value or have known problems that can't be fixed;
- Convert actual categorical features from current numeric data type to an object data type;
- normalize the representation of missing data;
- clean up strings that are actually numeric; 
- extract features;
- encode categorical features; 
- deal with missing data. 

#### Removing Features

Let's start with features we can get rid of: 

- `SalesID` can be deleted as it is a unique identifier, that is, each row in the data has a unique value for `SalesID` so our model will not be able to use this to help it generalize; 
- `MachineID` should be deleted as it can be shown to have inconsistencies and errors, as in the same `MachineID` showing up as being manufactured in many different years (see link in [Section 7.4](https://mlbook.explained.ai/bulldozer-intro.html#sec:7.4) for details).  

In [None]:
df.drop(['SalesID', 'MachineID'], axis=1, inplace=True)
# df = df.drop(['SalesID', 'MachineID'], axis=1)

df.columns

#### Convert DataType

Sometimes what are really categorical variables show up in our data as numeric, so we need to figure out how to handle these situations. Of our original numeric data, it seems that we have a few that are categorical:

- `ModelID` is a nominal categorical feature but it is already encoded as a number and has no missing values so we will leave it as it is;
- `datasource` is a nominal categorical feature but it is already encoded as a number and has no missing values so we will leave it as it is;
- `auctioneerID` is a nominal categorical feature and since it has missing values we will convert this to an *object* data type and deal with it when we handle categorical encoding and missing values.

So, for now we will simply convert `auctioneerID` from numeric to string data type. 

In [None]:
df['auctioneerID'] = df['auctioneerID'].astype(str)

#### What does *missing* mean?

The concept of missing is not usually straightforward so you will have to do some digging into the data to see what you find. 

In [None]:
missing = pd.DataFrame({'colour':['Unspecified', 'red', None, '', 'None', 'yellow'], 'width':[ 12, -1, '', 14, 999, np.nan]})
missing

In [None]:
missing.isnull()

#### How Missing is Represented in Our Data

Let's see how missing values are showing up in our data:

In [None]:
df['Drive_System'].unique()

In [None]:
df['Backhoe_Mounting'].unique()

In [None]:
df['fiModelSeries'].unique()

#### Normalize the Representation of Missing Values

It will be much easier to handle if we convert all the different ways this data has to signal missing data down to a single representation: `np.nan`. To do this we will use the following function which: 
- converts all strings (text) to lower case;
- fill actual missing data with `np.nan`; the impact of this is to convert `None` to `np.nan`;
- convert all the other representations ('none', 'none or unspecified', '#name?', and '') to `np.nan`.

In [None]:
from pandas.api.types import is_string_dtype, is_object_dtype

def df_normalize_strings(df):
    for col in df.columns:
        if is_string_dtype(df[col]) or is_object_dtype(df[col]):
            df[col] = df[col].str.lower()
            df[col] = df[col].fillna(np.nan)
            df[col] = df[col].replace('none or unspecified', np.nan)
            df[col] = df[col].replace('none', np.nan)
            df[col] = df[col].replace('#name?', np.nan)
            df[col] = df[col].replace('', np.nan)

In [None]:
df_normalize_strings(df)

In [None]:
df['Drive_System'].unique()

In [None]:
df['Backhoe_Mounting'].unique()

In [None]:
df['fiModelSeries'].unique()

#### Numeric Features Hiding as Strings

Some of the features that are being stored as strings are actually numeric: `TireSize`, `Undercarriage_Pad_Width`, `Blade_Width`, and `Stick_Length`. The first two are easier so let's look at them first.

In [None]:
df['Tire_Size'].unique()

In [None]:
df['Undercarriage_Pad_Width'].unique()

In [None]:
m = '36 inch'.split()
int(m[0])

For these two we are going to: 
- extract numbers using a regular expression;
- replace any resulting missing value with `np.nan` (just in case);
- convert the column to numeric data type.

To see how regular expressions are going to work for us, let's use a toy dataframe. 

In [None]:
regexp = pd.DataFrame({'Tire_Size':['12', 'some text 14 some text', '13"', '12.5"']})
regexp

In [None]:
regexp.info()

Now let's use a regular expression to extract the types of numbers we expect to see in the `Tire_Size` feature. 


In [None]:
regexp['Tire_Size'] = regexp['Tire_Size'].str.extract(r'(\d+\.\d+|\d+)') 
regexp

In the `extract('(\d+\.\d+|\d+)')` code above we have the following basic elements:

- `\d+` to extract any sequence of digits, e.g., '12', '123', '1234'; 
- `\.` to extract a literal decimal point '.'; and, 
- `|` is the OR operator. 

These basic elements are used in the following way: 
- `\d+\.\d+` to extract any sequence of one or more digits followed by a decimal point followed by another sequence of one or more digits; OR 
- `\d+` to extract any sequence of one or more digits when there is no decimal point. 

We should also notice that extract the numbers does not mean that the column has been converted to a numeric data type. So far, we have just cleaned up the strings that represent numbers so that all non-numeric characters (like " as the shorthand notation for inches) have been removed.  

In [None]:
regexp.info()

We need to explicitly convert the feature to a numeric data type:

In [None]:
regexp['Tire_Size'] = pd.to_numeric(regexp['Tire_Size'])
regexp.info()

We will now create a function to do this for any feature:

In [None]:
def extract_sizes(df, colname):
    df[colname] = df[colname].str.extract(r'(\d+\.\d+|\d+)', expand=True)
    df[colname] = df[colname].replace('', np.nan)
    df[colname] = pd.to_numeric(df[colname])

We can now apply this function to `Tire_Size` and `Undercarriage_Pad_Width`. 

In [None]:
df['Tire_Size'].unique()

In [None]:
extract_sizes(df, 'Tire_Size')

In [None]:
df['Tire_Size'].unique()

In [None]:
extract_sizes(df, 'Undercarriage_Pad_Width')

Dealing with `Blade_Width` is a bit more complicated because of the `"<12'"` value:

In [None]:
df['Blade_Width'].unique()

There are a couple ways to approach this: 
- convert it into numeric form; or 
- consider this to be a categorical variable given the small number of unique values. 

This feature has missing values so we have to consider that as well. If we convert it to numeric we will end up replacing the missing values with a median value. And if we treat it as categorical, then the missing values will form their own category. 

We will treat this as a categorical variable but the next section shows how you could go about converting it to numeric if you chose to do that. 

##### Aside: convert `Blade_Width` to numeric

To demonstrate this will will use a toy dataset that consists of all the unique values found in our data. 

In [None]:
blade = pd.DataFrame({'width':[np.nan, "12'", "14'", "13'", "16'", "<12'"]})
blade

Since there aren't that many unique values, we can create a mapping that directly converts all the values to their corresponding number. Here, I am grouping all the `"<12'"` under the number 11. This may not be ideal as that will impact the median value when we replace the missing values. 

In [None]:
blade['width'] = blade['width'].map({"NaN": np.nan, "12'":12, "13'": 13, "14'":14, "16'":16, "<12'":11})
blade

In [None]:
blade.info()

In [None]:
df['Stick_Length'].unique()

`Stick_Length` is similar to `Blade_Width` and we could handle it in the same way. However, in this case I am going to try to convert it numeric `apply()` and a new function. Aside from the missing values, the entries for this feature all have the same structure: `'10\'6"'`. The outer quotations (' ') tell us this is a string, while the middle `\'` is a single literal quotation that is shorthand for the distance measurement of `feet` and the single " is shorthand for `inches`. 

The steps we'll need to take are: 
- extract the number for feet and the number for inches; 
- multiply the number of feet by 12 to convert to inches; and 
- add it to the number of inches. 
This will convert the `Stick_Length` feature to a numeric column where the unit of length is the inch. 

To see how this is going to work we will create a toy dataframe using the unique values for `Stick_Length`. 

In [None]:
stick = pd.DataFrame({'length': df['Stick_Length'].unique()})
                      
stick.head()

In [None]:
stick.info()

In [None]:
# modified version of: https://stackoverflow.com/questions/26986655/changing-height-feet-and-inches-to-an-integer-in-python-pandas

def parse_length(length):
    if not pd.isnull(length):
      split_length = length.split("' ")
      feet = float(split_length[0])
      inches = float(split_length[1].replace("\"",""))
      return (12*feet) + inches
    else:
      return np.nan

stick['length'] = stick["length"].apply(lambda x: parse_length(x))

In [None]:
stick.head()

In [None]:
stick.info()

### Summary

It is a good time to recall everything that we have done so far. We have:

- dropped the `SalesID` and `MachineID` features;
- converted `auctioneerID` to 'string' data type so we can treat it as a categorical feature;
- decided to leave `Blade_Width` as 'string' and treat as a categorical feature instead of converting to numeric;
- extracted numeric features from the original `Undercarriage_Pad_Width` and `Tire_Size` strings;
- converted `Stick_Length` to a numeric feature from the original string representation;
- normalized the representation of missing values to `np.nan`.

Since this process gets messy and, at times, difficult to keep track of, let's reproduce everything we've done so we can see it all in one place. 

In [None]:
df = df_raw.copy()
df = df.iloc[-100000:]

In [None]:
df.drop(['SalesID', 'MachineID'], axis=1, inplace=True)
df['auctioneerID'] = df['auctioneerID'].astype(str)
df_normalize_strings(df)
extract_sizes(df, 'Tire_Size')
extract_sizes(df, 'Undercarriage_Pad_Width')
df['Stick_Length'] = df['Stick_Length'].apply(lambda x: parse_length(x))

In [None]:
sniff_modified(df)