## Chapter 9

Fastbook walkthrough.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from fastai.tabular.all import Path

In [None]:
path = Path('../input/bluebook-for-bulldozers')

### The data

Its a good practice to give `low_memory = false` asit sis enabled by default,and can cause continuity changes later.

In [None]:
df = pd.read_csv(path/'TrainAndValid.csv', low_memory=False)

In [None]:
df.columns

Ordinal columns ,refers to columns containing strings or similar but where the strings have a natural ordering, like productSize here.

In [None]:
df['ProductSize'].unique()

In [None]:
# we can tell Pandas for a suitable ordering like

sizes = 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'

In [None]:
df['ProductSize'] = df['ProductSize'].astype('category')
print(df['ProductSize'])
df['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
print(df['ProductSize'])

In [None]:
df['ProductSize']

The most important data is one that we want to predict. The dependent variable

In [None]:
dep_var = 'SalePrice'

HereKaggle tells you what metric to use, we use root mean squared log error (RMSLE) between actual and predicted auction prices. 

In [None]:
df[dep_var] = np.log(df[dep_var])

### Decision Tree ensembles 
After each question the data at that part of the tree is split between a yes and a no branch

### Rudimentary deision tree from scratch

This sequence of questions is now a procedure for taking any data item, whether an item from the training set or a new one, and assigning that item to a group. Namely, after asking and answering the questions, we can say the item belongs to the same group as all the other training data items that yielded the same set of answers to the questions. But what good is this? The goal of our model is to predict values for items, not to assign them into groups from the training dataset. The value is that we can now assign a prediction value for each of these groups—for regression, we take the target mean of the items in the group.

Let's consider how we find the right questions to ask. Of course, we wouldn't want to have to create all these questions ourselves—that's what computers are for! The basic steps to train a decision tree can be written down very easily:

1. Loop through each column of the dataset in turn.
1. For each column, loop through each possible level of that column in turn.
1. Try splitting the data into two groups, based on whether they are greater than or less than that value (or if it is a categorical variable, based on whether they are equal to or not equal to that level of that categorical variable).
1. Find the average sale price for each of those two groups, and see how close that is to the actual sale price of each of the items of equipment in that group. That is, treat this as a very simple "model" where our predictions are simply the average sale price of the item's group.
1. After looping through all of the columns and all the possible levels for each, pick the split point that gave the best predictions using that simple model.
1. We now have two different groups for our data, based on this selected split. Treat each of these as separate datasets, and find the best split for each by going back to step 1 for each group.
1. Continue this process recursively, until you have reached some stopping criterion for each group—for instance, stop splitting a group further when it has only 20 items in it.

Although this is an easy enough algorithm to implement yourself (and it is a good exercise to do so), we can save some time by using the implementation built into sklearn.

First, however, we need to do a little data preparation.

In [None]:
# my attempt at above algorithm
# https://www.analyticsvidhya.com/blog/2020/10/all-about-decision-tree-from-scratch-with-python-implementation/

# picking categorical and non categorical
# https://stackoverflow.com/questions/35826912/what-is-a-good-heuristic-to-detect-if-a-column-in-a-pandas-dataframe-is-categori
likely_cat = {}
for i in df.columns:
    #print(i)
    likely_cat[i] = 1.*df[i].nunique()/df[i].count() < 0.05 
    
# also handled by cont_cat_split

In [None]:
likely_cat

# these are columns that mightbe categorical

In [None]:
len(df)

In [None]:
import math

# 1 level of dcision tree split 
# with randomly chosen 1 category

# looping through each column
for i in df.columns:
    # looping through levels
    #print("levels for " , i)
    #print(df[i].nunique())
    
    # splitting into groups
    if(df[i].dtype.kind in 'biufc' and df[i].nunique() > 20):
        # it means that the value is continous
        print(" FOR CONTINOUS ", i)
        print(" SPLIT AT ", df[i].mean())
        split_mean = df[i].mean()
        
        # splitting dataframe based on df[i].mean()
        df_left = df[df[i] < split_mean]
        df_right = df[df[i] > split_mean]
        
        print(" SALEPRICE FOR LEFT NODE ", df_left[dep_var].mean())
        print(" SALEPRICE FOR RIGHT NODE ", df_right[dep_var].mean())
        print(" ACTUAL SALEPRICE FOR NODE ", df[dep_var].mean())
        
        print("\n")
        
    else:
        # it means that the value is categorical
        print(" FOR CATEGORICAL ", i)
        # lets split at random unique column
        print(" SPLIT AT ", df[i].unique()[math.floor(len(df[i].unique())/2)])
        split_cat =  df[i].unique()[math.floor(len(df[i].unique())/2)]
        
        df_cat = df[df[i] == split_cat]
        print(" SALEPRICE FOR SPLIT NODE ", df_cat[dep_var].mean())
        print(" ACTUAL SALEPRICE FOR NODE ", df[dep_var].mean())
        
        print("\n")
    

# we are not doing it recursively nor looping through all columns
# but seems like it gives a better understanding about what decision tree does

### handling dates in dataset

Since date can be treated as ordinal value, however its format presents a unique challenge. 

In fastai we convert the date into a multiple columns like holiday, day of the week, month etc. through `datepart function`

In [None]:
from fastai.tabular.all import add_datepart

In [None]:
df = add_datepart(df, 'saledate')

In [None]:
# same for test
df_test = pd.read_csv(path/'Test.csv', low_memory=False)
df_test = add_datepart(df_test, 'saledate')

In [None]:
[o for o in df.columns if o.startswith('sale')]

### Using TabularPandas and TabularProc

We need to handlethe missing data and strings. Since sklearn doesnt di t out of the box we will use TabularProc , Categorify and FillMissing. 

TabularProc is like a transforms:
 but it modifies in place
 it runs everything together at once
 
 `Categorify` replaces the column with a numerical categorical column
 
 `FillMissing` replaced the missing values with the median of the column and create a new column with boolean indicated value was filled.

In [None]:
from fastai.tabular.all import Categorify, FillMissing

In [None]:
# I guess we will use it often in tabular data
procs = [Categorify, FillMissing]

TabularPandas can also help in splitting data into training and validation sets for us. You cannot randomly choose because its is a time series.

`np.where` returns theindices of all `True` values

Here we choose validation as more than 2011

In [None]:
cond = (df.saleYear<2011) | (df.saleMonth<10)
train_idx = np.where(cond)[0]
valid_idx = np.where(~cond)[0]

splits = (list(train_idx), list(valid_idx))

TabularPandas is the equivalent of fastai Datasets object. TabularPandas need to be told which columns are continous and which aren't. 

In [None]:
from fastai.tabular.all import cont_cat_split, TabularPandas

In [None]:
cont, cat = cont_cat_split(df, 1, dep_var=dep_var)
to = TabularPandas(df, procs,cat,cont, y_names=dep_var, splits=splits)

In [None]:
cont

In [None]:
len(to.train), len(to.valid)

In [None]:
to.show(3)

In [None]:
to1 = TabularPandas(df, procs, ['state', 'ProductGroup', 'Drive_System', 'Enclosure' ], [], y_names=dep_var, splits=splits)

In [None]:
to1.show(3)

In [None]:
to.items.head(3)['ProductGroup']
# the underlying items are all numeric

In [None]:
to1.items[['state', 'ProductGroup', 'Drive_System', 'Enclosure']].head(3)

In [None]:
to.classes['ProductSize']

In [None]:
try:
    to.items['ProductSize']
except Exception as e:
    print(e)

In [None]:
from fastai.tabular.all import save_pickle

In [None]:
save_pickle('to.pkl',to)

In [None]:
try:
    to = (path/'to.pkl').load()
except Exception as e:
    print(e)
    print('Am I living a lie?')

In [None]:
from fastai.tabular.all import load_pickle

In [None]:
to = load_pickle('to.pkl')

In [None]:
xs, y = to.train.xs, to.train.y
valid_xs, valid_y = to.valid.xs, to.valid.y

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
m = DecisionTreeRegressor(max_leaf_nodes=4)
m.fit(xs, y)

In [None]:
# visualising tree - https://mljar.com/blog/visualize-decision-tree/
from sklearn import tree

In [None]:
text_representation = tree.export_text(m)
print(text_representation)

In [None]:
for i in to.__dir__():
    print(i)

In [None]:
print(to.x_names)

In [None]:
to.y_names

In [None]:
_ = tree.plot_tree(m,feature_names=to.x_names, class_names=to.y_names, filled=True)

Lets try and explain this tree.
The top node represents the represents the initial model before any splits were done when all data is one group.  
It predicts the value to be average of whole dataset we see the prediction of 10.10 for logof sales price with mean squared of 0.48
best split found was coupler system.

Moving down to left we see the best split is YearMade. The leaf nodes have no questions to be asked.

In [None]:
!pip install dtreeviz

In [None]:
from dtreeviz.trees import *

In [None]:
samp_idx = np.random.permutation(len(y))[:500]
dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var,
        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
        orientation='LR')

Now we see a problem , if look at the Year made split its more like a christmas tree at the end the middle area is fully empty. so we are going to replace the values by 1950 for all Yearmade less than 1900

In [None]:
xs.loc[xs['YearMade']<1900, 'YearMade'] = 1950
valid_xs.loc[valid_xs['YearMade']< 1900, 'Yearmade'] = 1950

In [None]:
xs['YearMade'].unique()

In [None]:
m = DecisionTreeRegressor(max_leaf_nodes=4).fit(xs,y)

dtreeviz(m, xs.iloc[samp_idx], y.iloc[samp_idx], xs.columns, dep_var, scale=1.6, label_fontsize=10, orientation='LR')

In [None]:
# lets make a biggertree with no stoppingcriteria

m = DecisionTreeRegressor()
m.fit(xs, y)

In [None]:
# creating a funciton to checck the root mean squared error of our model

def r_mse(pred, y):
    return round(math.sqrt(((pred-y)**2).mean()), 6)

def m_rmse(m, xs, y):
    return r_mse(m.predict(xs), y)

In [None]:
m_rmse(m, xs,y)

# perfect (over fit) on training data

In [None]:
valid_xs = to.valid.xs
valid_y = to.valid.y

In [None]:
m_rmse(m, valid_xs, valid_y)

In [None]:
# it might be bad

m.get_n_leaves(), len(xs)

# we got so many nodes , so ofcourse its overfitting
# lets restrict nodes to 25

In [None]:
m = DecisionTreeRegressor(min_samples_leaf=25)
m.fit(to.train.xs, to.train.y)
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)

In [None]:
m.get_n_leaves()

In [None]:
m_rmse(m, valid_xs, valid_y)

# better but more imporvements needed

### Categorical Variables

Categorical values are actually one of the most important feature thatn any other sometimes, and could not be avoided, so we need to one hot encode the categorical variables. 

### Random forests

Bagging predictors is a method for generating multiple versions of a predictors and using these to get an aggregated predictor.

the procedure for doing it 

1. randomly chosse asubset of your own data
1. train a model using this subset
1. Save that model then return to step 1 a few times
1. This will give you a number of trained models, take the predictions then take the average of each of these models predictions

Brieman expanded this concept to randomly choosing rows for each models training but also selected randomly from a subset of columns this is called random forest.

In defining randomforestresgressor `n_estimators` we define the number if trees we want,`max_samples` defines how many rows to sampe for each ree `max_features` for columns `min_samples_leaf` parameter limits the tree. n_job=-1 tells sklearn to use all ourCPUs to build the trees in parallel

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
def rf(xs, y, n_estimators=540, max_samples=200_000, max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators, max_samples=max_samples, max_features=max_features, min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

In [None]:
m = rf(xs, y)

In [None]:
m_rmse(m, xs, y), m_rmse(m, valid_xs, valid_y)

Random forest isnt ery sensitive to the hyperparameter choices such as max_features. max_samples can often be left at its default

In [None]:
preds= np.stack([t.predict(valid_xs) for t in m.estimators_])

# m.estimators are the different random forest models

In [None]:
r_mse(preds.mean(0), valid_y)

In [None]:
plt.plot([r_mse(preds[:i+1].mean(0), valid_y) for i in range(40)])

### Out of bag error

The performance on our validation set is worse than on our training set. OOB is a way of measuring prediction error on a different subse tof the training data.

In [None]:
r_mse(m.oob_prediction_,y)

### Model Interpretation

These arete things to be considered in every prediction model
> How Confident are we in our predictions using a particular row of data

> For prediction with a particular row of data what were the most importatn factors and how they influenced that prediction

> which columns were strongest predictors

> which columns are redundant

> How do predictions vary

### Tree variance for confidence

confidence can be found using the considering the standard deviation 

In [None]:
preds = np.stack([t.predict(valid_xs) for t in m.estimators_])
preds.shape

In [None]:
# now we have a prediction for every tree and every auction
# getting standard deviation of all trees
preds_std = preds.std(0)

In [None]:
preds_std[:5]

#this makes moreof a diffenrece in production systme

### Feature importance

We can use sklearns `feature_importances_`

In [None]:
def rf_feat_importance(m,df):
    return pd.DataFrame({'cols': df.columns,'imp': m.feature_importances_}).sort_values('imp', ascending=False)

In [None]:
fi = rf_feat_importance(m, xs)
fi[:10]

In [None]:
#heres aplot for relative importances

def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi[:30])

### Removing low importance Variables 

if `fi.imp` < 0.005


In [None]:
to_keep = fi[fi.imp>0.005].cols
len(to_keep)

In [None]:
xs_imp = xs[to_keep]
valid_xs_imp = valid_xs[to_keep]

In [None]:
m = rf(xs_imp, y)

In [None]:
m_rmse(m, xs_imp, y), m_rmse(m, valid_xs_imp,valid_y)

In [None]:
len(xs.columns), len(xs_imp.columns)

In [None]:
plot_fi(rf_feat_importance(m, xs_imp))

### Removing redundant features

In [None]:
import scipy
from scipy.cluster import hierarchy as hc


# https://www.kaggle.com/saty101/fastai-course-v4-utils

In [None]:

def cluster_columns(df, figsize=(10,6), font_size=12):
    corr = np.round(scipy.stats.spearmanr(df).correlation, 4)
    corr_condensed = hc.distance.squareform(1-corr)
    z = hc.linkage(corr_condensed, method='average')
    fig = plt.figure(figsize=figsize)
    hc.dendrogram(z, labels=df.columns, orientation='left', leaf_font_size=font_size)
    plt.show()

In [None]:
cluster_columns(xs_imp)

This chart shows the pair of columns that are most similar. Unsurprisingle the fields liek saleYear and saleElapsed were merged early. Similarity is determined throughrank. Lets define a function that trains the random forst on a dataset and gives the oob scorewe use  a lower max_samples and min_smaples_leaf.

In [None]:
def get_oob(df):
    m = RandomForestRegressor(n_estimators=40, min_samples_leaf=15,
                              max_samples=50000, max_features=0.5, n_jobs=-1, oob_score=True)
    m.fit(df,y)
    return m.oob_score_

In [None]:
get_oob(xs_imp)

In [None]:
# lets remove each of our ptentially redundant variables one at a time
{c:get_oob(xs_imp.drop(c, axis=1)) for c in (
    'saleYear', 'saleElapsed', 'ProductGroupDesc','ProductGroup',
    'fiModelDesc', 'fiBaseModel',
    'Hydraulics_Flow','Grouser_Tracks', 'Coupler_System')}

In [None]:
# removing multiple variables

to_drop = ['saleYear', 'ProductGroupDesc', 'fiBaseModel', 'Grouser_Tracks']
get_oob(xs_imp.drop(to_drop, axis=1))

In [None]:
xs_final = xs_imp.drop(to_drop, axis=1)
valid_xs_final = valid_xs_imp.drop(to_drop, axis=1)

In [None]:
path = Path("./")

In [None]:
save_pickle(path/'xs_final.pkl', xs_final)
save_pickle(path/'valid_xs_final.pkl', valid_xs_final)

In [None]:
xs_final = load_pickle(path/'xs_final.pkl')
valid_xs_final = load_pickle(path/'valid_xs_final.pkl')

In [None]:
m = rf(xs_final, y)
m_rmse(m, xs_final, y), m_rmse(m, valid_xs_final, valid_y)

### Partial Dependence

UNderstanding the relationship between prdictors , goood idea is to check the count the vaues er category to see howcommon eachcategory is

In [None]:
p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c)

In [None]:
ax = valid_xs_final['YearMade'].hist()

In [None]:
from sklearn.inspection import plot_partial_dependence

fig, ax = plt.subplots(figsize=(12,4))

plot_partial_dependence(m, valid_xs_final,['YearMade', 'ProductSize'], grid_resolution=20, ax=ax)

### Data Leakage

Happens when cause and effect is reversed. SOmething that happened after the fact is used to use as prediction identifier. Like being happy means winning. But if the data was collected after winning most people would be happy anyway. 


### Tree Interpreter 

These can help you to identify which factors influence specific predictions

In [None]:
!pip install treeinterpreter

In [None]:
!pip install waterfallcharts

In [None]:
import warnings
warnings.simplefilter('ignore', FutureWarning)

from treeinterpreter import treeinterpreter
from waterfall_chart import plot as waterfall

this helpsin answering : For predicting with a particular row of data, what were the most important factors, and how did they influence that prediction?

This is used to see if a particular column is predicted to be expensive wy exactly is it predicted that way.

In [None]:
row = valid_xs_final.iloc[:5]

In [None]:
prediction, bias, contributions = treeinterpreter.predict(m, row.values)

In [None]:
prediction[0], bias[0], contributions[0].sum()

In [None]:
waterfall(valid_xs_final.columns, contributions[0], threshold=0.08, 
          rotation_value=45,formatting='{:,.3f}');

This is exactly the parameters that are used for predictions

### Extrapolation problem in random forests

In [None]:
np.random.seed(42)

In [None]:
from fastai.tabular.all import torch

In [None]:
# lets take a simple task of makingpredictions from 40 datapoints
# showing a noisy linear relationship

x_lin = torch.linspace(0,20,steps=40)
y_lin = x_lin + torch.randn_like(x_lin)
plt.scatter(x_lin, y_lin)

In [None]:
# we need to turn our variable to a matrix with one column

xs_lin = x_lin.unsqueeze(1)
x_lin.shape, xs_lin.shape

In [None]:
x_lin[:, None].shape

# more flexible is to slice anarray with None which adds one additional unit axis

In [None]:
m_lin = RandomForestRegressor().fit(xs_lin[:30], y_lin[:30])

# we will use the first 30 rows to train the model

In [None]:
plt.scatter(x_lin, y_lin, 20)
plt.scatter(x_lin, m_lin.predict(xs_lin), color='red', alpha=0.5)

Why is this happening ? its because a random forest can only predict in the domain, and it is as high as it can go. SO in a datalike inflation which rises beyond the training data random forests will not work.

### Finding Out of domain data

in validation set. 

We use decision tree to see if the data is in the training set or validatio set. To see this in action lets combine our training and validation sets together create a dependent varible that represent which dataset each row comes form

In [None]:
# it is eady through using a random forest

df_dom = pd.concat([xs_final, valid_xs_final])
is_valid = np.array([0]* len(xs_final) + [1]*len(valid_xs_final))

m = rf(df_dom, is_valid)
rf_feat_importance(m, df_dom)[:6]

This shows that there are three columns that differ significantly berween trianingand validation sets: saleElapsed, SalesID, machineID

In [None]:
m = rf(xs_final, y)
print('orig', m_rmse(m, valid_xs_final, valid_y))

for c in ('SalesID', 'saleElapsed', 'MachineID'):
    m = rf(xs_final.drop(c, axis=1), y)
    print(c, m_rmse(m, valid_xs_final.drop(c,axis=1), valid_y))

Lookslikewe can remove SalesID, MachinID

In [None]:
time_vars = ['SalesID', 'MachineID']
xs_final_time = xs_final.drop(time_vars, axis=1)
valid_xs_time = valid_xs_final.drop(time_vars, axis=1)



In [None]:
m = rf(xs_final_time, y)
m_rmse(m, valid_xs_time, valid_y)

In [None]:
#another way to remove data is simply removing old data

xs['saleYear'].hist()

The result of training on this subset

In [None]:
filt = xs['saleYear']>2004
xs_filt = xs_final_time[filt]
y_filt = y[filt]

In [None]:
m = rf(xs_filt, y_filt)
m_rmse(m, xs_filt, y_filt), m_rmse(m, valid_xs_time, valid_y)

## Lets see if using neural network helps

In [None]:
df_nn = pd.read_csv('../input/bluebook-for-bulldozers/TrainAndValid.csv', low_memory=False)
df_nn['ProductSize'] = df_nn['ProductSize'].astype('category')
df_nn['ProductSize'].cat.set_categories(sizes, ordered=True, inplace=True)
df_nn[dep_var] = np.log(df_nn[dep_var])
df_nn = add_datepart(df_nn, 'saledate')

using the same columns for our neural network that we did for random forest

In [None]:
df_nn_final = df_nn[list(xs_final_time.columns) + [dep_var]]

1. categorical values are handles thorugh embeddings in neural network, if max_card is lower then fastai will treat the variable as categorical, embedding larger than 10,000 should only be used after youve tested wether there are better was to do so. grup the variable so we will use 9000 as out `max_card`

In [None]:
cont_nn,cat_nn = cont_cat_split(df_nn_final, max_card=9000, dep_var=dep_var)

we do not want ot treat `saleElapsed` as categorical.

In [None]:
df_nn_final['saleElapsed'].head()

In [None]:
cont_nn.append('saleElapsed')
cat_nn.remove('saleElapsed')

We also want ot makesure it is of numeric type

In [None]:
df_nn['saleElapsed'] = df_nn['saleElapsed'].astype(int)

In [None]:
#lookingat cardinality of all categorical variables
df_nn_final[cat_nn].nunique()

There are two variables pertaining to "model" of the equipment bith with very high cardinalities suggesting that they may be redundant. Lets try removing it.

In [None]:
xs_filt2 = xs_filt.drop('fiModelDescriptor', axis=1)
valid_xs_time2 = valid_xs_time.drop('fiModelDescriptor', axis=1)

In [None]:
m2 = rf(xs_filt2, y_filt)
m_rmse(m2, xs_filt2, y_filt), m_rmse(m2, valid_xs_time2, valid_y)

# theres minimal impact so we will remove it

In [None]:
cat_nn.remove('fiModelDescriptor')

We can create our tabular pandas the same way we created before with one significanta ddition we now need to introduce normalisation 

In [None]:
from fastai.tabular.all import Normalize

In [None]:
procs_nn = [Categorify, FillMissing, Normalize]
to_nn = TabularPandas(df_nn_final, procs_nn, cat_nn, cont_nn,
                      splits=splits, y_names=dep_var)

In [None]:
# since Tabular models dont require much GPU RAM we can use larger batch sizes
dls = to_nn.dataloaders(1024)

In [None]:
y = to_nn.train.y
y.min(), y.max()

# lets look at the range of our dependent model

We use `tabular_learner` for creating the model. we need a big model though

In [None]:
from fastai.tabular.all import tabular_learner, F

In [None]:
learn  = tabular_learner(dls, y_range=(8,12), layers=[500,250], 
                         n_out=1, loss_func=F.mse_loss)

In [None]:
learn.lr_find()

In [None]:
learn.fit_one_cycle(5, 1e-3)

In [None]:
preds, targs = learn.get_preds()
r_mse(preds, targs)

In [None]:
learn.save('nn1')

In fastai a tabular model is simply a model that takes in columns of continuos or categorical data and predicts a category. `tabular_learner` is an object of class `TabularModel`

In [None]:
#tabular_learner??

You'll see that like `collab_learner`, it first calls `get_emb_sz` to calculate appropriate embedding sizes (you can override these by using the `emb_szs` parameter, which is a dictionary containing any column names you want to set sizes for manually), and it sets a few other defaults. Other than that, it just creates the `TabularModel`, and passes that to `TabularLearner` (note that `TabularLearner` is identical to `Learner`, except for a customized `predict` method).

### Using Ensembling

lets use neural net and random forest togethor. fusion time!


One minor issue we have to be aware of is that our PyTorch model and our sklearn model create data of different types: PyTorch gives us a rank-2 tensor (i.e, a column matrix), whereas NumPy gives us a rank-1 array (a vector). `squeeze` removes any unit axes from a tensor, and `to_np` converts it into a NumPy array:

In [None]:
try:
    rf_preds = m.predict(valid_xs_time)
    ens_preds = (to_np(preds.squeeze()) + rf_preds) /2
    r_mse(ens_preds,valid_y)
except Exception as e:
    print(e)

### Boosting

Note from chapter

So far our approach to ensembling has been to use bagging, which involves combining many models (each trained on a different data subset) together by averaging them. As we saw, when this is applied to decision trees, this is called a random forest.

There is another important approach to ensembling, called boosting, where we add models instead of averaging them. Here is how boosting works:

Train a small model that underfits your dataset.
Calculate the predictions in the training set for this model.
Subtract the predictions from the targets; these are called the "residuals" and represent the error for each point in the training set.
Go back to step 1, but instead of using the original targets, use the residuals as the targets for the training.
Continue doing this until you reach some stopping criterion, such as a maximum number of trees, or you observe your validation set error getting worse.
Using this approach, each new tree will be attempting to fit the error of all of the previous trees combined. Because we are continually creating new residuals, by subtracting the predictions of each new tree from the residuals from the previous tree, the residuals will get smaller and smaller.

To make predictions with an ensemble of boosted trees, we calculate the predictions from each tree, and then add them all together. There are many models following this basic approach, and many names for the same models. Gradient boosting machines (GBMs) and gradient boosted decision trees (GBDTs) are the terms you're most likely to come across, or you may see the names of specific libraries implementing these; at the time of writing, XGBoost is the most popular.

Note that, unlike with random forests, with this approach there is nothing to stop us from overfitting. Using more trees in a random forest does not lead to overfitting, because each tree is independent of the others. But in a boosted ensemble, the more trees you have, the better the training error becomes, and eventually you will see overfitting on the validation set.

We are not going to go into detail on how to train a gradient boosted tree ensemble here, because the field is moving rapidly, and any guidance we give will almost certainly be outdated by the time you read this. As we write this, sklearn has just added a HistGradientBoostingRegressor class that provides excellent performance. There are many hyperparameters to tweak for this class, and for all gradient boosted tree methods we have seen. Unlike random forests, gradient boosted trees are extremely sensitive to the choices of these hyperparameters; in practice, most people use a loop that tries a range of different hyperparameters to find the ones that work best.


### Combining embeddings with orher methods

if you first train a neural network with categorical embeddings, and then use those categorical embeddings instead of the raw categorical columns in the model. In every case, the models are dramatically improved by using the embeddings instead of the raw categories.

This is a really important result, because it shows that you can get much of the performance improvement of a neural network without actually having to use a neural network at inference time. You could just use an embedding, which is literally just an array lookup, along with a small decision tree ensemble.

These embeddings need not even be necessarily learned separately for each model or task in an organization. Instead, once a set of embeddings are learned for some column for some task, they could be stored in a central place, and reused across multiple models. In fact, we know from private communication with other practitioners at large companies that this is already happening in many places.

### Conclusion

* We have dicussed two approaches to tabular modeling: decision tree ensembles and neural networks. We've also mentioned two different decision tree ensembles: random forests, and gradient boosting machines. Each is very effective, but each also has compromises:

Random forests are the easiest to train, because they are extremely resilient to hyperparameter choices and require very little preprocessing. They are very fast to train, and should not overfit if you have enough trees. But they can be a little less accurate, especially if extrapolation is required, such as predicting future time periods.

Gradient boosting machines in theory are just as fast to train as random forests, but in practice you will have to try lots of different hyperparameters. They can overfit, but they are often a little more accurate than random forests.

Neural networks take the longest time to train, and require extra preprocessing, such as normalization; this normalization needs to be used at inference time as well. They can provide great results and extrapolate well, but only if you are careful with your hyperparameters and take care to avoid overfitting.

We suggest starting your analysis with a random forest. This will give you a strong baseline, and you can be confident that it's a reasonable starting point. You can then use that model for feature selection and partial dependence analysis, to get a better understanding of your data.

From that foundation, you can try neural nets and GBMs, and if they give you significantly better results on your validation set in a reasonable amount of time, you can use them. If decision tree ensembles are working well for you, try adding the embeddings for the categorical variables to the data, and see if that helps your decision trees learn better.