In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

#Jupyter notebook tricks
#https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

#https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html
#The below 2 lines will automatically reload any changed modules before executing any line of code.
#load_ext is an IPython magic command
#More about magic commands:
#1)https://ipython.org/ipython-doc/3/interactive/tutorial.html
#2)https://jakevdp.github.io/PythonDataScienceHandbook/01.03-magic-commands.html
#3)https://ipython.org/ipython-doc/3/interactive/reference.html - good explanation of magic commands
#autoreload is a IPython extension to automatically reload modules.
!pip install git+https://github.com/fastai/fastai@2e1ccb58121dc648751e2109fc0fbf6925aa8887
!apt update && apt install -y libsm6 libxext6
%load_ext autoreload
%autoreload 2

#The below line is used to plot charts inline in the notebook, instead of having the charts displayed in a seperate window.
%matplotlib inline

import pandas as pd
import numpy as np

#Import the necessary libraries
from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics
from fastai.imports  import *
from fastai.structured  import *

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
#We are executing a shell command here. You can do that by using the ! character before the command as
#shown here. You can execute any command you want here by preceding the command with the ! character
#Format: !<command to execute>
PATH = "../input/"
!ls {PATH}

In [None]:
#Read the contents of train.csv into a dataframe using the Pandas library
df_raw = pd.read_csv(f'{PATH}TrainAndValid.csv',low_memory=False,parse_dates=["saledate"])

1. low_memory=False  - More of the file is read so that the types are inferred correctly
2. parse_dates=["saledate"] - Results in the "saledate" column being read as a seperate date column

In [None]:
#Lets look at the top 5 rows using the Pandas DataFrame head() method
df_raw.head()

In [None]:
#The info method is useful to get a quick description of the data(# of columns, #of rows,datatypes of each column )
df_raw.info()

As you can see above, we have a lot of missing data as well. 

By default when you read a dataframe, as the number of columns increases,  there is a chance that not all the data gets displayed fully in the table and instead the display is truncated. Hence,  the below function display_all, ensures that all the data is displayed in its full form. We make use of the options listed at: https://pandas.pydata.org/pandas-docs/stable/options.html. Also, please observe that the columns are printed row wise and the rows are printed column wise. We are doing this as there are huge number of columns to print.

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

In [None]:
display_all(df_raw.tail().T)

In [None]:
display_all(df_raw.describe(include='all').T)

It's important to note what metric is being used for a project. Generally, selecting the metric(s) is an important part of the project setup. However, in this case Kaggle tells us what metric to use: RMSLE (root mean squared log error) between the actual and predicted auction prices. Therefore we take the log of the prices, so that RMSE will give us what we need.

In [None]:
df_raw.SalePrice = np.log(df_raw.SalePrice)

Numpy lets us treat arrays, matrices, vectors, high dimensional tensors as if they are Python variables

**Initial modelling using Random Forest algorithm**

In [None]:
#n_jobs=-1 indicates that the algorithm should use parallelism as part of its fit/predict phases. With n_jobs=-1
#scikit-learn will use all the CPU's available.
#For more information on this parameter, you can take a look at: http://scikit-learn.org/stable/glossary.html#term-n-jobs
#m = RandomForestRegressor(n_jobs=-1)
# The following code is supposed to fail due to string values in the input data
#m.fit(df_raw.drop('SalePrice',axis=1),df_raw.SalePrice)

Everything in scikit-learn has the following form:
-Instantiate an object of the machine learning model.
-Fit the data using the model. As part of the fit operation, the algorithm tries to learn the relationship between the independent variables and the dependent variables. In our case, everything except the SalePrice are the independent variables and the value which will be predicted(in our case: SalePrice) is the dependent variable.
-axis=1 means remove columns

This dataset contains a mix of continuous and categorical variables.
The following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities.


Machine learning algorithms work only with numbers. Hence, as part of data pre-processing, we try to
convert any string values into numbers.

Here are some of the information we can extract from date — year, month, quarter, day of month, day of week, week of year, is it a holiday? weekend? was it raining? was there a sport event that day? It really depends on what you are doing. If you are predicting soda sales in SoMa, you would probably want to know if there was a San Francisco Giants ball game that day. **What is in a date is one of the most important piece of feature engineering you can do** and no machine learning algorithm can tell you whether the Giants were playing that day and that it was important. So this is where you need to do feature engineering.

The **add_datepart** method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can’t capture any trend/cyclical behavior as a function of time at any of these granularities.


In [None]:
add_datepart(df_raw,'saledate')
df_raw.saleYear.head()


After running add_datepart, it added many numerical columns and removed saledate column. This is not quite enough to get passed the error we saw earlier as we still have other columns that contain string values. Pandas has a concept of a category data type, but by default it would not turn anything into a category for you. Fast.ai provides a function called **train_cats** which creates categorical variables for everything that is a String. **Behind the scenes, it creates a column that is an integer and it is going to store a mapping from the integers to the strings**. train_cats is called “train” because it is training data specific. It is important that validation and test sets will use the same category mappings (in other words, if you used 1 for “high” for a training dataset, then 1 should also be for “high” in validation and test datasets). **For validation and test dataset, use apply_cats instead.**

In [None]:
#train_cats will not change the way the dataframe looks but behind the scenes it assign numbers to each
#of the categories.
train_cats(df_raw)
df_raw.UsageBand.cat.categories


In [None]:
#Check the columns in the dataframe
df_raw.columns


In [None]:
#There is a kind of categorical variable called “ordinal”. An ordinal categorical variable has some kind of order (e.g. “Low” < “Medium” < “High”). 
#Random forests are not terribly sensitive for that fact, but it is worth noting.

df_raw.UsageBand.cat.set_categories(['High','Medium','Low'],ordered=True,inplace=True)

Normally, pandas will continue displaying the text categories, while treating them as numerical data internally. Optionally, we can replace the text categories with numbers, which will make this variable non-categorical, like so:.

In [None]:
df_raw.UsageBand = df_raw.UsageBand.cat.codes

We're still not quite done - for instance we have lots of missing values, which we can't pass directly to a random forest.
The below  will add a number of empty values for each series, we sort them by the index (pandas.Series.sort_index), and divide by a number of dataset.

In [None]:
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html
display_all(df_raw.isnull().sum().sort_index()/len(df_raw))

Reading CSV took about 10 seconds, and processing took another 10 seconds, so if we do not want to wait again, it is a good idea to save them. Here we will save it in a feather format. What this is going to do is to save it to disk in exactly the same basic format that it is in RAM. This is by far the fastest way to save something, and also to read it back. Feather format is becoming standard in not only Pandas but in Java, Apache Spark, etc.

In [None]:
os.makedirs('tmp',exist_ok=True)
df_raw.to_feather('tmp/bulldozers-raw')

**Pre-processing**

In the future we can simply read it from this fast format.

In [None]:
df_raw = pd.read_feather('tmp/bulldozers-raw')

We will replace categories with their numeric codes, handle missing continuous values, and split the dependent variable into a separate variable.

In [None]:
df, y, nas = proc_df(df_raw,'SalePrice')

We now have something we can pass to a random forest!

In [None]:
m=RandomForestRegressor(n_jobs=-1)
m.fit(df,y)
m.score(df,y)

In [None]:
#With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a common operation.
#https://pandas.pydata.org/pandas-docs/stable/indexing.html

def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

Co-efficient of determination -  When you have a negative co-efficient of determination, it means that your model has performed worse than the naive model that predicts the mean for all observations.
A great article which explains the co-efficient of determination: **https://ragrawal.wordpress.com/2017/05/06/intuition-behind-r2-and-other-regression-evaluation-metrics/#comment-7387**




**Random Forests**
Having a validation set if one of the most important steps in building a machine learning model. No one in the industry does this but as per Prof. Jeremy Howard this is a very very important step.
Now, lets try our model again with a training set and a validation set.
  

In [None]:
def rmse(x,y): return math.sqrt(((x - y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train),y_train), rmse(m.predict(X_valid),y_valid),
          m.score(X_train,y_train), m.score(X_valid,y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)



In [None]:
m = RandomForestRegressor(n_jobs=-1)
%time (m.fit(X_train,y_train))
print_score(m)

As per Prof. Jeremy Howard, if there is any operation that takes more than 10 seconds, then it becomes extremely difficult to work with that data interactively. Hence, one of the approaches that is taken is to work with a subset of the data. Once we decide upon the hyperparameters and are done with the feature engineering on this subset of the data, we then run the model on the entire dataset which takes much more time than it took on the subset of the data. We do this process in the below cell. Make sure that the validation set remains the same.

In [None]:
df_trn, y_trn, nas = proc_df(df_raw,'SalePrice', subset=30000, na_dict=nas)
X_train, _ = split_vals(df_trn,20000)
y_train, _ = split_vals(y_trn,20000)

m = RandomForestRegressor(n_jobs=-1)
%time (m.fit(X_train,y_train))
print_score(m)

Building a single tree

In [None]:
m = RandomForestRegressor(n_estimators=1, max_depth=3, bootstrap=False, n_jobs=-1)
m.fit(X_train,y_train)
print_score(m)

In [None]:
draw_tree(m.estimators_[0], df_trn, precision=3)

Lets create a bigger tree to see how it fares

In [None]:
#Here we have removed the depth parameter to see if that makes a difference
#As you can see, the R2 is better than the earlier R2. However, its still not up to the mark.
m = RandomForestRegressor(n_estimators=1, bootstrap=False, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

To make these trees better, we will build a forest, which consists of various trees. We will use a technique called **bagging**, to build the forest. Bagging should make the model more generalizable.

**Bagging**
To learn about bagging in random forests, let's start with our basic model again.

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
#Good explanation of slicing multi dimensional numpy arrays can be found in the book: Python for Data Analysis by Wes McKinney. Refer Chapter 4
preds = np.stack([t.predict(X_valid) for t in m.estimators_]) 
preds[:,0],np.mean(preds[:,0]),y_valid[0]

Here is a plot of R² values given first i trees. As we add more trees, R² improves. But it seems as though it has flattened out.

In [None]:
plt.plot([metrics.r2_score(y_valid, np.mean(preds[:i+1], axis=0)) for i in range(10)]);



The shape of the above curve suggests that adding more trees isn't going to help us much. Let's check. (Compare this to our original model on a sample)

In [None]:
m = RandomForestRegressor(n_estimators=20, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=80, n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

As you see, adding more trees do not help much. It will not get worse but it will stop improving things much. **This is the first hyper parameter to learn to set — a number of estimators**. **A method of setting is, as many as you have time to fit and that actually seems to be helping.** Adding more trees slows it down, but with less trees you can still get the same insights. **So when Jeremy builds most of his models, he starts with 20 or 30 trees and at the end of the project or at the end of the day’s work, he will use 1000 trees and run it over night.**

Sometimes your dataset will be small and you will not want to pull out a validation set because doing so means you now do not have enough data to build a good model. However, random forests have a very clever trick called out-of-bag (OOB) error which can handle this (and more!)



Is our validation set worse than our training set because we're over-fitting, or because the validation set is for a different time period, or a bit of both? With the existing information we've shown, we can't tell. However, random forests have a very clever trick called out-of-bag (OOB) error which can handle this (and more!)
The idea is to calculate error on the training set, but only include the trees in the calculation of a row's error where that row was not included in training that tree. This allows us to see whether the model is over-fitting, without needing a separate validation set.
This also has the benefit of allowing us to see whether our model generalizes, even if we only have a small amount of data so want to avoid separating some out to create a validation set.
This is as simple as adding one more parameter to our model constructor. We print the OOB error last in our print_score function below.


In [None]:
#Possible explanation as to why the oob score is better here than the R2 score of the validation set, whereas Jeremy says that it should generally be lower
#https://forums.fast.ai/t/oob-then-and-now-2017-11-vs-2018-10/23913
#The below OOB score also proves that the validation set time difference is making a difference here. The OOB score was calculated on data points in the same time range
#and we got a higher OOB score. This proves that the time difference in the validation set is what is making the difference when compared to the training data.

m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

**Reducing over-fitting**

**Subsampling**

Earlier, we took 30,000 rows and created all the models which used a different subset of that 30,000 rows. Why not take a totally different subset of 30,000 each time? In other words, let’s leave the entire 389,125 records as is, and if we want to make things faster, pick a different subset of 30,000 each time. So rather than bootstrapping the entire set of rows, just randomly sample a subset of the data

It turns out that one of the easiest ways to avoid over-fitting is also one of the best ways to speed up analysis: subsampling. Let's return to using our full dataset, so that we can demonstrate the impact of this technique.

In [None]:
df_trn, y_trn, nas = proc_df(df_raw, 'SalePrice')
X_train, X_valid = split_vals(df_trn, n_trn)
y_train, y_valid = split_vals(y_trn, n_trn)

**The basic idea is this**: rather than limit the total amount of data that our model can access, let's instead limit it to a different random subset per tree. **That way, given enough trees, the model can still see all the data, but for each individual tree it'll be just as fast as if we had cut down our dataset as before**.

In [None]:
set_rf_samples(20000)

In [None]:
m = RandomForestRegressor(n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)

**Since each additional tree allows the model to see more data, this approach can make additional trees more useful.**

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

This will take the same amount of time to run as before, **but every tree has an access to the entire dataset. **

**The biggest tip:**Most people run all of their models on all of the data all of the time using their best possible parameters which is just pointless. If you are trying to find out which feature is important and how they are related to each other, having that 4th decimal place of accuracy is not going to change any of your insights at all. Do most of your models on a large enough sample size that your accuracy is reasonable (within a reasonable distance of the best accuracy you can get) and taking a small number of seconds to train so that you can interactively do your analysis.

**Tree building parameters** - Lets explore some more parameters

We revert to using a full bootstrap sample in order to show the impact of other over-fitting avoidance methods.

In [None]:
reset_rf_samples()

In [None]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
%time m.fit(X_train, y_train)
print_score(m)

Here OOB is higher than validation set. This is because our validation set is a different time period whereas OOB samples are random. It is much harder to predict a different time period.

**min_sample**

Another way to reduce over-fitting is to grow our trees less deeply. We do this by specifying (with min_samples_leaf) that we require some minimum number of rows in every leaf node. This has two benefits:

* 1.     There are less decision rules for each leaf node; simpler models should generalize better
* 2.     The predictions are made by averaging more rows in the leaf node, resulting in less volatility



In [None]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3, 
                          n_jobs=-1, oob_score=True) 
%time m.fit(X_train, y_train) 
print_score(m)

1. min_sample_leaf=3 : Stop training the tree further when a leaf node has 3 or less samples (before we were going all the way down to 1). This means there will be one or two less levels of decision being made which means there are half the number of actual decision criteria we have to train (i.e. faster training time).
2. For each tree, rather than just taking one point, we are taking the average of at least three points that we would expect the each tree to generalize better. But each tree is going to be slightly less powerful on its own.
3.The numbers that work well are 1, 3, 5, 10, 25, but it is relative to your overall dataset size.


**max_feature**

We can also increase the amount of variation amongst the trees by not only use a sample of rows for each tree, but to also using a sample of columns for each split. We do this by specifying max_features, which is the proportion of features to randomly select from at each split.

In [None]:
m = RandomForestRegressor(n_estimators=40, min_samples_leaf=3,max_features=0.5, n_jobs=-1, oob_score=True) 
m.fit(X_train, y_train)
print_score(m)

* max_features=0.5 : The idea is that the **less correlated your trees are** with each other, the better. Imagine you had one column that was so much better than all of the other columns of being predictive that every single tree you built always started with that column. But there might be some interaction of variables where that interaction is more important than the individual column. So if every tree always splits on the same thing the first time, you will not get much variation in those trees.
* In addition to taking a subset of rows, at every single split point, take a different subset of columns.
* For row sampling, each new tree is based on a random set of rows, for column sampling, every individual binary split, we choose from a different subset of columns.
* 0.5 means randomly choose a half of them. There are special values you can use such as sqrt or log2
* Good values to use are 1, 0.5, log2, or sqrt


