This notebook is based on the fastai colab notebook: https://colab.research.google.com/github/fastai/fastbook/blob/master/09_tabular.ipynb#scrollTo=KuB9u4cudJMx

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
!pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz
import fastbook
fastbook.setup_book()

In [None]:
#hide
from fastbook import *
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG

pd.options.display.max_rows = 20
pd.options.display.max_columns = 8

Look at the data

In [None]:
path = '/kaggle/input/tabular-playground-series-jul-2021/'
train = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv', low_memory= False)
train.head()


 **There are three target columns. So we will treat this problem as three different problem. I will be proceeding with three datasets one for each target**

In [None]:
target_carbon_monoxide = train.target_carbon_monoxide.values
target_benzene = train.target_benzene.values
target_nitrogen_oxides = train.target_nitrogen_oxides.values


In [None]:
train_carbon_monoxide = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv', low_memory= False)
train_carbon_monoxide.drop(['target_benzene','target_nitrogen_oxides'], axis=1, inplace= True)
train_carbon_monoxide.head()


In [None]:
train_benzene = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv', low_memory= False)
train_benzene.drop(['target_carbon_monoxide','target_nitrogen_oxides'], axis=1, inplace= True)
train_benzene.head()


In [None]:
train_nitrogen_oxides = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/train.csv', low_memory= False)
train_nitrogen_oxides.drop(['target_carbon_monoxide','target_benzene'], axis=1, inplace= True)
train_benzene.head()

In [None]:
test = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/test.csv', low_memory= False)
test.head()



In [None]:
sample_submission = pd.read_csv('/kaggle/input/tabular-playground-series-jul-2021/sample_submission.csv',low_memory=False)
sample_submission

Specifying low_memory = False, which is True by default , helps pandas look into the entire dataset

In [None]:
train.columns

Only 12 columns here. 'target_carbon_monoxide','target_benzene','target_nitrogen_oxides' are the target columns.

In [None]:
dep_var_tcm = 'target_carbon_monoxide'
dep_var_tb = 'target_benzene'
dep_var_tno ='target_nitrogen_oxides'

Our evaluation metric is root mean squared log error(RMSLE) between the actual and predicted values.We will take the log of dependent variables  and we will get what we need.


In [None]:
train_carbon_monoxide[dep_var_tcm]= np.log(train_carbon_monoxide[dep_var_tcm])
train_benzene[dep_var_tb]= np.log(train_benzene[dep_var_tb])
train_nitrogen_oxides[dep_var_tno]= np.log(train_nitrogen_oxides[dep_var_tno])

First trying decision trees

We need to take care of the dates properly. We would want our model to make decisions based on the knowledge how recent a date or what day of the week is it or a month. To do this, we replace every date column with a set of date metadata columns, such as holiday, day of week, and month. These columns provide categorical data that we suspect will be useful.We will use fastai's add_datapart function to do this.

In [None]:
train_carbon_monoxide = add_datepart(train_carbon_monoxide, 'date_time')
train_benzene = add_datepart(train_benzene, 'date_time')
train_nitrogen_oxides = add_datepart(train_nitrogen_oxides, 'date_time')

In [None]:
' '.join(o for o in train_carbon_monoxide.columns if o.startswith('date'))
' '.join(o for o in train_benzene.columns if o.startswith('date'))
' '.join(o for o in train_nitrogen_oxides.columns if o.startswith('date'))

In [None]:
train_nitrogen_oxides.head()

In [None]:
train_nitrogen_oxides.columns

We will do the same for test dataset

In [None]:
test = add_datepart(test, 'date_time')
' '.join(o for o in test.columns if o.startswith('date'))

Using fastai's TabularPandas and TabularProc to do preprocessing of the data

In [None]:
procs = [Categorify, FillMissing]

Because this is a TimeSeries we have to careful when doing the partition for train and validation set. If we will take a closer look at the date range in the test set, we will discover that it covers 4 month period from Jan 2011, which is later in time than any date in the training set. Its a good design becuase we have to make a model which predict in the future. So, if we want to have a useful validation set, we want the validation set to be later in the time than the training set. Our training set ends by December 2010, so we will define narrower training set which consists of the training data from before November 2010, and we'll define a validation set consisting of data after Novemebr 2010

In [None]:
#first for train_carbon_monoxide
cond = (train_carbon_monoxide.date_timeYear<2011) & (train_carbon_monoxide.date_timeMonth<10)

train_cm_idx = np.where( cond)[0]
train_bz_idx =np.where(cond)[0]
train_no_idx= np.where(cond)[0]
valid_cm_idx = np.where(~cond)[0]
valid_bz_idx = np.where(~cond)[0]
valid_no_idx = np.where(~cond)[0]

splits_cm = (list(train_cm_idx),list(valid_cm_idx))
splits_bz = (list(train_bz_idx),list(valid_bz_idx))
splits_no = (list(train_no_idx),list(valid_no_idx))


In [None]:
print(len(splits_cm[1]))

Now we will tell TabularPandas which columns are coninuous and which are categorical. We will do it automatically using function cont_cat_split

In [None]:
cont_cm,cat_cm = cont_cat_split(train_carbon_monoxide, 1, dep_var=dep_var_tcm)
cont_bz,cat_bz = cont_cat_split(train_benzene, 1, dep_var=dep_var_tb)
cont_no,cat_no = cont_cat_split(train_nitrogen_oxides, 1, dep_var=dep_var_tno)

In [None]:
from fastai.tabular import *
#for NN
data_cm = TabularDataLoaders.from_df(train_carbon_monoxide, cat_names=cat_cm, cont_names=cont_cm, procs=procs, 
                                 y_names="target_carbon_monoxide", bs=64)
data_bz = TabularDataLoaders.from_df(train_benzene, cat_names=cat_bz, cont_names=cont_bz, procs=procs, 
                                 y_names="target_benzene", bs=64)
data_no = TabularDataLoaders.from_df(train_nitrogen_oxides, cat_names=cat_no, cont_names=cont_no, procs=procs, 
                                 y_names="target_nitrogen_oxides", bs=64)

In [None]:
to_cm = TabularPandas(train_carbon_monoxide, procs, cat_cm, cont_cm, y_names=dep_var_tcm, splits=splits_cm)
to_bz = TabularPandas(train_benzene, procs, cat_bz, cont_bz, y_names=dep_var_tb, splits=splits_bz)
to_no = TabularPandas(train_nitrogen_oxides, procs, cat_no, cont_no, y_names=dep_var_tno, splits=splits_no)

In [None]:
len(to_bz.train),len(to_bz.valid)

In [None]:
to_bz.show(3)

In [None]:
to_bz.items.head(3)

Saving the pre-processed dataset to be used at later stage directly

In [None]:
#if required saved the processed dataset using below given code and load it.
#save_pickle(path/'to.pkl',to)
#To read this back later, you would type:
#to = (path/'to.pkl').load()

Creating the decision tree first

Defining our dependednt and independent variables first

In [None]:
xs_cm,y_cm = to_cm.train.xs,to_cm.train.y
xs_bz,y_bz = to_bz.train.xs,to_bz.train.y
xs_no,y_no = to_no.train.xs,to_no.train.y
valid_xs_cm,valid_y_cm = to_cm.valid.xs,to_cm.valid.y
valid_xs_bz,valid_y_bz = to_bz.valid.xs,to_bz.valid.y
valid_xs_no,valid_y_no = to_no.valid.xs,to_no.valid.y

In [None]:
print(xs_cm.columns)

We can make a decision tree because our data is all numeric and has no missing values

In [None]:
#First doing some Decision tree evaluation on carbon monoxide dataset
m_cm= DecisionTreeRegressor(max_leaf_nodes=4)
m_cm.fit(xs_cm, y_cm);

For simplicity the number of nodes are just 4. Let's see how it looks.

In [None]:
draw_tree(m_cm, xs_cm, size=10, leaves_parallel=True, precision=2)

Decision trees have never been visualized better than the above picture.The top node represents the initial model before any splits have been done, when all the data is in one group. We can see it predicts a value of 0.41 for the logarithm of the 'target_carbon_monoxide','target_benzene','target_nitrogen_oxides' columns. It gives a mean squared error of 0.48. We can also see that there are 4902 entries which is equal to our training set. Final information is that decision criterion for best split was found at column sensor2.

We can display the same information using Terence Parr's treeviz library:

In [None]:
samp_idx = np.random.permutation(len(y_cm))[:500]
dtreeviz(m_cm, xs_cm.iloc[samp_idx], y_cm.iloc[samp_idx], xs_cm.columns, dep_var_tcm,
        fontname='DejaVu Sans', scale=1.6, label_fontsize=10,
        orientation='LR')


No, obvious outliers seen here, so no modification of dataset is required

Let's create a bigger decision tree for all three datasets, here we are not passing any stopping criteria such as max_leaf_nodes

In [None]:
m_cm = DecisionTreeRegressor()
m_bz=DecisionTreeRegressor()
m_no=DecisionTreeRegressor()

m_cm.fit(xs_cm, y_cm);
m_bz.fit(xs_bz, y_bz);
m_no.fit(xs_no, y_no);

We will create a small function to check the RMSE

In [None]:
def r_mse(pred,y): return round(math.sqrt(((pred-y)**2).mean()), 6)
def m_rmse(m, xs, y): return r_mse(m.predict(xs), y)

In [None]:
m_rmse(m_cm, xs_cm, y_cm)
m_rmse(m_bz, xs_bz, y_bz)
m_rmse(m_no, xs_no, y_no)

This check was on the training set, let's check the validation set

In [None]:
m_rmse(m_cm, valid_xs_cm, valid_y_cm)


Model is overfitting big time and the reason is we have as many leaf_nodes as data points

In [None]:
m_cm.get_n_leaves(), len(xs_cm)

The reson is sklearn's default settings which allow it to continue splitting nodes until there is one item in each leaf node. Change the stopping rule to ensure every leaf node contains at least 25 auction records

In [None]:
m_cm = DecisionTreeRegressor(min_samples_leaf=25)
m_cm.fit(to_cm.train.xs, to_cm.train.y)
m_rmse(m_cm, xs_cm, y_cm), m_rmse(m_cm, valid_xs_cm, valid_y_cm)

That's much better

In [None]:
#Checking number of leaves again
m_cm.get_n_leaves()

That's way better DT

Let's get the prediction using DTregressor on all the three datasets

In [None]:
pred_cm = m_cm.predict(test)
pred_bz= m_bz.predict(test)
pred_no= m_no.predict(test)

In [None]:
## create submission for DTR
#sample_submission[sample_submission.columns[1]] = pred_cm
#sample_submission[sample_submission.columns[2]] = pred_bz
#sample_submission[sample_submission.columns[3]] = pred_no
#sample_submission


In [None]:
#sample_submission.to_csv('submission_clf_gscv.csv', index=False)

Creating a random forest classifier

In [None]:
def rf(xs, y, n_estimators=40, max_samples=4902,
       max_features=0.5, min_samples_leaf=5, **kwargs):
    return RandomForestRegressor(n_jobs=-1, n_estimators=n_estimators,
        max_samples=max_samples, max_features=max_features,
        min_samples_leaf=min_samples_leaf, oob_score=True).fit(xs, y)

In [None]:
m_cm = rf(xs_cm, y_cm);
m_bz = rf(xs_bz, y_bz);
m_no = rf(xs_no, y_no);

In [None]:

m_rmse(m_cm, xs_cm, y_cm), m_rmse(m_cm, valid_xs_cm, valid_y_cm)



Our validation RMSE has definitly improved compared to DecisionTreesregressor

**feature importance**

In [None]:
def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)

In [None]:
#for carbon monoxide dataset
fi_cm = rf_feat_importance(m_cm, xs_cm)
fi_cm[:10]

In [None]:
#for benzene dataset
fi_bz = rf_feat_importance(m_bz, xs_bz)
fi_bz[:10]

In [None]:
#for nitrogen oxide dataset
fi_no = rf_feat_importance(m_no, xs_no)
fi_no[:10]

For all the three datasets sensor2 dataset is the most important, others keep going up and down. As this stat was also proven by decision trees

In [None]:
def plot_fi(fi):
    return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)

plot_fi(fi_no[:30]);

Removing the columns with low importance. Let's try keeping the columns with a feature importance 0.008

In [None]:
#for cm
to_keep_cm = fi_cm[fi_cm.imp>0.008].cols
len(to_keep_cm)

In [None]:
#for bz
to_keep_bz = fi_bz[fi_bz.imp>0.008].cols
len(to_keep_bz)

In [None]:
#for cm
to_keep_no = fi_no[fi_no.imp>0.008].cols
len(to_keep_no)

In [None]:
xs_imp_cm = xs_cm[to_keep_cm]
xs_imp_bz = xs_bz[to_keep_bz]
xs_imp_no = xs_no[to_keep_no]

In [None]:
valid_xs_imp_cm = valid_xs_cm[to_keep_cm]
valid_xs_imp_bz = valid_xs_bz[to_keep_bz]
valid_xs_imp_no = valid_xs_no[to_keep_no]

In [None]:
m_cm = rf(xs_imp_cm, y_cm)

Let's see the result

In [None]:
m_rmse(m_cm, xs_imp_cm, y_cm), m_rmse(m_cm, valid_xs_imp_cm, valid_y_cm)

Accuracy is almost same, but there are lesser columns to deal with

Removing redundant features

In [None]:
cluster_columns(xs_imp_cm)

date_timeDayofyear and date_timeElapsed are very close , so we can remove for carbon_monoxide dataset, let's see the same for nitrogen_oxides and benzene

In [None]:
cluster_columns(xs_imp_bz)

No such dependent variables left

In [None]:
cluster_columns(xs_imp_no)

In this one, date_timeDayofyear, date_timeElapsed, date_timeWeek are merging very early and two of these can be removed.

In [None]:
train_carbon_monoxide.dtypes

 we want to limit the gases conc to be within the history gases conc values, so we need to calculate the y_range. Note that we multiplied the maximum of Saleprice by 1.2 so when we apply sigmoid the upper limit will also be covered

In [None]:
max_y_cm= np.max(train_carbon_monoxide['target_carbon_monoxide'])*1.2
y_range_cm = torch.tensor([0, max_y_cm])
y_range_cm

In [None]:
max_y_bz= np.max(train_benzene['target_benzene'])*1.2
y_range_bz = torch.tensor([0, max_y_bz])
y_range_bz

In [None]:
max_y_no= np.max(train_nitrogen_oxides['target_nitrogen_oxides'])*1.2
y_range_no = torch.tensor([0, max_y_no])
y_range_no

In [None]:
#learner for carbon_monoxide dataset
learn_cm = tabular_learner(data_cm, layers=[1000,500], 
                        y_range=y_range_cm, metrics=rmse)
#learner for benzene dataset
learn_bz = tabular_learner(data_bz, layers=[1000,500], 
                        y_range=y_range_bz, metrics=rmse)
#learner for nitrogen oxides dataset
learn_no = tabular_learner(data_no, layers=[1000,500], 
                        y_range=y_range_no, metrics=rmse)

In [None]:
learn_cm.model

In [None]:
learn_bz.model

In [None]:
print(pred_no)