In [None]:
!pip install fastai==0.7.0

**Dependencies**

In [None]:
import numpy as np
import pandas as pd
from collections import Counter

from fastai.imports import *
from fastai.structured import *

import os
print(os.listdir("../input"))

Load the training and testing sets. Parse the date columns as `date` by explicitly specifying it. This speeds up the loading process in general as the csv_reader does not have to spend time in deriving the `dtypes` of  these columns. 

In [None]:
train_df = pd.read_csv('../input/train_file.csv',parse_dates=["Application Date", "Issue Date", "Final Date", "Expiration Date"])
test_df = pd.read_csv('../input/test_file.csv', parse_dates=["Application Date", "Issue Date", "Final Date", "Expiration Date"])

train_df.shape, test_df.shape

In [None]:
train_df.info()

In [None]:
train_df.head()

In [None]:
train_df.isna().sum()

In [None]:
test_df.isna().sum()

**Lots of missing values!**

Let's now see the target distribution of the training set. 

In [None]:
label_counts = Counter(train_df['Category'].values)
label_counts.most_common()

* **Start of feature engineering**:
    Merge the training set and testing set and merging do not shuffle the instances randomly. This will help in two ways - 
    * Generate new features (when performing one-hot encoding for preprocessing categorical features) thereby allowing an ML   model to train with low bias.
    * Reduces the chance of data leakages. 

In [None]:
test_df['Category'] = '' # So that there is not mismatch in the dimensions

In [None]:
all_data_df =pd.concat([train_df,test_df])
all_data_df.shape

Both the datasets have the following date columns -
* Application Date
* Issue Date
* Final Date
* Expiration Date

The **fast.ai** function `add_datepart()` helps in a lot of ways. 

> The following method extracts particular date fields from a complete datetime for the purpose of constructing categoricals. You should always consider this feature extraction step when working with date-time. Without expanding your date-time into these additional fields, you can't capture any trend/cyclical behavior as a function of time at any of these granularities. - [Fast.AI's course on Machine Learning for coders](https://github.com/fastai/fastai/blob/master/courses/ml1/lesson1-rf.ipynb)

In [None]:
add_datepart(all_data_df,'Application Date')
add_datepart(all_data_df,'Issue Date')
add_datepart(all_data_df,'Final Date')
add_datepart(all_data_df,'Expiration Date')

In [None]:
all_data_df.shape

Another **fast.ai** function - `train_cats()`, which allows to convert strings to `pandas` categories.

In [None]:
train_cats(all_data_df)

In [None]:
all_data_df.Category.cat.categories

**Missing values' handling**

In [None]:
df_x, df_y, nas = proc_df(all_data_df, 'Category')
df_x.shape, df_y.shape

In [None]:
label_counts = Counter(df_y)
label_counts.most_common()

The 0 category samples belong to the testing set. Apart from that all the target distributions are  same. Encoding mappings are as follows - 
- 5 -> SINGLE FAMILY / DUPLEX
- 4 -> MULTIFAMILY
- 3 -> INSTITUTIONAL
- 2 -> INDUSTRIAL
- 1 -> COMMERCIAL

We are done with the feature engineering part for now. We can get back to the initial train:test ration. 

In [None]:
def split_vals(a,n): return a[:n], a[n:]

In [None]:
df_train, test  = split_vals(df_x,33539)
y_train,_ = split_vals(df_y,33539)
df_train.shape, y_train.shape, test.shape

An additional split of training and validation sets from the training set. 

In [None]:
X_train, X_valid = split_vals(df_train,28539)
y_train, y_valid = split_vals(y_train,28539)

X_train.shape, y_train.shape, X_valid.shape, y_valid.shape

Given a shorter time-frame, trying out different ML models can be tedious. H2O.ai's AutoML framework is pretty good for situations like this. 

In [None]:
import h2o
from h2o.automl import H2OAutoML

h2o.init(max_mem_size='10G')

In [None]:
X_train['Category'] = y_train
X_valid['Category'] = y_valid

All H2O models require the data to be presented in a format called [H2OFrame](http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html). 

In [None]:
X_train_h2o = h2o.H2OFrame(X_train)
X_valid_h2o = h2o.H2OFrame(X_valid)
X_test_h2o = h2o.H2OFrame(test)

We have a multi-class/multi-nomial classification problem here. And H2O requires the target columns (for classification problems) to be in `factors`. 

In [None]:
X_train_h2o['Category'] = X_train_h2o['Category'].asfactor()
X_valid_h2o['Category'] = X_valid_h2o['Category'].asfactor()

In [None]:
predictors = X_train_h2o.columns[0:-1]
target = 'Category'

In [None]:
auto_h2o = H2OAutoML(seed = 1, sort_metric = 'mean_per_class_error')
%time auto_h2o.train(x=predictors,\
               y=target,\
               training_frame=X_train_h2o)

lb = auto_h2o.leaderboard
lb.head(rows=lb.nrows)

H2O's AutoML yielded a **Stacked Ensemble** model as the most superior model among few others. Let's take a look the model. 

In [None]:
auto_h2o.leader

> So, the Stacked Ensemble consists of MultiNomial GLMs. Let's go ahead and see the confusion matrix of the model on the validation set. GLMs are generally good for datasets having high dimensionality and sparsity. This further confirms the hypothesis that H2O's AutoML picked up the right model.

In [None]:
auto_h2o.leader.confusion_matrix(X_valid_h2o)

<h3>**Interpretation:**</h3> 
![cm_interpretation](https://i.ibb.co/V2hNKfS/Capture.png)
<center>[Source](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#f1)</center>

In [None]:
label_counts = Counter(X_valid['Category'].values)
label_counts.most_common()

The model performs fairly good for the classes 5 and 1. But the model performs not so well for classes - 4, 3 and 2 (encoded version of the original categories). This is happening because of the poor sample distributions in the training and validation set. As a result, the Stacked Ensemble model is failing to capture the trends underlying the sample belonging to these three classes. May be a better split in the training and validation set would result in a better model. As a next step, we can come up with better split between the training and validation sets so as to better distribute the samples belonging to class 4, 3 and 2. 

**Make prediction on the testing set and prepare the submission file**

In [None]:
test_preds = auto_h2o.leader.predict(X_test_h2o)

In [None]:
application_numbers = test_df['Application/Permit Number']
results=(h2o.as_list(test_preds['predict'])).iloc[:,0]

In [None]:
final=pd.concat([application_numbers, results], axis=1)
final.rename(columns={'predict': 'Category'}, inplace=True)

- 5 -> SINGLE FAMILY / DUPLEX
- 4 -> MULTIFAMILY
- 3 -> INSTITUTIONAL
- 2 -> INDUSTRIAL
- 1 -> COMMERCIAL

In [None]:
final.loc[final['Category'] == 5, 'Category'] = 'SINGLE FAMILY / DUPLEX'
final.loc[final['Category'] == 4, 'Category'] = 'MULTIFAMILY'
final.loc[final['Category'] == 3, 'Category'] = 'INSTITUTIONAL'
final.loc[final['Category'] == 2, 'Category'] = 'INDUSTRIAL'
final.loc[final['Category'] == 1, 'Category'] = 'COMMERCIAL'

print (final.head())

In [None]:
final.to_csv('submission.csv', sep=',',index=False) 
!head -5 submission.csv