In [None]:
!pip install deep_autoviml

# Introduction

This notebook is dedicated to demonstrate the potential of **Deep AutoViML** AutoMLproduct at automatic building the DL model for this contest as well as predicting for the competion contest.

**Note:** We have to install **Deep AutoViML** into the notebook interactively since it is not the part of the default set of Python packages deployed to the Kaggle notebook docker image.

In [None]:
from deep_autoviml import deep_autoviml as deepauto
import pandas as pd
import datetime as dt

In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

# Data Preprocessing

First of all, we read the data in memory

In [None]:
trainfile = '/kaggle/input/tabular-playground-series-dec-2021/train.csv'
testfile = '/kaggle/input/tabular-playground-series-dec-2021/test.csv'
subfile = '/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv'

train = pd.read_csv(trainfile)
test = pd.read_csv(testfile)
sub = pd.read_csv(subfile)
print(train.shape, test.shape)
train.head()

# Data Preprocessing and Feature Engineering

We are going to check the frequency of the target class labels in the training set

In [None]:
#### Check the class counts. If any class is too small, just drop those classes
target = 'Cover_Type'
train[target].value_counts()

As we can see, classes with the labels of **4** and **5** are quite rare in the training set. Therefore we will ignore them in the model training. To achieve it, we are going to drop the training observations with such class labels:

In [None]:
print('rows dropped = ', train[((train[target] == 4) | (train[target] == 5))].shape)
train = train[~((train[target] == 4) | (train[target] == 5))]
print(train.shape)

Also, we detected **'Soil_Type15'** and **'Soil_Type7'** to be useless features. They only have one value for each and every record in the training set (that is, they have zero variance) so they would not be helpful in the model training.

We are going to remove these features from both the training and test sets.

In [None]:
# drop useless features with zero variance
features_to_drop = ['Soil_Type15', 'Soil_Type7']
train = train.drop(features_to_drop, axis=1)
test = test.drop(features_to_drop, axis=1)

After it, we are going to apply some additional feature engineering below

In [None]:
# drop Wildness_Area3
train = train.drop(['Wilderness_Area3'], axis=1)
test = test.drop(['Wilderness_Area3'], axis=1)

# additional feature engineering
train['EHiElv'] = train['Horizontal_Distance_To_Roadways'] * train['Elevation']
test['EHiElv'] = test['Horizontal_Distance_To_Roadways'] * test['Elevation']

train.loc[train["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
train.loc[train["Hillshade_9am"] > 255, "Hillshade_9am"] = 255
test.loc[test["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
test.loc[test["Hillshade_9am"] > 255, "Hillshade_9am"] = 255

# Summed features pointed out by @craigmthomas (https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/292823)
soil_features = [x for x in train.columns if x.startswith("Soil_Type")]
wilderness_features = [x for x in train.columns if x.startswith("Wilderness_Area")]

train["soil_type_count"] = train[soil_features].sum(axis=1)
test["soil_type_count"] = test[soil_features].sum(axis=1)

train["wilderness_area_count"] = train[wilderness_features].sum(axis=1)
test["wilderness_area_count"] = test[wilderness_features].sum(axis=1)

# Deep AutoViML: Model training and predictions

First of all, we are going to set up some metadata for the model.

**Note:** The essential detail is we are going to use Google's deep-and-wide NN architecture (*keras_model_type = 'fast1'*) and training for 100 epoches with early stopping option, to find the right compromise between the  model accuracy and the training time. You can check out the documentation per https://github.com/AutoViML/deep_autoviml , **API** section, to see more on the available model types.

In [None]:
project_name = "deep_autoviml"
keras_options = {'early_stopping': True}
model_options = {}
keras_model_type = 'fast'  # new fast mode


Now we are going to launch the model training (with possible option for the early stopping, fast2 model and relatively small number of epoches).

In [None]:
model, cat_vocab_dict = deepauto.fit(train, target, keras_model_type=keras_model_type,
		project_name=project_name, keras_options=keras_options,  
		model_options=model_options, save_model_flag=True, use_my_model='',
		model_use_case='', verbose=0)

Now we are ready to predict the class labels on the training set as well as submit the predictions file into the competition.

In [None]:
#predict
predictions = deepauto.predict(model, project_name, test_dataset=test,
            keras_model_type=keras_model_type, cat_vocab_dict=cat_vocab_dict)

In [None]:
# update submission with the predictions obtained
prediction_class = predictions[1]
sub['Cover_Type'] = prediction_class
# submit predictions
sub.to_csv("submission.csv", index=False)

In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)

# Summary

Out of the box, with very limited training time, we achieved quite a good class categorization accuracy
- **0.90615** for 'fast1' model type (with only 'Soil_Type15', 'Soil_Type7' features dropped, and without data preprocessing suggested in https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373)
- **0.82723** for 'fast1' model type and additional feature preprocessing ('Soil_Type15', 'Soil_Type7' features dropped, imputing outliers for Aspect, Hillshade_9am, Hillshade_Noon, and Hillshade_3pm)
- **0.89217** for ('Soil_Type15', 'Soil_Type7' features dropped, EHiElv and EViElv features added)
- **0.91313** for ('Soil_Type15', 'Soil_Type7' features dropped, EHiElv feature added)
- **0.90490** for ('Soil_Type15', 'Soil_Type7' features dropped, EHiElv feature added, Aspect and Aspect2 feature manipulations)
- **0.91323** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' feature added, and outlier smoothening for 'Hillshade_9am')
- **0.91164** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' feature added, and outlier smoothening for 'Hillshade_9am' and 'Hillshade_Noon')
- **0.90723** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' feature added, and outlier smoothening for 'Hillshade_9am' and 'Hillshade_3pm')
- **0.90957** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Highwater' features added, and outlier smoothening for 'Hillshade_9am')
- **0.89942** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'EVDtH' features added, and outlier smoothening for 'Hillshade_9am')
- **0.91190** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'EHDtH' features added, and outlier smoothening for 'Hillshade_9am')
- **0.90940** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Euclidean_Distance_to_Hydrolody' features added, and outlier smoothening for 'Hillshade_9am')
- **0.86319** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Manhattan_Distance_to_Hydrolody' features added, and outlier smoothening for 'Hillshade_9am')
- **0.91019** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Hydro_Fire_1' features added, and outlier smoothening for 'Hillshade_9am')
- **0.89677** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Hydro_Fire_2' features added, and outlier smoothening for 'Hillshade_9am')
- **0.90663** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Hydro_Road_1' features added, and outlier smoothening for 'Hillshade_9am')
- **0.90780** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Hydro_Road_2' features added, and outlier smoothening for 'Hillshade_9am')
- **0.91160** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Fire_Road_1' features added, and outlier smoothening for 'Hillshade_9am')
- **0.91052** for ('Soil_Type15', 'Soil_Type7' features dropped, 'EHiElv' and 'Fire_Road_2' features added, and outlier smoothening for 'Hillshade_9am')
- **0.91939** for ('Soil_Type15', 'Soil_Type7', and 'Wilderness_Area3' features dropped, 'EHiElv' feature added, and outlier smoothened for 'Hillshade_9am')
- **0.91339** for ('Soil_Type15', 'Soil_Type7', and 'Wilderness_Area3' features dropped, 'EHiElv' and 'Hillshade_3pm_is_zero' features added, and outlier smoothened for 'Hillshade_9am')
- **0.92358** for ('Soil_Type15', 'Soil_Type7', and 'Wilderness_Area3' features dropped, 'EHiElv' and 'soil_type_count' features added, and outlier smoothened for 'Hillshade_9am')
- **0.92918** for ('Soil_Type15', 'Soil_Type7', and 'Wilderness_Area3' features dropped, 'EHiElv', 'soil_type_count', and 'wilderness_area_count' features added, and outlier smoothened for 'Hillshade_9am')
- **0.91276** for ('Soil_Type15', 'Soil_Type7', and 'Wilderness_Area1' features dropped, 'EHiElv', 'soil_type_count', and 'wilderness_area_count' features added, and outlier smoothened for 'Hillshade_9am')



As we can see, the results are quite promising.

**Business-class accuracy**. *'fast1'* model in basic setup scored at *90.6%* accuracy on the  test set out of the box, and it could be futher tuned to *92.92%* with a bit of additional feature engineering (see above). Such an accuracy is extremely good for the real-world ML classification projects (sometimes they are released even if 70% of accuracy is achieved, depending on the missclassification tall and impact for a particular business problem).

**Speed of model training and time to market under affordable resource allocation**. Model training time was less than 30 min (both on Kaggle premise and a local laptop). There was no need to set up any expensive hardware or cloud-based clusters to train the model.

Surely, such a result is still behind certain manually tuned models.

However, it demontrates that **Deep AutoViML** can deliver the models of real-world quality without  spending too much time and resource on its setup and training (as we know, in business enviroments, delivering good-enough results at a very fast timeline is often better then delivering the result of exceptional quality under very long timeline).