# Introduction

The purpose of this notebook is to 
- Perform the feature importance study for the training set of this competition refined with the additional features engineered per the suggestions in https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293612
- Train a basic catboost model with the important feature subset detected above

The feature importance study will basically follow the flow of the brilliant experiment in https://www.kaggle.com/lordozvlad/tps-dec-fast-feature-importance-with-sklearnex, by @lordozvlad . The essence of the flow is presented below

- Detecting the feature importance with the permutation importance method, as implemented by ELI5 package, with the random forest trained as a model in the feature permutation rounds
- Accelerating performance of scikit-learn-based operations with Intel’s accelerator (sklearnex)

The catboost model is not fully tuned, and its purpose is to demonstrate how good (or equally less than good) the accuracy of the prediction with the important feature subset is.


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime as dt
from typing import Tuple, List, Dict

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.offline


# read data
in_kaggle = True

def get_data_file_path(is_in_kaggle: bool) -> Tuple[str, str, str]:
    train_path = ''
    test_path = ''
    sample_submission_path = ''

    if is_in_kaggle:
        # running in Kaggle, inside the competition
        train_path = '../input/tabular-playground-series-dec-2021/train.csv'
        test_path = '../input/tabular-playground-series-dec-2021/test.csv'
        sample_submission_path = '../input/tabular-playground-series-dec-2021/sample_submission.csv'
    else:
        # running locally
        train_path = 'data/train.csv'
        test_path = 'data/test.csv'
        sample_submission_path = 'data/sample_submission.csv'

    return train_path, test_path, sample_submission_path

In [None]:
# main flow
start_time = dt.datetime.now()
print("Started at ", start_time)

In [None]:
%%time
# get the training set and labels
train_set_path, test_set_path, sample_subm_path = get_data_file_path(in_kaggle)

train = pd.read_csv(train_set_path)
test = pd.read_csv(test_set_path)

subm = pd.read_csv(sample_subm_path)

# Review of Class Labels in Training

In [None]:
#### Check the class counts. If any class is too small, just drop those classes
target = 'Cover_Type'
train[target].value_counts()

As we can see, classes with the labels of 4 and 5 are quite rare in the training set. Therefore we will ignore them in the model training. To achieve it, we are going to drop the training observations with such class labels:

In [None]:
print('rows dropped = ', train[((train[target] == 4) | (train[target] == 5))].shape)
train = train[~((train[target] == 4) | (train[target] == 5))]
print(train.shape)

# Additional Feature Engineering and Data Preprocessing

In [None]:
%%time
# remove useless features
zero_variance_features = [ 'Soil_Type7', 'Soil_Type15', 'Id']

train = train.drop(zero_variance_features, axis=1)
test = test.drop(zero_variance_features, axis=1)

# extra feature engineering
def r(x):
    if x+180>360:
        return x-180
    else:
        return x+180

def fe(df):
    
    features_Hillshade = ['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']
    
    df['EHiElv'] = df['Horizontal_Distance_To_Roadways'] * df['Elevation']
    df['EViElv'] = df['Vertical_Distance_To_Hydrology'] * df['Elevation']
    df['Aspect2'] = df.Aspect.map(r)
    ### source: https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293373
    df["Aspect"][df["Aspect"] < 0] += 360
    df["Aspect"][df["Aspect"] > 359] -= 360
    df.loc[df["Hillshade_9am"] < 0, "Hillshade_9am"] = 0
    df.loc[df["Hillshade_Noon"] < 0, "Hillshade_Noon"] = 0
    df.loc[df["Hillshade_3pm"] < 0, "Hillshade_3pm"] = 0
    df.loc[df["Hillshade_9am"] > 255, "Hillshade_9am"] = 255
    df.loc[df["Hillshade_Noon"] > 255, "Hillshade_Noon"] = 255
    df.loc[df["Hillshade_3pm"] > 255, "Hillshade_3pm"] = 255
    ########
    df['Highwater'] = (df.Vertical_Distance_To_Hydrology < 0).astype(int)
    df['EVDtH'] = df.Elevation - df.Vertical_Distance_To_Hydrology
    df['EHDtH'] = (df.Elevation - df.Horizontal_Distance_To_Hydrology * 0.2).astype(int)
    df['Euclidean_Distance_to_Hydrolody'] = ((df['Horizontal_Distance_To_Hydrology']**2 + df['Vertical_Distance_To_Hydrology']**2)**0.5).astype(int)
    df['Manhattan_Distance_to_Hydrolody'] = df['Horizontal_Distance_To_Hydrology'] + df['Vertical_Distance_To_Hydrology']
    df['Hydro_Fire_1'] = df['Horizontal_Distance_To_Hydrology'] + df['Horizontal_Distance_To_Fire_Points']
    df['Hydro_Fire_2'] = abs(df['Horizontal_Distance_To_Hydrology'] - df['Horizontal_Distance_To_Fire_Points'])
    df['Hydro_Road_1'] = abs(df['Horizontal_Distance_To_Hydrology'] + df['Horizontal_Distance_To_Roadways'])
    df['Hydro_Road_2'] = abs(df['Horizontal_Distance_To_Hydrology'] - df['Horizontal_Distance_To_Roadways'])
    df['Fire_Road_1'] = abs(df['Horizontal_Distance_To_Fire_Points'] + df['Horizontal_Distance_To_Roadways'])
    df['Fire_Road_2'] = abs(df['Horizontal_Distance_To_Fire_Points'] - df['Horizontal_Distance_To_Roadways'])
    df['Hillshade_3pm_is_zero'] = (df.Hillshade_3pm == 0).astype(int)
    
    df["Hillshade_mean"] = df[features_Hillshade].mean(axis=1).astype(int)
    df['amp_Hillshade'] = df[features_Hillshade].max(axis=1) - df[features_Hillshade].min(axis=1).astype(int)
    return df

train = fe(train)
test = fe(test)

# Summed features pointed out by @craigmthomas (https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/292823)
soil_features = [x for x in train.columns if x.startswith("Soil_Type")]
wilderness_features = [x for x in train.columns if x.startswith("Wilderness_Area")]

train["soil_type_count"] = train[soil_features].sum(axis=1)
test["soil_type_count"] = test[soil_features].sum(axis=1)

train["wilderness_area_count"] = train[wilderness_features].sum(axis=1)
test["wilderness_area_count"] = test[wilderness_features].sum(axis=1)

In [None]:
%%time
def reduce_memory_usage(df):
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != 'object':
            c_min = df[col].min()
            c_max = df[col].max()
            
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    pass
        else:
            df[col] = df[col].astype('category')
    
    return df

In [None]:
%%time
train = reduce_memory_usage(train)
test  = reduce_memory_usage(test)

# Intel® Extension for Scikit-learn

We are going ot install Intel Extension for  Scikit-learn to accelerate the performance of the usual scikit-learn routines.

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off > /dev/null 2>&1

Now we are ready to apply the performance acceleration with just two lines of code

In [None]:
%%time
from sklearnex import patch_sklearn
patch_sklearn()


# Feature Importancce with ELI5

One of the most basic questions we might ask of a model is: What features have the biggest impact on predictions?

This concept is called feature importance.

There are multiple ways to measure feature importance. In this kernel we consider permutation importance using library ELI5.
¶
ELI5 provides a way to compute feature importances for any black-box estimator by measuring how score decreases when a feature is not available. It implements permutational feature importance scoring.

In [None]:
%%time
from sklearn.model_selection import train_test_split
from timeit import default_timer as timer

X, y = train.drop(['Cover_Type'], axis = 1), train['Cover_Type']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.3, random_state = 42)


In [None]:
import eli5
from eli5.sklearn import PermutationImportance
from timeit import default_timer as timer
from sklearn.ensemble import RandomForestClassifier

In [None]:
%%time
timeFirstI  = timer()
modelRF     = RandomForestClassifier(random_state = 42).fit(X_train, y_train)
perm        = PermutationImportance(modelRF, random_state = 42).fit(X_val, y_val)
timeSecondI = timer()

In [None]:
print("Total time with Intel Extension: {} seconds".format(timeSecondI - timeFirstI))

In [None]:
%%time
eli5.show_weights(perm, feature_names = X.columns.tolist())

In [None]:
%%time
pi_features = eli5.explain_weights_df(perm, feature_names = X_train.columns.tolist())
pi_features = pi_features.loc[pi_features['weight'] >= 0.0001]['feature'].tolist()

In [None]:
%%time
#show all important features
pi_features

## Feature importance notes

As we can see, the top five features detected to be important as follows
- 'Elevation', 'EVDtH', 'soil_type_count', 'EHDtH', 'Wilderness_Area3'
- 'Elevation' is still the most important feature (as in the dataset with the raw features only)
- Three newly generated derived features take the places from 2 to 4
- 'Wilderness_Area3', despite its strong negative correlation with 'Wilderness_Area1', shows to be one of the top 5 most important features in the model training


In [None]:
%%time
# subset the training and validation sets with the important features only
X_trainPI = X_train.loc[:, pi_features]
X_valPI   = X_val.loc[:, pi_features]

# Catboost  Prediction with Important Features

In [None]:
test_data = test.loc[:, pi_features]

In [None]:
from catboost import CatBoostClassifier

cat_params = {
    'iterations': 20000,
    'depth': 7,
    'task_type' : 'GPU',
    'l2_leaf_reg': 5,
    'eval_metric': 'Accuracy',
}

cat = CatBoostClassifier(**cat_params)
cat.fit(X_trainPI, y_train, eval_set=(X_valPI, y_val))

In [None]:
# predict
predictions = cat.predict(test_data)
subm['Cover_Type'] = predictions

if in_kaggle:
    submission_path = 'submission.csv'
else:
    submission_path = 'output/catboost_eli5_prediction.csv'

subm.to_csv(submission_path, index = False)

In [None]:
print('We are done. That is all, folks!')
finish_time = dt.datetime.now()
print("Finished at ", finish_time)
elapsed = finish_time - start_time
print("Elapsed time: ", elapsed)

# References

- The basic flow of Feature Importance experiment with ELI5, along with using the Intel accelerator for Scikit-learn, inherited from the nice notebook per https://www.kaggle.com/lordozvlad/tps-dec-fast-feature-importance-with-sklearnex, by @lordozvlad
- The additional feature engineering implemented per the excellent guideline thread [Feature engineering update thread](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293612) by @lucamassaron
- The feature importance study from the orginal competition examplified by https://www.kaggle.com/mariannejoyleano/ml-forest-cover-feature-engineering-v01 (please note they used the embedded feature importance of several algorithms, as opposed to the ELI5-backed permutation feature importancce method applied in ths experiment)


