# Introduction

In this notebook, I attempt to exploit the sparity of the binary features for soil type and wilderness area. This dataset is a synthetic version of the well-known [forest covertype dataset](https://archive.ics.uci.edu/ml/datasets/covertype). In the original dataset, soil type features and wilderness area features are one-hot encoded, i.e., they are one-hot encoded binary features from categorical variables `Soil_Type` and `Wilderness_Area`. In the synthetic version, one-hotness has been lost. Nevertheless, they are still relatively sparse. Can we still exploit the sparsity to benefit classifier training, especially for tree-based classifiers? 

# Setting up data and baseline experiment

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
pd.set_option('max_columns',None)

In [None]:
train_data =  pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')

In [None]:
def optimize_ints(df: pd.DataFrame) -> pd.DataFrame:
    ints = df.select_dtypes(include=['int']).columns.tolist()
    df[ints] = df[ints].apply(pd.to_numeric, downcast='integer')
    return df

def optimize_floats(df: pd.DataFrame) -> pd.DataFrame:
    floats = df.select_dtypes(include=['float']).columns.tolist()
    df[floats] = df[floats].apply(pd.to_numeric, downcast='float')
    return df

In [None]:
train_data = optimize_ints(train_data)

In [None]:
X = train_data.drop('Cover_Type',axis=1).set_index('Id')
y = train_data.Cover_Type

In [None]:
# remove constant features
to_remove = []
summary = X.describe()
for c in X.columns:
    if summary.loc['std',c]==0:
        to_remove.append(c)
        
print(to_remove)

X=X.drop(to_remove,axis=1)

In [None]:
# too few data for a class
to_remove = []
for c in y.unique():
    indices = list(y[y==c].index)
    if len(indices) <= 1:
        to_remove += indices
        
print(to_remove)

X=X.drop(to_remove,axis=0)
y=y.drop(to_remove,axis=0)

In [None]:
from sklearn.preprocessing import LabelEncoder
y = LabelEncoder().fit_transform(y)

We note in particular memory usage of this baseline dataframe.

In [None]:
X.info(memory_usage='deep')

The whole dataframe consists of integers. We now check the smallest integer in the dataframe. The reason is we need to find an unused integer to represent a missing value, as will be clear later.

In [None]:
X.min(axis=0).min()

Now we continue to set up XGBoost to run our experiment to get baseline timing and quality (accuracy). Nothing groundbreaking here, just splitting the data into training and testing and using some possibly non-optimal hyperparameters to illustrate the approach.

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier

xgb.__version__

In [None]:
params = {"max_depth": 6,
          'subsample': 0.1,
          "colsample_bytree": 0.25,
          'learning_rate': 0.05,
          'min_child_weight': 0
         }
params['verbosity'] = 2
params['tree_method'] = 'gpu_hist'
params['predictor'] = 'gpu_predictor'
params['sampling_method'] = 'gradient_based'
params['n_jobs'] = -1
params['random_state']=42
params['n_estimators'] = 5000

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,shuffle=True,random_state=42,stratify=y)

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
%%time
clf = XGBClassifier(**params, use_label_encoder=False).fit(X_train,y_train, eval_metric=['merror','mlogloss'],
                                                           eval_set=[(X_train,y_train),(X_test,y_test)],verbose=1000)

In [None]:
%%time
print(F'Train accuracy: {accuracy_score(y_train,clf.predict(X_train))}')
print(F'Test accuracy: {accuracy_score(y_test,clf.predict(X_test))}')

# Sparse representation of dataframe

Let us first investigate how sparse the columns really are. 

In [None]:
sparse_cols = [c for c in X.columns if c.startswith('Soil_Type') or c.startswith('Wilderness_Area')]
X[sparse_cols].mean(axis=0)

As expected, soil type columns are sparse; wilderness area columns, not so much.

Now we choose \\(-32768\\) to be our token for "missing value". This value is not used in the training set as we have checked earlier (and also not used in the test set, something we've checked behind the scene). This would necessitate expanding the data width for the binary columns from `int8` to `int16`. The additional space needed for this would be more than compensated for when we finally convert to sparse matrix representation.

Now you may wonder why, we don't have missing values in the dataframe. That's correct, but we are going to consider the \\(0\\)'s in the binary columns as "missing".  After all, \\(0\\) and \\(1\\) are just names we call the two different states.

In [None]:
MISSING = -32768

In [None]:
Xs = X.copy()
Xs[sparse_cols] = Xs[sparse_cols].astype(np.int16) # expand to signed 16-bit integer
Xs[sparse_cols] = Xs[sparse_cols].replace(0,MISSING) # substitute 0's by -32768's
Xs[sparse_cols] = Xs[sparse_cols].astype(pd.SparseDtype(np.int16,MISSING)) # convert columns to sparse arrays

In [None]:
Xs.info(memory_usage='deep')

Everything looks good. Memory usage is actually reduced by almost 50%.

Now repeat the experiment with this sparse dataframe.

In [None]:
Xs_train,Xs_test,y_train,y_test = train_test_split(Xs,y,test_size=0.25,shuffle=True,random_state=42,stratify=y)

The reason why we go through this apparently trivial exercise of replacing \\(0\\)'s by a missing token is that, as explained in the [XGBoost paper](https://arxiv.org/abs/1603.02754), there is a sparsity-aware algorithm for splitting samples at a node that learns a "default direction" for samples with missing values, and it only needs to examine the samples that have a value. Long story short, we expect a speed up in training.

In [None]:
%%time
clf = XGBClassifier(**params, use_label_encoder=False, missing=MISSING).fit(Xs_train,y_train, eval_metric=['merror','mlogloss'],
                                                           eval_set=[(Xs_train,y_train),(Xs_test,y_test)],verbose=1000)

It looks like we get more than 25% speed up in training...

In [None]:
%%time
print(F'Train accuracy: {accuracy_score(y_train,clf.predict(Xs_train))}')
print(F'Test accuracy: {accuracy_score(y_test,clf.predict(Xs_test))}')

... and quality has not degraded.