# Introduction

This kernel is a bit of an exploration to see if we can reduce our dimensionality in a way that tree-based ML algorithms can make use of. In particular, this was inspired by the kernel [TPS122021 Exploiting Sparsity for XGBoost](https://www.kaggle.com/siukeitin/tps122021-exploiting-sparsity-for-xgboost) that was discussed at [https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/294808](https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/294808). 

In this particular kernel, we'll make use of the fact that the `Soil_Type` fields are binary valued. Because of this fact, we can string all of them together and construct a 40-bit integer representation of the `Soil_Type` field. To explain with a simple example, assume for a moment that we have only 8 `Soil_Type` fields:

```
Soil_Type1, Soil_Type2, Soil_Type3, Soil_Type4, Soil_Type5, Soil_Type6, Soil_Type7, Soil_Type8
```

All of the values are either `0` or `1` like so:

```
0, 0, 0, 0, 0, 0, 1, 1
```

If we concatenate them together, and stuff them into an integer, we would have a value that ranges from 0 to 255. With the example above we would have:

```
binary     decimal
00000011 = 3
```

We can do the same thing with all of the `Soil_Type` features, and generate 40-bit integer values. The benefit here is that we would retain all of the original information within the `Soil_Type` bits, but collapse a 40-dimensional space down to 1. Let's see how this looks.

# Convert To 40-Bit Integer

### Setup Dataframes

In [None]:
!pip install --upgrade scikit-learn

In [None]:
import pandas as pd
import numpy as np
import gc

train = pd.read_csv("../input/tabular-playground-series-dec-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-dec-2021/test.csv")

### New Column Definition

Here we'll generate a new field for the `soiltype_label`, and store it in a new column.

In [None]:
train["soiltype_label"] = 0
test["soiltype_label"] = 0

train["soiltype_label"] = train["soiltype_label"].astype(np.int64)
test["soiltype_label"] = test["soiltype_label"].astype(np.int64)

soil_columns = [x for x in train.columns if x.startswith("Soil_Type")]

### Row Transformer

The function below will take a single row of data, and transform the `Soil_Type` fields into a 40-bit integer.

In [None]:
def make_40_bit_int_from_soiltype(row):
    value = 0
    for column in soil_columns:
        value |= row[column]
        value = value << 1
    return value

### Apply Transformation

With the function above, all we have to do is `apply` it to our dataframe.

In [None]:
train["soiltype_label"] = train.apply(make_40_bit_int_from_soiltype, axis=1)
print(": Number of unique labels: {:,d}".format(train["soiltype_label"].nunique()))

# Comparing Information Representation Methods

Now that we have a single label representing all of the `Soil_Type` features, we should check to make sure we aren't losing any performance from the new features. First, we'll generate a LightGBM model with all of the `Soil_Type` features as they originally appeared.

In [None]:
# Drop Cover_Type 5, since we only have one example of it
train = train[(train["Cover_Type"] != 5)]

# Retain all of our features
features = [
    'Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
    'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 
    'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3',
    'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 
    'Soil_Type6', 'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12',
    'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 
    'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24',
    'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 
    'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36',
    'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']

In [None]:
%%time
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from lightgbm import LGBMClassifier
from lightgbm import early_stopping
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

target = train["Cover_Type"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="softmax",
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        callbacks=[early_stopping(50, verbose=False)],
    )

    train_oof_preds = model.predict(x_valid)
    train_preds[test_index] = train_oof_preds
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))
    print("-- Accuracy: {}".format(accuracy_score(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- Accuracy: {}".format(accuracy_score(target, train_preds)))

train["unmodified_preds"] = train_preds

# Show the confusion matrix
confusion = confusion_matrix(train["Cover_Type"], train["unmodified_preds"])
cover_labels = [1, 2, 3, 4, 6, 7]
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(confusion, annot=True, fmt=",d", xticklabels=cover_labels, yticklabels=cover_labels)
_ = ax.set_title("Confusion Matrix for LGB Classifier (Unmodified Dataset)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(confusion)
_ = gc.collect()

Now let's see what happens if we drop all of those `Soil_Type` columns and focus on using our `soiltype_label` instead. We'll label encode it first however, since huge values of 64-bit integers are going to give LightGBM a problem.

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
train["soiltype_label_encoded"] = encoder.fit_transform(train["soiltype_label"])
train["soiltype_label_encoded"] = train["soiltype_label_encoded"].astype(np.int16)

In [None]:
%%time
target = train["Cover_Type"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

features = [x for x in features if not x.startswith("Soil_Type")]
features.insert(0, "soiltype_label_encoded")

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="softmax",
        cat_feature=[0],
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        callbacks=[early_stopping(50, verbose=False)],
    )

    train_oof_preds = model.predict(x_valid)
    train_preds[test_index] = train_oof_preds
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))
    print("-- Accuracy: {}".format(accuracy_score(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- Accuracy: {}".format(accuracy_score(target, train_preds)))

train["collapsed_preds"] = train_preds

# Show the confusion matrix
confusion = confusion_matrix(train["Cover_Type"], train["collapsed_preds"])
cover_labels = [1, 2, 3, 4, 6, 7]
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(confusion, annot=True, fmt=",d", xticklabels=cover_labels, yticklabels=cover_labels)
_ = ax.set_title("Confusion Matrix for LGB Classifier (Collapsed Dataset)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(confusion)
_ = gc.collect()

We can compare the two runs to see what happened:

In [None]:
bar, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(
    x=["Unmodified", "40-bit Integer"],
    y=[
        float(accuracy_score(target, train["unmodified_preds"])),
        accuracy_score(target, train["collapsed_preds"]),
    ],
)
_ = ax.set_title("Accuracy Score Based on Approach", fontsize=15)
_ = ax.set_xlabel("Approach")
_ = ax.set_ylabel("Accuracy Score")
_ = ax.set(ylim=(0.90, 1.0))
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s="{:.4f}".format(height),
        ha="center"
    )

A general observation can be made here:

1. The performance of the 40-bit integer representation is better, again probably due to dimensionality reduction. We see accuracy improvements in `Cover_Type` classes of `4`, `6`, and `7`.

The question is however, is can we apply the same integer encoding to the test set and expect to see the same performance?

# Train / Test Differences

The first thing to do is check to see if the set of integer representations between the two sets overlap one another. The motivation behind this check is that certain combinations of binary features may not occur in the training set, but occur in the testing set. This is problematic, since the integer encoding will result in two different numeric values. For example, assume again that we have only 8 `Soil_Type` features. In the training set and testing set, let's assume that we have only the following two rows of observations:

| Data Set | Type 1 | Type 2 | Type 3 | Type 4 | Type 5 | Type 6 | Type 7 | Type 8 | Encoded |
| :------- | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :----: | :-----: |
| Training | 0      | 0      | 0      | 0      | 0      | 0      | 1      | 1      | 3       |
| Testing  | 0      | 0      | 0      | 0      | 0      | 0      | 0      | 1      | 1       |

The problem is that while both the training set and testing set have `Soil_Type8` of `1`, the training set has `Soil_Type7` as `1`. As we can see, the overall encoded values differ because of that. This results in a loss of information - specifically, the loss of `Soil_Type8` being the same between the datasets. Even worse, if we attempt to label encode the two sets, the default SciKit Learn `LabelEncoder` will throw errors that the testing set contains values that were unseen in the original `fit` from the training set (and rightly so, since `LabelEncoder` is only meant to be used to transform `Y` variables, not features). 

We can however, use `OrdinalEncoder` and scope out the size of the differences that occurs between the two datasets.

### Transform Test Set

First thing we need to do, is apply the same 40-bit integer transformer to the testing set.

In [None]:
test["soiltype_label"] = test.apply(make_40_bit_int_from_soiltype, axis=1)
print(": Number of unique labels: {:,d}".format(test["soiltype_label"].nunique()))

### Check for Mismatches

Let's use the `OrdinalEncoder` and check to see how many rows in the testing set have values we haven't seen before.

In [None]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder(
    dtype=np.int16,
    handle_unknown="use_encoded_value",
    unknown_value=32766,
)
train["soiltype_label_ordinal"] = encoder.fit_transform(train[["soiltype_label"]])
test["soiltype_label_ordinal"] = encoder.transform(test[["soiltype_label"]])

In [None]:
print(": Number of mismatched rows: {:,d}".format(
    test[(test["soiltype_label_ordinal"] == 32766)].shape[0]
))

Given that there are 1,000,000 rows in the testing set, having 6,843 that are mismatched is not bad (~0.7% of the testing data are impacted by this). However, this competition is all about rare cases, so we should examine if we can improve upon this.

# Encoding Using 5 x 8-bit Integers

Instead of using a single 40-bit integer, we could instead encode the soil types using a series of five 8-bit integers instead. By breaking up the encoded values, we may lose less information about soil types, since we spread out single-bit differences across 5 numerics instead of 1.

In [None]:
def make_5_8_bit_ints_from_soiltype(row):
    integer1 = (np.int64(row["soiltype_label"]) & 0xFF00000000) >> 30
    integer2 = (np.int64(row["soiltype_label"]) & 0x00FF000000) >> 24
    integer3 = (np.int64(row["soiltype_label"]) & 0x0000FF0000) >> 16
    integer4 = (np.int64(row["soiltype_label"]) & 0x000000FF00) >> 8
    integer5 = (np.int64(row["soiltype_label"]) & 0x00000000FF)
    return integer1, integer2, integer3, integer4, integer5

### Transform the rows

In [None]:
train[["soiltype_int1", "soiltype_int2", "soiltype_int3", "soiltype_int4", "soiltype_int5"]] = train.apply(make_5_8_bit_ints_from_soiltype, axis=1, result_type="expand")
test[["soiltype_int1", "soiltype_int2", "soiltype_int3", "soiltype_int4", "soiltype_int5"]] = test.apply(make_5_8_bit_ints_from_soiltype, axis=1, result_type="expand")

In [None]:
int_encoder = OrdinalEncoder(
    dtype=np.int16,
    handle_unknown="use_encoded_value",
    unknown_value=32766,
)
train[["soiltype_label_int1", "soiltype_label_int2", "soiltype_label_int3", "soiltype_label_int4", "soiltype_label_int5"]] = int_encoder.fit_transform(train[["soiltype_int1", "soiltype_int2", "soiltype_int3", "soiltype_int4", "soiltype_int5"]])
test[["soiltype_label_int1", "soiltype_label_int2", "soiltype_label_int3", "soiltype_label_int4", "soiltype_label_int5"]] = int_encoder.transform(test[["soiltype_int1", "soiltype_int2", "soiltype_int3", "soiltype_int4", "soiltype_int5"]])

Let's check to see how many rows in our test set are now mismatched from the training set.

In [None]:
def check_for_missing_values(row):
    return 1 if row["soiltype_label_int1"] == 32766 or row["soiltype_label_int2"] == 32766 or row["soiltype_label_int3"] == 32766 or row["soiltype_label_int4"] == 32766 or row["soiltype_label_int5"] == 32766 else 0

test["missing_8_bit_value"] = test.apply(check_for_missing_values, axis=1)
print(": Number of mismatched rows: {:,d}".format(
    test["missing_8_bit_value"].sum()
))

We've now reduced the number of mismatches to 33. Let's re-run the LightGBM model and see if there is any accuracy increase with the new encoding, or if the additional dimensions reduces accuracy.

In [None]:
%%time
target = train["Cover_Type"]
cv_rounds = 3

k_fold = StratifiedKFold(
    n_splits=cv_rounds,
    random_state=2021,
    shuffle=True,
)

features.remove("soiltype_label_encoded")
features.insert(0, "soiltype_label_int1")
features.insert(1, "soiltype_label_int2")
features.insert(2, "soiltype_label_int3")
features.insert(3, "soiltype_label_int4")
features.insert(4, "soiltype_label_int5")

train_preds = np.zeros(len(train.index), )
train_probas = np.zeros(len(train.index), )

for fold, (train_index, test_index) in enumerate(k_fold.split(train[features], target)):
    x_train = train[features].iloc[train_index]
    y_train = target.iloc[train_index]

    x_valid = train[features].iloc[test_index]
    y_valid = target.iloc[test_index]

    model = LGBMClassifier(
        random_state=2021,
        n_estimators=2000,
        verbose=-1,
        metric="softmax",
        cat_feature=[0, 1, 2, 3, 4],
    )
    model.fit(
        x_train,
        y_train,
        eval_set=[(x_valid, y_valid)],
        callbacks=[early_stopping(50, verbose=False)],
    )

    train_oof_preds = model.predict(x_valid)
    train_preds[test_index] = train_oof_preds
    
    print("-- Fold {}:".format(fold+1))
    print("{}".format(classification_report(y_valid, train_oof_preds)))
    print("-- Accuracy: {}".format(accuracy_score(y_valid, train_oof_preds)))

print("-- Overall:")
print("{}".format(classification_report(target, train_preds)))
print("-- Accuracy: {}".format(accuracy_score(target, train_preds)))

train["integer_preds"] = train_preds

# Show the confusion matrix
confusion = confusion_matrix(train["Cover_Type"], train["integer_preds"])
cover_labels = [1, 2, 3, 4, 6, 7]
fig, ax = plt.subplots(figsize=(15, 15))
ax = sns.heatmap(confusion, annot=True, fmt=",d", xticklabels=cover_labels, yticklabels=cover_labels)
_ = ax.set_title("Confusion Matrix for LGB Classifier (8-bit Integers)", fontsize=15)
_ = ax.set_ylabel("Actual Class")
_ = ax.set_xlabel("Predicted Class")

del(train_preds)
del(confusion)
_ = gc.collect()

In [None]:
bar, ax = plt.subplots(figsize=(20, 10))
ax = sns.barplot(
    x=["Unmodified", "40-bit Integer", "5 x 8-bit Integers"],
    y=[
        float(accuracy_score(target, train["unmodified_preds"])),
        accuracy_score(target, train["collapsed_preds"]),
        accuracy_score(target, train["integer_preds"]),
    ],
)
_ = ax.set_title("Accuracy Score Based on Approach", fontsize=15)
_ = ax.set_xlabel("Approach")
_ = ax.set_ylabel("Accuracy Score")
_ = ax.set(ylim=(0.90, 1.0))
for p in ax.patches:
    height = p.get_height()
    ax.text(
        x=p.get_x()+(p.get_width()/2),
        y=height,
        s="{:.4f}".format(height),
        ha="center"
    )

The overall accuracy for the 5 x 8-bit integer approach is very slightly less than that of the 40-bit integer approach. However, overall, we only have a mismatch in data of 33 rows. It is probably very worthwhile to go with the 5 x 8-bit integer approach, as we have an impact on only a very small number of testing rows.

# Conclusions

It appears that we can reduce the overall dimensionality of the dataset significantly by encoding the `Soil_Type` variables as a series of five 8-bit integers. The encoding compresses the data while at the same time providing a boost to a vanilla LightGBM model's accuracy by nearly 0.5%. Presumably, the reduced dimensionality may also provide some benefit to other approaches, as we collapsed a 40-dimensional vector down to a 5-dimensional vector.

If you find this kernel useful, please consider upvoting it!