Initial EDA
==

In [TPS202112 - Parquet](https://www.kaggle.com/kaaveland/tps202112-parquet), I converted the competition CSV files to parquet, to facilitate rapid testing/exploring, and we'll be build off from that dataset.

In [TPS202112 - XGBoost Baseline](https://www.kaggle.com/kaaveland/tps202112-xgboost-baseline) and [TPS202112 - LGBM  feature importance](https://www.kaggle.com/kaaveland/tps202112-lgbm-feature-importance) we establish that boosters do pretty well on this problem, easily achieving LB scores of over 95.3% without looking at the data.

In this notebook, we'll try to gain some advantage from... actually looking at the data.

In [None]:
import os
import getpass
if getpass.getuser() == 'root': # kaggle
    %pip install -qU scikit-learn

import random
import seaborn as sns
import plotly.express as px
import pandas as pd
import numpy as np
import torch
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

random.seed(64)
np.random.seed(64)

data_root = os.environ.get('KAGGLE_DIR', '../input')
df = pd.read_parquet(f'{data_root}/tpsdec2021parquet/train.pq')
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=64)
n_jobs = min(cv.n_splits, os.cpu_count())

sns.set(
    style='darkgrid', context='notebook', rc={
        'figure.frameon': False,
        'figure.figsize': (16, 12),
        'legend.frameon': False,
    }
)

tree_method = 'gpu_hist' if torch.cuda.is_available() else 'hist'

df.info()

There are no nulls
==

I already know this, but just to have it out of the way:

In [None]:
df.isna().sum().sum()

Label distribution
==

This data set is wildly imbalanced:

In [None]:
px.bar(df.Cover_Type.value_counts(normalize=True))

It's quite possible that it's a good idea to try to model some of these separately. Eg. if we can get close to 100% accuracy on the question: "is it either 1 or 2?", then maybe we could solve the others separately.

Let's start looking at features.

Understanding Wilderness_Area
==

Let's start by looking into `Wilderness_Area`. I think there's a good chance that this is actually a single one-hot
encoded variable. Let's test that theory quickly:

In [None]:
is_wilds = df.columns.str.startswith('Wilderness')
wilds = df.columns[is_wilds]
df[wilds].sum(axis=1).value_counts()

That's patently false -- it's possible for all of these variables to be true at the same time. It's possibly interesting to sum them like this as a feature. Let's take a look at how commonly these are positive:

In [None]:
d = df[df.columns[is_wilds | (df.columns == 'Cover_Type')]]

px.bar(
    df[wilds].mean(), title='mean(variable)'
)

Are they correlated?

In [None]:
px.imshow(df[wilds].corr())


We have pretty strong correlation between `Wilderness_Area1` and `Wilderness_Area3`. Let's check if they can tell us anything about the likely target value:


In [None]:

px.bar(
    d.groupby('Cover_Type', as_index=False).mean().melt(id_vars=['Cover_Type'], value_name='mean(variable)'),
    x='Cover_Type', y='mean(variable)', color='variable', barmode='group'
)


This looks like it might have some value -- to me, there's this particular thing where `Wilderness_Area1` is almost never set outside of `Cover_Type = 1 | 2 | 7` and `Wilderness_Area4` is almost only seen in those cases.

I bet that these alone could give us a reasonable baseline:


In [None]:
tree = DecisionTreeClassifier()

cross_val_score(
    tree, df[wilds], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Right, beats the dummy classifier:


In [None]:
dummy = DummyClassifier(strategy='prior')

cross_val_score(
    dummy, df[wilds], df.Cover_Type, cv=cv, n_jobs=n_jobs,
)

Let's check if summing these features could make an extra feature of some value:

In [None]:
cross_val_score(
    tree,
    pd.concat([df[wilds], df[wilds].sum(axis=1).rename('sum')], axis=1),
    df.Cover_Type,
    cv=cv,
    n_jobs=n_jobs
)


Nope, the decision tree can perfectly capture all that's being expressed here without it, so it seems unlikely we'll have much to gain by doing feature engineering on these.

Let's check out the boolean columns that were not `Wilderness_Area`

Soil_Type
==

There are lots and lots of boolean soil type columns:


In [None]:
is_soiltype = df.columns.str.startswith('Soil_')
soils = df.columns[is_soiltype]

px.bar(df[soils].mean())

These appear to be quite sparse. Could they be some one-hot encoding of some other kind of feature?

In [None]:
px.bar(
    df[soils].sum(axis=1).value_counts(normalize=True)
)


Okay, it's definitely not the case that this is a single one-hot encoding. Let's try to see if there's anything obviously connecting the various features here with the label:


In [None]:
d = df[df.columns[is_soiltype | (df.columns == 'Cover_Type')]].groupby(
    'Cover_Type', as_index=False
).mean().melt(
    id_vars=['Cover_Type'], value_name='mean(variable)'
)

px.bar(
    d, x='variable', y='mean(variable)', facet_col='Cover_Type', facet_col_wrap=4
)


It seems like maybe these could help us separate some of the rarer classes, but these are very sparse in `Cover_Type <= 3`.

Let's how far a model gets by only looking at these. For this, I'm going to regularize the tree a bit, to prevent it from growing very deep:


In [None]:
cross_val_score(
    DecisionTreeClassifier(max_depth=12), df[soils], df.Cover_Type, cv=cv, n_jobs=n_jobs,
)

Right, on their own, these are not much better than `DummyClassifier`. It's plausible that some other model could find them valuable:

In [None]:
cross_val_score(
    LogisticRegression(), df[soils], df.Cover_Type, cv=cv, n_jobs=n_jobs,
)

Well, no, not really. The decision tree couldn't make much use of them, and neither could a logistic regression. Let's quickly check a booster too:


In [None]:
from xgboost import XGBClassifier

cross_val_score(
    XGBClassifier(tree_method=tree_method), df[soils], df.Cover_Type, cv=cv, n_jobs=1
)


Right, these just aren't very strong features, it would appear.

Continuous features
==

These were considered by far the most important features by the boosters we've already trained, in particular `Elevation` was used very actively.

There aren't that many of these, so we'll check these out one at a time. But first, an overview plot:


In [None]:
conts = df.columns[(df.dtypes == np.float32) | (df.columns == 'Cover_Type')]
conts = df[conts].melt(id_vars=['Cover_Type']).astype({'Cover_Type': 'category'})

sns.displot(
    data=conts, facet_kws={'sharey': False, 'sharex': False}, common_bins=False,
    x='value', hue='Cover_Type', col='variable', col_wrap=3, bins=50,
);

Right, certainly easy to see why Elevation makes a difference. A lot of these don't look like they're normally distributed, so it might be interesting to do some transforms for linear models here.

Let's look at them one at a time.

Elevation
==

Elevation is super-important:


In [None]:
sns.displot(df.astype({'Cover_Type': 'category'}), x='Elevation', hue='Cover_Type', kind='ecdf', aspect=2);


It looks like it's going to perfectly separate several of the cover types on its own:


In [None]:
cross_val_score(tree, df[['Elevation']], df.Cover_Type, cv=cv, n_jobs=min(5, os.cpu_count()))


Let's recheck the label proportions:


In [None]:
100 * df.Cover_Type.value_counts(normalize=True)


So, this feature alone gets us to 89% accuracy. That is rather quite strong. This makes it very interesting to check whether the elevation distribution is roughly the same in train and test, so let's do that:


In [None]:
df_test = pd.read_parquet(f'{data_root}/tpsdec2021parquet/test.pq')

sns.displot(
    pd.concat([df[['Elevation']].assign(set='train'), df_test[['Elevation']].assign(set='test')]).reset_index(),
    x='Elevation', row='set', aspect=2, bins=100, facet_kws={'sharey': False}
);


In [None]:
sns.displot(
    pd.concat([df[['Elevation']].assign(set='train'), df_test[['Elevation']].assign(set='test')]).reset_index(),
    x='Elevation', hue='set', aspect=2, kind='ecdf',
);


It seems to me that there are differences. The test set has higher representation at lower Elevation. This probably also implies a higher amount of the class imbalances for low-elevation classes, just to take the mean:


In [None]:
df.groupby('Cover_Type')['Elevation'].agg(['mean', 'std'])


Test set probably has a higher density of Cover_Type = 4
==

This finding could be important later on. We can try to make use of this, for example by assigning higher class weight to this Cover_Type.


In [None]:
sns.histplot(x=df.loc[df.Cover_Type == 4, 'Elevation'], kde=True);


The train set contains less than 1% of these samples, but the test set likely has much more, with its increased density below 2150 elevation.

Feature transformations for Elevation
--

Not sure what makes sense here, exactly. But I think we can get away with just scaling it, even for linear classifiers:

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

cross_val_score(
    make_pipeline(StandardScaler(), LogisticRegression()),
    df[['Elevation']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Well, that's almost as good as the decision tree. What if we take the log or the cube root?


In [None]:
cross_val_score(
    make_pipeline(StandardScaler(), LogisticRegression()),
    df[['Elevation']].apply(np.log), df.Cover_Type, cv=cv, n_jobs=n_jobs
)


In [None]:
cross_val_score(
    make_pipeline(StandardScaler(), LogisticRegression()),
    df[['Elevation']].apply(np.sqrt), df.Cover_Type, cv=cv, n_jobs=n_jobs
)

In [None]:
cross_val_score(
    make_pipeline(StandardScaler(), LogisticRegression()),
    df[['Elevation']].apply(np.cbrt), df.Cover_Type, cv=cv, n_jobs=n_jobs
)


We'll keep it in mind that a log transformation here could help linear models, and move on -- hold on, actually, I want to test one thing. I seem to remember `Wilderness_Area4` being very important for `Cover_Type in (4, 6)`. Might that just be a proxy for a certain kind of elevation?


In [None]:
sns.displot(
    x=df.Elevation, hue=df.Wilderness_Area4, kde=True, bins=50, aspect=2
);


`Wilderness_Area4` depends on `Elevation`
--

Okay, bingo. We were getting 60% accuracy by just classifying based on the `Wilderness_Area` variables, was that all `Wilderness_Area4`?


In [None]:
cross_val_score(
    tree, df[['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3']], df.Cover_Type, cv=cv, n_jobs=min(5, os.cpu_count())
)


No, this is still a lot better than the `DummyClassifier` -- but maybe that's also correlation with `Elevation`. Let's check:

In [None]:
from matplotlib import pyplot as plt

fig, axes = plt.subplots(2, 2, sharex=True)

for i, ax in enumerate(axes.flat, 1):
    sns.histplot(x=df.Elevation, hue=df[f'Wilderness_Area{i}'], kde=True, bins=50, ax=ax).set(title=f'Wilderness_Area{i}');

At a glance, perhaps it's only `Wilderness_Area4`. Using the wilderness features together with elevation results in stronger trees:

In [None]:
cross_val_score(
    tree,
    pd.concat([df[wilds], df[['Elevation']]], axis=1),
    df.Cover_Type, cv=cv, n_jobs=n_jobs
)

And stronger linear models too:

In [None]:
cross_val_score(
    make_pipeline(StandardScaler(), LogisticRegression()),
    pd.concat([df[wilds], df[['Elevation']].apply(np.log)], axis=1),
    df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Still more features go get through, though. Let's move on!

Aspect
==

The aspect is in degrees azimuth, let's review it's full distribution:


In [None]:
sns.displot(x=df.Aspect, kde=True, aspect=2, bins=100);


Well, perhaps it used to be in degrees in the original dataset, but this doesn't look exactly like degrees.

The bins look strange to me, almost as if this feature was categorical in part. How does this feature distribute for the various `Cover_Type` values?

In [None]:
sns.displot(
    x=df.Aspect, col=df.Cover_Type, kde=True, aspect=2, bins=50, col_wrap=4, facet_kws={'sharey': False}
);


To me, it doesn't look like there are massive differences between these, other than in the Cover_Types that have too few samples. Let's check with the tree:


In [None]:
cross_val_score(
    tree, df[['Aspect']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


That performed as well as the `DummyClassifier` -- on its own, this feature doesn't explain much. Let's check whether the test distribution is about the same as the train distribution:


In [None]:
sns.displot(
    pd.concat([df[['Aspect']].assign(set='train'), df_test[['Aspect']].assign(set='test')]).reset_index(),
    x='Aspect', row='set', bins=50, kde=True, aspect=2, facet_kws={'sharey': False}
);

This feature feels unimportant at this point, and I'm not going to worry about it much.

Slope
==

Also in degrees, let's check it out:


In [None]:
sns.displot(x=df.Slope, aspect=2, kde=True, bins=50);


That looks almost categorical in nature here. And it also looks like something that could benefit from being transformed. Let's how this distributes with the labels:

In [None]:
sns.displot(
    x=df.Slope, col=df.Cover_Type, kde=True, aspect=2, bins=50, col_wrap=4, facet_kws={'sharey': False}
);

This doesn't look important at first glance. Can the decision tree make any use of it at all?


In [None]:
cross_val_score(
    tree, df[['Slope']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


No, doesn't look like it. Let's move on.

Horizontal_Distance_To_Hydrology
==

This is distance to the nearest water source:


In [None]:
sns.displot(x=df.Horizontal_Distance_To_Hydrology, kde=True, bins=100, aspect=2);


This has a tail, and we'd probably benefit from doing some sort of transform to it, for our linear models:


In [None]:
sns.displot(x=np.log(df.Horizontal_Distance_To_Hydrology - df.Horizontal_Distance_To_Hydrology.min() + 1), kde=True, bins=100, aspect=2);


Now, in the real world, it makes perfect sense to have 0 distance to Hydrology, but probably not negative distance to it? Does that imply direction?

Anyway, let's check if this explains anything with the labels:


In [None]:
sns.displot(
    x=np.log(df.Horizontal_Distance_To_Hydrology - df.Horizontal_Distance_To_Hydrology.min() + 1),
    kde=True, bins=100, aspect=2, col=df.Cover_Type, col_wrap=4, facet_kws={'sharey': False}
);

This also doesn't seem useful, let's check if the tree can make use of it to beat `DummyClassifier`:


In [None]:
cross_val_score(
    tree, df[['Horizontal_Distance_To_Hydrology']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Right, it can't. As a last check, does negative distance mean anything?

In [None]:
df.assign(
    neg_distance=df.Horizontal_Distance_To_Hydrology < 0
).groupby(
    'Cover_Type'
).neg_distance.mean()

Possibly a _very_ weak correlation with Cover_Type = 4, but more likely just because we have so few. Let's leave this here.

Vertical_Distance_To_Hydrology
==

Also supposed to be meters:

In [None]:
sns.displot(df.Vertical_Distance_To_Hydrology, aspect=2, bins=100, kde=True);


Right, this too can be negative. Let's check out if this impacts the label:


In [None]:
sns.displot(
    x=df.Vertical_Distance_To_Hydrology, col=df.Cover_Type, kde=True, aspect=2, bins=100, col_wrap=4, facet_kws={'sharey': False}
);


At a glance, this too, doesn't seem so useful. I'm going to throw the tree at it, and move on:


In [None]:
cross_val_score(
    tree, df[['Vertical_Distance_To_Hydrology']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Seems on par with the dummy classifier to me, so we can probably move on.

Horizontal_Distance_To_Roadways
==

This one was rated as important by the boosters we trained earlier. It's supposed to be distance in meters again -- and I suppose it would be a good measurement of how likely something is to be impacted by human activity, such as logging or construction. Let's check this out:


In [None]:
sns.displot(df.Horizontal_Distance_To_Roadways, aspect=2, bins=100, kde=True);


In [None]:
sns.displot(
    x=df.Horizontal_Distance_To_Roadways, col=df.Cover_Type, kde=True, aspect=2, bins=100, col_wrap=4, facet_kws={'sharey': False}
);


Ah, this seems more promising. But, even though these distributions are distinct, they wouldn't be easy to separate. Maybe we need to have this in combination with something else for it to make sense? I don't think we can easily fix this with some numerical trick:

In [None]:
sns.displot(
    x=np.log(1 + df.Horizontal_Distance_To_Roadways - df.Horizontal_Distance_To_Roadways.min()),
    col=df.Cover_Type, kde=True, aspect=2, bins=100, col_wrap=4, facet_kws={'sharey': False}
);

In [None]:
cross_val_score(
    tree, df[['Horizontal_Distance_To_Roadways']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)

That seems not helpful, unfortunately.

Hillshade_*
==

These seem similar enough, that I'm going to try to do all three at once. This seems to be some sort of measurement of shade, but I can't work out what the unit is.

Oh well, we can have negative shade again!

In [None]:
shades = df[df.columns[(df.columns == 'Cover_Type') | df.columns.str.startswith('Hillshade')]]
d = shades.melt(id_vars=['Cover_Type'])

sns.displot(
    d, x='value', col='variable', row='Cover_Type', kde=True, bins=100, facet_kws={'sharey': False}
);


None of these seem particularly compelling to me, on their own, they seem mostly to not be strongly connected to the label. Can our tree make use of these?


In [None]:
cross_val_score(
    tree, df[df.columns[df.columns.str.startswith('Hillshade')]], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Well, no, not really.

Horizontal_Distance_To_Fire_Points
==

Also supposed to be meters, let's check it out:

In [None]:
sns.displot(
    x=df.Horizontal_Distance_To_Fire_Points, bins=100, aspect=2, kde=True
);


In [None]:
sns.displot(
    x=df.Horizontal_Distance_To_Fire_Points, bins=100, aspect=2, kde=True, col=df.Cover_Type, col_wrap=4, facet_kws={'sharey': False}
);


On eyeballing it, this doesn't seem too promising either. It does look like maybe we could apply some transformation to this as well, to make it look more centered:


In [None]:
sns.displot(
    x=np.log(1 + df.Horizontal_Distance_To_Fire_Points - df.Horizontal_Distance_To_Fire_Points.min()),
    bins=100, aspect=2, kde=True, col=df.Cover_Type, col_wrap=4, facet_kws={'sharey': False}
);

In [None]:
sns.displot(
    x=np.sqrt(1 + df.Horizontal_Distance_To_Fire_Points - df.Horizontal_Distance_To_Fire_Points.min()),
    bins=100, aspect=2, kde=True, col=df.Cover_Type, col_wrap=4, facet_kws={'sharey': False}
);


But I don't think this makes it a lot more compelling as a feature. Let's try it:


In [None]:
cross_val_score(
    tree, df[['Horizontal_Distance_To_Fire_Points']], df.Cover_Type, cv=cv, n_jobs=n_jobs
)


Again, not beating the dummy classifier.

Feature inspection summary
==

That means we only found very few features that are strong enough to stand on their own -- `Elevation` doing nearly 90% accuracy on its own, and the rest being very underwhelming on their own. If we were to try out some feature engineering, lots of the continuous features could probably benefit from being transformed, if we were using linear models. Other than that, `Elevation` seems most exciting, or perhaps dealing with the major imbalances.

At this point, it would be interesting to see if there's any combination of features that helps, and does not contain `Elevation` -- and looking into different ways of encoding the bools may also be useful. It's probably going to be a lot of feature engineering work to make a good linear model here.
