In this notebook, I would like to compare the (synthetic) dataset from [Tabular Playground Series - Dec 2021](https://www.kaggle.com/c/tabular-playground-series-dec-2021) against the  original competition (non-synthetic) dataset [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
original = pd.read_csv('../input/forest-cover-type-prediction/train.csv')
synthetic = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')

# Shape

In [None]:
print('original', original.shape)
print('synthetic', synthetic.shape)

# Columns Names and Data-Types

In [None]:
(original.columns == synthetic.columns).all()

In [None]:
(original.dtypes == synthetic.dtypes).all()

As shown above, the column's names and data-types are similar.

# Target Label/Class Distribution

In [None]:
target = 'Cover_Type'
def plot(df, name):
    count = df[target].value_counts().sort_index()
    plt.ticklabel_format(useOffset=False, style='plain')
    plt.bar(count.index, count)
    plt.xlabel('label/class')
    plt.title(name)
    plt.show()
plot(synthetic, 'synthetic')
plot(original, 'original')

The distribution of synthetic dataset is imbalanced, but the distribution of original dataset is balanced.

### Leaked Test-Label on Original-Competition

In the original-competition, the **leaked test-label** has imbalance distribution (almost half of them are class 2, which resembles the train-set in the synthethic-dataset).

In the original-competition, a classifier normally would have no access to the information about class imbalance in the test-set, hence can't utilize the imbalance distribution of the label when doing prediction.

Contrary, synthetic's training-set contains information about the class imbalance. The classifier should be able to gain benefit from the class imbalance ***under assumption*** that the imbalance in the synthetic's test-set will be similar.

# Aspect

In [None]:
pd.DataFrame(dict(
    synthetic = synthetic.Aspect.describe(),
    original = original.Aspect.describe()
)).round(0)

The original Aspect range is [0,360] angle-degree, but the synthethic range is [-33,407].

In [None]:
def plot(df, name):
    df.Aspect.value_counts().sort_index().plot(figsize=(11,5))
    plt.title(name)
    plt.xlabel('Aspect')
    plt.show()
plot(original, 'original')
plot(synthetic, 'synthetic')

Please note that the original one looks shaky, but the synthetic one looks smoother, this is simply because the synthetic one has large number of sample.

If we smooth-out the shakiness, both of them has similar movement which resembles a trigonometric's [sinusoid](https://en.wikipedia.org/wiki/Sine_wave) curve. I would say that the original one definitely forming a sine wave, but not sure about the synthethic one.

I also would like to mention about the sharp spike on `Aspect=0` on the synthethic one that seems pretty strange.

## Periodicity of Aspect

I would like to do a experiment to prove the following:
1. Aspect of 1 degree-angle would closely resemble 359 degree.
2. Aspect of 90 degree-angle would be very different from 270 degree.

A classifier should have difficulty to separate the 1st case (because of similiarity), but not the 2nd case.

In [None]:
x = synthetic.copy()
for cname in ['Elevation', 'Aspect', 'Slope',
       'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology',
       'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon',
       'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points']:
    before = len(x)
    x = x[x[cname].between(
        original[cname].min(),
        original[cname].max(),
    )]
    after = len(x)
    print(cname, after-before, before, after, sep='\t')
def clean_one_hot(cnames, x=x):
    cnt = x[cnames].sum(axis=1)
    before = len(x)
    x = x[cnt==1]
    after = len(x)
    print(after-before, before, after, sep='\t')
    return x
x = clean_one_hot([
    'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4'
])
x = clean_one_hot([
    'Soil_Type1', 'Soil_Type2', 'Soil_Type3',
    'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8',
    'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12',
    'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16',
    'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20',
    'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24',
    'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28',
    'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32',
    'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36',
    'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40',
])
synthetic = x
del x

### ? Aspect of 1 degree-angle would closely resemble 359 degree ?

A classifier should have difficulty to separate samples of 1 degree from 359 degree (due to similarity)

#### Original Dataset

In [None]:
high = original.Aspect.between(340,359)
low = original.Aspect.between(1,20)
high = original[high].copy()
high['isHigh'] = 1
low = original[low].copy()
low['isHigh'] = 0
x = pd.concat([high, low])
y = x.pop('isHigh')
print('total sample', len(x))
print('target class balance', y.mean())
x.pop('Aspect') # important, it's too easy if the classifier can see the Aspect directly
x.pop('Id')
x = x.drop(columns=['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']) # explained later
x = x.to_numpy()
y = y.to_numpy()
train, val = train_test_split(np.arange(len(x)), stratify=y, random_state=0)
est = RandomForestClassifier(random_state=0)
est.fit(x[train], y[train])
print('train score', accuracy_score(y[train], est.predict(x[train])))
print('val score', accuracy_score(y[val], est.predict(x[val])))

With validation-accuracy only 57% it seems the classifier is having hard time separating `Aspect<20` from `Aspect>340`, because those Aspect values are close/similar.

**Why I drop `Hillshade` columns?**

I believe Hillshade have a strong correlation with Aspect. E.g. if your house is facing East (`Aspect=90`), then your house will be brighter at morning, compared to other houses facing West that will be darker at morning (`Aspect=270`). Large Hillshade value means brighter, and vice-versa, small Hillshade value means darker.

Let's feed the Hillshade columns back into our classifier to prove the correlation.

In [None]:
high = original.Aspect.between(340,359)
low = original.Aspect.between(1,20)
high = original[high].copy()
high['isHigh'] = 1
low = original[low].copy()
low['isHigh'] = 0
x = pd.concat([high, low])
y = x.pop('isHigh')
print('total sample', len(x))
print('target class balance', y.mean())
x.pop('Aspect') # important, it's too easy if the classifier can see the Aspect directly
x.pop('Id')
# Hillshade data are not dropped this time
x = x.to_numpy()
y = y.to_numpy()
train, val = train_test_split(np.arange(len(x)), stratify=y, random_state=0)
est = RandomForestClassifier(random_state=0)
est.fit(x[train], y[train])
print('train score', accuracy_score(y[train], est.predict(x[train])))
print('val score', accuracy_score(y[val], est.predict(x[val])))

The validation-accuracy improves from 57% to 98%, hence Hillshade have strong correlation with Aspect ... ***at least in the original dataset***.

#### Synthetic Dataset

In [None]:
high = synthetic.Aspect.between(340,359)
low = synthetic.Aspect.between(1,20)
high = synthetic[high].sample(999, random_state=0).copy()
high['isHigh'] = 1
low = synthetic[low].sample(999, random_state=0).copy()
low['isHigh'] = 0
x = pd.concat([high, low])
y = x.pop('isHigh')
x.pop('Aspect') # important, it's too easy if the classifier can see the Aspect directly
x.pop('Id')
x = x.drop(columns=['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']) # explained above
x = x.to_numpy()
y = y.to_numpy()
print('total sample', len(x))
print('target class balance', y.mean())
train, val = train_test_split(np.arange(len(x)), stratify=y, random_state=0)
est = RandomForestClassifier(random_state=0)
est.fit(x[train], y[train])
print('train score', accuracy_score(y[train], est.predict(x[train])))
print('val score', accuracy_score(y[val], est.predict(x[val])))

Only 52% accuracy, pretty similar with our finding on original data, the classifier is having hard time distinguishing similar Aspect (`Aspect<20` versus `Aspect>340`) ... **but let's try to use the Hillshade data now**.

In [None]:
high = synthetic.Aspect.between(340,359)
low = synthetic.Aspect.between(1,20)
high = synthetic[high].sample(999, random_state=0).copy()
high['isHigh'] = 1
low = synthetic[low].sample(999, random_state=0).copy()
low['isHigh'] = 0
x = pd.concat([high, low])
y = x.pop('isHigh')
x.pop('Aspect') # important, it's too easy if the classifier can see the Aspect directly
x.pop('Id')
# Hillshade data are not dropped this time
x = x.to_numpy()
y = y.to_numpy()
print('total sample', len(x))
print('target class balance', y.mean())
train, val = train_test_split(np.arange(len(x)), stratify=y, random_state=0)
est = RandomForestClassifier(random_state=0)
est.fit(x[train], y[train])
print('train score', accuracy_score(y[train], est.predict(x[train])))
print('val score', accuracy_score(y[val], est.predict(x[val])))


Now this is funny. The validation-accuracy staying around 52%, while in the original-dataset the validation-accuracy improves drastically from 57% to 98%. Hence my conclusion is:

**In the original dataset, Aspect is strongly-correlated with Hillshade. But this doesn't hold true for the synthetic dataset.**

### ? Aspect of 90 degree-angle would be very different from 270 degree ?

A classifier should be able to easily distinguish samples of 90 degree from 270 degree, since facing East would bring a lot of difference compared to facing West.

#### Original Dataset

In [None]:
east = original.Aspect.between(70,110)
west = original.Aspect.between(260,280)
east = original[east].sample(500, random_state=0).copy()
east['isEast'] = 1
west = original[west].copy()
west['isEast'] = 0
x = pd.concat([east, west])
y = x.pop('isEast')
print('total sample', len(x))
print('target class balance', y.mean())
x.pop('Aspect') # important, it's too easy if the classifier can see the Aspect directly
x.pop('Id')
x = x.drop(columns=['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']) # explained above
x = x.to_numpy()
y = y.to_numpy()
train, val = train_test_split(np.arange(len(x)), stratify=y, random_state=0)
est = RandomForestClassifier(random_state=0)
est.fit(x[train], y[train])
print('train score', accuracy_score(y[train], est.predict(x[train])))
print('val score', accuracy_score(y[val], est.predict(x[val])))

I would say that validation-score 85% is pretty good. The classifier can easily separate the sample that facing East from the one that facing West.

#### Synthetic Dataset

In [None]:
east = synthetic.Aspect.between(70,110)
west = synthetic.Aspect.between(260,280)
east = synthetic[east].sample(500, random_state=0).copy()
east['isEast'] = 1
west = synthetic[west].sample(500, random_state=0).copy()
west['isEast'] = 0
x = pd.concat([east, west])
y = x.pop('isEast')
print('total sample', len(x))
print('target class balance', y.mean())
x.pop('Aspect') # important, it's too easy if the classifier can see the Aspect directly
x.pop('Id')
x = x.drop(columns=['Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']) # explained above
x = x.to_numpy()
y = y.to_numpy()
train, val = train_test_split(np.arange(len(x)), stratify=y, random_state=0)
est = RandomForestClassifier(random_state=0)
est.fit(x[train], y[train])
print('train score', accuracy_score(y[train], est.predict(x[train])))
print('val score', accuracy_score(y[val], est.predict(x[val])))

Now this is funny. Only 48% accuracy is indication that the classifier is having difficulty separating the sample that facing East from the one that facing West.

We can get 85% accuracy on the original dataset, but only able to get 48% accuracy on the synthetic dataset.

## Take Away

`Aspect` column in the synthetic-dataset is quite different from what the real-data (original-dataset) should be. Seems like the generation of synthethic data wasn't able to properly generate the periodicity of a feature, and it's also failing to generate the same correlation between features.