# December 2021 Exploratory Data Analysis

Here we have a multi-class classification problem - we wish to predict the "Cover Type" given a number of other attributes. The original dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Covertype), although the data we use here has been extended from this by use of a GAN.

Contents:

<a href='#read_in_data'>Read in and check data</a>

<a href='#target_variable'>Target variable</a>

<a href='#binary_variables'>Binary variables</a>

<a href='#continuous_variables'>Continuous variables</a>

In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt, ticker as mtick
import seaborn as sns

## Read in and check data
<a id='read_in_data'></a>

We read in our training data and take a look at a few entries - we can see that we have 56 attributes (including the id and target column), and 4 million data points.

In [None]:
data = pd.read_csv("../input/tabular-playground-series-dec-2021/train.csv")
print(f"Size of data set is {data.shape}")
data.sample(5)

We do a quick check for any null values. Happily, it appears we don't have any.

In [None]:
data.isnull().sum()

Finally, we take a brief look at the features themselves. We can see that we have 10 continuous variables of varying scales, and then two types of binary variable (the wilderness area and soil type). The latter are not mutually exclusive, and one data point can have more than one (or none) of both the wilderness area and soil type attribute.

In [None]:
data.describe().T

## Target Variable
<a id="target_variable"></a>

We first take a quick look at the distribution of our target variable. We see that the vast majority of our data points are concentrated in the first two categories, with over 50% of our data in the second category alone. The remaining five have far fewer observations - data for categories four and five is particularly low, with a single observation in category five. 

In [None]:
ax = data["Cover_Type"].value_counts(normalize=True).sort_index().plot(kind="bar")
fig = plt.gcf()
fig.set_size_inches(20, 5)
ax.yaxis.set_major_formatter(mtick.FuncFormatter('{:.0%}'.format))
ax.set_title("Distribution of cover type", fontsize=20);

In [None]:
data["Cover_Type"].value_counts().sort_index()

## Binary variables
<a id='binary_variables'></a>

### Wilderness area

We start by looking at the 4 wilderness areas. We can see that type 3 is the dominant class, followed by type 1. Looking at the distribution of the different target categories, we see this borne out in categories 1 and 2 (unsurprising, as these make up the vast majority of our observations). Wilderness area 4 looks like an interesting variable, as it is a much more significant % of the total observations for our minor categories 3, 4 and 6.

In [None]:
wilderness_cols = [f"Wilderness_Area{i}" for i in range(1, 5)]
wilderness_data = data[wilderness_cols + ["Cover_Type"]]
ax = wilderness_data.sum()[wilderness_cols].plot(kind="bar")

fig = plt.gcf()
fig.set_size_inches(20, 5)
ax.set_title("Wilderness areas", fontsize=20);

In [None]:
wilderness_data = wilderness_data.groupby("Cover_Type").sum() 
wilderness_data_norm = wilderness_data.mul(1 / wilderness_data.sum(axis=1), axis=0)  # norm

ax = wilderness_data_norm.plot(kind="bar", stacked=True)
fig = plt.gcf()
fig.set_size_inches(20, 5)
ax.set_title("Cover type by wilderness area", fontsize=20);

In [None]:
wilderness_data

### Soil types

We perform a similar analysis on the soil types. Splitting each by cover type, it does appear at a glance that some soil types can help identify some of the less common classes, although again with such an imbalanced dataset we must be careful to avoid overfitting. We also note that soil types 7 and 15 do not have any associated observations.

In [None]:
soil_cols= [f"Soil_Type{i}" for i in range(1, 41)]
soil_data = data[soil_cols + ["Cover_Type"]]
ax = soil_data.sum()[soil_cols].plot(kind="bar")

fig = plt.gcf()
fig.set_size_inches(20, 5)
ax.set_title("Soil types", fontsize=20);

In [None]:
soil_data = soil_data.groupby("Cover_Type").sum() 
soil_data.T.plot(kind="bar", stacked=True)

fig = plt.gcf()
fig.set_size_inches(20, 5)
ax.set_title("Soil types by cover type", fontsize=20);

## Continuous variables
<a id='continuous_variables'></a>

We also take a look at the 10 continuous variables. We note that all 10 are very weakly correlated. For plotting distributions, I have grouped categories 3 - 7 into a single group - generally the distributions look the same across the (now) three different categories, although the distribution of elevation is notably different across the three. This latter point also highlights the potential error of grouping together the last five categories, as there appears to be two distinct groups here - this is explored further in the final chart.

In [None]:
cts_variables = ["Elevation", "Aspect", "Slope", "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology", 
                 "Horizontal_Distance_To_Roadways", "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm", "Horizontal_Distance_To_Fire_Points"]

sns.heatmap(data[cts_variables].corr())
fig = plt.gcf()
fig.set_size_inches(15, 10)
ax.set_title("Correlations", fontsize=20);

In [None]:
data_cts = data.loc[:, cts_variables + ["Cover_Type"]]
data_cts.loc[:, "Cover_Type"] = data_cts.loc[:, "Cover_Type"].map(lambda x: x if x in (1, 2) else "other")

fig, axs = plt.subplots(5, 2)
for i, col in enumerate(cts_variables):
    ax = axs[i // 2, i % 2]
    data_to_plot = data_cts[[col, "Cover_Type"]]
    sns.histplot(data_to_plot.sample(frac=0.1), x=col, hue="Cover_Type", ax=ax)

fig.set_size_inches(20, len(cts_variables) * 3)
fig.tight_layout()

In [None]:
data_elevation = data.loc[data["Cover_Type"].isin([3, 4, 6, 7]), ["Cover_Type", "Elevation"]]
sns.displot(data_elevation, x="Elevation", hue="Cover_Type")

fig = plt.gcf()
fig.set_size_inches(15, 5)
plt.title("Minor groups' elevation distribution", fontsize=20);

<a id='the_destination'></a>