# Cover Type Prediction: First glimpse

Sébastien Meyer

In [None]:
import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
font = {"size": 18}
mpl.rc("font", **font)

## Basic information about our dataset

In [None]:
# Read files
test_df = pd.read_csv("data/covtype.csv", index_col=["Id"])

training_ids = []

with open("data/training_ids.txt", "r", encoding="utf-8") as f:

    training_ids = f.read().split(",")
    training_ids = [int(x) for x in training_ids]

train_df = test_df.iloc[training_ids, :].copy()

print(train_df.shape)
train_df.head()

We observe that the dataset is made of 15120 training samples, where there are 54 variables and 1 target (the cover type).

In [None]:
print(test_df.shape)

There are 581012 test samples. Clearly, the test set is much larger than the training set. Therefore, we must be paying attention to overfitting.

## What is our task ?

We want to know if we are facing a regression or a classification problem, and to know how many classes there are if we are facing a classification situation. Also, we need to know if the training set is imbalanced.

In [None]:
# Imbalanced or balanced ?
print(train_df["Cover_Type"].value_counts())

The task is thus a classification task, with 7 possible labels. The training dataset is imbalanced, we might need to weight a class differently from another.

In [None]:
# Repartition of cover types
plt.figure(figsize=(10, 8))

color = iter(plt.cm.tab10(np.linspace(0, 1, 7)))

plt.hist(train_df.loc[train_df["Cover_Type"] == 1, "Cover_Type"], label="Cover_Type = 1", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))
plt.hist(train_df.loc[train_df["Cover_Type"] == 2, "Cover_Type"], label="Cover_Type = 2", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))
plt.hist(train_df.loc[train_df["Cover_Type"] == 3, "Cover_Type"], label="Cover_Type = 3", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))
plt.hist(train_df.loc[train_df["Cover_Type"] == 4, "Cover_Type"], label="Cover_Type = 4", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))
plt.hist(train_df.loc[train_df["Cover_Type"] == 5, "Cover_Type"], label="Cover_Type = 5", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))
plt.hist(train_df.loc[train_df["Cover_Type"] == 6, "Cover_Type"], label="Cover_Type = 6", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))
plt.hist(train_df.loc[train_df["Cover_Type"] == 7, "Cover_Type"], label="Cover_Type = 7", bins=np.arange(1, 9)-0.5, alpha=0.75, color=next(color))

plt.tight_layout()
# plt.savefig("report/figures/label_repart.png", facecolor="white")
plt.show()

We want to check is there are any missing values.

In [None]:
# Are there missing values ?
train_df.isnull().sum()

The dataset is clearly clean. Next, we will dive into the data types.

## Looking at the available features

In [None]:
# Numerical / Categorical features
train_df.dtypes

Here, we can see that all the features are numerical. We also want to know if there are discrete or continuous. Continuous features take unique values for every data points, while discrete features have a limited amount of values.

In [None]:
# Discrete / Continuous features
train_df.nunique()

It seems that all the features are discrete to some extent. Furthermore, Wilderness_AreaX and Soil_TypeX features are binary features with values 0 and 1. The question we want to answer is whether a unique value of each of these features is assigned to each data point.

In [None]:
# Binary variables and target
wild_var = [f"Wilderness_Area{i}" for i in range(1, 5)]
soil_var = [f"Soil_Type{i}" for i in range(1, 41)]
label_var = ["Cover_Type"]

# Print the number of positive class for both types of binary features and the missing spots
print("Total number of 1 for Wilderness_AreaX features: ", train_df[wild_var].sum().sum())
print("Total number of 1 for Soil_TypeX features: ", train_df[soil_var].sum().sum())

print("\n")

print("Number of data point(s) with no Wilderness_AreaX: ", (train_df[wild_var].sum(axis=1) == 0).sum())
print("Number of data point(s) with no Soil_TypeX: ", (train_df[soil_var].sum(axis=1) == 0).sum())

This exploration ensures that each data point has one and only one corresponding Wilderness_AreaX value set to 1 and one and only one corresponding Soil_TypeX value set to 1.

Now, let's verify the values of numerical features.

In [None]:
# Separate discrete and continuous features
all_var = list(train_df.columns)
disc_var = wild_var + soil_var + label_var
cont_var = [x for x in all_var if x not in disc_var]

# Look at the coherence of the data
train_df[cont_var].describe()

In [None]:
pd.set_option("display.max_columns", None)

def describe(df, stats):
    d = df.describe()
    return d.append(df.reindex(d.columns, axis=1).agg(stats))

print(describe(train_df, ["sum"]))

pd.set_option("display.max_columns", 15)

The details given in the data description are as follows:

- Elevation, quantitative (meters): Elevation in meters
- Aspect, quantitative (azimuth): Aspect in degrees azimuth
- Slope, quantitative (degrees): Slope in degrees
- *Horizontal_Distance_To_Hydrology* , quantitative (meters): Horz Dist to nearest surface water features
- *Vertical_Distance_To_Hydrology* , quantitative (meters): Vert Dist to nearest surface water features
- *Horizontal_Distance_To_Roadways* , quantitative (meters ): Horz Dist to nearest roadway
- *Hillshade_9am* , quantitative (0 to 255 index): Hillshade index at 9am, summer solstice
- *Hillshade_Noon*, quantitative (0 to 255 index): Hillshade index at noon, summer soltice
- *Hillshade_3pm*, quantitative (0 to 255 index): Hillshade index at 3pm, summer solstice
- *Horizontal_Distance_To_Fire_Points*, quantitative (meters): Horz Dist to nearest wildfire ignition points

The description of the features correspond to the given details. Moreover, the values seem reasonable. 

Then, we will look at the distributions and correlation.

In [None]:
# Find correlations with the target and sort
corr = train_df.corr()["Cover_Type"].sort_values()
print(corr)

It appears that some of the Soil_TypeX and Wilderness_AreaX variables have the largest correlation with the target variable. Let's look at them.

Also, a NaN value is set for Soil_Type15, which shows that the value is always 0 for all data points. This variable cannot help with our task.

In [None]:
# Look at the repartition of the most correlated binary features
print("Number of positive Soil_Type38 values among training set: ", (train_df["Soil_Type38"] == 1).sum())
print("Number of positive Soil_Type39 values among training set: ", (train_df["Soil_Type39"] == 1).sum())
print("Number of positive Wilderness_Area1 values among training set: ", (train_df["Wilderness_Area1"] == 1).sum())
print("Number of positive Soil_Type29 values among training set: ", (train_df["Soil_Type29"] == 1).sum())

In [None]:
# Repartition of cover types when Soil_Type38 is equal to 1 and 0
plt.figure(1, figsize=(10, 8))
sns.kdeplot(train_df.loc[train_df["Soil_Type38"] == 0, "Cover_Type"], label="Soil_Type38 = 0", color="blue")
sns.kdeplot(train_df.loc[train_df["Soil_Type38"] == 1, "Cover_Type"], label="Soil_Type38 = 1", color="red")

plt.legend(loc="best")
plt.title("Repartition of Soil_Type38 among Cover_Type")

plt.show()

In the case of Soil_Type38, we indeed observe that data points with value 1 are divided into cover types 1 and 7, while data points with value 0 can be of all cover types, except 7 with less probability.

In [None]:
# Repartition of cover types when Soil_Type39 is equal to 1 and 0
plt.figure(1, figsize=(10, 8))
sns.kdeplot(train_df.loc[train_df["Soil_Type39"] == 0, "Cover_Type"], label="Soil_Type39 = 0", color="blue")
sns.kdeplot(train_df.loc[train_df["Soil_Type39"] == 1, "Cover_Type"], label="Soil_Type39 = 1", color="red")

plt.legend(loc="best")
plt.title("Repartition of Soil_Type39 among Cover_Type")

plt.show()

The repartition among Soil_Type39 is almost the same as for Soil_Type38, which give an insight that some soil types are largely related to specific cover types. These features will be of great importance.

In [None]:
# Repartition of cover types when Soil_Type29 is equal to 1 and 0
plt.figure(1, figsize=(10, 8))
sns.kdeplot(train_df.loc[train_df["Soil_Type29"] == 0, "Cover_Type"], label="Soil_Type29 = 0", color="blue")
sns.kdeplot(train_df.loc[train_df["Soil_Type29"] == 1, "Cover_Type"], label="Soil_Type29 = 1", color="red")

plt.legend(loc="best")
plt.title("Repartition of Soil_Type29 among Cover_Type")

plt.show()

For Soil_Type29 feature, the repartition is less precise in case of positive values. However, we can see that some of the cover types are immediately eliminated when the class is positive. 

In [None]:
# Repartition of cover types when Wilderness_Area1 is equal to 1 and 0
plt.figure(1, figsize=(10, 8))
sns.kdeplot(train_df.loc[train_df["Wilderness_Area1"] == 0, "Cover_Type"], label="Wilderness_Area1 = 0", color="blue")
sns.kdeplot(train_df.loc[train_df["Wilderness_Area1"] == 1, "Cover_Type"], label="Wilderness_Area1 = 1", color="red")

plt.legend(loc="best")
plt.title("Repartition of Wilderness_Area1 among Cover_Type")

plt.show()

Interestingly, we observe that the repartition of cover types regarding Wilderness_Area1 is similar to the one for Soil_Type29. Later, we will see if these features are correlated.

Finally, take a look at the repartition for the most correlated continuous features.

In [None]:
# Repartition of cover types wrt Horizontal_Distance_To_Roadways
plt.figure(figsize=(10, 8))

color = iter(plt.cm.brg(np.linspace(0, 1, 7)))

sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 1, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 1", color=next(color))
sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 2, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 2", color=next(color))
sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 3, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 3", color=next(color))
sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 4, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 4", color=next(color))
sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 5, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 5", color=next(color))
sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 6, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 6", color=next(color))
sns.kdeplot(train_df.loc[train_df["Cover_Type"] == 7, "Horizontal_Distance_To_Roadways"], label="Cover_Type = 7", color=next(color))

plt.legend(loc="best")
plt.title("Repartition of Horizontal_Distance_To_Roadways among Cover_Type")

plt.show()

This graph show that medium values of cover types (around 3-6) correspond to smaller values of horz dist to roadways, while cover types 1, 2 and 7 can be assigned to larger values. However, wee see that in both groups, the value of horz dist to roadways does not help much at differentiating the cover types.

In terms of correlation between the features, we might not be able to get a lot of information. Indeed, the most correlated features with the target are unique features such as Soil_TypeX and Wilderness_AreaX. Let's plot a correlation heatmap between Soil_TypeX and Wilderness_AreaX for the highest correlated features.

In [None]:
most_corr_disc = train_df[["Soil_Type38", "Soil_Type39", "Soil_Type29", "Wilderness_Area1", "Cover_Type"]]
most_corr_disc_corr = most_corr_disc.corr()

plt.figure(figsize = (8, 6))

# Heatmap of correlations
sns.heatmap(most_corr_disc_corr, cmap=plt.cm.RdYlBu_r, vmin=-0.23, annot=True, vmax=0.55)

plt.xticks(rotation=60)

plt.tight_layout()
# plt.savefig("report/figures/basecorr.png", facecolor="white")
plt.show()

These results confirm what we observed, that is, Soil_Type29 and Wilderness_Area1 give similar information about the cover types, that are more likely to be small if the values of these features are larger. Also, we see that two groups of cover types appear to be less easy to discriminate: 1-2-7 and 3-4-5-6.