In this notebook I will try to gain insights from data. This notebook will help in later deciding what model would be best for this problem.

# Initialization

In [None]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_selection import f_classif
from sklearn.feature_selection import chi2
from sklearn.feature_selection import mutual_info_classif

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv', index_col='Id').reset_index(drop=True)
df.head()

In [None]:
X = df.drop(['Cover_Type'], axis=1)
y = df['Cover_Type']

In [None]:
y.value_counts()

The classes are very imbalanced. Class 5 only has 1 sample, so in my opinion it would be better to just remove it.

For more discussion on this: https://www.kaggle.com/c/tabular-playground-series-dec-2021/discussion/293064

In [None]:
class5_index = None
for index, val in enumerate(y):
    if val == 5:
        class5_index = index
        
X = X.drop([class5_index])
y = y.drop([class5_index])

# Splitting into training and validation sets

Splitting before EDA ensures that the validation dataset does not contribute to the decision making and is only used for validation.

In [None]:
# y.unique()

In [None]:
X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)
# df_train, df_val = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
# df_train = df_train.reset_index(drop=True)
# df_train.head()
X_train

# Exploratory Data Analysis (EDA)

In [None]:
X_train.info()

On first sight it looks like the dataset only has numerical features but some features (like `Wilderness_Area{n}`) look like they are actually categorical but encoded.

We can confirm this.

In [None]:
numerical = []
categorical = []
for col in X_train.columns:
    if X_train[col].nunique() <= 2:
        categorical.append(col)
    else:
        numerical.append(col)
        
categorical

Therefore, features `Wilderness_Area{n}` and `Soil_Type{n}` are categorical features.

## Numerical Features

In [None]:
X_numerical = X_train[numerical]

X_numerical.head()

In [None]:
X_numerical.describe()

### Missing Values

In [None]:
X_numerical.isna().any()

The numerical features of trainig data have no missing values.

Even though the test data might have. So, we will in advance decide how to fill any missing values if found.

The methodology used for numerical features is:
- Fill with mean if the feature has Gaussian distribution
- Fill with meadian otherwise

To find if the feature is Gaussian or not we will plot histograms of each feature.

In [None]:
for c in numerical:
    plt.hist(X_numerical[c], bins=100)
    plt.xlabel(c)
    plt.show()

From the plots, we can see that only `Elevation` and `Hillshade_3pm` are Gaussian-like. So, we will fill missing values of those with mean.

### Feature Redundance

Next, we will look at correlation between features to find if there are any redundant features.

We will try to find linear correlation between features using Pearson's correlation coefficient and non-linear correlation using Spearman's correlation.

For both we will plot a correlation matrix to make the result readable.

Source: https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

In [None]:
pearson_corr = X_numerical.corr(method='pearson').abs()

fig, ax = plt.subplots(figsize=(6, 6))

plt.title("Correlation Plot\nAbsolute value of Pearson's Correlation Coefficient\n\n")
sns.heatmap(pearson_corr,
            cmap=sns.diverging_palette(230, 10, as_cmap=True),
            square=True,
            vmin=0,
            vmax=1,
            ax=ax)
plt.show()

In [None]:
spearman_corr = X_numerical.corr(method='spearman').abs()

fig, ax = plt.subplots(figsize=(6, 6))

plt.title('Correlation Plot\nAbsolute value of Spearman Correlation Coefficient\n\n')
sns.heatmap(spearman_corr,
            cmap=sns.diverging_palette(230, 10, as_cmap=True),
            square=True,
            vmin=0,
            vmax=1,
            ax=ax)
plt.show()

None of the features are correlated with each other.

### Feature Selection

Now, we will try to find feature relevance with the target.

For this we will use ANOVA F-value to find linear relationship and Kendall's $\tau$ coefficient for non-linear relationship.

Source: https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

In [None]:
anova_f_values = f_classif(X_numerical, y_train)[0]

linear_corr = pd.Series(anova_f_values, index=X_numerical.columns)
linear_corr

The more the ANOVA F-value the more important the feature is in predicting the result.

In [None]:
non_linear_corr = X_numerical.corrwith(y_train, method='kendall')
non_linear_corr

The closer the value to 1 the more important the feature is in predicting the result.

If and which features to remove we will decide by training some simple linear models after removing the features one by one based on their correlation values and evaluating their scores.

## Categorical Features

In [None]:
X_categorical = X_train[categorical]

X_categorical.head()

### Missing Values

In [None]:
X_categorical.isna().any()

The training data does not have any missing values but the testing data can. So, we will fill the missing values with the most frequent value in the feature.

### Feature Redundance

Now, we will find redundant categorical features.

We will try to find linear correlation between features using Pearson's correlation coefficient and non-linear correlation using Spearman's correlation.

For both we will plot a correlation matrix to make the result readable.

Source: https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

In [None]:
pearson_corr = X_categorical.corr(method='pearson').abs()

fig, ax = plt.subplots(figsize=(6, 6))

plt.title("Correlation Plot\nAbsolute value of Pearson's Correlation Coefficient\n\n")
sns.heatmap(pearson_corr,
            cmap=sns.diverging_palette(230, 10, as_cmap=True),
            square=True,
            vmin=0,
            vmax=1,
            ax=ax)
plt.show()

In [None]:
spearman_corr = X_categorical.corr(method='spearman').abs()

fig, ax = plt.subplots(figsize=(6, 6))

plt.title('Correlation Plot\nAbsolute value of Spearman Correlation Coefficient\n\n')
sns.heatmap(spearman_corr,
            cmap=sns.diverging_palette(230, 10, as_cmap=True),
            square=True,
            vmin=0,
            vmax=1,
            ax=ax)
plt.show()

`Wilderness_Area1` and `Wilderness_Area3` are somewhat correalated so we may decide to remove one of those.

The other important observation we can make is that the plot along features `Soil_Type7` and `Soil_Type15` is weird. To understand the reason for it we can check the values they have.

In [None]:
X_categorical['Soil_Type7'].value_counts()

In [None]:
X_categorical['Soil_Type15'].value_counts()

Both of these features have only 1 value, so it is better to remove them.

In [None]:
X_categorical = X_categorical.drop(['Soil_Type7', 'Soil_Type15'], axis=1)

### Feature Selection

Now, we will try to find feature relevance with the target.

For this we will use Chi-Squared test and Mutual Information.

Source: https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/

In [None]:
chi_square = chi2(X_categorical, y_train)[0]

chi_square = pd.Series(chi_square, index=X_categorical.columns)
chi_square

The more the Chi-squared value the more important the feature is in predicting the result.

In [None]:
mutual_info = mutual_info_classif(X_categorical, y_train, discrete_features=True, random_state=42)

mutual_info = pd.Series(mutual_info, index=X_categorical.columns)
mutual_info

The more the Mutual Information value the more important the feature is in predicting the result.

If and which features to remove we will decide by training some simple linear models after removing the features one by one based on their correlation values and evaluating their scores.