## Introduction

As I was browsing the data description of the Forest Cover Type Competition (https://www.kaggle.com/c/forest-cover-type-prediction/data), I noticed it had some details on the characteristics and families of the soil types. And doing [my EDA](https://www.kaggle.com/possatti/tps-12-eda) I wondered if those details could be used as features to improve model performance.

On this notebook I give it a try. I coded how these features can be generated, and experimented with a Random Forest model.

## Description of soil properties

From https://www.kaggle.com/c/forest-cover-type-prediction/data: 

> - 1 Cathedral family - Rock outcrop complex, extremely stony.
> - 2 Vanet - Ratake families complex, very stony.
> - 3 Haploborolis - Rock outcrop complex, rubbly.
> - 4 Ratake family - Rock outcrop complex, rubbly.
> - 5 Vanet family - Rock outcrop complex complex, rubbly.
> - 6 Vanet - Wetmore families - Rock outcrop complex, stony.
> - 7 Gothic family.
> - 8 Supervisor - Limber families complex.
> - 9 Troutville family, very stony.
> - 10 Bullwark - Catamount families - Rock outcrop complex, rubbly.
> - 11 Bullwark - Catamount families - Rock land complex, rubbly.
> - 12 Legault family - Rock land complex, stony.
> - 13 Catamount family - Rock land - Bullwark family complex, rubbly.
> - 14 Pachic Argiborolis - Aquolis complex.
> - 15 unspecified in the USFS Soil and ELU Survey.
> - 16 Cryaquolis - Cryoborolis complex.
> - 17 Gateview family - Cryaquolis complex.
> - 18 Rogert family, very stony.
> - 19 Typic Cryaquolis - Borohemists complex.
> - 20 Typic Cryaquepts - Typic Cryaquolls complex.
> - 21 Typic Cryaquolls - Leighcan family, till substratum complex.
> - 22 Leighcan family, till substratum, extremely bouldery.
> - 23 Leighcan family, till substratum - Typic Cryaquolls complex.
> - 24 Leighcan family, extremely stony.
> - 25 Leighcan family, warm, extremely stony.
> - 26 Granile - Catamount families complex, very stony.
> - 27 Leighcan family, warm - Rock outcrop complex, extremely stony.
> - 28 Leighcan family - Rock outcrop complex, extremely stony.
> - 29 Como - Legault families complex, extremely stony.
> - 30 Como family - Rock land - Legault family complex, extremely stony.
> - 31 Leighcan - Catamount families complex, extremely stony.
> - 32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.
> - 33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.
> - 34 Cryorthents - Rock land complex, extremely stony.
> - 35 Cryumbrepts - Rock outcrop - Cryaquepts complex.
> - 36 Bross family - Rock land - Cryumbrepts complex, extremely stony.
> - 37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.
> - 38 Leighcan - Moran families - Cryaquolls complex, extremely stony.
> - 39 Moran family - Cryorthents - Leighcan family complex, extremely stony.
> - 40 Moran family - Cryorthents - Rock land complex, extremely stony.

## Preparation

In [None]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display
from scipy.stats import uniform, randint, mode
from sklearn.metrics import accuracy_score, balanced_accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, RandomizedSearchCV
sns.set()

In [None]:
plt.rcParams['figure.figsize'] = (16, 4)

## Loading the data

In [None]:
soil_type_vars = [f'Soil_Type{i}' for i in range(1, 41)]
wilderness_area_vars = [f'Wilderness_Area{i}' for i in range(1, 5)]
binary_vars = soil_type_vars + wilderness_area_vars
numerical_vars = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points']
features = numerical_vars + binary_vars
target = 'Cover_Type'

In [None]:
dtypes = {
    'Id': np.int32,
    'Elevation': np.int16,
    'Aspect': np.int16,
    'Slope': np.int8,
    'Horizontal_Distance_To_Hydrology': np.int16,
    'Vertical_Distance_To_Hydrology': np.int16,
    'Horizontal_Distance_To_Roadways': np.int16,
    'Hillshade_9am': np.int16,
    'Hillshade_Noon': np.int16,
    'Hillshade_3pm': np.int16,
    'Horizontal_Distance_To_Fire_Points': np.int16,
    'Cover_Type': np.int8,
}
binary_vars_dtypes = {c: np.int8 for c in binary_vars}
dtypes.update(binary_vars_dtypes)

In [None]:
train = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/train.csv', dtype=dtypes)
test = pd.read_csv('/kaggle/input/tabular-playground-series-dec-2021/test.csv', dtype=dtypes)

In [None]:
# Only for some quick tests.
train_50k = train.sample(n=50_000, random_state=42)

In [None]:
# Soil_Type7 and Soil_Type15 are all zeros. So it's safe to drop.
features.remove('Soil_Type7')
features.remove('Soil_Type15')
# There is only one observation for Cover Type 5, so I dropped it too.
train = train.query('Cover_Type != 5')

## Parsing interesting soil properties

In [None]:
soil_descriptions = '''
1;  Cathedral family - Rock outcrop complex, extremely stony.
2;  Vanet - Ratake families complex, very stony.
3;  Haploborolis - Rock outcrop complex, rubbly.
4;  Ratake family - Rock outcrop complex, rubbly.
5;  Vanet family - Rock outcrop complex complex, rubbly.
6;  Vanet - Wetmore families - Rock outcrop complex, stony.
7;  Gothic family.
8;  Supervisor - Limber families complex.
9;  Troutville family, very stony.
10; Bullwark - Catamount families - Rock outcrop complex, rubbly.
11; Bullwark - Catamount families - Rock land complex, rubbly.
12; Legault family - Rock land complex, stony.
13; Catamount family - Rock land - Bullwark family complex, rubbly.
14; Pachic Argiborolis - Aquolis complex.
15; unspecified in the USFS Soil and ELU Survey.
16; Cryaquolis - Cryoborolis complex.
17; Gateview family - Cryaquolis complex.
18; Rogert family, very stony.
19; Typic Cryaquolis - Borohemists complex.
20; Typic Cryaquepts - Typic Cryaquolls complex.
21; Typic Cryaquolls - Leighcan family, till substratum complex.
22; Leighcan family, till substratum, extremely bouldery.
23; Leighcan family, till substratum - Typic Cryaquolls complex.
24; Leighcan family, extremely stony.
25; Leighcan family, warm, extremely stony.
26; Granile - Catamount families complex, very stony.
27; Leighcan family, warm - Rock outcrop complex, extremely stony.
28; Leighcan family - Rock outcrop complex, extremely stony.
29; Como - Legault families complex, extremely stony.
30; Como family - Rock land - Legault family complex, extremely stony.
31; Leighcan - Catamount families complex, extremely stony.
32; Catamount family - Rock outcrop - Leighcan family complex, extremely stony.
33; Leighcan - Catamount families - Rock outcrop complex, extremely stony.
34; Cryorthents - Rock land complex, extremely stony.
35; Cryumbrepts - Rock outcrop - Cryaquepts complex.
36; Bross family - Rock land - Cryumbrepts complex, extremely stony.
37; Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.
38; Leighcan - Moran families - Cryaquolls complex, extremely stony.
39; Moran family - Cryorthents - Leighcan family complex, extremely stony.
40; Moran family - Cryorthents - Rock land complex, extremely stony.
'''

In [None]:
from io import StringIO
soil_desc_df = pd.read_csv(StringIO(soil_descriptions.lower()), sep=';', header=None, names=['soil_id', 'description'])
soil_desc_df.info()
soil_desc_df.head()

Below, I manually selected keywords from the description that could have potential. Using pandas I created columns for each of these keywords, where it's values (boolean) mark if that keyword shows up in the soil description or not.

In [None]:
soil_keywords = [
    'cathedral',
    'vanet',
    'ratake',
    'haploborolis',
    'wetmore',
    'troutville',
    'bullwark',
    'rogert',
    'cryaquolis',
    'cryaquepts',
    'cryaquolls',
    'granile',
    'como',
    'catamount',
    'cryorthents',
    'cryumbrepts',
    'bross',
    'rock outcrop',
    'rock land',
    'aquolis',
    'pachic argiborolis',
    'rogert',
    'till substratum',
    'leighcan',
    'moran',
    'gothic',
    'legault',
    'limber',
    'supervisor',
    'cryoborolis',
    'gateview',
    'borohemists',
    'warm',
    'rubbly',
    'very stony',
    'extremely stony',
    'extremely bouldery',
    # Stony has to be the last keyword in this list, otherwise it eats up the word "stony"
    'stony',
]

In [None]:
soil_desc_df['empty_description'] = soil_desc_df['description']
for keyword in soil_keywords:
    soil_desc_df[keyword] = soil_desc_df['empty_description'].str.find(keyword) >= 0
    soil_desc_df['empty_description'] = soil_desc_df['empty_description'].str.replace(keyword, '', regex=False)
    
for meaningless_word in ['family', 'families', 'complex', ',', '.', '-']:
    soil_desc_df['empty_description'] = soil_desc_df['empty_description'].str.replace(meaningless_word, '', regex=False)
    pass

Now I check if I missed any important keywords. The pandas series below shows all words that remained after I processed it as a keyword (or blatantly ignored it).

In [None]:
soil_desc_df['empty_description']

"typic" is an adjective that appears before some keywords. I don't think it's important. E.g., is there a difference between "Typic Cryaquolis" and "Cryaquolis"? My guess is that they just supressed the word "typic" in some instance, that's why I'm ignoring it. I could be wrong though.

Next, I tried to infer which of these features could be relevant or not. I do so, by filtering only those keywords which show in more than one soil description. If only one soil type has a certain keyword, than it brings no value at all, since this new feature would be identical to the respective `Soil_TypeX` feature.

In [None]:
important_keywords = soil_desc_df[soil_keywords].sum().to_frame('count')
important_keywords = important_keywords.query('count > 1')
important_keywords.sort_values('count', ascending=False)

I think how stony the soil is could be an ordinal feature. So, I created the feature `how_stony` below. Since I am working with a tree-based model (Random Forest), I don't use it. But the idea is here, in case it's useful for anyone. I display the dataframe below with some coloring, to more easily check the logic of the feature.

In [None]:
stony_binary_features = ['rubbly', 'stony', 'very stony', 'extremely stony', 'extremely bouldery']
soil_desc_df['how_stony'] = (
    soil_desc_df['rubbly'] * 1 +
    soil_desc_df['stony'] * 2 +
    soil_desc_df['very stony'] * 3 +
    soil_desc_df['extremely stony'] * 4 +
    soil_desc_df['extremely bouldery'] * 5
)
soil_desc_df[stony_binary_features + ['how_stony']].sort_values('how_stony') \
    .style.highlight_max(subset=stony_binary_features).background_gradient('YlGn', subset='how_stony')

## Joining with train and test datasets

In [None]:
soil_important_features = important_keywords.index.tolist()
soil_important_features

Ok, here is when I noticed joining these soil features with the main dataset would not be so straightforward as I thought. Some observations in the training and testing sets have multiple soil types! I totally forgot about this. On the original dataset, each observation had only one soil type...

In [None]:
sns.countplot(x=train[soil_type_vars].sum(axis=1).rename('number_of_soil_types'));

For the observations which have two or more soil types, with which row of soil features should we join it with? We could merge/blend the soil features of the multiple soil types... Here, I went with the simplest approach, I only join those observations that have one soil type. Those which have two or more soil types will not get any soil features.

In [None]:
def insert_soil_features(df, skip_soil_ids=[]):
    soil_id = sum([df[f'Soil_Type{i}'] * i for i in range(1, 41) if i not in skip_soil_ids])
    soil_id[df[soil_type_vars].sum(axis=1) != 1] = 0
    df['soil_id'] = soil_id
    df_with_soil_features = df.join(soil_desc_df.set_index('soil_id')[soil_important_features], on='soil_id')
    df_with_soil_features[soil_important_features] = df_with_soil_features[soil_important_features].fillna(False)
    return df_with_soil_features

In [None]:
soil_id = sum([train[f'Soil_Type{i}'] * i for i in range(1, 41)])
soil_id[train[soil_type_vars].sum(axis=1) != 1] = 0
train['soil_id'] = soil_id
train_with_soil_features = train.join(soil_desc_df.set_index('soil_id')[soil_important_features], on='soil_id')
train_with_soil_features[soil_important_features] = train_with_soil_features[soil_important_features].fillna(False)

In [None]:
train_with_soil_features = insert_soil_features(train, skip_soil_ids=[7, 15])
test_with_soil_features = insert_soil_features(test, skip_soil_ids=[7, 15])

In [None]:
# Just checking if I'm getting what I expected.
with pd.option_context('display.max_columns', None):
    display(test_with_soil_features.drop_duplicates(subset=['soil_id']).set_index('soil_id').sort_index()[soil_type_vars + soil_important_features])

## Random Forest (without soil features)

In [None]:
%%time
n_folds = 10
X = train[features]
y = train[target]
X_test = test[features]
n_labels = y.nunique()

kf = StratifiedKFold(n_splits=n_folds, random_state=42, shuffle=True)
fold_accuracies = np.empty(shape=(n_folds), dtype=np.float)
fold_preds_test = np.empty(shape=(n_folds, len(X_test)), dtype=np.int)
fold_proba_test = np.empty(shape=(n_folds, len(X_test), n_labels), dtype=np.int)
for fold_i, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]

    model = RandomForestClassifier(n_jobs=6)
    model.fit(X_train, y_train)

    preds_val = model.predict(X_val)
    fold_accuracies[fold_i] = accuracy_score(y_val, preds_val)
    print(f'Fold: {fold_i}, accuracy: {fold_accuracies[fold_i]:.4f}')

    # Prediction on test set.
    fold_proba_test[fold_i] = model.predict_proba(X_test)
    fold_preds_test[fold_i] = model.predict(X_test)

In [None]:
print(f'CV accuracy: {fold_accuracies.mean()} +- {fold_accuracies.std()}')

In [None]:
mode_result = mode(fold_preds_test, axis=0)
mode_submission = test[['Id']].copy()
mode_submission[target] = mode_result.mode.ravel()
mode_submission.to_csv('submission_without_soil_features.csv', index=False)
mode_submission.head()

### Feature importance

In [None]:
pd.DataFrame(model.feature_importances_, columns=['importance'], index=features) \
    .sort_values('importance', ascending=False).style.bar()

## Random Forest (using soil features)

In [None]:
%%time
n_folds = 10
X = train_with_soil_features[features + soil_important_features]
y = train_with_soil_features[target]
X_test = test_with_soil_features[features + soil_important_features]
n_labels = y.nunique()

kf = StratifiedKFold(n_splits=n_folds, random_state=42, shuffle=True)
fold_accuracies = np.empty(shape=(n_folds), dtype=np.float)
fold_preds_test = np.empty(shape=(n_folds, len(X_test)), dtype=np.int)
fold_proba_test = np.empty(shape=(n_folds, len(X_test), n_labels), dtype=np.int)
for fold_i, (train_idx, test_idx) in enumerate(kf.split(X, y)):
    X_train, X_val = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[test_idx]

    model = RandomForestClassifier(n_jobs=6)
    model.fit(X_train, y_train)

    preds_val = model.predict(X_val)
    fold_accuracies[fold_i] = accuracy_score(y_val, preds_val)
    print(f'Fold: {fold_i}, accuracy: {fold_accuracies[fold_i]:.4f}')

    # Prediction on test set.
    fold_proba_test[fold_i] = model.predict_proba(X_test)
    fold_preds_test[fold_i] = model.predict(X_test)

In [None]:
print(f'CV accuracy: {fold_accuracies.mean()} +- {fold_accuracies.std()}')

In [None]:
mode_result = mode(fold_preds_test, axis=0)
mode_submission = test[['Id']].copy()
mode_submission[target] = mode_result.mode.ravel()
mode_submission.to_csv('submission_with_soil_features.csv', index=False)
mode_submission.head()

### Feature importance

In [None]:
pd.DataFrame(model.feature_importances_, columns=['importance'], index=features + soil_important_features) \
    .sort_values('importance', ascending=False).style.bar()