## Forest Cover Type Prediction

Based on: https://www.kaggle.com/c/forest-cover-type-prediction

This notebook analyzes the use of Decision Tree and Random Forest modeling to predict forest cover type in Roosevelt National Forest in northern Colorado. The goal is to predict which of 7 cover types exists in a 30 m x 30 m plot based on various geographic and environmental variables. Exploratory Data Analysis (EDA) of the training data was performed in collaboration with my teammates in a separate notebook. 

In [1]:
%matplotlib inline
import sys
import copy
import re
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
1.14.3
0.23.0


### Importing and Creating Datasets

In [2]:
### Load CSV into Dataframe
train = pd.read_csv('train.csv', index_col = 0)

### Split the data into train, dev, test with a 80-10-10 ratio
np.random.seed(0) # random seed so we get the same shuffle

X = train.drop('Cover_Type', axis = 1)
Y = train['Cover_Type']

shuffle = np.random.permutation(np.arange(X.shape[0])) # shuffle the data
X, Y = X.iloc[shuffle], Y.iloc[shuffle]

train_data, train_labels = X[:12096], Y[:12096]        # the train set
dev_data, dev_labels = X[12096:13608], Y[12096:13608]  # the dev set
test_data, test_labels = X[13608:], Y[13608:]          # the test set
print("Training Label Distribution:\n", train_labels.groupby(train_labels).size())  # check label distribution

Training Label Distribution:
 Cover_Type
1    1724
2    1726
3    1737
4    1705
5    1735
6    1732
7    1737
Name: Cover_Type, dtype: int64


# Decision Tree, Random Forest, & Boosting Modeling

For replicability of the results, we used random_state = 0 for all of our Decision Tree and Random Forest models.

In [3]:
# deepcopy dev_data, test_data
dt_train = copy.deepcopy(train_data)
dt_dev = copy.deepcopy(dev_data)
dt_test = copy.deepcopy(test_data)

# keeping track of experiments and accuracies

dt_accuracies = []
dt_experiments = []
rf_accuracies = []
rf_experiments = []

## Functions

In [4]:
def model_performance(model, train_data, train_labels, dev_data, dev_labels, metrics = True):
    
    """
    Takes a custom model and fits the train data and labels
    Prints classification report, confusion, matrix, and accuracy
    Returns accuracy
    """
    
    model.fit(train_data, train_labels)
    dtree_pred = model.predict(dev_data)
    
    if metrics == True:
        print(classification_report(dev_labels, dtree_pred))
        print(confusion_matrix(dev_labels, dtree_pred))
        print("\naccuracy:", np.mean(dev_labels == dtree_pred))
    
    return np.mean(dev_labels == dtree_pred)

def importance_table(model, data, sort = True):
    """ By default, create dataframe (descending sort) of feature importances of decision tree or random forest model """
    
    table = pd.DataFrame({'importance':model.feature_importances_}, index = data.columns)
    
    if sort == True:
        return table.sort_values(by = 'importance', axis = 0, ascending = False)
    elif sort == False:
        return table

## Decision Trees

### Basic Decision Tree Model

First, we established a baseline with a simple Decision Tree model, changing the criterion to entropy. This resulted in 79.23% accuracy on the development data. From the confusion matrix, we saw that the model has the most trouble with cover types 1, 2, 3, and 6. In examining feature importance of the baseline model, we saw that Elevation, Horizontal Distance to Roadways, Horizontal Distance to Fire Points, Horizontal Distance to Hydrology were ranked the highest.

In [5]:
# specifying entropy, default is gini
dtree1 = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
dt_accuracies.append(model_performance(dtree1, dt_train, train_labels, dt_dev, dev_labels))
dt_experiments.append('Basic Decision Tree')
importance_table(dtree1, dt_train, sort = True)

              precision    recall  f1-score   support

           1       0.68      0.63      0.66       216
           2       0.65      0.58      0.62       226
           3       0.76      0.73      0.75       203
           4       0.92      0.95      0.93       243
           5       0.84      0.93      0.88       198
           6       0.78      0.78      0.78       222
           7       0.87      0.95      0.91       204

   micro avg       0.79      0.79      0.79      1512
   macro avg       0.79      0.79      0.79      1512
weighted avg       0.79      0.79      0.79      1512

[[137  50   1   0   4   1  23]
 [ 54 132   3   1  24   7   5]
 [  0   4 149   9   4  37   0]
 [  0   0  10 230   0   3   0]
 [  2   7   3   0 184   1   1]
 [  0   8  30   9   2 173   0]
 [  8   1   0   0   2   0 193]]

accuracy: 0.7923280423280423


Unnamed: 0,importance
Elevation,0.551508
Horizontal_Distance_To_Roadways,0.074367
Horizontal_Distance_To_Fire_Points,0.0606
Horizontal_Distance_To_Hydrology,0.053318
Hillshade_9am,0.046529
Vertical_Distance_To_Hydrology,0.026441
Hillshade_Noon,0.026102
Aspect,0.024384
Wilderness_Area1,0.022788
Hillshade_3pm,0.022064


### Decision Tree: Optimizing max_depth parameter

After running the baseline Decision Tree model, we sought to optimize parameters that could improve performance and reduce overfitting. First, we looked at the max_depth parameter, which controls the maximum depth of the tree that is generated. Through GridSearchCV, we found that 24 is the optimal max_depth, which "prunes" our Decision Tree by 30 levels. In using a model with max_depth = 24, we saw slight improvement in accuracy from 79.23% to 79.70%. At the same time, the feature importances ranking looked similar to that of the baseline model.

### Decision Tree: Optimizing max_features

Next, we examined the max_features parameter, which refers to the maximum number of features to consider when looking for the best split. We found that 40 features was the optimal setting and this yielded an accuracy of 79.56%. Overall, in terms of accuracy, precision, recall, and F-1 scores, this version was comparable to the model with max_depth = 24. However, this model has more nodes than the latter, so we opted for the more parsimonious model given that the comparable performance.

In [6]:
# this can take 2-3 mins to run
dtpl = Pipeline([('classifier', dtree1)])

param_grid = dict(classifier__max_depth = [n for n in range(1, (dt_train.shape[1] + 1))])

dtgs = GridSearchCV(dtpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
print(dtgs.best_params_)

dtree2 = DecisionTreeClassifier(criterion = 'entropy', max_depth = 24, random_state = 0)
dt_accuracies.append(model_performance(dtree2, dt_train, train_labels, dt_dev, dev_labels))
dt_experiments.append('Decision Tree with max_depth = 24')

param_grid = dict(classifier__max_features = [n for n in range(1, (dt_train.shape[1] + 1))])

dtgs = GridSearchCV(dtpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
print(dtgs.best_params_)

dtree3 = DecisionTreeClassifier(criterion = 'entropy', max_features = 40, random_state = 0)
dt_accuracies.append(model_performance(dtree3, dt_train, train_labels, dt_dev, dev_labels))
dt_experiments.append('Decision Tree with max_features = 40')

{'classifier__max_depth': 24}
              precision    recall  f1-score   support

           1       0.69      0.65      0.67       216
           2       0.68      0.60      0.63       226
           3       0.76      0.74      0.75       203
           4       0.92      0.95      0.93       243
           5       0.84      0.93      0.88       198
           6       0.78      0.78      0.78       222
           7       0.87      0.95      0.91       204

   micro avg       0.80      0.80      0.80      1512
   macro avg       0.79      0.80      0.79      1512
weighted avg       0.79      0.80      0.79      1512

[[140  47   1   0   4   1  23]
 [ 49 135   4   1  24   8   5]
 [  0   3 150   9   4  37   0]
 [  0   0  10 230   0   3   0]
 [  3   6   3   0 184   1   1]
 [  0   8  30   9   2 173   0]
 [ 10   1   0   0   0   0 193]]

accuracy: 0.796957671957672
{'classifier__max_features': 40}
              precision    recall  f1-score   support

           1       0.67      0.66     

### Decision Tree: Optimizing min_samples_split & min_samples_leaf

Next, we respectively optimized for min_samples_split, which refers to the minimum number of samples to split an internal node, and  min_samples_leaf, which refers to the minimum number of samples to be in a leaf node. From this optimizations, we found that percentages closest to 0 for both min_samples_split and min_samples_leaf achieved the best results. For example, min_samples_split to 0.01 pruned the tree back too far so accuracy was reduced to 74.21%. This version had only 411 nodes compared to approximately 3233 for the baseline and optimized max_depth trees respectively. With min_samples_split = 0.001, the accuracy was still lower than that of the baseline model. Similarly, with min_samples_leaf = 0.001, the accuracy achieved was 76.85%.

In [7]:
param_grid = dict(classifier__min_samples_split = [n for n in np.arange(0.001, 0.01, 0.01)])

dtgs = GridSearchCV(dtpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
print(dtgs.best_params_)

dtree4 = DecisionTreeClassifier(criterion = 'entropy', min_samples_split = 0.001, random_state = 0)
dt_accuracies.append(model_performance(dtree4, dt_train, train_labels, dt_dev, dev_labels, metrics = False))
dt_experiments.append('Decision Tree with min_samples_split = 0.001')

param_grid = dict(classifier__min_samples_leaf = [n for n in np.arange(0.001, 0.01, 0.002)])

dtgs = GridSearchCV(dtpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
print(dtgs.best_params_)

dtree5 = DecisionTreeClassifier(criterion = 'entropy', min_samples_leaf = 0.001, random_state = 0)
dt_accuracies.append(model_performance(dtree5, dt_train, train_labels, dt_dev, dev_labels, metrics = False))
dt_experiments.append('Decision Tree with min_samples_leaf = 0.001')

{'classifier__min_samples_split': 0.001}
{'classifier__min_samples_leaf': 0.001}


## Random Forest Modeling

### Random Forest: Optimizing n_estimators

To get a sense of the appropriate parameter grid, we initially tried 10, 1000, and 1500 trees respectively to establish a baseline and determine an appropriate range for the parameter grid. With 10 trees, the model had an accuracy of approximately 82%. The models with 1000 and 1500 trees respectively had accuracies around 86%. Given that the higher the number of trees used in Random Forest, the longer it takes to train, we chose to search for an optimal number of estimators below 1000. After running GridSearchCV, we found that the optimal number of trees is 600. 

We also examined the feature importances from the Random Forest model and compared them to that of our Decision Tree model. The feature importance of Elevation was reduced from approximately 0.55 to 0.26 and we saw that soil types that previously had no importance gained some in this model. Perhaps, this suggests that while Elevation is important, it is overemphasized in our Decision Tree model. Additionally, the importance of Wilderness_Area4 rose to 0.07 whereas in the Decision Tree Model, its importance was less than 0.001.

### Random Forest: Optimizing for max_depth

Given the results of our parameter optimization for Decision Trees, we chose to focus on optimizing for max_depth for our Random Forest model. Having a lower max_depth may also help decrease the speed of training given the optimal number of estimators. The optimal max_depth for Random Forest was found to be 35, which is a bit higher than that of the Decision Tree model. The accuracy of the model with max_depth = 35 is 86.18%--only had a slight improvement in accuracy compared to the baseline Random Forest model, which was 86.11%. 

In [8]:
# note: gridsearch can take 10-15 mins to run
#rfpl = Pipeline([('classifier', RandomForestClassifier(criterion = 'entropy', random_state = 0))])

#param_grid = dict(classifier__n_estimators = [n for n in range(100, 1000, 100)])

#rfgs = GridSearchCV(rfpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
#print(rfgs.best_params_)

rfc1 = RandomForestClassifier(n_estimators = 600, criterion = 'entropy', random_state = 0)
rf_accuracies.append(model_performance(rfc1, dt_train, train_labels, dt_dev, dev_labels))
rf_experiments.append('Random Forest n_estimators = 600')
importance_table(rfc1, dt_train, sort = True)

# gridsearch can take 7-10 mins to run
#rfpl = Pipeline([('classifier', rfc1)])
#param_grid = dict(classifier__max_depth = [n for n in range(10, 55, 5)])

#rfgs2 = GridSearchCV(rfpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
#print(rfgs2.best_params_)

rfc2 = RandomForestClassifier(n_estimators = 600, max_depth = 35, criterion = 'entropy', random_state = 0)
rf_accuracies.append(model_performance(rfc2, dt_train, train_labels, dt_dev, dev_labels))
rf_experiments.append('Random Forest n_estimators = 600, max_depth = 35')

              precision    recall  f1-score   support

           1       0.76      0.74      0.75       216
           2       0.78      0.69      0.73       226
           3       0.87      0.83      0.85       203
           4       0.92      0.98      0.95       243
           5       0.87      0.97      0.92       198
           6       0.86      0.85      0.86       222
           7       0.94      0.97      0.95       204

   micro avg       0.86      0.86      0.86      1512
   macro avg       0.86      0.86      0.86      1512
weighted avg       0.86      0.86      0.86      1512

[[160  39   0   0   6   0  11]
 [ 44 156   1   0  18   5   2]
 [  0   0 168  11   3  21   0]
 [  0   0   2 238   0   3   0]
 [  0   2   1   0 193   2   0]
 [  0   2  21   9   1 189   0]
 [  6   0   0   0   0   0 198]]

accuracy: 0.8611111111111112


Unnamed: 0,importance
Elevation,0.262381
Horizontal_Distance_To_Roadways,0.088981
Wilderness_Area4,0.078875
Horizontal_Distance_To_Fire_Points,0.064422
Horizontal_Distance_To_Hydrology,0.049385
Vertical_Distance_To_Hydrology,0.043686
Hillshade_9am,0.041909
Aspect,0.039704
Hillshade_3pm,0.036511
Hillshade_Noon,0.036418


In [9]:
# gridsearch can take 7-10 mins to run
#rfpl = Pipeline([('classifier', rfc1)])
#param_grid = dict(classifier__max_depth = [n for n in range(10, 55, 5)])

#rfgs2 = GridSearchCV(rfpl, param_grid, iid = True, refit = True, cv=5, return_train_score=True).fit(dt_train, train_labels)
#print(rfgs2.best_params_)

rfc2 = RandomForestClassifier(n_estimators = 600, max_depth = 35, criterion = 'entropy', random_state = 0)
rf_accuracies.append(model_performance(rfc2, dt_train, train_labels, dt_dev, dev_labels))
rf_experiments.append('Random Forest n_estimators = 600, max_depth = 35')

              precision    recall  f1-score   support

           1       0.76      0.74      0.75       216
           2       0.79      0.69      0.74       226
           3       0.87      0.83      0.85       203
           4       0.92      0.98      0.95       243
           5       0.87      0.97      0.92       198
           6       0.86      0.86      0.86       222
           7       0.94      0.97      0.95       204

   micro avg       0.86      0.86      0.86      1512
   macro avg       0.86      0.86      0.86      1512
weighted avg       0.86      0.86      0.86      1512

[[160  39   0   0   6   0  11]
 [ 44 156   1   0  18   5   2]
 [  0   0 168  11   3  21   0]
 [  0   0   2 238   0   3   0]
 [  0   2   1   0 193   2   0]
 [  0   1  21   9   1 190   0]
 [  6   0   0   0   0   0 198]]

accuracy: 0.8617724867724867


In [10]:
### FINAL RESULTS

# Best Decision Tree

best_dt = DecisionTreeClassifier(criterion = 'entropy', max_depth = 24, random_state = 0)

# Best Random Forest

best_rf = RandomForestClassifier(criterion = 'entropy', n_estimators = 600, max_depth = 35,  random_state = 0)

## Feature Selection

- Summary
    - Dropping Slope and Aspect
        - Reasoning: Hillshade is calculated using Slope and Aspect so removing Slope and Aspect may make our model more parsimonious without information loss. Additionally, of the continuous variables, Slope and Aspect were among those with the least feature importance based on our best Decision Tree and Random Forest models.
    - Dropping soil types that had little to no feature importance
        - Reasoning: In our best Decision Tree model, 9 soil types had less than 0.0001 feature importance with 7 having no importance. In our best Random Forest model, 7 soil types had less than 0.0001 feature importance with 3 having no importance.
    - Dropping Slope, Aspect, and Unimportant Soil Types
        - This led us to examine how these models would perform if all of these variables were excluded, reducing the dataset to 43 columns. We did not see much improvement for the Decision Tree model, but the Random Forest model's accuracy increased to 87.07% and the F-1 scores increased for cover types 1, 2, 3, 6, and 7, some of which are the most confused in other models.

### Feature Selection: Dropping Slope and Aspect in Dataset

The Decision Tree model (optimized for max_depth) trained on the dataset excluding slope and aspect had an accuracy of 79.43%, which is not much smaller than that of the Decision Model trained on the full dataset. However, in comparing the confusion matrices, the model trained with the smaller dataset confused the classification of cover types 1 and 2 more than the model with the full dataset. However, in terms of the Random Forest model (optimized for max_depth), we found slightly better accuracy at 86.84% training with this smaller dataset. Below, we highlight the results from the Random Forest model.

In [11]:
# removing from train
drop_col1 = ['Slope', 'Aspect']
dt_train2 = dt_train.drop(drop_col1, axis = 1)
dt_dev2 = dt_dev.drop(drop_col1, axis = 1)

# using GridSearch, we found max_depth = 20 to be optimal
dt_accuracies.append(model_performance(best_dt, dt_train2, train_labels, dt_dev2, dev_labels, metrics = False))
dt_experiments.append('Feature Selection: No Slope, Aspect')
# using GridSearch, we found max_depth = 40 to be optimal
rf_accuracies.append(model_performance(best_rf, dt_train2, train_labels, dt_dev2, dev_labels))
rf_experiments.append('Feature Selection: No Slope, Aspect')

              precision    recall  f1-score   support

           1       0.77      0.75      0.76       216
           2       0.78      0.70      0.74       226
           3       0.88      0.84      0.86       203
           4       0.93      0.98      0.96       243
           5       0.89      0.97      0.93       198
           6       0.86      0.86      0.86       222
           7       0.95      0.98      0.96       204

   micro avg       0.87      0.87      0.87      1512
   macro avg       0.87      0.87      0.87      1512
weighted avg       0.87      0.87      0.87      1512

[[161  42   0   0   5   0   8]
 [ 42 159   1   0  15   6   3]
 [  0   0 170  10   2  21   0]
 [  0   0   1 239   0   3   0]
 [  0   2   1   0 193   2   0]
 [  0   1  20   8   1 192   0]
 [  5   0   0   0   0   0 199]]

accuracy: 0.8683862433862434


### Feature Selection: Dropping the soil types that have little to no importance

In examining feature importances of the Decision Tee and Random Forest models, we saw that some soil types continued to show little or no importance. We examined the effect of dropping those specific soil types and retraining on this smaller dataset and saw improvements for both models. For the Decision Tree model, the accuracy increased to 80.29% and for the Random Forest model, the accuracy increased to 86.04%. Below, we highlight the results for retraining on the Random Forest model.

In [12]:
# this list is from the feature importances table from the best random forest
rf_table = importance_table(best_rf.fit(dt_train, train_labels), dt_train, sort = True)
drop_col2 = rf_table.index[rf_table['importance'] < 0.0001].tolist()

dt_train3 = dt_train.drop(drop_col2, axis = 1)
dt_dev3 = dt_dev.drop(drop_col2, axis = 1)

dt_accuracies.append(model_performance(best_dt, dt_train3, train_labels, dt_dev3, dev_labels, metrics = False))
dt_experiments.append('Feature Selection: No Unimporant Soil Types')
rf_accuracies.append(model_performance(best_rf, dt_train3, train_labels, dt_dev3, dev_labels))
rf_experiments.append('Feature Selection: No Unimporant Soil Types')

              precision    recall  f1-score   support

           1       0.76      0.73      0.75       216
           2       0.78      0.69      0.73       226
           3       0.87      0.82      0.85       203
           4       0.92      0.98      0.95       243
           5       0.88      0.97      0.92       198
           6       0.86      0.86      0.86       222
           7       0.94      0.97      0.95       204

   micro avg       0.86      0.86      0.86      1512
   macro avg       0.86      0.86      0.86      1512
weighted avg       0.86      0.86      0.86      1512

[[158  42   0   0   5   0  11]
 [ 43 157   1   0  18   5   2]
 [  0   0 167  11   3  22   0]
 [  0   0   3 237   0   3   0]
 [  0   2   1   0 193   2   0]
 [  0   1  20   9   1 191   0]
 [  6   0   0   0   0   0 198]]

accuracy: 0.8604497354497355


### Feature Selection: Dropping Slope, Aspect, & Unimportant Soil Types

Based on the results of the previous experiments, we examined the impact of removing the combination of these columns from the dataset and retraining both types of models. This resulted in 79.43% accuracy for the Decision Tree model. On the other hand, we saw an increase in accuracy for the Random Forest model at 87.03%.

In [13]:
drop_col3 = drop_col1 + drop_col2
dt_train4 = dt_train.drop(drop_col3, axis = 1)
dt_dev4 = dt_dev.drop(drop_col3, axis = 1)

dt_accuracies.append(model_performance(best_dt, dt_train4, train_labels, dt_dev4, dev_labels, metrics = False))
dt_experiments.append('Feature Selection: No Slope, Aspect, & Unimportant Soil Types')
rf_accuracies.append(model_performance(best_rf, dt_train4, train_labels, dt_dev4, dev_labels))
rf_experiments.append('Feature Selection: No Slope, Aspect, & Unimportant Soil Types')

              precision    recall  f1-score   support

           1       0.78      0.75      0.77       216
           2       0.79      0.70      0.74       226
           3       0.88      0.84      0.86       203
           4       0.93      0.98      0.95       243
           5       0.89      0.97      0.93       198
           6       0.86      0.86      0.86       222
           7       0.94      0.98      0.96       204

   micro avg       0.87      0.87      0.87      1512
   macro avg       0.87      0.87      0.87      1512
weighted avg       0.87      0.87      0.87      1512

[[163  39   0   0   5   0   9]
 [ 42 159   2   0  16   4   3]
 [  0   0 171  10   1  21   0]
 [  0   0   1 239   0   3   0]
 [  0   2   1   0 193   2   0]
 [  0   1  19   9   1 192   0]
 [  5   0   0   0   0   0 199]]

accuracy: 0.8703703703703703


## Feature Engineering: New Soil Type Features

By far, the soil types contribute the most to the high dimensionality of the train dataset since there are 40 types. From experimenting with feature selection, we know that there is some information value to the soil types, but there may be a way to reduce the dimensions without loss of information. First, we ran our best Random Forest model on just the soil types as a way to prioritize the importance of the soil types. 

The original problem description also came with information on each soil type. We did a basic analysis of the words/terms in these descriptions to create new one-hot encoded features such as stony, rubbly, Leighcan, Catamount, complex, etc. These denoted soil qualities and common families they share. As we have seen with other models, some soil types have little to no importance, so we excluded these from the analysis in further efforts to keep the number of dimensions lower. Through this, we were able to describe the soil types with 33 columns compared to the original 40. We also tested a version of new features created from just the most important soil types and reduced the columns to 23 in that case. Both versions had comparable performance to the original dataset performing slightly better.

In [14]:
# create datasets with just the soil types
dt_train_justsoil = dt_train.drop(list(train_data.columns[:14]), axis = 1)
dt_dev_justsoil = dt_dev.drop(list(dev_data.columns[:14]), axis = 1)

best_rf_soils = RandomForestClassifier(n_estimators = 600, max_depth = 35, criterion = 'entropy', random_state = 0).fit(dt_train_justsoil, train_labels)
# get the feature importance of the soil types
rf_importance_soil = importance_table(best_rf_soils, dt_train_justsoil, sort = True)
# we will use the mean of the feature importance to set the threshold of which soil types are important
importance_threshold = np.mean(rf_importance_soil['importance'])

# get the unimportant soil types
remove_soils = rf_importance_soil.index[rf_importance_soil['importance'] < 0.001].tolist()
remove_soil_ind = []
# get the indices for these soil types
for soil in remove_soils:
    remove_soil_ind.append(int(soil.split('Soil_Type')[1]) - 1)

# get the most important soil types
priority_soils = rf_importance_soil.index[rf_importance_soil['importance'] > importance_threshold].tolist()
priority_soil_ind = []
# get the indices for these soil types
for soil in priority_soils:
    priority_soil_ind.append(int(soil.split('Soil_Type')[1]) - 1)

# read in the soil type info
soils_analysis = pd.read_csv('soil_types.csv', header = None)
# drop the unimportant soil types so they won't be included in the description analysis
soils_analysis_all = soils_analysis.drop(remove_soil_ind, axis = 0)
# get only the most important soil types
soils_analysis_priority = soils_analysis.iloc[priority_soil_ind]

In [15]:
def get_terms(soil_desc):
    """ Takes in list of soil descriptions and builds a dict of word and word count """
    terms_dict = {}

    for soil in soil_desc[0]:
        soil = re.sub(r'[^\w\s]','', soil)
        words = soil.split()

        for word in words:
            if word in [str(n) for n in range(1, 41)]:
                continue
            elif word in terms_dict:
                terms_dict[word] += 1
            else:
                terms_dict[word] = 1
    
    return terms_dict

def filter_terms(terms_dict):
    """ Takes in the terms dictionary and remove unnecessary words and add in common two word terms """
    
    terms = list(terms_dict.keys())
    remove_terms = ['family', 'Rock', 'outcrop', 'extremely', 'families', 'very', 'land', 'till', 'substratum']

    for term in remove_terms:
        try:
            terms.remove(term)
        except:
            continue
    # add back in two word terms
    terms = terms + ['Rock land', 'Rock outcrop', 'till substratum']
    
    return terms

def new_soil_feats(terms, soils_df, train_data, dev_data):
    
    """ Creates new dataset with added soil features """
    soils_df['Soil_Cat'] = soils_df[0].apply(lambda s: int(re.findall(r'\d+', s)[0]))
    soils_df.set_index('Soil_Cat', inplace = True)
    
    for term in terms:

        soils_df[term] = soils_df[0].apply(lambda s: int(len(re.findall(term, s)) == 1))

    # deep copy the train_data and dev_data
    fe_train = copy.deepcopy(train_data)
    fe_dev = copy.deepcopy(dev_data)
    
    # set up soil_cat for the merge
    fe_train['Soil_Cat'] = 0
    soil_names = ['Soil_Type' + str(i) for i in range(1, 41)]
    for i, name in enumerate(soil_names, 1):
        fe_train.loc[fe_train[name] == 1, 'Soil_Cat'] = i

    fe_dev['Soil_Cat'] = 0
    soil_names = ['Soil_Type' + str(i) for i in range(1, 41)]
    for i, name in enumerate(soil_names, 1):
        fe_dev.loc[fe_dev[name] == 1, 'Soil_Cat'] = i
    
    fe_train.drop(list(fe_train.columns[14:54]), axis = 1, inplace = True)
    fe_dev.drop(list(fe_dev.columns[14:54]), axis = 1, inplace = True)
    
    # create copy of train_dataset with these additional soil type dummy variables
    fe_train_sg = fe_train.merge(soils_df.drop(0, axis = 1), right_index = True, left_on = 'Soil_Cat', how='left')
    # take out unnecessary columns
    fe_train_sg.drop('Soil_Cat', axis = 1, inplace = True)

    fe_dev_sg = fe_dev.merge(soils_df.drop(0, axis = 1), right_index = True, left_on = 'Soil_Cat', how='left')
    fe_dev_sg.drop('Soil_Cat', axis = 1, inplace = True)
    
    return fe_train_sg, fe_dev_sg

# deep copy the soils analysis dataframe
soils_df_all = copy.deepcopy(soils_analysis)
soils_df_priority = copy.deepcopy(soils_analysis)

all_terms_dict = get_terms(soils_analysis_all)
priority_terms_dict = get_terms(soils_analysis_priority)
all_terms = filter_terms(all_terms_dict)
priority_terms = filter_terms(priority_terms_dict)

fe_train_all, fe_dev_all = new_soil_feats(all_terms, soils_df_all, dt_train, dt_dev)
fe_train_priority, fe_dev_priority= new_soil_feats(priority_terms, soils_df_priority, dt_train, dt_dev)

# gridsearch found that 900 estimators was optimal for both datasets
best_rf_fe = RandomForestClassifier(n_estimators = 900, criterion = 'entropy', random_state = 0)

print("Random Forest with New Soil Features from All Soil Types:")
rf_accuracies.append(model_performance(best_rf_fe, fe_train_all, train_labels, fe_dev_all, dev_labels))
rf_experiments.append('Feature Engineering: New Soil Features From All')
print("Random Forest with New Soil Features from Most Important Soil Types:")
rf_accuracies.append(model_performance(best_rf_fe, fe_train_priority, train_labels, fe_dev_priority, dev_labels))
rf_experiments.append('Feature Engineering: New Soil Features From Important Soil Types')

Random Forest with New Soil Features from All Soil Types:
              precision    recall  f1-score   support

           1       0.75      0.71      0.73       216
           2       0.76      0.70      0.73       226
           3       0.87      0.85      0.86       203
           4       0.93      0.98      0.95       243
           5       0.89      0.97      0.93       198
           6       0.87      0.86      0.87       222
           7       0.92      0.97      0.94       204

   micro avg       0.86      0.86      0.86      1512
   macro avg       0.86      0.86      0.86      1512
weighted avg       0.86      0.86      0.86      1512

[[154  45   0   0   4   0  13]
 [ 44 158   1   0  16   4   3]
 [  0   0 172   9   3  19   0]
 [  0   0   3 237   0   3   0]
 [  0   2   1   0 193   2   0]
 [  0   2  20   9   0 191   0]
 [  7   0   0   0   0   0 197]]

accuracy: 0.8611111111111112
Random Forest with New Soil Features from Most Important Soil Types:
              precision    r

## Decision Tree & Random Forest Final Result

Our experiments have shown that Random Forest outperforms Decision Tree for this problem, which is not unexpected given that Random Forest is a meta estimator that uses many Decision Trees. After various parameter optimization, feature selection, and feature engineering experiments, we find that the optimized Random Forest model with 600 trees and max_depth = 35 in combination with datasets without Slope, Aspect, and unimportant soil types yield the best results with the development data. 

When predicting using the best models with this feature selection on the test data, we saw that this resulted in an increase of approximately 2-3% in accuracy for both our best Decision Tree and Random Forest models. Please see the tables below for a summary of our experiments and final results.

In [16]:
dt_train_final = dt_train.drop(drop_col3, axis = 1)
dt_dev_final = dt_dev.drop(drop_col3, axis = 1)
dt_test_final = dt_test.drop(drop_col3, axis = 1)

dt_accuracies.append(model_performance(best_dt, dt_train, train_labels, dt_test, test_labels, metrics = False))
dt_experiments.append('Best Model with Untransformed Test Data')
dt_accuracies.append(model_performance(best_dt, dt_train_final, train_labels, dt_test_final, test_labels, metrics = False))
dt_experiments.append('Best Model with Transformed Test Data')

rf_accuracies.append(model_performance(best_rf, dt_train, train_labels, dt_test, test_labels, metrics = False))
rf_experiments.append('Best Model with Untransformed Test Data')
rf_accuracies.append(model_performance(best_rf, dt_train_final, train_labels, dt_test_final, test_labels, metrics = False))
rf_experiments.append('Best Model with Transformed Test Data')

dt_results = pd.DataFrame({'Experiment':dt_experiments, 'Accuracy':dt_accuracies})
dt_results.set_index('Experiment', inplace = True)

rf_results = pd.DataFrame({'Experiment':rf_experiments, 'Accuracy':rf_accuracies})
rf_results.set_index('Experiment', inplace = True)

In [18]:
print("Decision Tree Modeling Summary")
dt_results

Decision Tree Modeling Summary


Unnamed: 0_level_0,Accuracy
Experiment,Unnamed: 1_level_1
Basic Decision Tree,0.792328
Decision Tree with max_depth = 24,0.796958
Decision Tree with max_features = 40,0.795635
Decision Tree with min_samples_split = 0.001,0.789021
Decision Tree with min_samples_leaf = 0.001,0.768519
"Feature Selection: No Slope, Aspect",0.787698
Feature Selection: No Unimporant Soil Types,0.796958
"Feature Selection: No Slope, Aspect, & Unimportant Soil Types",0.794974
Best Model with Untransformed Test Data,0.80754
Best Model with Transformed Test Data,0.828704


In [19]:
print("Random Forest Modeling Summary")
rf_results

Random Forest Modeling Summary


Unnamed: 0_level_0,Accuracy
Experiment,Unnamed: 1_level_1
Random Forest n_estimators = 600,0.861111
"Random Forest n_estimators = 600, max_depth = 35",0.861772
"Feature Selection: No Slope, Aspect",0.868386
Feature Selection: No Unimporant Soil Types,0.86045
"Feature Selection: No Slope, Aspect, & Unimportant Soil Types",0.87037
Feature Engineering: New Soil Features From All,0.861111
Feature Engineering: New Soil Features From Important Soil Types,0.861772
Best Model with Untransformed Test Data,0.876984
Best Model with Transformed Test Data,0.892196


## AdaBoost

In [17]:
def adaboost_optimzer(best_model, estimators, learning_rates, transformed_train, transformed_dev):
    
    #FUNCTION ARGUMENTS:
    #best_model - best version of the model to boost
    #estimators - a list of integers to input the n_estimators argument
    #learning_rates - list of learning rates to input to the learning_rate argument
    #trainsformed_train - the version of train_data post feature engineering used in the best_model
    #trainsformed_dev - the version of dev_data post feature engineering used in the best_model 
    
    results = []
    for n in estimators:
        for lr in learning_rates:
            abc = AdaBoostClassifier(base_estimator=best_model, n_estimators=n, learning_rate=lr, random_state = 0)
            abc.fit(transformed_train, train_labels)
            accuracy = abc.score(transformed_dev, dev_labels)
            
            results.append([n, lr, accuracy])
            
    results_t = pd.DataFrame(results)         
    results_t.columns = ["Maximum Number of Estimators", "Learning Rates", "Accuracy for Adaboost"]
    return results_t.sort_values(by = 'Accuracy for Adaboost', ascending = False)

In [20]:
learning_rates = [0.001, 0.01, 0.1, 1]
estimators = [n for n in range(100, 500, 100)]

adaboost_optimzer(best_dt, estimators, learning_rates, dt_train, dt_dev)

Unnamed: 0,Maximum Number of Estimators,Learning Rates,Accuracy for Adaboost
2,100,0.1,0.806878
6,200,0.1,0.806878
10,300,0.1,0.806878
14,400,0.1,0.806878
0,100,0.001,0.80291
4,200,0.001,0.80291
8,300,0.001,0.80291
12,400,0.001,0.80291
1,100,0.01,0.79828
5,200,0.01,0.79828


In [21]:
learning_rates = [0.001, 0.01, 0.1, 1]
estimators = [n for n in range(100, 500, 100)]

adaboost_optimzer(best_dt, estimators, learning_rates, dt_train_final, dt_dev_final)

Unnamed: 0,Maximum Number of Estimators,Learning Rates,Accuracy for Adaboost
0,100,0.001,0.795635
1,100,0.01,0.795635
2,100,0.1,0.795635
3,100,1.0,0.795635
4,200,0.001,0.795635
5,200,0.01,0.795635
6,200,0.1,0.795635
7,200,1.0,0.795635
8,300,0.001,0.795635
9,300,0.01,0.795635


#### Voting Classifier: Multi-Model Ensemble

Utilizing our two best models, SVM and Random Forest, we created a Voting Classifier model with soft voting based on probabilities. Surprisingly, the Voting model was much more accurate using unscaled data, which is necessary for SVM. As we've seen in our EDA and in other models, Elevation had a lot of importance in terms of prediction. Since Elevation has the largest range among the unscaled features, it is likely overemphasized in the SVM model and may contribute to the increased accuracy in this case. The model achieved an accuracy of 89.08% on the unscaled development dataset in conjunction with our feature selection method of excluding Slope, Aspect, and unimportant soil types.

For the unscaled test dataset, the Voting Classifier model achieved 88.82% accuracy, which was lower than that of the single Random Forest model. In particular, it had more misclassifications of cover type 1. However, this model had better F-1 scores for cover types 5 and 6 and better accuracy for cover types 2 and 3. In future iterations, this ensemble method may be strengthened with the inclusion of other well-calibrated models.

In [None]:
svm = SVC(gamma = 'scale', kernel = 'rbf', C = 1.0, probability = True, random_state = 0)
rfc = RandomForestClassifier(criterion = 'entropy', n_estimators = 600, max_depth = 35,  random_state = 0)
estimators = [('svm', svm), ('rf', rfc)]

voting = VotingClassifier(estimators, voting = 'soft')
print("SVM & Random Forest Ensemble with dev_data")
model_performance(voting, dt_train_final, train_labels, dt_dev_final, dev_labels)
print("SVM & Random Forest Ensemble with test_data")
model_performance(voting, dt_train_final, train_labels, dt_test_final, test_labels)