# Predicting Forest Cover Using Ensembles of Classifiers

### ***Yabra Muvdi***

In this project, I try to predict the class of forest cover (the predominant kind of tree cover) from strictly cartographic and environment variables. The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains categorical data for qualitative independent variables (wilderness areas and soil types). The details on the data at *covertype.info* file and at https://archive.ics.uci.edu/ml/datasets/Covertype

# Steps 1 and 2

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import sklearn
import ipywidgets
from math import floor, ceil
import random
import time
from utils.helper_functions import *

In [2]:
data = pd.read_csv("./Data/MultiClass_Train_reduced.csv")
data

Unnamed: 0,Elevation,Aspect,Slope,Horiz_dist_hydro,Vertical_dist_hydro,Horiz_dist_roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horiz_dist_firepoints,Cover_Type,Wilderness_Area,Soil_Type
0,3202,34,10,0,0,2760,219,218,134,1734,1,3,38
1,3113,251,13,192,40,5600,191,249,195,2555,2,1,22
2,2801,77,9,510,17,1728,232,223,122,1087,2,1,12
3,3165,82,9,319,56,4890,233,225,124,1452,1,1,29
4,3048,333,11,124,31,2823,196,226,170,666,1,1,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...
61001,3255,1,13,0,0,1552,201,215,151,713,1,1,38
61002,3170,170,25,417,61,2605,229,241,128,3350,2,3,33
61003,2994,170,13,134,18,1610,229,245,146,2394,2,3,33
61004,2543,135,4,124,17,524,227,238,145,1106,3,4,6


In [3]:
def generate_dummies(input_file):
    """This funcion takes an input file, loads its data, separates it into 
    the variable to predict (Y) and its features (X's) and generates the
    required dummy features.
    
    Warning: This function will only work if the column names in the provided
    input file as the same as the ones used for training.
    """
    
    import pandas as pd
    data = pd.read_csv(input_file)
    
    # Delete rows with missing values
    data = data.dropna(axis = 0)
    
    # Split X's and the Y
    y = data["Cover_Type"]  # This is the classificatoin outcome: Class of forest
    X = data.drop(['Cover_Type'], axis=1)
    
    # Preprocessing the features
    X_cont = X.drop(['Wilderness_Area', 'Soil_Type'], axis=1)
    
    wild_dum = pd.get_dummies(X.Wilderness_Area, drop_first = True)
    wild_dum.columns = ['Neota','Comanche', 'Cache']
    
    soil_dum = pd.get_dummies(X.Soil_Type, prefix = "soil", drop_first = True)
    
    X_cat = wild_dum.join(soil_dum)
    X_final = X_cont.join(X_cat) 
    
    return X_final, y

In [4]:
X, y = generate_dummies("./Data/MultiClass_Train_reduced.csv")

In [5]:
X

Unnamed: 0,Elevation,Aspect,Slope,Horiz_dist_hydro,Vertical_dist_hydro,Horiz_dist_roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horiz_dist_firepoints,...,soil_31,soil_32,soil_33,soil_34,soil_35,soil_36,soil_37,soil_38,soil_39,soil_40
0,3202,34,10,0,0,2760,219,218,134,1734,...,0,0,0,0,0,0,0,1,0,0
1,3113,251,13,192,40,5600,191,249,195,2555,...,0,0,0,0,0,0,0,0,0,0
2,2801,77,9,510,17,1728,232,223,122,1087,...,0,0,0,0,0,0,0,0,0,0
3,3165,82,9,319,56,4890,233,225,124,1452,...,0,0,0,0,0,0,0,0,0,0
4,3048,333,11,124,31,2823,196,226,170,666,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
61001,3255,1,13,0,0,1552,201,215,151,713,...,0,0,0,0,0,0,0,1,0,0
61002,3170,170,25,417,61,2605,229,241,128,3350,...,0,0,1,0,0,0,0,0,0,0
61003,2994,170,13,134,18,1610,229,245,146,2394,...,0,0,1,0,0,0,0,0,0,0
61004,2543,135,4,124,17,524,227,238,145,1106,...,0,0,0,0,0,0,0,0,0,0


In [6]:
y

0        1
1        2
2        2
3        1
4        1
        ..
61001    1
61002    2
61003    2
61004    3
61005    2
Name: Cover_Type, Length: 61006, dtype: int64

In [7]:
#Training and test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=92)

Now, we should create a **y** variable that only shows if a forest is class 2 or 3 (or if it is not).

In [8]:
y_23 = np.where((y == 2) | (y==3), 1, 0)

In [9]:
y_23.sum()

33492

In [10]:
y2 = pd.get_dummies(y)[2]

In [11]:
y3 = pd.get_dummies(y)[3]

In [12]:
y2.sum() + y3.sum()

33492

In [13]:
y_23_train = np.where((y_train == 2) | (y_train == 3), 1, 0)

In [14]:
y_23_test = np.where((y_test == 2) | (y_test == 3), 1, 0)

Before proceding any further we should rescale our features in order to for them to have a commom mean and variance

In [15]:
# I want to save this object 'scaler' because I want to use the same one for both the training and the test data
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)

X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

X_train=pd.DataFrame(X_train, columns=X.columns)
X_test=pd.DataFrame(X_test, columns=X.columns)

In [16]:
X_train.describe()

Unnamed: 0,Elevation,Aspect,Slope,Horiz_dist_hydro,Vertical_dist_hydro,Horiz_dist_roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horiz_dist_firepoints,...,soil_31,soil_32,soil_33,soil_34,soil_35,soil_36,soil_37,soil_38,soil_39,soil_40
count,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,...,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0,45754.0
mean,-7.345515e-16,-2.981689e-17,3.7853480000000003e-17,-6.988334e-17,-1.894615e-17,-1.188017e-16,4.39178e-16,2.338762e-16,-3.135432e-16,2.3294449999999997e-19,...,-2.795334e-17,7.57846e-17,4.5657120000000006e-17,-4.3793560000000005e-17,4.503593e-18,-6.211852e-18,4.658889e-18,-5.1558370000000004e-17,1.708259e-17,5.093719e-17
std,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,...,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011,1.000011
min,-3.857424,-1.392601,-1.884201,-1.269371,-3.437989,-1.505327,-5.7181,-7.114132,-3.717809,-1.504321,...,-0.2141902,-0.315288,-0.2908669,-0.0557962,-0.05419694,-0.01322417,-0.01870344,-0.1641146,-0.1584076,-0.1210727
25%,-0.5366756,-0.8663449,-0.6853951,-0.7582694,-0.6771781,-0.7990982,-0.5207036,-0.5129665,-0.6133576,-0.7196235,...,-0.2141902,-0.315288,-0.2908669,-0.0557962,-0.05419694,-0.01322417,-0.01870344,-0.1641146,-0.1584076,-0.1210727
50%,0.1360162,-0.2598128,-0.1525924,-0.2377028,-0.2827765,-0.2314207,0.2217816,0.142111,0.01275027,-0.2013123,...,-0.2141902,-0.315288,-0.2908669,-0.0557962,-0.05419694,-0.01322417,-0.01870344,-0.1641146,-0.1584076,-0.1210727
75%,0.7268461,0.9443318,0.513411,0.5384147,0.3859914,0.6236234,0.704397,0.6964074,0.6910338,0.436492,...,-0.2141902,-0.315288,-0.2908669,-0.0557962,-0.05419694,-0.01322417,-0.01870344,-0.1641146,-0.1584076,-0.1210727
max,3.16135,1.818452,5.175435,5.308698,9.354341,3.037375,1.558255,1.553047,2.856324,3.907883,...,4.668748,3.171704,3.437999,17.92237,18.45123,75.61911,53.46611,6.093302,6.312827,8.259501


# Step 3: Build Binary Classification Models 

## Step 3.1: KNN

In [17]:
# Load K-NN from sklearn
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier 
#Initialize model
KNN = KNeighborsClassifier(n_neighbors = 10, weights = 'distance')

In [18]:
KNN.fit(X_train, y_23_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='distance')

## Step 3.2: Logistic Regression

In [20]:
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression(C=100, random_state= 92, solver='lbfgs', max_iter=1000)

In [21]:
logistic.fit(X_train, y_23_train)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=92, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Step 3.3: Random Forest

In [25]:
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=20, max_leaf_nodes = 50)

In [26]:
forest.fit(X_train, y_23_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=50,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

# Step 4: Ensembling models

In [27]:
#Dictionary with my models
models = {'logistic': logistic,
              'knn': KNN,
              'random forest': forest}

In [28]:
models

{'logistic': LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=1000,
                    multi_class='warn', n_jobs=None, penalty='l2',
                    random_state=92, solver='lbfgs', tol=0.0001, verbose=0,
                    warm_start=False),
 'knn': KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                      metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                      weights='distance'),
 'random forest': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                        max_depth=None, max_features='auto', max_leaf_nodes=50,
                        min_impurity_decrease=0.0, min_impurity_split=None,
                        min_samples_leaf=1, min_samples_split=2,
                        min_weight_fraction_leaf=0.0, n_estimators=20,
                        n_jobs=None, oob_score=False, random_state=None

In [29]:
def train_predict(model_list,xtrain = X_train,ytrain= y_23_train,xtest=X_test,ytest=y_23_test):
    #Fit models in list on training set and return preds
    P = np.zeros((ytest.shape[0], len(model_list)))
    P = pd.DataFrame(P)

    cols = list()
    for i, (name, m) in enumerate(models.items()):
        print("%s..." % name, end=" ", flush=False)
        #m.fit(xtrain, ytrain)
        P.iloc[:, i] = m.predict_proba(xtest)[:, 1]
        cols.append(name)
        print("done")

    P.columns = cols
    print("Done.\n")
    return P


def score_models(P, y):
    # Score model in test set
    print("Scoring models.")
    scores=[]
    for m in P.columns:
        score = roc_auc_score(y, P.loc[:, m])
        scores.append(score)
        print("%-26s: %.3f" % (m, score))
    return P.columns,scores

In [30]:
#Get some first scores and predictions from the simple models
P = train_predict(models,X_train,y_23_train,X_test,y_23_test)
my_models,my_scores= score_models(P, y_23_test)

logistic... done
knn... done
random forest... done
Done.

Scoring models.
logistic                  : 0.820
knn                       : 0.941
random forest             : 0.846


In [32]:
#Create a Meta-learner
from sklearn.ensemble import ExtraTreesClassifier

meta_learner = ExtraTreesClassifier(
    n_estimators=10,
    bootstrap=True,
    max_features=0.7,
    random_state=92)

In [33]:
models.values()

dict_values([LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=92, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False), KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=10, p=2,
                     weights='distance'), RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=50,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, w

In [35]:
import mlens
from mlens.ensemble import SuperLearner
# Instantiate the ensemble
sl = SuperLearner(
    folds=5,
    random_state=92,
    verbose=2,
    backend="multiprocessing"
)

# Add the base learners and the meta learner
sl.add(list(models.values()), proba=True)
sl.add_meta(meta_learner, proba=True)

# Train the ensemble
sl.fit(X_train, y_23_train)


Fitting 2 layers


[MLENS] backend: threading


Processing layer-1             done | 00:02:48
Processing layer-2             done | 00:00:00
Fit complete                        | 00:02:49


SuperLearner(array_check=None, backend=None, folds=5,
       layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1,
   name='layer-1', propagate_features=None, raise_on_exception=True,
   random_state=4218, shuffle=False,
   stack=[Group(backend='threading', dtype=<class 'numpy.float32'>,
   indexer=FoldIndex(X=None, folds=5, raise_on_ex...rer=None)],
   n_jobs=-1, name='group-1', raise_on_exception=True, transformers=[])],
   verbose=1)],
       model_selection=False, n_jobs=None, raise_on_exception=True,
       random_state=92, sample_size=20, scorer=None, shuffle=False,
       verbose=2)

In [36]:
# Predict the test set
p_sl = sl.predict_proba(X_test)
scoreStack1 = roc_auc_score(y_23_test, p_sl[:, 1])
print("\nSuper Learner ROC-AUC score: %.3f" % scoreStack1)


Predicting 2 layers
Processing layer-1             done | 00:01:12
Processing layer-2             done | 00:00:00
Predict complete                    | 00:01:13

Super Learner ROC-AUC score: 0.915


In [39]:
sl

SuperLearner(array_check=None, backend=None, folds=5,
       layers=[Layer(backend='threading', dtype=<class 'numpy.float32'>, n_jobs=-1,
   name='layer-1', propagate_features=None, raise_on_exception=True,
   random_state=4218, shuffle=False,
   stack=[Group(backend='threading', dtype=<class 'numpy.float32'>,
   indexer=FoldIndex(X=None, folds=5, raise_on_ex...rer=None)],
   n_jobs=-1, name='group-1', raise_on_exception=True, transformers=[])],
   verbose=1)],
       model_selection=False, n_jobs=None, raise_on_exception=True,
       random_state=92, sample_size=20, scorer=None, shuffle=False,
       verbose=2)