In this notebook,  we test a neural netowrk as a model to predict the forest cover type from cartographic variables.


The data was obtained from the UCI machine learning repository, The data consists of 54 features and 581012 instances. As specified at https://archive.ics.uci.edu/ml/datasets/Covertype, the variables are

Name / Data Type / Measurement / Description

Elevation / quantitative /meters / Elevation in meters

Aspect / quantitative / azimuth / Aspect in degrees azimuth

Slope / quantitative / degrees / Slope in degrees

Horizontal_Distance_To_Hydrology / quantitative / meters / Horz Dist to nearest surface water features

Vertical_Distance_To_Hydrology / quantitative / meters / Vert Dist to nearest surface water features

Horizontal_Distance_To_Roadways / quantitative / meters / Horz Dist to nearest roadway

Hillshade_9am / quantitative / 0 to 255 index / Hillshade index at 9am, summer solstice

Hillshade_Noon / quantitative / 0 to 255 index / Hillshade index at noon, summer soltice

Hillshade_3pm / quantitative / 0 to 255 index / Hillshade index at 3pm, summer solstice

Horizontal_Distance_To_Fire_Points / quantitative / meters / Horz Dist to nearest wildfire ignition points

Wilderness_Area (4 binary columns) / qualitative / 0 (absence) or 1 (presence) / Wilderness area designation

Soil_Type (40 binary columns) / qualitative / 0 (absence) or 1 (presence) / Soil Type designation

Cover_Type (7 types) / integer / 1 to 7 / Forest Cover Type designation  (response variable)

The forest cover type classes are:

Forest Cover Type Classes:	    1 -- Spruce/Fir

                                2 -- Lodgepole Pine
                                
                                3 -- Ponderosa Pine
                                
                                4 -- Cottonwood/Willow
                                
                                5 -- Aspen
                                
                                6 -- Douglas-fir
                                
                                7 -- Krummholz

and the the number of instances for each class is

           Number of records of Spruce-Fir:                211840 
           
           Number of records of Lodgepole Pine:            283301 
           
           Number of records of Ponderosa Pine:             35754 
           
           Number of records of Cottonwood/Willow:           2747 
           
           Number of records of Aspen:                       9493 
           
           Number of records of Douglas-fir:                17367 
           
           Number of records of Krummholz:                  20510  
           
          	
           Total records:                                  581012

As we can see, the first two classes make up more than 85% of our data. To simplify things, we will not use all of the data. For the first two classes, we will randomly choose 35700 instances and ignore data related to the Cottonwood/Willow and Aspen classes and thus work with a data set of 5 classes.

We use the Neural network class implementation from our package ML_tools. We use the update method ADAM, a stochastic gradient-based method introduced in http://arxiv.org/pdf/1412.6980.pdf to train the model.


In [1]:
#!/usr/bin/env/python
from __future__ import print_function 
import os
import sys
import urllib
import gzip
import numpy as np
import pickle
import urllib.request




In [2]:
#Retrieving data
current = os.getcwd() #current directory
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/covtype/covtype.data.gz'  #link to zip file
filename = 'covtype.dat'
fullpath = os.path.join(current,filename)
if  filename in os.listdir(current):
    print('Data file %s already present' %filename)
else:
    #Attempt to retrieve compressed file
    print('Retrieving and uncompressing file %s' %filename)
    try:
        zipped_fullpath = fullpath + '.gz'
        if sys.version_info[0] >= 3:  #for Python 3
            zipped_fullpath, _ = urllib.request.urlretrieve(url, zipped_fullpath)
        else: #for Python 2
            zipped_fullpath, _ = urllib.urlretrieve(url, zipped_fullpath)
        print('Compressed file succesfully retrieved')
    except IOError as e:
        print('Cannot retrieve %s from %s: s'% (url,zipped_fullpath, e))
    #Attempt to extract file
    print('\nExtracting file:')
    try:
        with gzip.open(zipped_fullpath,'rb') as file:
            s = file.read()
            file.close()
        with open(fullpath,'wb') as f:
            f.write(s)
        print('File succesfully extracted')
    except IOError as e:
        print('Could not decompress %s: s' (zipped_fullpath, e))
        
        
    
    

    
    
    

Retrieving and uncompressing file covtype.dat
Compressed file succesfully retrieved

Extracting file:
File succesfully extracted


Next,the data is loaded and we normalize the quantitative variables. The hillshade variables are index varaiables taking values between 0 and 255. These variables are linearly rescaled to take values between -1 and 1. The other quantitative variables are normalized the usual way by considering the mean and standard deviation over the whole dataset. We partition the data into a training set and a validation set.

In [3]:
def relabel(y):
    #function used for relabelling our class labels
    if y < 5:
        return y-1
    else:
        return y-3
vecrelabel = np.vectorize(relabel)


In [4]:
#Loading file as numpy array, preprocessing, partitioning data into training and validation datasets and finally
#storing the preprocessed data as binary file if the file is not present already
filename = 'covtype_preprocess.dat'
os.path.join(current,filename)
if filename in os.listdir(current):
    print('Preprocessed data already present.')
    print('Retrieving numpy arrays')
    f = open(filename,'rb')
    Data_train  = pickle.load(f)
    Data_val = pickle.load(f)
    mean = pickle.load(f)
    std = pickle.load(f)
    f.close()
else:
    print('Preprocessing, partitioning and storing the data')
    Data_loaded  = np.loadtxt('covtype.dat', delimiter =',')
    print('Shape of Data_loaded:',Data_loaded.shape)
    #Selecting Data from Data_loaded:
    #1) Randomly choosing 35700 instances for the first two classes
    spruce_condition = (Data_loaded[:,-1] == 1)   
    Data_spruce = Data_loaded[spruce_condition,:]   #data for first class
    
    selection = np.random.choice(len(Data_spruce), 35700)
    Data_sprucesl = Data_spruce[selection,:]  #selection of instances
    
    
    lodgepole_condition = (Data_loaded[:,-1] == 2)   
    Data_lodgepole = Data_loaded[lodgepole_condition,:] #data for second class
    
    selection = np.random.choice(len(Data_lodgepole), 35700)
    Data_lodgepolesl = Data_lodgepole[selection,:]  #selection of instances for second class
   
    
    #rest of data
    willow_condition = (Data_loaded[:,-1] == 4)
    aspen_condition =  (Data_loaded[:,-1] == 5)
    condition3 = np.logical_or(willow_condition, aspen_condition)  #aspen or will type
    condition = np.logical_not(np.logical_or(np.logical_or(spruce_condition,lodgepole_condition),condition3))
    Data_other = Data_loaded[condition,:]
    
    Data = np.concatenate((Data_other, Data_sprucesl,Data_lodgepolesl),axis=0) #data used
    print('Shape of data used: ',Data.shape)
    
    #Normalizing the data and relabelling the classes
    mean = np.mean(Data[:,[0,1,2,3,4,5,9]], axis = 0)
    std  = np.std(Data[:,[0,1,2,3,4,5,9]], axis = 0)
    Data[:,[0,1,2,3,4,5,9]] = (Data[:,[0,1,2,3,4,5,9]]-mean)/std
    Data[:,6:9] = 2.0*Data[:,6:9]/255.0 -1
    Data[:,-1] = vecrelabel(Data[:,-1]) 
    #Partitioning the data
    N = len(Data)
    u = np.random.permutation(np.arange(N))
    N_train = int(round(0.55*N))
    Data_train = Data[u[:N_train],:]
    Data_val = Data[u[N_train:],:]
   
    #Saving the data
    f = open(filename,'wb')
    pickle.dump(Data_train,f)
    pickle.dump(Data_val,f)
    pickle.dump(mean,f)
    pickle.dump(std,f)
    f.close()

print('Data_train shape:',Data_train.shape)
print('Data_val shape:',Data_val.shape)
  
  
    
    


Preprocessing, partitioning and storing the data
Shape of Data_loaded: (581012, 55)
Shape of data used:  (145031, 55)
Data_train shape: (79767, 55)
Data_val shape: (65264, 55)


In [9]:
#Preparing data for training. 
Xtr = Data_train[:,:-1]
print(Xtr.shape)
Ytr = Data_train[:,-1]
Ytr = Ytr.astype(int)

Xv  = Data_val[:,:-1]
print(Xv.shape)
Yv = Data_val[:,-1]
Yv = Yv.astype(int)





(79767, 54)
(65264, 54)


In [10]:
#Loading Neural_network module
#Assuming ML_tools package is in samed directory containing covertype directory
containing_directory = os.path.split(current)[0]
fullpath = os.path.join(containing_directory,'ML_tools') #path to ML_tools package
sys.path.append(fullpath)
from Neural_network import NeuralNetwork



In [11]:
#Training a neural network
data_train = {}
data_train['Xtrain']= Xtr
data_train['Ytrain']=Ytr
data_val = {}
data_val['Xval']= Xv
data_val['Yval']=Yv
input_dims = Xtr.shape[1]

#Dictionary of optional arguments to be passed to trainig method of NeuralNetwork class
dict1 = {}
dict1['num_epochs'] = 30 #number of epochs
dict1['update_method'] = 'adam' #stochastic gradient-based update method used
dict1['print_rate'] = 50 #printing rate of loss values
hyperparamaters ={}
hyperparamaters['learning_rate'] = 0.03
dict1['rate_decay']= 0.90
dict1['update_hyperparams'] = hyperparamaters
dict1['batch_size']= 100
reg = 0.001
#regularization constant is set 
model2 = NeuralNetwork(input_dims,[200, 70, 30],5,reg) 

#Training. Will take awhile.
model2.stoch_train(data_train, data_val, **dict1)


(Epoch 0 / 30) training accuracy: 43.9266864744; validation accuracy: 43.9108850208
Iteration 50 / 23910) loss: 0.64427259922
Iteration 100 / 23910) loss: 0.761224580288
Iteration 150 / 23910) loss: 0.753147710085
Iteration 200 / 23910) loss: 0.724600511551
Iteration 250 / 23910) loss: 0.785159703732
Iteration 300 / 23910) loss: 0.664028719425
Iteration 350 / 23910) loss: 0.825770318985
Iteration 400 / 23910) loss: 0.586271295071
Iteration 450 / 23910) loss: 0.752010461807
Iteration 500 / 23910) loss: 0.604125889301
Iteration 550 / 23910) loss: 0.614490700722
Iteration 600 / 23910) loss: 0.741288555622
Iteration 650 / 23910) loss: 0.703974433422
Iteration 700 / 23910) loss: 0.594110956192
Iteration 750 / 23910) loss: 0.704151956081
(Epoch 1 / 30) training accuracy: 73.8651321975; validation accuracy: 74.1250919343
Iteration 800 / 23910) loss: 0.670640815735
Iteration 850 / 23910) loss: 0.604857423782
Iteration 900 / 23910) loss: 0.581287454605
Iteration 950 / 23910) loss: 0.75258788108

After 30 epochs of training,we obtain parameters corresponding to a training and validation accuracy around 82%. To put this in perspective, in the paper "Comparative Accuracies of Artificial Neural Networks and Discriminant Analysis in Predicting Forest Cover Types from Cartographic Variables", they were able to obtain a test accuracy of about 70% by using a small training set and only using 12 of the features. We could obtain better accuracy by letting the model train for a longer period of time.