title: "FOREST TYPE CLASSIFICATION"

date: 2019-10-01

tags: [data wrangling, data science, machine learning model]

header:
  image:
  
excerpt: "data wrangling, data science, machine learning model"

mathjax: "true"


# FOREST TYPE CLASSIFICATION
**Classify forest types based on information about the area**

## Project Description
This is a competition project hosted on [Kaggle.com](www.Kaggle.com)

In this project we’ll predict what types of trees there are in an area based on various geographic features.

## Data Set
The datasets comes from a study conducted in four wilderness areas within the beautiful Roosevelt National Forest of northern Colorado. These areas represent forests with very little human disturbances – the existing forest cover types there are more a result of ecological processes rather than forest management practices.
The data is in raw form and contains categorical data such as wilderness areas and soil type.

Acknowledgements:
This dataset was provided by Jock A. Blackard and Colorado State University. We also thank the UCI machine learning repository for hosting the dataset. 
Data set can be downloaded from this [link](https://www.kaggle.com/c/learn-together/data)

## Evaluation
Models are evaluated on categorization accuracy.

## Basic Random Forets Classifier
**The challenge:**

In this competition you’ll predict what types of trees there are in an area based on various geographic features.

The competition datasets comes from a study conducted in four wilderness areas within the beautiful Roosevelt National Forest of northern Colorado. These areas represent forests with very little human disturbances – the existing forest cover types there are more a result of ecological processes rather than forest management practices.
The data is in raw form and contains categorical data such as wilderness areas and soil type.

### Import Required Libraries

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.feature_selection import SelectKBest, chi2 # For feature selection

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

/kaggle/input/learn-together/train.csv
/kaggle/input/learn-together/sample_submission.csv
/kaggle/input/learn-together/test.csv


In [2]:
!pwd

/kaggle/working


### Load Data Set

In [3]:
print("Loading data set......")
train = pd.read_csv("../input/learn-together/train.csv")
test = pd.read_csv("../input/learn-together/test.csv")
print("Done...")

Loading data set......
Done...


### Exploratory Data Analysis

In [4]:
print("train data size:", train.shape)
train.head()

train data size: (15120, 56)


Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40,Cover_Type
0,1,2596,51,3,258,0,510,221,232,148,...,0,0,0,0,0,0,0,0,0,5
1,2,2590,56,2,212,-6,390,220,235,151,...,0,0,0,0,0,0,0,0,0,5
2,3,2804,139,9,268,65,3180,234,238,135,...,0,0,0,0,0,0,0,0,0,2
3,4,2785,155,18,242,118,3090,238,238,122,...,0,0,0,0,0,0,0,0,0,2
4,5,2595,45,2,153,-1,391,220,234,150,...,0,0,0,0,0,0,0,0,0,5


In [5]:
print("test data size:", test.shape)
test.head()

test data size: (565892, 55)


Unnamed: 0,Id,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,...,Soil_Type31,Soil_Type32,Soil_Type33,Soil_Type34,Soil_Type35,Soil_Type36,Soil_Type37,Soil_Type38,Soil_Type39,Soil_Type40
0,15121,2680,354,14,0,0,2684,196,214,156,...,0,0,0,0,0,0,0,0,0,0
1,15122,2683,0,13,0,0,2654,201,216,152,...,0,0,0,0,0,0,0,0,0,0
2,15123,2713,16,15,0,0,2980,206,208,137,...,0,0,0,0,0,0,0,0,0,0
3,15124,2709,24,17,0,0,2950,208,201,125,...,0,0,0,0,0,0,0,0,0,0
4,15125,2706,29,19,0,0,2920,210,195,115,...,0,0,0,0,0,0,0,0,0,0


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15120 entries, 0 to 15119
Data columns (total 56 columns):
Id                                    15120 non-null int64
Elevation                             15120 non-null int64
Aspect                                15120 non-null int64
Slope                                 15120 non-null int64
Horizontal_Distance_To_Hydrology      15120 non-null int64
Vertical_Distance_To_Hydrology        15120 non-null int64
Horizontal_Distance_To_Roadways       15120 non-null int64
Hillshade_9am                         15120 non-null int64
Hillshade_Noon                        15120 non-null int64
Hillshade_3pm                         15120 non-null int64
Horizontal_Distance_To_Fire_Points    15120 non-null int64
Wilderness_Area1                      15120 non-null int64
Wilderness_Area2                      15120 non-null int64
Wilderness_Area3                      15120 non-null int64
Wilderness_Area4                      15120 non-null int64
Soil_T

In [7]:
train.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,15120.0,7560.5,4364.91237,1.0,3780.75,7560.5,11340.25,15120.0
Elevation,15120.0,2749.322553,417.678187,1863.0,2376.0,2752.0,3104.0,3849.0
Aspect,15120.0,156.676653,110.085801,0.0,65.0,126.0,261.0,360.0
Slope,15120.0,16.501587,8.453927,0.0,10.0,15.0,22.0,52.0
Horizontal_Distance_To_Hydrology,15120.0,227.195701,210.075296,0.0,67.0,180.0,330.0,1343.0
Vertical_Distance_To_Hydrology,15120.0,51.076521,61.239406,-146.0,5.0,32.0,79.0,554.0
Horizontal_Distance_To_Roadways,15120.0,1714.023214,1325.066358,0.0,764.0,1316.0,2270.0,6890.0
Hillshade_9am,15120.0,212.704299,30.561287,0.0,196.0,220.0,235.0,254.0
Hillshade_Noon,15120.0,218.965608,22.801966,99.0,207.0,223.0,235.0,254.0
Hillshade_3pm,15120.0,135.091997,45.895189,0.0,106.0,138.0,167.0,248.0


### Feature Selection

In this model, we are not going to make much feature engineering. We will make model by using the available features. There are some negative values in "Vertical_Distance_To_Hydrology" column. SO we will drop this column along with "Id" column. "Id"column is not required for our model.

In [8]:
# Declare target and predictors
print("Selecting features and target columns for model")
target = train['Cover_Type']
train_df = train.drop(["Cover_Type", "Id", "Vertical_Distance_To_Hydrology"], axis=1)
test_df = test.drop(["Id", "Vertical_Distance_To_Hydrology"], axis=1)


Selecting features and target columns for model


In [9]:
train_df.shape, test_df.shape

((15120, 53), (565892, 53))

To select best features for our model, we wil use "SelectKBest" module from sklearn feature selection.
Since our target column is categorical, we will use "chi2" as parameter for "SelectKbest".

In [10]:
# Feature  selection

best = SelectKBest(chi2, k=25).fit(train_df, target)
train_best = best.transform(train_df)
test_best = best.transform(test_df)

### Create Model

We are using RandomForestClassifier algorithm for our model

In [11]:
# Create Model
print("Creating model")
rf = RandomForestClassifier(n_estimators=100)
print("Model created")


Creating model
Model created


### Cross validation of the model

In [12]:
print("Cross vaidation Score")
print(cross_val_score(rf,train_best, target, cv=3, scoring="accuracy" ))


Cross vaidation Score
[0.79424603 0.77083333 0.78650794]


### Fit Model and Predict

In [13]:
print("Fitting Model on training data set.....")
# Fit Model to traing data
rf.fit(train_best, target)
print("Predict on test data set....")
test_pred = rf.predict(test_best)



Fitting Model on training data set.....
Predict on test data set....


### Create Submission File

In [14]:
# Save test predictions to file
print("Creating submission file")
output = pd.DataFrame({'Id': test.Id,'Cover_Type': test_pred})
output.to_csv('submission_rf_1.csv', index=False)


Creating submission file


### Further work
* More accurate model can be created using features engineering.
* For better performance of this model, Parameter tuning is advised.
