# Predicting the cover types of forest

## Fire up

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from ggplot import *
import sklearn
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('../input/covtype.csv')
print(df.describe(include = 'all'))
from sklearn.cross_validation import train_test_split
import matplotlib as mpl
import seaborn as sns

**Through the mean and sd of binary variables,  one can guess that the binary variables are not distributed evenly.  Further fact can be found on the EDA of data set**

In order to avoid cheating, we split the data into training set and test set before having a look into the data set.

In [None]:
train,test = train_test_split(df,test_size=0.2,random_state=999)

## Exploratory Data Analysis

Before fitting the data set in machine learning algorithm, one should first have a look into each variable. Part of analysis is inspired by @Alan (AJ) Pryor, Jr and credits should be granted to him.

In [None]:
con = ['Elevation' , 'Aspect' , 'Slope', 'Horizontal_Distance_To_Hydrology' , 'Vertical_Distance_To_Hydrology' ,'Horizontal_Distance_To_Roadways','Hillshade_9am','Hillshade_Noon','Hillshade_3pm','Horizontal_Distance_To_Fire_Points','Cover_Type']
con_variables = train[con]

**Continuous Features vs Labels (Box Plots)**

In [None]:
for i in range(10):
    g= ggplot(con_variables,aes(x= 'Cover_Type',y=con[i]))+geom_boxplot()+ggtitle('Box Plot of Cover Type and '+con[i])+theme_bw()
    print(g)

**Among those continuous features, several features, such as Elevation, Aspect, Slope, Horizontal Distance to Hydrology, Distance to Road and Distance to Fire Points may be an influential features on the cover types of forest.**

**Continuous Features Correlations (Correlation Matrix Calculated)**

In [None]:
Cor = con_variables.iloc[:,0:10]
Cor_matrxi = Cor.corr(method='pearson', min_periods=1)
print(Cor_matrxi)

In [None]:
fig, ax = plt.subplots()
heatmap = ax.pcolor(Cor_matrxi, cmap=plt.cm.Blues, alpha=0.8)
fig = plt.gcf()
fig.set_size_inches(6, 6)
ax.set_frame_on(False)
ax.set_yticks(np.arange(10) + 0.5, minor=False)
ax.set_xticks(np.arange(10) + 0.5, minor=False)
ax.set_xticklabels(con[0:10], minor=False)
ax.set_yticklabels(con[0:10], minor=False)
plt.xticks(rotation=90)

We can find that the variables of Hill Shades afternoon are strongly correlated with each other. Also, the variable: Aspect, is also connected with the Hill Shades variables, especially with the shades afternoon. 

**Elevation vs Slope Visulization**

Also, the geographical factors, elevation and slope, are also correlated with each other, and can be viewed as having co-influence on cover types. 

In [None]:
con_variables['Type'] = con_variables['Cover_Type'].apply(lambda x:str(x))

In [None]:
g=ggplot(con_variables,aes(x='Elevation',y='Slope',color='Type')) +geom_point() +theme_bw()
print(g)

In [None]:
g=ggplot(con_variables,aes(x='Elevation',y='Slope',color='Type')) +geom_point() +theme_bw()+facet_wrap('Type')
print(g)

We can see that type patterns are recognizable. For example: Type 7 are always seen on the place with high elevation. Also, Slope may not be as influential as Elevation

**Hill Shade Visulization**

Since the afternoon Hill shades are always correlated with the variable of aspect, we plot the afternoon shade index with the aspect variable and see if there is certain patterns on it.

In [None]:
g=ggplot(con_variables,aes(x='Aspect',y='Hillshade_Noon',color='Type')) +geom_point() +theme_bw()+facet_wrap('Type')
print(g)

In [None]:
g=ggplot(con_variables,aes(x='Aspect',y='Hillshade_3pm',color='Type')) +geom_point() +theme_bw()+facet_wrap('Type')
print(g)

Within same period of time, the aspect-index patterns are similar with each other with different types. However, the general patterns of 3 pm and Noon shades are different with each other.

**Distance_To_Hydrology**

In [None]:
g=ggplot(con_variables,aes(x='Horizontal_Distance_To_Hydrology',y='Vertical_Distance_To_Hydrology',color='Type')) +geom_point() +theme_bw()+facet_wrap('Type')
print(g)

Generally, the middle part of all seven plots are 'fatter' than the other parts. This is because a closer place to hydrology tends to be more beneficial for the growth of plants. 

## Feature Selection

In order to reduce the dimension of the data set, I intend to implement a tree-based feature selection on the data set.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
features = train.iloc[:,0:54]
label = train['Cover_Type']
clf = ExtraTreesClassifier()
clf = clf.fit(features, label)
model = SelectFromModel(clf, prefit=True)
New_features = model.transform(features)
print(New_features.shape)

Based on the feature importance of the model, the number of features is reduced to 11. I believe that it can accelerate the process of model fitting.

## Model Fit

In [None]:
test_features = test.iloc[:,0:54]
Test_features = model.transform(test_features)
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
Classifiers = [DecisionTreeClassifier(),LogisticRegression(C=0.000000001,solver='liblinear',max_iter=200),RandomForestClassifier(n_estimators=200)]

In [None]:
from sklearn.metrics import accuracy_score
Model = []
Accuracy = []
for clf in Classifiers:
    fit=clf.fit(New_features,label)
    pred=fit.predict(Test_features)
    Model.append(clf.__class__.__name__)
    Accuracy.append(accuracy_score(test['Cover_Type'],pred))
    print('Accuracy of '+clf.__class__.__name__ +' is '+str(accuracy_score(test['Cover_Type'],pred)))

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(test['Cover_Type'],pred)

In [None]:
importances = clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]
plt.figure()
plt.title("Feature importances")
plt.bar(range(New_features.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(New_features.shape[1]), indices)
plt.xlim([-1, New_features.shape[1]])
plt.show()

**Question: Always curious about how to check the names of selected features in Python. Could any Kaggler help me with that?**

## Concluding Remark

With tree-based feature selection, the accuracy of random forest model is improved, together with the reduction of running time. The data set contains enough information for building a well-performed classification model. 

If you have any advice/suggestion, please do share your opinion! Any help/advice will be extremely appreciated!