![forest](https://i.imgur.com/3tGpAwL.jpg)
<h1 align="center">Exploring Roosevelt Natl. Forest Cover Types</h1><br>
Here's my exploration of this dataset for predicting leaf cover varieties in the Colorado mountains. It's a bit exhaustive as a way of gathering all the information in one place and attempting to consolidate it in a more human-readable fashion. Much of this first section is most likely a reversal of changes previously made to the data set, such as returning dummies to categorical types.
<h2>Project setup</h2>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

pd.set_option('display.max_columns', 100)

In [None]:
# Set up labels 
cover_types = {
    1: "Spruce/Fir",
    2: "Lodgepole Pine",
    3: "Ponderosa Pine",
    4: "Cottonwood/Willow",
    5: "Aspen",
    6: "Douglas-fir",
    7: "Krummholz"}
wild_areas = {
    1: "Rawah Wilderness Area",
    2: "Neota Wilderness Area",
    3: "Comanche Peak Wilderness Area",
    4: "Cache la Poudre Wilderness Area"}

<h2>Read input data files into Pandas dataframes</h2>

In [None]:
train_raw = pd.read_csv("../input/train.csv")
test_raw = pd.read_csv("../input/test.csv")
sample_submission = pd.read_csv("../input/sample_submission.csv")

In [None]:
train_raw.head()

<h2>Relabel data for interpretation</h2><br>
Since the raw dataframe is a little extensive, I'll clean it up for plotting and readability.

In [None]:
train = train_raw.copy()
test = test_raw.copy()

<h3>Relabel cover types with descriptive values</h3>

In [None]:
train['Cover_Type'] = train['Cover_Type'].apply(lambda x: cover_types[x])

<h3>Relabel wilderness areas with true names</h3>

In [None]:
df = train[["Wilderness_Area1","Wilderness_Area2","Wilderness_Area3","Wilderness_Area4"]]
df = df.idxmax(axis=1)
train["Wilderness_Area1"] = df.apply(lambda x: wild_areas[int(x.split("Wilderness_Area")[1])])
train = train.rename(columns = {"Wilderness_Area1": "Wilderness_Area"})
train.drop(["Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4"], axis=1, inplace=True)

<h3>Restructure soil types as categorical column</h3>

In [None]:
train['Soil_Type1'] = train[train.columns[12:52]].idxmax(axis=1)
train = train.rename(columns = {"Soil_Type1": "Soil_Type"})
train.drop(train.columns[13:52], inplace=True, axis=1)

In [None]:
train.columns = train.columns.str.replace("_", " ")
train.drop("Id", inplace=True, axis=1)

<h2>Inspect newly labeled and organized data</h2>

In [None]:
train.head()

In [None]:
train.describe()

<h2>Relabel Test Data</h2> <br>
Here's all the same (relavant) relabeling done to the test dataframe for later comparison purposes.

In [None]:
# Relabel wilderness areas with true names
df = test[["Wilderness_Area1","Wilderness_Area2","Wilderness_Area3","Wilderness_Area4"]]
df = df.idxmax(axis=1)
test["Wilderness_Area1"] = df.apply(lambda x: wild_areas[int(x.split("Wilderness_Area")[1])])
test = test.rename(columns = {"Wilderness_Area1": "Wilderness_Area"})
test.drop(["Wilderness_Area2", "Wilderness_Area3", "Wilderness_Area4"], axis=1, inplace=True)
# Restructure soil types as categorical column
test['Soil_Type1'] = test[test.columns[12:52]].idxmax(axis=1)
test = test.rename(columns = {"Soil_Type1": "Soil_Type"})
test.drop(test.columns[13:52], inplace=True, axis=1)
test.columns = test.columns.str.replace("_", " ")

<h2>Data Visualisation</h2>
<br>
Let's look at some plots of the dataset. For some of these we'll have to actually go back to the raw data in order to get the columns in a form that can be graphed properly. 

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.figure(figsize=(9,4.5))
sns.countplot(x="Cover Type", data=train)
plt.xticks(rotation=65)

We can see that each cover type is equally represented in the set. Let's look at how they're distributed relative to some of the other variables.

In [None]:
plt.figure(figsize=(9,4.5))
sns.countplot(x='Cover Type', hue='Wilderness Area', data=train)
plt.xticks(rotation=65)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2)

This plot shows us that there seems to be significant correlation between cover type and wilderness area, and that the Neota Area has a much smaller representation than the others. Let's do one more visualization using the soil types.

In [None]:
df = pd.DataFrame(train['Soil Type'].value_counts())   
df.reset_index(inplace=True)
df.columns = ['Soil Type', 'Soil Count']

In [None]:
plt.figure(figsize=(25, 7))
sns.barplot(x='Soil Type', y='Soil Count', data=df)
plt.xticks(rotation=80)
sns.set_context("notebook", font_scale=1.5)
plt.title('Train Soil Types')

Now we can see that the most common soil type by far is Soil_Type10, and the least common types are barely represented, if at all. Let's check on those rarer types to see if they have any entries at all.

In [None]:
df.tail()

Ok, there's at least one entry for all soil types, but we should be aware that some of them have very low representation. In order to have good predictions, we'd hope our test data to have similar trends to our training set, so let's take a look at the test set to see the distibution. 

In [None]:
df = pd.DataFrame(test['Soil Type'].value_counts())   
df.reset_index(inplace=True)
df.columns = ['Soil Type', 'Soil Count']

In [None]:
plt.figure(figsize=(25, 7))
sns.barplot(x='Soil Type', y='Soil Count', data=df)
plt.xticks(rotation=80)
sns.set_context("notebook", font_scale=1.5)
plt.title('Test Soil Types')

In [None]:
df.tail()

The test data has a very similar distibution of soil types to training. Although the ranking differs for individual types, the similarity seems enough for meaninful usage. In combination with the other features, we'll see how the model overcomes these descrepancies.
<br>

In [None]:
# Save altered data
train1 = train
test1 = test

# Reset train and test sets to original states
train = train_raw.copy()
test = test_raw.copy()

## Feature Building

The data set shows multiple columns of location in relation to certain geographical amenities. By combining these variables we can create new features as a function of distance from these amenities. These new features are likely to have higher importance in our prediction model because they are more specific descriptions than the isolated indicators of location. [Lathwal](https://www.kaggle.com/codename007) has already done this engineering well in [his notebook](https://www.kaggle.com/codename007/forest-cover-type-eda-baseline-model), so I will borrow his new features for use in this model. The next cell is his code.

In [None]:
####################### Train data #############################################
train['HF1'] = train['Horizontal_Distance_To_Hydrology']+train['Horizontal_Distance_To_Fire_Points']
train['HF2'] = abs(train['Horizontal_Distance_To_Hydrology']-train['Horizontal_Distance_To_Fire_Points'])
train['HR1'] = abs(train['Horizontal_Distance_To_Hydrology']+train['Horizontal_Distance_To_Roadways'])
train['HR2'] = abs(train['Horizontal_Distance_To_Hydrology']-train['Horizontal_Distance_To_Roadways'])
train['FR1'] = abs(train['Horizontal_Distance_To_Fire_Points']+train['Horizontal_Distance_To_Roadways'])
train['FR2'] = abs(train['Horizontal_Distance_To_Fire_Points']-train['Horizontal_Distance_To_Roadways'])
train['ele_vert'] = train.Elevation-train.Vertical_Distance_To_Hydrology

train['slope_hyd'] = (train['Horizontal_Distance_To_Hydrology']**2+train['Vertical_Distance_To_Hydrology']**2)**0.5
train.slope_hyd=train.slope_hyd.map(lambda x: 0 if np.isinf(x) else x) # remove infinite value if any

#Mean distance to Amenities 
train['Mean_Amenities']=(train.Horizontal_Distance_To_Fire_Points + train.Horizontal_Distance_To_Hydrology + train.Horizontal_Distance_To_Roadways) / 3 
#Mean Distance to Fire and Water 
train['Mean_Fire_Hyd']=(train.Horizontal_Distance_To_Fire_Points + train.Horizontal_Distance_To_Hydrology) / 2 

####################### Test data #############################################
test['HF1'] = test['Horizontal_Distance_To_Hydrology']+test['Horizontal_Distance_To_Fire_Points']
test['HF2'] = abs(test['Horizontal_Distance_To_Hydrology']-test['Horizontal_Distance_To_Fire_Points'])
test['HR1'] = abs(test['Horizontal_Distance_To_Hydrology']+test['Horizontal_Distance_To_Roadways'])
test['HR2'] = abs(test['Horizontal_Distance_To_Hydrology']-test['Horizontal_Distance_To_Roadways'])
test['FR1'] = abs(test['Horizontal_Distance_To_Fire_Points']+test['Horizontal_Distance_To_Roadways'])
test['FR2'] = abs(test['Horizontal_Distance_To_Fire_Points']-test['Horizontal_Distance_To_Roadways'])
test['ele_vert'] = test.Elevation-test.Vertical_Distance_To_Hydrology

test['slope_hyd'] = (test['Horizontal_Distance_To_Hydrology']**2+test['Vertical_Distance_To_Hydrology']**2)**0.5
test.slope_hyd=test.slope_hyd.map(lambda x: 0 if np.isinf(x) else x) # remove infinite value if any

#Mean distance to Amenities 
test['Mean_Amenities']=(test.Horizontal_Distance_To_Fire_Points + test.Horizontal_Distance_To_Hydrology + test.Horizontal_Distance_To_Roadways) / 3 
#Mean Distance to Fire and Water 
test['Mean_Fire_Hyd']=(test.Horizontal_Distance_To_Fire_Points + test.Horizontal_Distance_To_Hydrology) / 2

In [None]:
features = [col for col in train.columns if col not in ['Cover_Type','Id']]

## Run Classifier Models

In [None]:
# Set up test values from the already classified data so that I can test model accuracy
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
X_train, X_test, y_train, y_test = train_test_split(train[features], train['Cover_Type'], test_size=0.3)

In [None]:
predictions = pd.DataFrame()

### Random Forests

Considering the source data, Random Forests of decision trees seems like a poetically appropriate model to apply, although it may not be the most functional. Let's see how well it predicts.

In [None]:
# Benefit from n_estimators seems to level out around 1000
# with n_estimators=500, accuracy=0.78643 (119th place)
# with n_estimators=750, accuracy=0.78699 (114th place)

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=750)

In [None]:
%%time
rfc.fit(X_train, y_train)

In [None]:
%%time
predictions['Random Forest'] = rfc.predict(X_test)

In [None]:
print(classification_report(y_test, predictions['Random Forest']))
print(confusion_matrix(y_test, predictions['Random Forest']))

### Extra Trees

Adding in the extra trees classifier will give another model to check against. 

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
etc = RandomForestClassifier(n_estimators=750)

In [None]:
%%time
etc.fit(X_train, y_train)

In [None]:
%%time
predictions['Extra Trees'] = etc.predict(X_test)

In [None]:
print(classification_report(y_test, predictions['Extra Trees']))
print(confusion_matrix(y_test, predictions['Extra Trees']))

### Gradient Boosting 

Gradient boosting classifier will give me another model for a vote consensus. 

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
gbc = AdaBoostClassifier(GradientBoostingClassifier(n_estimators=100), n_estimators=10, learning_rate=.1, algorithm='SAMME')


In [None]:
%%time
gbc.fit(X_train, y_train)

In [None]:
%%time
predictions['Gradient Boosting'] = gbc.predict(X_test)

In [None]:
print(classification_report(y_test, predictions['Gradient Boosting']))
print(confusion_matrix(y_test, predictions['Gradient Boosting']))

### Ada Boost

I'll also do an Ada boosting classifier on the Extra Trees model, and play with the parameters to squeeze better accuracy out of it.

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier(ExtraTreesClassifier(n_estimators=500), n_estimators=500, learning_rate=.1, algorithm='SAMME')

In [None]:
%%time
abc.fit(X_train, y_train)

In [None]:
%%time
predictions['Ada Boost'] = abc.predict(X_test)

In [None]:
print(classification_report(y_test, predictions['Ada Boost']))
print(confusion_matrix(y_test, predictions['Ada Boost']))

## Tally up Votes


In [None]:
predictions.describe()

In [None]:
%%time
pred = predictions.mode(axis=1)

In [None]:
predictions.head()

In [None]:
pred.head()

In [None]:
print(classification_report(y_test, pred[0]))
print(confusion_matrix(y_test, pred[0]))

## Get Predictions from Best Model

While Random Forests and Extra Trees did pretty well around 89% accuracy, using AdaBoost with Extra Trees was the highest around 90%. Since the voting mechanism detracts from this accuracy, I'll submit my predictions from just the AdaBoost with Extra Trees model. The cells below run the model on the real test data.

In [None]:
X_train = train[features]
y_train = train['Cover_Type']
X_test = test[features]

In [None]:
abc2 = AdaBoostClassifier(ExtraTreesClassifier(n_estimators=500), n_estimators=500, learning_rate=.1, algorithm='SAMME')

In [None]:
%%time
abc2.fit(X_train, y_train)

In [None]:
%%time
pred = abc2.predict(X_test)

<h1>Print submission file</h1>

In [None]:
sub = pd.DataFrame({"Id": test["Id"], "Cover_Type": pred.astype('int')})
sub.reindex().head()

In [None]:
sub.to_csv("submission.csv", index=False)

This notebook is an example of what is possible with a simple analysis + prediction, without going much into tweaking parameters of the models. If you'd like to leave a commment, please do. I welcome any feedback to further my own learning process. 