# Forest Cover Type Prediction
***
$16.07.2021$

__Any ideas how to improve perfomance will be appreciated!__

**Best accuracy: 0.71819.**

What can be improved : Feature Analysis, Better Feature Engineering
***
$$ $$
Cool project where we need to classify the forest cover type based on the different land properties. <br>

Forest cover type characterize the type of trees that grow on the land with given Id.

There are 7 types:

1. Spruce/Fir 
2. Lodgepole Pine
3. Ponderosa Pine
4. Cottonwood/Willow
5. Aspen
6. Douglas-fir
7. Krummholz

**Let's get into it!**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier, VotingClassifier

import warnings
warnings.filterwarnings('ignore')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
dataset = pd.read_csv('/kaggle/input/forest-cover-type-prediction/train.csv')
test = pd.read_csv('/kaggle/input/forest-cover-type-prediction/test.csv')

In [None]:
dataset.columns

In [None]:
dataset.describe()

In [None]:
Y = dataset['Cover_Type']

In [None]:
X = dataset.drop(['Cover_Type', 'Id'], axis=1)

ids = test['Id']
test = test.drop(['Id'], axis=1)

# 1. Data Exploration
One of the first questions that comes to mind when we see the data is how the features are distributed <br>
**1.1 Elevation distribution**

In [None]:
ax = sns.displot(X, x='Elevation')
ax.set(xlabel ='meters')
plt.show()

<br>

**1.2 Slope distribution**

In [None]:
ax = sns.displot(X, x='Slope', color='green')
ax.set(xlabel ='Slope(Degrees)')
plt.show()

<br>

**1.3 Horizontal distance to water distribution**

In [None]:
ax = sns.displot(X, x='Horizontal_Distance_To_Hydrology', kind='kde', )
ax.set(xlabel ='Meters')
plt.show()

<br>

**1.4 Vertical distance to water distribution**

In [None]:
ax = sns.displot(X, x='Vertical_Distance_To_Hydrology', kind='kde')
ax.set(xlabel ='Meters')
plt.show()

<br>

**1.6 Horizontal distance to roadways distribution**

In [None]:
ax = sns.displot(X, x='Horizontal_Distance_To_Roadways', color='brown')
ax.set(xlabel ='Meters')
plt.show()

<br>

**1.7 Horizontal distance to fire points distribution**

In [None]:
ax = sns.displot(X, x='Horizontal_Distance_To_Fire_Points', color='blue')
ax.set(xlabel ='Meters')
plt.show()

## 2. Feature Engineering

Maybe some type of trees in the area depends on the absolute distance from water?   

For example, maybe cotton likes to be near the water, while pines/spruce don't.

Let's test this adding new feature.


__Spoiler: That didn't help perfomance, slightly harmed the accuracy__



In [None]:
#def add_abs_distance(dataframe):
#    dataframe['Absolute_Distance_To_Hydrology'] = (dataframe['Vertical_Distance_To_Hydrology'] ** 2 
#                                                   + dataframe['Horizontal_Distance_To_Hydrology'] ** 2) ** 0.5
#    dataset.drop(['Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Hydrology'], axis=1)
#    return dataframe

#X = add_abs_distance(X)
#test = add_abs_distance(test)

## 3. Models

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, 
                                                    test_size=0.3, random_state=42)

We will try different multiclass algorithms:

In [None]:
KNN = KNeighborsClassifier(n_neighbors=3)

KNN.fit(X_train, Y_train)
res = KNN.predict(X_test)

print(accuracy_score(res, Y_test))

In [None]:
DTC = DecisionTreeClassifier(max_depth=10)

DTC.fit(X_train, Y_train)
res = DTC.predict(X_test)

print(accuracy_score(res, Y_test))

In [None]:
GNB = GaussianNB()

GNB.fit(X_train, Y_train)
res = GNB.predict(X_test)

print(accuracy_score(res, Y_test))

In [None]:
XGBC = XGBClassifier(n_estimators=300)

XGBC.fit(X_train, Y_train)
res = XGBC.predict(X_test)

print(accuracy_score(res, Y_test))

In [None]:
RFC = RandomForestClassifier(n_estimators=50)

RFC.fit(X_train, Y_train)
res = RFC.predict(X_test)

print(accuracy_score(res, Y_test))

## 3. Voting
We will use soft voting classifier.

In [None]:
estimators=[('KNN', KNeighborsClassifier(n_neighbors=3)),
            ('DTC', DecisionTreeClassifier(max_depth=10)), 
            ('XGBC', XGBClassifier(n_estimators=300)), 
            ('RFC', RandomForestClassifier(n_estimators=50))]

voting = VotingClassifier(estimators=estimators, voting='soft')

voting.fit(X_train, Y_train)
res = voting.predict(X_test)

print(accuracy_score(res, Y_test))

## 4. Predict

In [None]:
predict = voting.predict(test)

In [None]:
submission = pd.DataFrame(data=predict, columns=['Cover_Type'])
submission['Id'] = ids
submission.set_index('Id',inplace=True)
submission.to_csv('/kaggle/working/submission.csv', index=False)