## Forest Cover Type Prediction

_[Kaggle competition](https://www.kaggle.com/c/forest-cover-type-prediction/overview)_

Author: Piotr Cichacki<br/>
Date: 18.02.2021

<b>1) Goal: to predict the forest cover type from strictly cartographic variables.</b>

<b>2) Data description</b><br/>
The training set (15120 observations) contains both features information and the cover type. Each observation is a 30m x 30m patch.<br/>
There are 4 binary columns for wilderness area and 40 binary columns for soil type in which 0 = absence and 1 = presence. <br/>
Seven cover types (our target variable): spruce/fir, lodgepole pine, ponderosa pine, cottonwood/willow, aspen, douglas-fir, krummholz.



In [None]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings('ignore')

In [None]:
train = pd.read_csv('../input/forest-cover-type-prediction/train.csv')
train.drop('Id', axis=1, inplace=True)
test = pd.read_csv('../input/forest-cover-type-prediction/test.csv')

### Quick overview of our data

In [None]:
print("Training data shpae: ", train.shape)
print("Test data shpae: ", test.shape)

In [None]:
train.head(10)

In [None]:
train.info()

In [None]:
train.describe()

In [None]:
print("There are missing values in the training dataset: ", train.isnull().sum().values.sum() > 0)
print("There are missing values in the test dataset: ", test.isnull().sum().values.sum() > 0)

#### Conclusions:

We have 56 columns in our train dataset but only 12 attributes (without ID and our target variable) because there are 4 columns dedicated to wilderness area and 40 columns dedicated to soil type. <br/>
All attributes are of type int (wilderness area and soil type columns are binary which means that they can have only value 0 or 1). <br/>
We do not have to deal with missing values in our datasets. 

### Our target variable: cover type

In [None]:
plt.title("Distribution of cover type")
sns.barplot(train['Cover_Type'].value_counts().index, train['Cover_Type'].value_counts().values)
plt.show()

We can see that we have the same number of occurences for each type of cover.

### Exploratory data analysis

At the beginning I will convert our train dataset to have Soil Type and Wilderness Area in single columns. Then I will present data in contingency table to check whether there is significant difference in proportions between groups. 

In [None]:
soil_type = train.loc[:, 'Soil_Type1':'Soil_Type40'].stack()
soil_type = pd.Series(soil_type[soil_type!=0].index.get_level_values(1))
for i in range(soil_type.size):
    soil_type.values[i] = int((soil_type.values[i])[9:])

In [None]:
wilderness_area = train.loc[:, 'Wilderness_Area1':'Wilderness_Area4'].stack()
wilderness_area = pd.Series(wilderness_area[wilderness_area!=0].index.get_level_values(1))
for i in range(wilderness_area.size):
    wilderness_area.values[i] = int((wilderness_area.values[i])[15:])

In [None]:
data = pd.concat([train.iloc[:, 0:10], wilderness_area, soil_type, train['Cover_Type']], axis=1)
data = data.rename(columns={0:'Wilderness_Area', 1:'Soil_Type'})

In [None]:
data.head()

In [None]:
pd.crosstab(data['Wilderness_Area'], data['Cover_Type'])

In [None]:
sns.catplot(data=data, kind='count', x='Cover_Type', hue='Wilderness_Area')
plt.title('Distribution of cover type between wilderness areas')
plt.show()

In [None]:
pd.crosstab(data['Cover_Type'], data['Soil_Type'])

We can see that wilderness area and soil type have significant influence on cover type because there are a lot of zeros in our tables which means that certain types of cover occur only in certain conditions.

Now let's focus on remaining attributes.

In [None]:
data.groupby(['Cover_Type']).mean()

In [None]:
columns = data.columns[:-3]

In [None]:
for column in columns:
    sns.displot(data, x=data[column], hue='Cover_Type', kind='kde', fill=True, palette='Paired')
    plt.title(column + ' distribution between cover types')
    plt.show()

In [None]:
for column in columns:
    sns.boxplot(x='Cover_Type', y=column, data=data, palette='Paired')
    plt.title(column + ' distribution between cover types')
    plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
train['Slope'] = scaler.fit_transform(np.array(train['Slope']).reshape(-1,1))

In [None]:
columns = ['Elevation', 'Aspect', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 
           'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm']
scaler = MinMaxScaler()
for column in columns:
    train[column] = scaler.fit_transform(np.array(train[column]).reshape(-1,1))

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(data.corr().round(2), annot=True)
plt.show()

### Building model



In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(train.drop(['Cover_Type'], axis=1), train['Cover_Type'], random_state=42)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = LogisticRegression(C=1)
lr.fit(X_train, y_train)

In [None]:
print("Accuracy on training set: ", lr.score(X_train, y_train))
print("Accuracy on test set: ", lr.score(X_test, y_test))

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rfc = RandomForestClassifier(n_estimators=1000)
rfc.fit(X_train, y_train)

In [None]:
print("Accuracy on training set: ", rfc.score(X_train, y_train))
print("Accuracy on test set: ", rfc.score(X_test, y_test))

Let's now evaluate our model using K-fold cross-validation.

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
cross_val_score(rfc, X_train, y_train, cv=3, scoring='accuracy')

The next step is to build confusion matrix to see how our model make predictions.

In [None]:
from sklearn.model_selection import cross_val_predict

In [None]:
y_train_pred = cross_val_predict(rfc, X_train, y_train, cv=5)

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
conf_matrix = confusion_matrix(y_train, y_train_pred)
conf_matrix

In [None]:
plt.rcParams["figure.figsize"] = (8, 8)
plt.matshow(conf_matrix, interpolation ='nearest', cmap='plasma')
plt.title("Confusion matrix", fontdict={'fontsize':12})
plt.colorbar()
plt.show()

In [None]:
row_sums = conf_matrix.sum(axis=1, keepdims=True)
norm_conf_matrix = conf_matrix / row_sums

In [None]:
np.fill_diagonal(norm_conf_matrix, 0)

plt.rcParams["figure.figsize"] = (8, 8)
plt.matshow(norm_conf_matrix, interpolation ='nearest', cmap='plasma')
plt.title('Plot of the errors\n (divided by number of observations in the corresponding class)', fontdict={'fontsize':12})
plt.colorbar()
plt.show()

From above plot we can see that many Lodgepole Pine (1 type) are classified as Spruce/Fir (0 type) and another way around. 

### Making predictions

In [None]:
submission = pd.DataFrame()
submission['Id'] = test['Id']
submission['Cover_Type'] = rfc.predict(test.drop('Id', axis=1))
submission.set_index('Id', inplace=True)
submission.to_csv('submission.csv')