# Forest Cover Type Prediction
Use cartographic variables to classify forest categories

## PyCaret

#### Problem
The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:

    1 - Spruce/Fir
    2 - Lodgepole Pine
    3 - Ponderosa Pine
    4 - Cottonwood/Willow
    5 - Aspen
    6 - Douglas-fir
    7 - Krummholz

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations).

#### Evaluation Metric
Multi-class classification accuracy

In [None]:
# Install PyCaret
!pip install pycaret

In [None]:
# install watermark
!pip install watermark

## Imports

In [None]:
import pandas as pd
import numpy as np
import random
import matplotlib as m
import matplotlib.pyplot as plt
import seaborn as sns

#import pycaret 
import pycaret   
from pycaret.classification import *  #import classification module 

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings('ignore')

# make pandas show all columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Formating the plots
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline

plt.style.use('fivethirtyeight')
m.rcParams['axes.labelsize'] = 14
m.rcParams['xtick.labelsize'] = 12
m.rcParams['ytick.labelsize'] = 12
m.rcParams['figure.figsize'] = (15, 5)
m.rcParams['font.size'] = 12
m.rcParams['legend.fontsize'] = 'large'
m.rcParams['figure.titlesize'] = 'medium'
m.rcParams['text.color'] = 'k'
sns.set(rc={'figure.figsize':(15,5)})

In [None]:
# Versões dos pacotes usados neste jupyter notebook
%reload_ext watermark
%watermark -a "Forest Cover Type Prediction -- Jessica Cabral" --iversions
%watermark -n -t -z

In [None]:
np.random.seed(42)
random.seed(42)
random_seed = 42

## Import Data

In [None]:
train = pd.read_csv('../input/forest-cover-type-prediction/train.csv')
test = pd.read_csv('../input/forest-cover-type-prediction/test.csv')
sample_submission = pd.read_csv('../input/forest-cover-type-prediction/sampleSubmission.csv')


print('Train: {}'.format(train.shape))
print('test: {}'.format(test.shape))
print('sample_submission: {}'.format(sample_submission.shape))

In [None]:
display(train.head(), test.head())

train.shape, test.shape

## Pre-Processing

In [None]:
# remove de ID column

train = train.drop(columns=['Id'], axis=1)
test = test.drop(columns=['Id'], axis=1)

train.shape, test.shape

## Train and test split

In [None]:
# separate training and test dataset

train, validation = train_test_split(train, test_size=0.33, random_state=random_seed)

train.shape, validation.shape

We split the training data into training and validation, why?

We are going to setup pycaret to use only the training sample to subdivide again into test and train
These new samples will be used to train and validate the model metrics by the pycaret itself

After all the training and tuning the model, we will use our validation sample to validate the model against data it haven't seen yet, thereby generating our most important metrics: Accuracy test data

Here is a drawing that tries to illustrate this:

![how_it_works](https://github.com/jcabralc/forest_cover_type_prediction/blob/master/imgs/how_it_works.png?raw=true)

pretty cute my drawing isn't it?
:D

## Setup pycaret

In [None]:
exp_clf = setup(data = train,           # train data
              target = 'Cover_Type',   # feature that we are trying to predict
              train_size = 0.7)     # proportion of training data

## Train and compare models

This function train all the models available in the model library and scores them using Stratified Cross Validation. The output prints a score grid with Accuracy, AUC, Recall, Precision, F1, Kappa and MCC (averaged accross folds), determined by fold parameter.

#### Default params
compare_models(blacklist = None, whitelist = None, fold = 10,  round = 4,  sort = ‘Accuracy’,  n_select = 1, turbo = True, verbose = True)


In [None]:
%%time

# Train the modelos using default params
best_model = compare_models()
print(best_model)

The AUC will be returned as zero (0.0) If target variable is multiclass (more than 2 classes), like in our case

Our winner is the CatBoost Classifier, with 83,15% Accuracy

## Model Optimization

Let's see if we can optimize our model accuracy using pycaret tune_model function

In [None]:
tuned_catboost = tune_model(best_model)
print(tuned_catboost)

with the default parameters 

'tune_model(estimator = None,  fold = 10,  round = 4,  n_iter = 10, custom_grid = None,  optimize = ‘Accuracy’, choose_better = False, verbose = True)'

our accuracy down to 76,76%

interesting...

Let's check if we can do someting about it

In [None]:
# Increse the number of iterations (n_iter) to 35. 
# Increasing the n_iter parameter will for sure increase the training time but will 
# give a much better performance.

tuned_catboost_v1 = tune_model(best_model, n_iter = 35)
print(tuned_catboost_v1)

# you can try differents values for n_iter param

In [None]:
tuned_catboost_v1.get_params()

We got 85,72% Accuracy!


In [None]:
# Let's try a custom grid

# tune hyperparameters with custom_grid
params = {#'early_stopping_rounds': 15,
          'max_depth': list(range(3,10,1)),
          'learning_rate': [0.001, 0.01, 0.015, 0.02, 0.04, 0.1],
          #'n_estimators': list(range(100,300,50)),
          'iterations': [1000, 500, 1500, 800, 1100, 1200],
          }

tuned_catboost_v2 = tune_model(best_model, n_iter = 35, custom_grid = params)
print(tuned_catboost_v2)

# you can try differents values for n_iter param

In [None]:
tuned_catboost_v2.get_params()

not better

## Model Evaluate

In [None]:
#evaluate a model
evaluate_model(tuned_catboost_v1)

It's possible to plot the metrics indivually

Here is a example:

In [None]:
# Compare test data predictions and results
plot_model(tuned_catboost_v1, plot='confusion_matrix')

## Interpret the Model

# interpret overall model 
interpret_model(tuned_catboost_v2)

# correlation shap plot
interpret_model(tuned_catboost_v2, plot = 'correlation')

# interactive reason plot
interpret_model(tuned_catboost_v2, plot = 'reason')

Warnings:
    interpret_model doesn’t support multiclass problems.

## Predictions

In [None]:
# predict in train dataframe
y_train_pred = predict_model(tuned_catboost_v1)

# predict the test dataframe
y_pred = predict_model(tuned_catboost_v1, data = test)

In [None]:
# view the predictions
display(y_train_pred[['Cover_Type', 'Label']], y_pred['Label'])

In [None]:
# Finalize model
final_tuned_catboost_v1 = finalize_model(tuned_catboost_v1)

In [None]:
# Save model
save_model(final_tuned_catboost_v1, 'final_tuned_catboost_v1_30082020'

## Submission 

In [None]:
#sample_submission
sample_submission['Cover_Type'] = y_pred['Label'].tolist()

# Lets see the head of our submission file
display(sample_submission.head())

# Analyse the % of Cover Types predicted
display(sample_submission['Cover_Type'].value_counts(normalize=True)*100)

# Save the 
file_name = '3-sub_catboost_pycaret' 
sample_submission.to_csv('{}.csv'.format(file_name), index=False)