# Forest Cover Type Prediction

We shall follow the following steps to complete this challange:
1. Understand the business problem
2. Get the data
3. Discover and visualize insights (univariate and multi variate analysis)
4. Prepare data for ML algorithms
5. Select a model and train it
6. Fine tune your model
7. Launch, monitor and maintain your system (not needed in this case).


## Understand business problem:
Design and implement a system which can process the unscaled and binary features and predict the forest cover type. This is a multi-class classification project.
The test data is very large when compared to the train data. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## Get the data
We are mainly concerned with train.csv and test.csv here.
Let's load it into a pandas dataframe.


In [None]:
dataset = pd.read_csv('/kaggle/input/forest-cover-type-prediction/train.csv')
dataset.head()

In [None]:
dataset.Cover_Type.unique()

As we know this is a multi class classification task.
Now let's load the validation set from test.csv

We don't have the Cover_Type variable as it is the independent variable that we will predict.

In [None]:
dataset.info()

In [None]:
dataset.describe()

## Discover and visualize insights
Let's start with univariate and bivariate analysis to understand our data.

In [None]:
dataset.groupby(['Cover_Type']).agg(['count'])['Id']

In [None]:
# sns.countplot(x='Cover_Type', data=dataset)

## Prepare data for ML algorithms
As we have seen, the features have very different scales. So we need to bring them in same scale. Here we will use standard scaler class from sklearn.
But before that, its really important to split the dataset into train and test sets.
Let's do that first

In [None]:
from sklearn.model_selection import train_test_split

# first drop id column
dataset.drop(['Id'], axis=1, inplace=True)

X_train = dataset.iloc[:, :-1]
y_train = dataset.iloc[:, -1]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# X_test = None
# y_test = None
print(dataset.shape)
# print(X.shape)
# print(y.shape)
# print(X_train.shape)
# print(y_train.shape)
# print(X_test.shape)
# print(y_test.shape)

In [None]:
from sklearn.preprocessing import StandardScaler

to_be_scaled_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
       'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
       'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
       'Horizontal_Distance_To_Fire_Points']

sc_X = StandardScaler()
X_train[to_be_scaled_features] = sc_X.fit_transform(X_train[to_be_scaled_features])
# X_test[to_be_scaled_features] = sc_X.transform(X_test[to_be_scaled_features])


In [None]:
dataset.columns

In [None]:
X_train.head()

As we have scaled the data and we don't have any missing values, let's train multiple machine learning models and see which performs best.

Let's try couple other classifiers

In [None]:
# from sklearn.ensemble import RandomForestClassifier

# rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
# rf_classifier.fit(X_train, y_train)

# y_pred = rf_classifier.predict(X_test)

# from sklearn.metrics import accuracy_score, confusion_matrix

# print(accuracy_score(y_test, y_pred))
# rf_proba = rf_classifier.predict_proba(X_test)
# print(roc_auc_score(y_test, rf_proba, multi_class='ovr'))
# confusion_matrix(y_test, y_pred)

As we can see Random forest classifier performs the best here.
Before we optimise this further, lets try Xgboost classifier


In [None]:
from xgboost import XGBClassifier

xgb = XGBClassifier(
    learning_rate=0.09,
    n_estimators=500,
    max_depth=30,
    nthread=4,
    objective='multi:softprob',
    subsample = 0.75,
)

xgb.fit(X_train, y_train)

# y_pred = xgb.predict(X_test)

# from sklearn.metrics import accuracy_score, classification_report 

# print(accuracy_score(y_test, y_pred)*100)
# xgb_proba = xgb.predict_proba(X_test)
# print(roc_auc_score(y_test, xgb_proba, multi_class='ovr')*100)
# print(classification_report(y_test, y_pred))

In [None]:
from lightgbm import LGBMClassifier

lgb = LGBMClassifier(learning_rate=0.09,
                       num_leaves = 500,
                       boosting_type='gbdt',
                       objective= 'multiclass',
                       metric= 'multi_logloss',
                       max_depth = 30,
#                        n_estimators=3000,
#                        subsample_for_bin=4000, 
#                        min_split_gain=2,
#                        min_child_weight=2,
#                       min_child_samples=5,
                      subsample=0.75
                    )

lgb.fit(X_train, y_train)


# y_pred = lgb.predict(X_test)

# from sklearn.metrics import accuracy_score, classification_report 

# print(accuracy_score(y_test, y_pred)*100)
# xgb_proba = xgb.predict_proba(X_test)
# print(roc_auc_score(y_test, xgb_proba, multi_class='ovr')*100)
# print(classification_report(y_test, y_pred))

In [None]:
# from sklearn.ensemble import ExtraTreesClassifier

# et = ExtraTreesClassifier(n_estimators=500,n_jobs=-1,random_state=0)
# et.fit(X_train, y_train)

# y_pred = et.predict(X_train)

# from sklearn.metrics import accuracy_score, classification_report 

# print(accuracy_score(y_train, y_pred)*100)
# xgb_proba = et.predict_proba(X_train)
# print(roc_auc_score(y_train, xgb_proba, multi_class='ovr')*100)
# print(classification_report(y_train, y_pred))

In [None]:
validation_set = pd.read_csv('/kaggle/input/forest-cover-type-prediction/test.csv')
# validation_set.head()
ids = validation_set['Id']
validation_set.drop(['Id'], axis=1, inplace=True)
validation_set[to_be_scaled_features] = sc_X.transform(validation_set[to_be_scaled_features])
y_result = xgb.predict(validation_set)

y_result = pd.Series(y_result, name='Cover_Type')
ids = pd.Series(ids, name='Id')
submission = pd.concat([ids,y_result], axis=1)
submission.to_csv('/kaggle/working/submission_xgb.csv', index=False)