
#  Forest Cover Rype Prediction with Catboost
## Table of Contents
* [1. Overview](#1.)
* [2. Setup](#2.)
* [3. EDA & Preprocessing](#3.)
	* [3.1 Statistic Info](#3.1)
	* [3.2 Correlation Score](#3.2)
	* [3.3 Distribution of Label](#3.3)
	* [3.4 Drop ID column](#3.4)
    * [3.5 Data Wrangling](#3.5)
	* [3.6 Train Validation Split](#3.6)
    * [3.7 Add new features](#3.7)
* [4. Model Development](#4.)
* [5. Submission](#5.)

<a id="1."></a>
## 1. Overview
In this Notebook, I will build a Forest Cover Type Prediction Model using Catboost. This Notebook is based on my notebook https://www.kaggle.com/lonnieqin/tps-12-21-dnn about [Tabular Playground Series - Dec 2021 Competition](https://www.kaggle.com/c/tabular-playground-series-dec-2021).
<a id="2."></a>
## 2. Setup

In [None]:
import numpy as np
import pandas as pd
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
import os
import math
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
class Config:
    is_kaggle_platform = os.path.exists("/kaggle/input")
    dataset_name = "forest-cover-type-prediction"
    data_path = "/kaggle/input/%s/"%(dataset_name) if is_kaggle_platform else ""
    submit_filename = "submission.csv"
    label_name = "Cover_Type"
    id_field = "Id"
config = Config()

In [None]:
if not config.is_kaggle_platform:
  try:
    import kaggle
  except:
    !pip install kaggle
  if not os.path.exists("/root/.kaggle/kaggle.json"):
    !echo "{"username":"{your username}","key":"{your apikey}"}" >> /root/.kaggle/kaggle.json
    !chmod 600 /root/.kaggle/kaggle.json
  !kaggle competitions download -c $config.dataset_name
  !unzip test.csv.zip
  !unzip train.csv.zip
  !unzip sample_submission.csv.zip

In [None]:
train = pd.read_csv(config.data_path + "train.csv")
test = pd.read_csv(config.data_path + "test.csv")
sample_submission = pd.read_csv(config.data_path + "sampleSubmission.csv")

<a id="3."></a>
## 3. EDA & Preprocessing

In [None]:
train.head()

<a id="3.1"></a>
### 3.1 Statistic Info

In [None]:
train.info()

In [None]:
train.describe()

<a id="3.2"></a>
### 3.2 Correlation Score

In [None]:
corr = train.corr()
corr

In [None]:
corr.sort_values(ascending=False, inplace=True, by=config.label_name, key= lambda x: abs(x))
corr[config.label_name]

In [None]:
correlated_columns = corr[config.label_name][corr[config.label_name].abs() > 0.05].index
correlated_columns, len(correlated_columns)

In [None]:
correlation_score = train.corr()
correlated_features = correlation_score[config.label_name].sort_values(ascending=False).dropna()
correlated_columns = list(correlated_features[correlated_features.abs() > 0.05].index)
correlated_columns.remove(config.label_name)
print(correlated_columns)

In [None]:
corr2 = train[correlated_columns].corr()
corr2

In [None]:
plt.figure(figsize=(20, 20))
sns.heatmap(corr2, annot=True)

<a id="3.3"></a>
### 3.3 Distribution of Label
This dataset is very balanced.

In [None]:
sns.countplot(x=config.label_name, data=train)

In [None]:
train[config.label_name].value_counts()

In [None]:
#train = train.drop(index = int(np.where(train[config.label_name] == 5)[0]))

<a id="3.4"></a>
### 3.4 Drop ununsed columns
ID is not needed. So remove this column. So are Soid_Type7 and Soil_Type15.

In [None]:
train.pop(config.id_field)
_ = test.pop(config.id_field)

In [None]:
train.pop("Soil_Type7")
train.pop("Soil_Type15")
test.pop("Soil_Type7")
_ = test.pop("Soil_Type15")

<a id="3.5"></a>
### 3.5 Data Wrangling
Luckily There isn't any missing value, we don't need to worry about Data Wrangling.

In [None]:
null_counts = train.isnull().sum()
print(null_counts[null_counts > 0])
null_counts = test.isnull().sum()
print(null_counts[null_counts > 0])

In [None]:
cols = list(train.columns)
cols.remove(config.label_name)
for data in [train, test]:
    data['binned_elevation'] = [math.floor(v/50.0) for v in data['Elevation']]
    data['Horizontal_Distance_To_Roadways_Log'] = [math.log(v+1) for v in data['Horizontal_Distance_To_Roadways']]
    data['Soil_Type12_32'] = data['Soil_Type32'] + data['Soil_Type12']
    data['Soil_Type23_22_32_33'] = data['Soil_Type23'] + data['Soil_Type22'] + data['Soil_Type32'] + data['Soil_Type33']
    data["mean"] = data[cols].mean(axis=1)
    data["min"] = data[cols].min(axis=1)
    data["max"] = data[cols].max(axis=1)
    data["std"] = data[cols].std(axis=1)

<a id="3.6"></a>
### 3.6 Train Validation Split

In [None]:
train_features, val_features = train_test_split(train, test_size=0.15, random_state=42)
train_targets = train_features.pop(config.label_name)
val_targets = val_features.pop(config.label_name)
train_features.head()

<a id="4."></a>
## 4. Model Development

In [None]:
cat_params = {
    'iterations': 5000,
    'learning_rate': 0.03,
    'od_wait': 1000,
    'depth': 7,
    'task_type' : 'GPU',
    'l2_leaf_reg': 3,
    'eval_metric': 'Accuracy',
    'devices' : '0',
    'verbose' : 1000
}
cat = CatBoostClassifier(**cat_params)
cat.fit(train_features, train_targets, eval_set=(val_features, val_targets))


<a id="5."></a>
## 5. Submission

In [None]:
y_pred = cat.predict(test)
sample_submission[config.label_name] = y_pred.reshape(-1)
sample_submission.to_csv(config.submit_filename, index=False)
if not config.is_kaggle_platform:
  !kaggle competitions submit $config.dataset_name -m "Submission" -f $config.submit_filename

**If you found my notebook useful, give me an upvote.**

If you are interested, You may have a look at some of my TPS notebooks before.

- [Tabular Playground Series Prediction(Aug 2021)](https://www.kaggle.com/lonnieqin/tabular-playground-series-prediction)
- [Tabular Playground Prediction(Sep 2021) with CatBoost](https://www.kaggle.com/lonnieqin/catboost-tabular-playground-prediction-sep-2021)
- [Tabular Prediction(Oct 2021) with CatBoost](https://www.kaggle.com/lonnieqin/catboost-tabular-prediction-oct-2021)
- [TPS Prediction with DNN and KerasTuner (Oct 2021)](https://www.kaggle.com/lonnieqin/tps-prediction-with-dnn-and-kerastuner-oct-2021)
- [TPS-10-21: DNN](https://www.kaggle.com/lonnieqin/tps-10-21-dnn)
