---
# DECEMBER TPS

---

## 1. Introduction

For this competition, we will be predicting a categorical target based on a number of feature columns given in the data. The data is synthetically generated by a GAN that was trained on a the data from the [Forest Cover Type Prediction](https://www.kaggle.com/c/forest-cover-type-prediction/overview) (past comp). This dataset is (a) much larger, and (b) may or may not have the same relationship to the target as the original data.

Please refer to this [data page](https://www.kaggle.com/c/forest-cover-type-prediction/data) for a detailed explanation of the features.

#### Files

- train.csv - the training data with the target Cover_Type column
- test.csv - the test set; you will be predicting the Cover_Type for each row in this file (the target integer class)
- sample_submission.csv - a sample submission file in the correct format

#### Evaluation metrics

* Submissions are evaluated on `multi-class classification accuracy`.

---
#### About the project
The study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Each observation is a 30m x 30m patch. You are asked to predict an integer classification for the forest cover type. The seven types are:

- 1 - Spruce/Fir
- 2 - Lodgepole Pine
- 3 - Ponderosa Pine
- 4 - Cottonwood/Willow
- 5 - Aspen
- 6 - Douglas-fir
- 7 - Krummholz

The training set (15120 observations) contains both features and the Cover_Type. The test set contains only the features. You must predict the Cover_Type for every row in the test set (565892 observations).

#### Data Fields
- Elevation - Elevation in meters
- Aspect - Aspect in degrees azimuth
- Slope - Slope in degrees
- Horizontal_Distance_To_Hydrology - Horz Dist to nearest surface water features
- Vertical_Distance_To_Hydrology - Vert Dist to nearest surface water features
- Horizontal_Distance_To_Roadways - Horz Dist to nearest roadway
- Hillshade_9am (0 to 255 index) - Hillshade index at 9am, summer solstice
- Hillshade_Noon (0 to 255 index) - Hillshade index at noon, summer solstice
- Hillshade_3pm (0 to 255 index) - Hillshade index at 3pm, summer solstice
- Horizontal_Distance_To_Fire_Points - Horz Dist to nearest wildfire ignition points
- Wilderness_Area (4 binary columns, 0 = absence or 1 = presence) - Wilderness area designation
- Soil_Type (40 binary columns, 0 = absence or 1 = presence) - Soil Type designation
- Cover_Type (7 types, integers 1 to 7) - Forest Cover Type designation

#### The wilderness areas are:

- 1 - Rawah Wilderness Area
- 2 - Neota Wilderness Area
- 3 - Comanche Peak Wilderness Area
- 4 - Cache la Poudre Wilderness Area

#### The soil types are:

1 Cathedral family - Rock outcrop complex, extremely stony.
2 Vanet - Ratake families complex, very stony.
3 Haploborolis - Rock outcrop complex, rubbly.
4 Ratake family - Rock outcrop complex, rubbly.
5 Vanet family - Rock outcrop complex complex, rubbly.
6 Vanet - Wetmore families - Rock outcrop complex, stony.
7 Gothic family.
8 Supervisor - Limber families complex.
9 Troutville family, very stony.
10 Bullwark - Catamount families - Rock outcrop complex, rubbly.
11 Bullwark - Catamount families - Rock land complex, rubbly.
12 Legault family - Rock land complex, stony.
13 Catamount family - Rock land - Bullwark family complex, rubbly.
14 Pachic Argiborolis - Aquolis complex.
15 unspecified in the USFS Soil and ELU Survey.
16 Cryaquolis - Cryoborolis complex.
17 Gateview family - Cryaquolis complex.
18 Rogert family, very stony.
19 Typic Cryaquolis - Borohemists complex.
20 Typic Cryaquepts - Typic Cryaquolls complex.
21 Typic Cryaquolls - Leighcan family, till substratum complex.
22 Leighcan family, till substratum, extremely bouldery.
23 Leighcan family, till substratum - Typic Cryaquolls complex.
24 Leighcan family, extremely stony.
25 Leighcan family, warm, extremely stony.
26 Granile - Catamount families complex, very stony.
27 Leighcan family, warm - Rock outcrop complex, extremely stony.
28 Leighcan family - Rock outcrop complex, extremely stony.
29 Como - Legault families complex, extremely stony.
30 Como family - Rock land - Legault family complex, extremely stony.
31 Leighcan - Catamount families complex, extremely stony.
32 Catamount family - Rock outcrop - Leighcan family complex, extremely stony.
33 Leighcan - Catamount families - Rock outcrop complex, extremely stony.
34 Cryorthents - Rock land complex, extremely stony.
35 Cryumbrepts - Rock outcrop - Cryaquepts complex.
36 Bross family - Rock land - Cryumbrepts complex, extremely stony.
37 Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony.
38 Leighcan - Moran families - Cryaquolls complex, extremely stony.
39 Moran family - Cryorthents - Leighcan family complex, extremely stony.
40 Moran family - Cryorthents - Rock land complex, extremely stony.

---

## 2. EDA and Data Visualizations
### Imports

In [None]:
import os
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.figure_factory as ff
import plotly.graph_objects as go
from matplotlib.ticker import FormatStrFormatter

import gc

import warnings
warnings.filterwarnings('ignore')

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

### Load Data

In [None]:
train = pd.read_csv(r'/kaggle/input/tabular-playground-series-dec-2021/train.csv', index_col='Id')
test = pd.read_csv(r'/kaggle/input/tabular-playground-series-dec-2021/test.csv', index_col='Id')
submission= pd.read_csv(r'/kaggle/input/tabular-playground-series-dec-2021/sample_submission.csv', index_col='Id')

In [None]:
# this code snippet is taken form https://www.kaggle.com/c/tabular-playground-series-oct-2021/discussion/275854

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

train = reduce_mem_usage(train)
test = reduce_mem_usage(test)
gc.collect()

In [None]:
print('shape')
print(train.shape)
print(test.shape)

print('Nullvalues')
display(train.isna().sum().sum())
display(test.isna().sum().sum())

### Dataset Overview

#### Data size
- Train data has 4,000,000 rows and 55 features including the target variable
- Test dataset has 1,000,000 rows and 54 features.

#### Missing Values
- No missing values in both train and test datasets!

#### Features
 - 10 features area numerical features.
 - The rest (44) are categorical features.
 
#### Target
- Multiclass target (1 to 7)
- Target distribution is NOT balanced. 
- Target 1 and 2 are by far the dominating classes. Class-2 being the most.
- Note also that  Cover_Type4 and  Cover_Type5 are almost non-existent (377 instances of Cover_Type4 and ONLY 1 for Cover_Type 5 out of 4milion rows!).

In [None]:
train.head()

### Features 

In [None]:
numerical_features = train.columns[:10]
categorical_features = train.columns[10:]

### Statistical Description (numerical features)


In [None]:
train[numerical_features].describe().T.sort_values(by='mean' , ascending = False)\
.style.background_gradient(cmap='Greys')\
.bar(subset=["mean",], color='#6495ED')\
.bar(subset=["max"], color='#ff355d')

In [None]:
test[numerical_features].describe().T.sort_values(by='mean' , ascending = False)\
.style.background_gradient(cmap='Greys')\
.bar(subset=["mean",], color='#6495ED')\
.bar(subset=["max"], color='#ff355d')

### Target Distribution

In [None]:
target = train['Cover_Type']

In [None]:
pal = ['#6495ED','#ff355d', '#8fdab0', 'red', 'red', '#000000', '#ccc000']
plt.figure(figsize=(8, 6))
ax = sns.countplot(x=target, palette=pal)
ax.set_title('Target variable distribution', fontsize=20, y=1.05)

sns.despine(right=True)
sns.despine(offset=10, trim=True)

### Are `Cover_Type4` & `Cover_Type5` really zeros?

In [None]:
for i in range(1, 8):
    print('Number of samples, Cover_Type{} = {}'.format(i, len(train[train['Cover_Type'] == i])))

### Features Data Visualization
- The numerical features distribution seems to be similar in both train and test datasets
- Most of the categorical features are zero's
- Features `Wilderness_Area1`, `Wilderness_Area3`, `Soil_Type7`, `Soil_Type15` are different from the others.
 - In features `Wilderness_Area1` and `Wilderness_Area3` we see *not insignificant* cat-1 presence.
 - In features `Soil_Type7` and `Soil_Type15` we see n0 cat-1 at all. They are all zeros.
 > These could be important features for the modeling!

In [None]:
train_ = train.sample(40000, random_state=1221)
test_ = test.sample(10000, random_state=1221)

features = train.columns
numerical_features = features[:-1]

In [None]:
def density_plotter(a, b, title):    
    L = len(numerical_features[a:b])
    nrow= int(np.ceil(L/5))
    ncol= 5
    
    fig, ax = plt.subplots(nrow, ncol,figsize=(20, 6), sharey=False, facecolor='#dddddd')

    fig.subplots_adjust(top=0.90)
    i = 1
    for feature in numerical_features[a:b]:
        plt.subplot(nrow, ncol, i)
        ax = sns.kdeplot(train_[feature], shade=True,  color='#6495ED',  alpha=0.85, label='train')
        ax = sns.kdeplot(test_[feature], shade=True, color='#ff355d',  alpha=0.85, label='test')
        ax.yaxis.set_major_formatter(FormatStrFormatter('%.0f'))
        ax.xaxis.set_label_position('top')
        ax.set_ylabel('')
        ax.set_yticks([])        
        ax.set_xticks([])        
           
        i += 1

    lines, labels = fig.axes[-1].get_legend_handles_labels()    
    fig.legend(lines, labels, loc = 'upper right',borderaxespad= 4.0, title='data set') 

    plt.suptitle(title, fontsize=20)
    plt.show()

In [None]:
density_plotter(a=0, b=10, title='Density plot: train & test data (numerical features)')

In [None]:
## Noticablly different features 
ff = ['Wilderness_Area1', 'Wilderness_Area3', 'Soil_Type7', 'Soil_Type15']

In [None]:
def count_plot_testTrain(data1, data2, features, titleText):
    L = len(features)
    nrow= int(np.ceil(L/9))
    ncol= 9    
    remove_last= (nrow * ncol) - L    

    fig, ax = plt.subplots(nrow, ncol,figsize=(22, 14), sharey=True, facecolor='#dddddd')
    
    fig.subplots_adjust(top=0.92)
    i = 1
    for feature in features[:-1]:
        plt.subplot(nrow, ncol, i)
        ax = sns.countplot(x=feature, color='#6495ED', data=data1, label='train')
        ax = sns.countplot(x=feature, color='#ff355d', data=data2, label='test')
        ax.set_xlabel(feature)
        ax.set_ylabel('')
        ax.set_yticks([])        
        ax.xaxis.set_label_position('top')         
        
        
        if feature in ff:
            ax = sns.countplot(x=feature, color='#6495ED', data=data1, label='train')
            ax = sns.countplot(x=feature, color='#ff355d', data=data2, label='test')            
            ax.set_facecolor('cyan')
        
        i += 1
        
    lines, labels = fig.axes[-1].get_legend_handles_labels()    
    fig.legend(lines, labels, loc = 'upper right',borderaxespad= 4.0, title='data set') 

    plt.suptitle(titleText, fontsize=20)
    plt.show()

count_plot_testTrain(train_, test_, categorical_features, titleText='Train & test data categorical features count plots ')

In [None]:
def count_plot(data, features, titleText, hue=None):
    
    L = len(features)
    nrow= int(np.ceil(L/9))
    ncol= 9
    
    fig, ax = plt.subplots(nrow, ncol,figsize=(22, 14), sharey=True, facecolor='#dddddd')
    fig.subplots_adjust(top=0.92)
    
    i = 1
    for feature in features[:-1]:
        total = float(len(data)) 
        plt.subplot(nrow, ncol, i)
        ax = sns.countplot(x=feature, palette=pal, data=data, hue=hue)
        ax.set_xlabel(feature)
        ax.set_ylabel('')
        ax.xaxis.set_label_position('top')
        ax.set_yticks([]) 
        ax.get_legend().remove()
        
        if feature in ff:
            ax.set_facecolor('cyan')
        
        i += 1
        
    lines, labels = fig.axes[-1].get_legend_handles_labels()    
    fig.legend(lines, labels, loc = 'upper right',borderaxespad= 3.0,title='data set' ) 
    
    plt.suptitle(titleText ,fontsize = 20)
    plt.show() 
    
count_plot(train_, categorical_features, 'Train data cat_feats: target distribution (count plot)', hue='Cover_Type')

### Correlation Heatmap 
- Notice the correlation coeff. of the features we have identified above as 'different' (`Wilderness_Area1`, `Wilderness_Area3`, `Soil_Type7` and `Soil_Type15`)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(16 , 16), facecolor='#dddddd')
corr = train.sample(40000, random_state=2021).corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

sns.heatmap(corr, ax=ax, square=True, center=0, linewidth=1, #vmax=0.2, vmin=-0.2,
        cmap=sns.diverging_palette(240, 10, as_cmap=True),
        cbar_kws={"shrink": .85}, mask=mask ) 

ax.set_title('Correlation heatmap: All features', fontsize=24, y= 1.05);

#### Pairplots of the numerical features

In [None]:
cols = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology',
       'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
       'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',
       'Horizontal_Distance_To_Fire_Points','Cover_Type']
df = train_[cols]

sns.pairplot(df, hue='Cover_Type', palette='coolwarm', corner=True)
plt.show()

## Models

#### XGBoost, Catboost, LGBM 
#### 1. Base model with no feature engineering no hyperparameteres

In [None]:
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [None]:
SEED =2021

X = train
y = train.pop('Cover_Type')

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.33, random_state=2021, shuffle=True)

#### 1. XGBoost

In [None]:
clf_xgb = XGBClassifier(
    seed=SEED,
    objective="multi:softmax",
    tree_method = 'gpu_hist',
    predictor = 'gpu_predictor',)

clf_xgb.fit(X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            early_stopping_rounds=40,
            verbose=500);

In [None]:
preds_valid = np.array(clf_xgb.predict(X_valid, ))
valid_acc = accuracy_score(y_pred=preds_valid, y_true=y_valid)
print('Xgboost validation accuracy {}'.format(valid_acc))

In [None]:
preds_test = np.array(clf_xgb.predict(test))
submission['Cover_Type'] = preds_test
submission.head()

In [None]:
submission.to_csv("submission_xgb.csv", index=False)

#### 2. Catboost

In [None]:
catb_params = {
              'loss_function': 'MultiClass',                
              'task_type':'GPU',
              'bootstrap_type':'Bernoulli'
             } 
clf_catboost = CatBoostClassifier(**catb_params)
clf_catboost.fit(X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            early_stopping_rounds=40,
            verbose=500);

In [None]:
preds_valid = np.array(clf_catboost.predict(X_valid, ))
valid_acc = accuracy_score(y_pred=preds_valid, y_true=y_valid)
print('Catboost validation accuracy: {}'.format(valid_acc))

In [None]:
preds_test = np.array(clf_catboost.predict(test))
submission['Cover_Type'] = preds_test
submission.head() 

In [None]:
submission.to_csv("submission_catboost.csv", index=False)

#### 3. LGBM

In [None]:
lgbm_params = {'objective': 'multiclass',  
          'random_state': SEED,
          'device': 'gpu'
          }
clf_lgbm= LGBMClassifier(**lgbm_params)
clf_lgbm.fit(X_train, y_train,
            eval_set=[(X_valid, y_valid)],
            early_stopping_rounds=40,
            verbose=500);

In [None]:
preds_valid = np.array(clf_lgbm.predict(X_valid, ))
valid_acc = accuracy_score(y_pred=preds_valid, y_true=y_valid)
print('LGBM validation accuracy: {}'.format(valid_acc))

In [None]:
preds_test = np.array(clf_lgbm.predict(test))
submission['Cover_Type'] = preds_test
submission.head()

In [None]:
submission.to_csv("submission_lgbm.csv", index=False)

#### Summary base models 

- xgboost : 0.958579
- catboost: 0.959965
- lgbm: 0.943972

> Catboost gives the `best` score


<!-- #### 4. TabNet

[I used this reference to build the TabNet model](https://github.com/dreamquark-ai/tabnet/blob/develop/forest_example.ipynb) (most of the params are also from the same ref)

!pip install pytorch_tabnet -q
from pytorch_tabnet.tab_model import TabNetClassifier
from pytorch_tabnet.tab_model import TabNetClassifier
import torch -->

<!-- X = StandardScaler().fit_transform(train)
Y = y.astype(dtype=int) 
X_test = StandardScaler().fit_transform(test) -->

<!-- X_train, X_valid, y_train, y_valid = train_test_split(X, Y, test_size=0.33, random_state=2021, shuffle=True) -->

<!-- clf = TabNetClassifier(
    n_d=64, 
    n_a=64, 
    n_steps=5,
    gamma=1.5, 
    n_independent=2, 
    n_shared=2,
    cat_emb_dim=1,
    lambda_sparse=1e-4,
    momentum=0.3, 
    clip_value=2.,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    scheduler_params = {"gamma": 0.95,
                     "step_size": 20},
    scheduler_fn=torch.optim.lr_scheduler.StepLR, 
    epsilon=1e-15
) -->

<!-- max_epochs=15

clf.fit(
    X_train=X_train, y_train=y_train,
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    eval_name=['train', 'valid'],
    max_epochs=max_epochs, 
    patience=100,
    batch_size=20480, 
    virtual_batch_size=256
) -->

<!-- preds = clf.predict(X_test)
submission['Cover_Type'] = preds
submission.head()
submission.to_csv("submission_tabnet.csv", index=False) -->