# Problem Statement

In this competition, we use cartographic variables to classify forest categories. 

The word 'cartographic' is an adjective which means 'relating to the science or practice of drawing maps'. Example usage: "he started his own cartographic printing company".

We are going to cover the following steps:
1. Import Libraries
2. Model
3. Evaluation
4. Submission
5. References

Let's get started.

The full training dataset has 4,000,000 (4M) rows. We'll use just a sample so we can explore the data more quickly.

# Import Libraries

In [None]:
import pandas as pd
from pathlib import Path

data_dir = Path('../input/tabular-playground-series-dec-2021/')

df_train = pd.read_csv(
    data_dir / "train.csv",
    index_col='Id',
    nrows=25000 # comment this row to use the full dataset
)

FEATURES = df_train.columns[:-1]
TARGET = df_train.columns[-1]

df_train.head()

The target attribute 'Cover_Type' contains 7 types of Forest Cover (1, 2, 3, 4, 5, 6, 7).

# Model

Let's try out a simple XGBoost model. This algorithm can handle missing values. We use XGBClassifier (instead of XGBRegressor, for instance), since this is a classification problem.

In [None]:
from xgboost import XGBClassifier

X = df_train.loc[:, FEATURES]
y = df_train.loc[:, TARGET]

model = XGBClassifier(
    max_depth=3,
    subsample=0.5,
    colsample_bytree=0.5,
    n_jobs=-1,
    # Uncomment if you want to use GPU. Recommended for whole training set.
    #tree_method='gpu_hist',
    random_state=0,
)

# Evaluation

The evaluation metric is multi-class classification accuracy.

In [None]:
from sklearn.model_selection import cross_validate
import warnings 
warnings.filterwarnings('ignore')

def score(X, y, model, cv):
    scoring = ["accuracy"]
    scores = cross_validate(
        model, X, y, scoring=scoring, cv=cv, return_train_score=True
    )
    scores = pd.DataFrame(scores).T
    return scores.assign(
        mean = lambda x: x.mean(axis=1),
        std = lambda x: x.std(axis=1),
    )

scores = score(X, y, model, cv=2)

display(scores)

# Submission

In [None]:
# Fit on full training set
model.fit(X, y)

X_test = pd.read_csv(data_dir / "test.csv", index_col='Id')

# Make predictions
y_pred = pd.Series(
    model.predict(X_test),
    index=X_test.index,
    name=TARGET,
)

# Create submission file
y_pred.to_csv("submission_getting_started.csv")

# References
1. Thank you to Ryan Holbrook, Alexis Cook and inversion for demonstrating how to get started using their [notebook](https://www.kaggle.com/ryanholbrook/getting-started-september-2021-tabular-playground/notebook).
