Reading parquet data
==

In [this notebook](https://www.kaggle.com/kaaveland/tps202112-parquet) I converted the competition data to parquet format, so that I don't need to read the data from csv in future notebooks for this competition.

CSV files have some pretty annoying disadvantages:

- It is slow to read data from CSV -- this is particularly egregious if your data has timestamp columns in it, or if you have strings.
- They are big. This is part of the reason why they're slow, it simply takes a while to move so much text data from disk into memory.
- They are untyped. In CSV, everything is a string -- it's up to the reading program to decide how to interpret the strings.

Parquet files do much better in all of these aspects, at the cost of not being human-readable text files.

In this case, our `train.pq` is 77MB, vs 548MB for `train.csv` -- even though it contains the same data!

This is a pretty normal compression ratio, in my experience -- when there are low cardinality columns, or repeated values, parquet files use tricks like run-length encoding to achieve compression of ratios between 2-10 compared to CSV.

Let's measure how long it takes to read the parquet files:

In [None]:
import pandas as pd
import lightgbm
import numpy as np
import random
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

random.seed(64)
np.random.seed(64)

In [None]:
%time df_test = pd.read_parquet('../input/tpsdec2021parquet/test.pq')
%time df = pd.read_parquet('../input/tpsdec2021parquet/train.pq')

That takes about 1 second on kaggle. Let's compare that with the CSV files:

In [None]:
%time pd.read_csv('../input/tabular-playground-series-dec-2021/test.csv')
%time csv_train = pd.read_csv('../input/tabular-playground-series-dec-2021/train.csv')

At more than 20 seconds, there's really no contest -- especially, because the csv train file has all the wrong datatypes, whereas the parquet file remembers our selection from [the last notebook](https://www.kaggle.com/kaaveland/tps202112-parquet?scriptVersionId=81264309):

In [None]:
df.info()

Use `lightgbm` to estimate feature importances
==

Here's a baseline model I typed up to get some feature importance plots:

In [None]:
label_encoder = LabelEncoder()

X_train, y_train = df.drop(columns=['Id', 'Cover_Type']), label_encoder.fit_transform(df.Cover_Type)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, shuffle=True, test_size=.2)
X_test = df_test.drop(columns=['Id'])

In [None]:
%%time

sane_defaults = {
    'objective': 'multiclass',
    'num_class': len(label_encoder.classes_),
    'learning_rate': .025,
    'seed': 64,
    'boosting': 'goss',
    'feature_fraction': .5,
    'force_row_wise': True,
    'metric': ['multi_logloss', 'multi_error'],
    'verbosity': -1,
    'first_metric_only': True,
}

booster = lightgbm.train(
    params=sane_defaults,
    train_set=lightgbm.Dataset(X_train, label=y_train),
    num_boost_round=3000,
    valid_sets=[lightgbm.Dataset(X_val, label=y_val)],
    early_stopping_rounds=50,
    verbose_eval=100,
)

In [None]:
px.bar(
    x=booster.feature_name(),
    y=booster.feature_importance(),
    title='importance_type = "split"'
)

In [None]:
px.bar(
    x=booster.feature_name(),
    y=booster.feature_importance('gain'),
    title='importance_type = "gain"'
)

In [None]:
sub = df_test[['Id']].assign(
    Cover_Type=label_encoder.inverse_transform(booster.predict(X_test).argmax(axis=1))
)

sub.to_csv('submission.csv', index=False)