Converting data to parquet
==

I see a lot of people copying around a `reduce_memory_usage` function, which they immediately call on the result of reading from a CSV.

Obviously, it's nice to reduce the memory usage, and it's preferrable to use 32 bit floats to 64 bit ints, for fast arithmetic. But there's a way to do that, without having to copy a function around all the time.

Apache Parquet
--

Is an open standard for an efficient binary file format for columnar data. It's supported by most tools that work with "big" tabular data, for example pandas, spark, polars, arrow. We use this at work, it scales very nicely up to "billions of things".

In general, if you `df.to_parquet(...)` a `DataFrame`, and later, you `pd.read_parquet(...)` the same one, all the columns will have the same dtype! To some extent, this works across languages an tools too -- if you've turned something into a `pd.CategoricalDtype` with `pandas`, chances are good that `polars`, which also supports such a concept, will read it back that way. 

In [None]:
import os
import pandas as pd
import numpy as np

data_root = os.environ.get('KAGGLE_DIR', '../input')
tps_root = f'{data_root}/tabular-playground-series-dec-2021'
df = pd.read_csv(f'{tps_root}/train.csv')

bools = df.columns[df.nunique() <= 2]
dtypes = {col: np.bool_ for col in bools}
dtypes['Id'] = np.int32
dtypes['Cover_Type'] = np.int8
floats = {
    col: np.float32 for col in set(df.columns) - set(dtypes)
}
dtypes.update(floats)
df = df.astype(dtypes)
df_test = pd.read_csv(f'{tps_root}/test.csv', dtype={col: dtype for col, dtype in dtypes.items() if col != 'Cover_Type'})

df.info()

So, we decided on some datatypes. We can just write these dataframes as parquet files now.

If we want to use them in another notebook, we have two options:

- After saving the notebook, we can navigate to the "Data" pane, and there's a "Create dataset" button we can use to make it into a Dataset.
- But we could also just use the output of this notebook as input to the next one.

First things first, let's create the files:

In [None]:
df.to_parquet('train.pq')
df_test.to_parquet('test.pq')

Next step
==

This time around, I chose to create a dataset from these files. In [the next](https://www.kaggle.com/kaaveland/tps202112-lgbm-feature-importance?scriptVersionId=81261111) notebook, I've simply added that dataset to a competition notebook to load it. Note that there's no need to worry about data types in that notebook.

There's no particular reason why I turned these output files into a kaggle dataset, I could have just as easily added the output of this notebook as input to the next. This is a very reasonable thing to do if you need to do some sort of expensive preprocessing! Or maybe you can add a high number of features, then in the next notebook, you can work on selecting the best ones?

Feature engineering
==

A number of good features have been found -- I'm not going to link to everything in the discussions, but there are lots of notebooks and topics for this now.

For this part, I will concatenate the train and test datasets, all these transformations look only on the same row, so this is perfectly fine.

In [None]:
both = pd.concat([df, df_test], axis=0)

wilderness_sum = both[both.columns[both.columns.str.startswith('Wilderness')]].sum(axis=1).rename('Wilderness_Sum').astype(np.float32)
soiltype_sum = both[both.columns[both.columns.str.startswith('Soil_')]].sum(axis=1).rename('Soil_Type_Sum').astype(np.float32)

aspect_rollover = (both.Aspect % 360).rename('Aspect_rolled_over')
hydrology_elevation = (both.Elevation - both.Vertical_Distance_To_Hydrology).rename('Hydrology_Elevation')
water_vert_direction = both.Vertical_Distance_To_Hydrology.apply(np.sign).rename('Water_Vertical_direction')

def make_positive(series): return series + abs(series.min())

manhattan_hydrology = (
        make_positive(both.Horizontal_Distance_To_Hydrology) +
        make_positive(both.Vertical_Distance_To_Hydrology)).rename('Manhattan_Hydrology').astype(np.float32)

euclidean_hydrology = (
        make_positive(both.Horizontal_Distance_To_Hydrology) ** 2 +
        make_positive(both.Vertical_Distance_To_Hydrology) ** 2
).apply(np.sqrt).rename('Euclidian_Hydrology').astype(np.float32)

hillshape_clipped = both[both.columns[both.columns.str.startswith('Hillshade_')]].clip(lower=0, upper=255).add_suffix('_clipped')

fe = pd.concat([
    both.Elevation,
    both.Horizontal_Distance_To_Roadways,
    both.Horizontal_Distance_To_Fire_Points,
    both[both.columns[both.columns.str.startswith('Wilderness')]],
    both[both.columns[both.columns.str.startswith('Soil_')]].drop(columns=['Soil_Type7', 'Soil_Type15']),
    wilderness_sum,
    soiltype_sum,
    aspect_rollover,
    hydrology_elevation,
    water_vert_direction,
    manhattan_hydrology,
    euclidean_hydrology,
    hillshape_clipped,
    both.Cover_Type
], axis=1)

train_fe, test_fe = fe.loc[fe.Cover_Type.notna()], fe.loc[fe.Cover_Type.isna()]

fe.info()

In [None]:
train_fe = train_fe.astype({'Cover_Type': np.int8})
test_fe = test_fe.drop(columns=['Cover_Type'])

train_fe.to_parquet('train_fe.pq')
test_fe.to_parquet('test_fe.pq')