This is an inference kernel. Please [find the training kernel here](https://www.kaggle.com/kneroma/lgbm-on-lyft-tabular-data-training).

Diving far into the zarr file format and  the Lyft L5kit github repos, I finally succeeded in converting the competition's dataset  into **csv** files on which we could run classical models.

For those who are interrested, [the csv dataset looks like this one](https://www.kaggle.com/kneroma/lyft-motion-prediction-autonomous-vehicles-as-csv). I've not uploaded the whole dataset for now. Stay tuned !

Finally, let's recall that [this notebook of mine could also help you in stepping far into Zarr files and the Lyft L5kit dataset format.](https://www.kaggle.com/kneroma/zarr-files-and-l5kit-data-for-dummies)

For the prediction, you can [find the test set as csv here](https://www.kaggle.com/kneroma/lyft-test-set-as-csv). 

<h4>Please, don't mind upovting the datasets in order to make them more visibe for all of us.</h4>

In [None]:
import pandas as pd, numpy as np
import re,json
import itertools as it
from pathlib import Path

import lightgbm as lgb

pd.options.display.max_columns=305

# Loading the test set as CSV

In [None]:
# Here, I'm gonna load the test, it contains `71122` rows as expected
df = pd.read_csv("../input/lyft-test-set-as-csv/Lyft_test_set.csv")
print("df.shape:", df.shape)
df.head(10)

> Some of the columns are self-explaining, for the others, please refer to the corresponding dataset for more details.

# Loading the LGBM models

In [None]:
def get_model_name(filename):
    return re.search("^(lgbm_[x,y]_shift_\d+)", filename).group(1)

In [None]:
def get_models(path):
    models = {}
    path = Path(path)
    for model in path.glob("lgbm*"):
        model_name = get_model_name(model.stem)
        shift = int(model_name.split("shift_")[1])
        meta = path.joinpath("meta_shift_{:02d}.json".format(shift))
        with meta.open() as f:
            train_cols = json.load(f)["TRAIN_COLS"]
        models[model_name] = {"model": model.as_posix(), "train_cols": train_cols}
    return models

In [None]:
models = get_models("../input/lyft-models/lgbm_06")
len(models)

In [None]:
next(iter(models.items()))

I've trained **100** LGBM models : one for each of the *50 time horizons*x*2 space dimension*

The whole training took about one hour and the prediction step is even fatster.

# Make prediction for the test set

In [None]:
def make_colnames():
    xcols = ["coord_x{}{}".format(step, rank) for step in range(3) for rank in range(50)]
    ycols = ["coord_y{}{}".format(step, rank) for step in range(3) for rank in range(50)]
    cols = ["timestamp", "track_id"] + ["conf_0", "conf_1", "conf_2"] + list(it.chain(*zip(xcols, ycols)))
    return cols

In [None]:
def predict(models, df):
    sub = np.empty((len(df), 305))
    sub.fill(np.nan)
    sub = pd.DataFrame(sub, columns = make_colnames())
    sub[["timestamp", "track_id"]] = df[["timestamp", "track_id"]]
    sub["conf_0"] = 1.0
    
    for shift in range(1, 51):
        for suffix in ["x", "y"]:
            model_info = models["lgbm_{}_shift_{:02d}".format(suffix, shift)]
                
            model = lgb.Booster(model_file= model_info["model"])
            pred = model.predict(df[model_info["train_cols"]])
            
            sub["coord_{}0{}".format(suffix, shift-1)] = pred

        if not shift%10:
            print("shift: {}".format(shift))
    
    sub.fillna(0., inplace=True)
    
    return sub

In [None]:
sub = predict(models, df)

In [None]:
sub.iloc[:50, :105]

In [None]:
sub.to_csv("submission.csv", index=False)

Getting such a score with no GPU computation nor image processing is just beautiful. More again, my LGBM are not well trained and I **zero** features ! Needless to say that there still room for improvements !

I will be publishing my training dataset and the whole conversion process by soon. For now, I need some cleaning and refacto for my messy code :) .

<div style="text-align:center;font-size:Large"><a href="https://www.kaggle.com/kneroma">@Kkiller</a></div>