# Zarr files and L5kit data for dummies

>When I saw this competition for the first time, I said to myself: great, I'm going to have a lot of fun :) . And very quickly, I signed up. But, in fact I hadn't seen the real data yet. The data, although it seems quite rich to me, is presented in a format that makes you want to go kill yourself rather than touch it (I'm exaggerating a bit of course).

>The few public notebooks that try to approach this ogre, all use the same logic that totally prevents you from understanding the the data. Indeed, they all use the L5kit API which is a great API (I think it is optimized from an application point of view). But let's not hide it, this data format is far from the one we are used to caress with Pandas. So all is lost? Well no! This is what I will try to show in this work.

In this notebook, we will discover together, step by step, what is a **zarr** file and as a use-case, we will apply our discoveries to the **L5kit zarr dataset**. Let's start right away !

To achieve our goals, we will need some ingredients:
* **Zarr** : as you may have guessed, it is the main package for handling **zarr** files
* **Numpy** :  zarr files are built in front of **Numpy arrays**
* **Pandas**:  our zarr files will be parsed into **Pandas DataFrames** for further analysis

In [None]:
try:
    import zarr
except  ModuleNotFoundError:
    ! pip install zarr > /dev/null
    ! pip install ipytree > /dev/null

In [None]:
import zarr
import pandas as pd, numpy as np
import itertools as it # I will be using the `itertools.chain` function
from pathlib import Path # for better file/path operations management

<h5 style="color:blue;text-align:center;">Please upvote the kernel if you find it useful. You'll motivate me to go through the junky documentations in order to make this competition Great Again :) !</h5>

# 1. Off-topic zarr tutorial for dummies
> This part is off-topic :), but not the less. Here, I will make a general introduction to zarr files. If you can't wait to get down to business, feel free to jump tho #part2 .

Zarr provides classes and functions for working with N-dimensional arrays that behave like NumPy arrays but whose data is divided into chunks and each chunk is compressed. If you are already familiar with HDF5 then Zarr arrays provide similar functionality, but with some additional flexibility.

## 1.1 Creating an array

Zarr has several functions for creating arrays. For example:

In [None]:
z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
z

The code above creates a 2-dimensional array of 32-bit integers with 10000 rows and 10000 columns, divided into chunks where each chunk has 1000 rows and 1000 columns (and so there will be 100 chunks in total).

## 1.2 Reading and writing data

Zarr arrays support a similar interface to NumPy arrays for reading and writing data. For example, the entire array can be filled with a scalar value:

In [None]:
z[:] = 42

Regions of the array can also be written to, e.g.:

In [None]:
z[0, :] = np.arange(10000)
z[:, 0] = np.arange(10000)

The contents of the array can be retrieved by slicing, which will load the requested region into memory as a NumPy array, e.g.:

In [None]:
z[0, 0]

In [None]:
z[-1, -1]

In [None]:
z[0, :]

In [None]:
z[:]

## 1.3 Persistent arrays

In the examples above, compressed data for each chunk of the array was stored in main memory. Zarr arrays can also be stored on a file system, enabling persistence of data between sessions. For example:

In [None]:
z1 = zarr.open('data/example.zarr', mode='w', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
z1

>The array above will store its configuration metadata and all compressed chunk data in a directory called ‘data/example.zarr’ relative to the current working directory. The zarr.convenience.open() function provides a convenient way to create a new persistent array or continue working with an existing array. Note that although the function is called “open”, there is no need to close an array: data are automatically flushed to disk, and files are automatically closed whenever an array is modified.

Persistent arrays support the same interface for reading and writing data, e.g.:

In [None]:
z1[:] = 42

In [None]:
z1[0, :] = np.arange(10000)

In [None]:
z1[:, 0] = np.arange(10000)

Check that the data have been written and can be read again:

In [None]:
z2 = zarr.open('data/example.zarr', mode='r')

In [None]:
np.all(z1[:] == z2[:])

If you are just looking for a fast and convenient way to save NumPy arrays to disk then load back into memory later, the functions zarr.convenience.save() and zarr.convenience.load() may be useful. E.g.:

In [None]:
a = np.arange(10)
zarr.save('data/example.zarr', a)
zarr.load('data/example.zarr')

## 1.4 Groups

Zarr supports hierarchical organization of arrays via groups. As with arrays, groups can be stored in memory, on disk, or via other storage systems that support a similar interface.

To create a group, use the zarr.group() function:

In [None]:
root = zarr.group()
root

Groups have a similar API to the Group class from h5py. For example, groups can contain other groups:

In [None]:
foo = root.create_group('foo')
bar = foo.create_group('bar')

Groups can also contain arrays, e.g.:

In [None]:
z1 = bar.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
z1

Members of a group can be accessed via the suffix notation, e.g.:

In [None]:
root['foo']

The ‘/’ character can be used to access multiple levels of the hierarchy in one call, e.g.:

In [None]:
root['foo/bar']

In [None]:
root['foo/bar/baz']

The zarr.hierarchy.Group.tree() method can be used to print a tree representation of the hierarchy,

In [None]:
print(root.tree(expand=True))

## 1.5 User attributes

Zarr arrays and groups support custom key/value attributes, which can be useful for storing application-specific metadata. For example:

In [None]:
root = zarr.group()
root.attrs['foo'] = 'bar'
z = root.zeros('zzz', shape=(10000, 10000))
z.attrs['baz'] = 42
z.attrs['qux'] = [1, 4, 7, 12]
sorted(root.attrs)

In [None]:
'foo' in root.attrs

In [None]:
root.attrs['foo']

In [None]:
sorted(z.attrs)

In [None]:
z.attrs['baz']

In [None]:
z.attrs['qux']

**Here is the for our Zarr tutorial. For more ressources, you can visit [the official documentation](https://zarr.readthedocs.io/en/stable/tutorial.html#creating-an-array) where I took most of the above examples.**

# 2. The L5Kit dataset

## 2.1 Introduction

The `L5Kit data` is stored in **zarr** format which is basically a set of numpy structured arrays. Conceptually, it is similar to a set of CSV files with records and different columns as we've seen it above.

As for any **zarr** file, there must be a root folder. In our case, the root folder would likely look like **<...>/lyft-motion-prediction-autonomous-vehicles**. I set mine in the **DATA_ROOT** global variable as below :

In [None]:
# Set your root path to data
DATA_ROOT = Path("../input/lyft-motion-prediction-autonomous-vehicles")

In [None]:
zl5 = zarr.open(DATA_ROOT.joinpath("scenes/sample.zarr").as_posix(), mode="r")
zl5

So, the **L5Kit** data consists of groups of zarr datasets. We take a look by doing:

In [None]:
zl5.info

> The dataset name is obivously equal to "**/**" as we're in the root folder. The has 5 memebers, namely:
* **Scenes** :  a collection of frames
* **Frames** :  a collection of agents (the host agents + other agents)
* **Agents** : Any object in circulation with the automatic vehicle (AV)
* **Traffic_light_faces** : traffic lights and their faces (bulbs)

We can also see the dataset's **tree** by doing :

In [None]:
print(zl5.tree(expand=True))

So there are:
 * 100 **scenes** in the sample dataset
 * 24838 **frames**
 * 1893736 **agents**
 * 316008 **traffic_light_faces**

## 2.2 The L5Kit dataset: scenes

Let's get more info from the **scenes** :

In [None]:
zl5.scenes.info

### 2.2.1. What is a scene ?

Let's take a look into a **scene**'s `dtype`:

In [None]:
zl5.scenes.dtype

So, a **scene**  consists of  3 block of things :
* **Frames** : a scene has a list of frames that start from ***scene.frame_index_interval\[0\]*** and ends at ***scene.frame_index_interval\[1\]***
* **Host** : a scene has a ***host*** which is the AV that films the scene.
* **Timestamps**: a scene has a ***start_time***  and an ***end_time***

**I will make a small function which take in a scene and outputs those components as a `dict`.**

In [None]:
def parse_scene(scene):
    scene_dict = {
            "frame_index_interval_start": scene[0][0],
            "frame_index_interval_end": scene[0][1],
            "host":  scene[1],
            "start_time": scene[2],
            "end_time": scene[3]
        }
    return scene_dict

In [None]:
scene = zl5.scenes[0]
scene

In [None]:
parse_scene(scene)

**Nice one ! A scene is less ugly now !**. We can just iterate over all the scenes and got them into a pandas DataFrame where we could make deeper analysis and create more features to train a good model.

Instead of making a simple function that naively iterates over the scenes, I will expose a robust interface that takes into account the fact that accessing (indexing) a **zarr** file is a somehow **expensive operation**  as **Zarr** needs to unpack the compressed chunk before taking the right index. In place and lieu of taking a single index, I will take a range (slice) in order to make **Zarr** faster.

### 2.2.2 Parsing scenes into a Pandas DataFrame

In [None]:
class BaseParser:
    """
    A robust and fast interface to load l5kit data into  Pandas dataframes.

    Parameters
    ----------
    chunk_size: int, default=1000
        How many items do you want in a single slice. The larger the better;
        as long as you have enough memory. Nevertheless, chunk sizes above `10_000` won't lead to
        significant speed gain as the original zarr files was chunked at 10_000.

    max_chunks: int, default=10
        How many chunks do you want to read from memory.

    root:
        Zarr data root path

    zarr_path:
        relative path or key to the data.
    """
    
    field = "scenes"
    dtypes = {}
    
    def __init__(self, start=0, end=None, chunk_size=1000, max_chunks=10, root=DATA_ROOT, zarr_path="scenes/sample.zarr"):
        
        self.start = start
        self.end = end
        self.chunk_size = chunk_size
        self.max_chunks = max_chunks
        

        self.root = Path(root)
        assert self.root.exists(), "There is nothing at {}!".format(self.root)
        self.zarr_path = Path(zarr_path)
        
     
    def parse(self):
        raise NotImplementedError
        
    def to_pandas(self, start=0, end=None, chunk_size=None, max_chunks=None):
        start = start or self.start
        end = end or self.end
        chunk_size = chunk_size or self.chunk_size
        max_chunks = max_chunks or self.max_chunks
        
        if not chunk_size or  not max_chunks: # One shot load, suitable for small zarr files
            df = zarr.load(self.root.joinpath(self.zarr_path).as_posix()).get(self.field)
            df = df[start:end]
            df = map(self.parse, df) 
        else: # Chunked load, suitable for large zarr files
            df = []
            with zarr.open(self.root.joinpath(self.zarr_path).as_posix(), "r") as zf:
                end = start+max_chunks*chunk_size if end is None else min(end, start+max_chunks*chunk_size)
                for i_start in range(start, end, chunk_size ):
                    items = zf[self.field][i_start: min(i_start + chunk_size,end)]
                    items = map(self.parse, items)
                    df.append(items)
            df = it.chain(*df)
            
        df = pd.DataFrame.from_records(df)
        for col, col_dtype in self.dtypes.items():
            df[col] = df[col].astype(col_dtype, copy=False)
        return df

In [None]:
class SceneParser(BaseParser):
    field = "scenes"
    
    @staticmethod
    def parse(scene):
        scene_dict = {
            "frame_index_interval_start": scene[0][0],
            "frame_index_interval_end": scene[0][1],
            "host":  scene[1],
            "start_time": scene[2],
            "end_time": scene[3]
        }
        return scene_dict

In [None]:
sp = SceneParser(chunk_size=None, max_chunks=None, zarr_path="scenes/sample.zarr")

In [None]:
scenes = sp.to_pandas()
scenes.shape

In [None]:
scenes.head()

In [None]:
scenes["duration"] = (scenes["end_time"] -  scenes["start_time"])/1e9
scenes["num_frames"] = scenes["frame_index_interval_end"] - scenes["frame_index_interval_start"]

In [None]:
scenes.describe()

## 2.3 The L5Kit dataset: frames

As we said it, **scenes** are made of **frames**. Each scene holds a reference to its frames whicht starts at ***frame_index_interval_start*** and ends at ***frame_index_interval_end***.

In [None]:
scene = scenes.iloc[-1]
scene

In [None]:
zl5.frames

In [None]:
scene_frames = zl5.frames[scene.frame_index_interval_start:scene.frame_index_interval_end]
frame = scene_frames[0]
frame

A **frame** consists of :
* ***timestamp***: the timesatamp at which the state of the worl was filmed
* ***agents***: the agent detected by the host (just a reference)
* ***traffic lights***
* ***informations about the host***: translation, rotation

In [None]:
def parse_frame(frame):
    frame_dict = {
        'timestamp': frame[0],
        'agent_index_interval_start': frame[1][0],
        'agent_index_interval_start': frame[1][1],
        'traffic_light_faces_index_interval_start': frame[2][0],
        'traffic_light_faces_index_interval_end': frame[2][1],
        'ego_translation_x': frame[3][0],
        'ego_translation_y': frame[3][1],
        'ego_translation_z': frame[3][2],
        'ego_rotation_xx': frame[4][0][0],
        'ego_rotation_xy': frame[4][0][1],
        'ego_rotation_xz': frame[4][0][2],
        'ego_rotation_yx': frame[4][1][0],
        'ego_rotation_yy': frame[4][1][1],
        'ego_rotation_yz': frame[4][1][2],
        'ego_rotation_zx': frame[4][2][0],
        'ego_rotation_zy': frame[4][2][1],
        'ego_rotation_zz': frame[4][2][2],
        
    }
    return frame_dict

In [None]:
parse_frame(frame)

In [None]:
class FrameParser(BaseParser):
    field = "frames"
    
    @staticmethod
    def parse(frame):
        frame_dict = {
            'timestamp': frame[0],
            'agent_index_interval_start': frame[1][0],
            'agent_index_interval_end': frame[1][1],
            'traffic_light_faces_index_interval_start': frame[2][0],
            'traffic_light_faces_index_interval_end': frame[2][1],
            'ego_translation_x': frame[3][0],
            'ego_translation_y': frame[3][1],
            'ego_translation_z': frame[3][2],
            'ego_rotation_xx': frame[4][0][0],
            'ego_rotation_xy': frame[4][0][1],
            'ego_rotation_xz': frame[4][0][2],
            'ego_rotation_yx': frame[4][1][0],
            'ego_rotation_yy': frame[4][1][1],
            'ego_rotation_yz': frame[4][1][2],
            'ego_rotation_zx': frame[4][2][0],
            'ego_rotation_zy': frame[4][2][1],
            'ego_rotation_zz': frame[4][2][2],

        }
        return frame_dict

    def to_pandas(self, start=0, end=None, chunk_size=None, max_chunks=None, scene=None):
        if scene is not None:
            start = scene.frame_index_interval_start
            end = scene.frame_index_interval_end
        
        df = super().to_pandas(start=start, end=end, chunk_size=chunk_size, max_chunks=max_chunks)
        return df

In [None]:
fp = FrameParser()

In [None]:
scene

In [None]:
frames = fp.to_pandas(scene=scene)
# frames = fp.to_pandas(scene=None)
frames.shape

In [None]:
frames.head()

In [None]:
frame = frames.iloc[0]
frame

## 2.4 The L5Kit dataset: agents

An agent is actually an object which is in move with the host (AV).

In [None]:
zl5.agents

In [None]:
frame_agents = zl5.agents[int(frame.agent_index_interval_start):int(frame.agent_index_interval_end)]
agent = frame_agents[0]
agent

In [None]:
PERCEPTION_LABELS = [
    "PERCEPTION_LABEL_NOT_SET",
    "PERCEPTION_LABEL_UNKNOWN",
    "PERCEPTION_LABEL_DONTCARE",
    "PERCEPTION_LABEL_CAR",
    "PERCEPTION_LABEL_VAN",
    "PERCEPTION_LABEL_TRAM",
    "PERCEPTION_LABEL_BUS",
    "PERCEPTION_LABEL_TRUCK",
    "PERCEPTION_LABEL_EMERGENCY_VEHICLE",
    "PERCEPTION_LABEL_OTHER_VEHICLE",
    "PERCEPTION_LABEL_BICYCLE",
    "PERCEPTION_LABEL_MOTORCYCLE",
    "PERCEPTION_LABEL_CYCLIST",
    "PERCEPTION_LABEL_MOTORCYCLIST",
    "PERCEPTION_LABEL_PEDESTRIAN",
    "PERCEPTION_LABEL_ANIMAL",
    "AVRESEARCH_LABEL_DONTCARE",
]

In [None]:
class AgentParser(BaseParser):
    field = "agents"
    
    @staticmethod
    def parse(agent):
        frame_dict = {
            'centroid_x': agent[0][0],
            'centroid_y': agent[0][1],
            'extent_x': agent[1][0],
            'extent_y': agent[1][1],
            'extent_z': agent[1][2],
            'yaw': agent[2],
            "velocity_x":  agent[3][0],
            "velocity_y":  agent[3][1],
            "track_id":  agent[4],
        }
        for p_label, p in zip(PERCEPTION_LABELS, agent[5]):
            frame_dict["label_probabilities_{}".format(p_label)] = p
        return frame_dict

    def to_pandas(self, start=0, end=None, chunk_size=None, max_chunks=None, frame=None):
        if frame is not None:
            start = int(frame.agent_index_interval_start)
            end = int(frame.agent_index_interval_end)
        
        df = super().to_pandas(start=start, end=end, chunk_size=chunk_size, max_chunks=max_chunks)
        return df

In [None]:
ap = AgentParser()

In [None]:
agents = ap.to_pandas(frame=frame)
# agents = ap.to_pandas(frame=None)
agents.shape

In [None]:
agents.head()

## 2.5 The L5Kit dataset: traffic lights

In [None]:
zl5.traffic_light_faces

In [None]:
class TrafficLightParser(BaseParser):
    field = "traffic_light_faces"
    
    @staticmethod
    def parse(light):
        frame_dict = {
            'face_id': light[0],
            'traffic_light_id': light[1],
            'traffic_light_face_status_0': light[2][0],
            'traffic_light_face_status_1': light[2][1],
            'traffic_light_face_status_2': light[2][2],
        }
        return frame_dict

    def to_pandas(self, start=0, end=None, chunk_size=None, max_chunks=None, frame=None):
        if frame is not None:
            start = int(frame.traffic_light_faces_index_interval_start)
            end = int(frame.traffic_light_faces_index_interval_end)
        
        df = super().to_pandas(start=start, end=end, chunk_size=chunk_size, max_chunks=max_chunks)
        return df

In [None]:
tlp = TrafficLightParser()

In [None]:
lights = tlp.to_pandas(frame = frame)
# lights = tlp.to_pandas(frame = None)
lights.shape

In [None]:
lights.head()

<h2 style="color:blue;text-align:center;">Kkiller</h2>