# torchvideo Dataset demo

This notebook demonstrates the `VideoDataset` classes 

## Contents

1. [Set up](#Set-up)
  1. [Imports](#Imports)
  2. [Downloading media](#Downloading-media)
2. [The `VideoDataset` classes](#The-VideoDataset-classes)
  1. [ImageFolderVideoDataset](#ImageFolderVideoDataset)
  2. [VideoFolderDataset](#VideoFolderDataset)
  3. [GulpVideoDataset](#GulpVideoDataset)
3. [Labels](#Labels)
4. [Sampling frames](#Frame-sampling)

---

## Set up

### Imports

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
# Add the library path to sys.path so we can import torchvideo
import sys
sys.path.append('../src')
print(sys.executable)
print(sys.version)

In [None]:
from torchvideo.transforms import *
from torchvideo.datasets import *
from torchvideo.samplers import *
from torchvideo.datasets.vis import show_video
from torchvision.transforms import Compose, Lambda, Grayscale

---

### Downloading media

First we need to download a test video and prepare the some toy datasets. We'll reuse the media used to test `torchvideo`. The `gen_test_media` script will download a short clip of [Big Buck Bunny](https://peach.blender.org/) and create datasets suitable for use with all `VideoDataset` classes:

- An [`ImageFolderVideoDataset`](https://torchvideo.readthedocs.io/en/latest/datasets.html#imagefoldervideodataset) where each example is comprised of a set of frames stored as images on disk.
- A [`VideoFolderDataset`](https://torchvideo.readthedocs.io/en/latest/datasets.html#videofolderdataset) where each example is stored as a video file.
- A [`GulpVideoDataset`](https://torchvideo.readthedocs.io/en/latest/datasets.html#gulpvideodataset) where frames are stored in a simple binary format of concatenated JPEGs (see the [GulpIO](https://github.com/TwentyBN/GulpIO) README for more info on this format)

In [None]:
%%bash
# Download the test media
if [[ -f ../tests/data/media/big_buck_bunny_360p_5mb.mp4 ]]; then
  cd ../tests/data/media 
  ./gen_test_media.sh > /dev/null 2>&1
  cd -
fi

---

Now that we have some data, we can play around with `torchvideo`'s dataset classes.

## The VideoDataset classes

Now that we have some data, we can play around with `torchvideo`'s dataset classes.

### ImageFolderVideoDataset

This is the dataset class you're looking for if you videos have been dumped into individual frames.

#### Data layout

How should the data be layed out on disk?
The root dataset folder should contain subdirectories for each example. Each subdirectory should contain numbered images corresponding to each frame from the video.

In [None]:
%%bash
cd ../tests/data/media
echo "Top level folder contents: "
ls -l video_image_folder 
echo

echo "Example folder contents: "
ls -l video_image_folder/video0 

#### Demo

In [None]:
dataset = ImageFolderVideoDataset('../tests/data/media/video_image_folder/', 'frame_{:05d}.jpg')
len(dataset)

By default, the dataset's will load an example and convert it into a tensor. If you wish to perform data augmentation you should pass a `transform` to the constructor which will receive an iterator of `PIL.Image.Image` objects.

In [None]:
type(dataset[0])

Data is layed out in the `CTHW` format, that is `(channels, time, height, width)`

In [None]:
dataset[0].shape

---

### VideoFolderDataset

This is the dataset class you're looking for if you've got a folder of video files, each representing an example in the dataset.

#### Data layout

The `VideoFolderDataset` expects a directory of video files.

In [None]:
%%bash
cd ../tests/data/media
echo "Top level folder contents: "
ls -l video_folder 

#### Demo

In [None]:
dataset = VideoFolderDataset('../tests/data/media/video_folder/')

By default, the dataset's will load an example and convert it into a tensor. If you wish to perform data augmentation you should pass a `transform` to the constructor which will receive an iterator of `PIL.Image.Image` objects.

In [None]:
type(dataset[0])

---

### GulpVideoDataset

If you've [gulped](https://github.com/TwentyBN/GulpIO#gulp-a-dataset) your data, then you'll want to use this class.

#### Data layout

The `GulpVideoDataset` uses the [GulpIO](https://github.com/TwentyBN/GulpIO) format. You have to 'gulp' your files and then point it toward the root directory containing the `*.gulp` and `*.gmeta` files.

In [None]:
%%bash
cd ../tests/data/media
echo "Top level folder contents: "
ls -l gulp_output 

#### Demo

If you've stored your example labels in the metadata for each segment, you can access them by passing `label_field` with the name of the field in the `gmeta` JSON file.

In [None]:
dataset = GulpVideoDataset('../tests/data/media/gulp_output', label_field='label')

When a dataset has been constructed with a `LabelSet` it'll return both the frames and labels for an example:

In [None]:
frames, label = dataset[0]
print("Label: ", label)
print(frames.shape)

---

## Labels 

Typically you're going to want to get the label for an example when you load it; this is the job of the `LabelSet`. By decoupling the reading of video data and storage of metadata we facilitate a flexible model where you can pick how to store your labels and how you store you video data so you can mix and match. 

A `LabelSet` is class that provides a `__getitem__` object that when given the filename of a video folder, or file, or video id returns the corresponding label. All `VideoDataset` classes support the `label_set` kwarg in their constructors, so you can pass a label set to any subclasses.

The `DummyLabelSet` will return the same label regardless of what it is passed, it's useful for testing/using code that demands a label and you just have to fake it.

In [None]:
label_set = DummyLabelSet(label=5)
label_set['video1'], label_set['any_video']

When a dataset is given a label set, it will return both the frames and labels upon loading a video

In [None]:
dataset = ImageFolderVideoDataset(
    '../tests/data/media/video_image_folder/', 'frame_{:05d}.jpg',
    label_set=label_set
)

frames, label = dataset[0]
print("Video shape:\t", frames.shape)
print("Label:\t\t", label)


Typically the storage format of labels is very dataset dependent so we provide a `LambdaLabelSet` class that wraps a user provided function that will return a label

In [None]:
import pandas as pd
labels_df = pd.DataFrame({
    'video':['video0', 'video1', 'video2'],
    'label': [1, 2, 3]
}).set_index('video')
label_set = LambdaLabelSet(lambda filename: labels_df.loc[filename]['label'])

label_set['video0'], label_set['video1'], label_set['video2']

---

## Frame sampling

Up until this point, we've always loaded all the frames from a video into memory. Loading frames is typically one of the most expensive operations in a video ML pipeline, so we would like to minimize this cost as much as possible and only load the frames we want.

Currently there are two dominent classes of frame sampling methods in video machine learning: dense sampling, where we sample a contiguous sequence of frames, sparse sampling, where we sample frames far apart from each other.

We provide a variety of samplers supporting these sampling strategies.

The purpose of a frame sampler is to generate a set of frame indices given the length of a video in frames.

The `FullVideoSampler` is the default sampler used in all dataset classes, it generates a `slice` object covering the entire video clip. If we had a video 20 frames long, we get a slice of starting from frame 0 ending at frame 19 with step size 1:

In [None]:
sampler = FullVideoSampler()
sampler.sample(20)

We can also control downsample the video by setting the `frame_step` parameter

In [None]:
sampler = FullVideoSampler(frame_step=2)
sampler.sample(20)

Frame samplers can return 3 types of indices, this is so that the video loaders can load in the most optimal fashion. It's helpful to convert these representations into a list of ints to ease the cognitive load in trying to understand which frames are being sampled:

In [None]:
from torchvideo.samplers import frame_idx_to_list

sampler = FullVideoSampler(frame_step=2)
frame_idx_to_list(sampler.sample(20))

This is a bit easier to see exactly what `slice(0, 20, 2)` represents in terms of frame indices.

Many methods utilize a fixed duration of clip sampled from a larger video. The `ClipSampler` implements this.

In [None]:
sampler = ClipSampler(clip_length=10, frame_step=1)
frame_idx_to_list(sampler.sample(50))

In [None]:
sampler = ClipSampler(clip_length=10, frame_step=2)
frame_idx_to_list(sampler.sample(50))

Sparse sampling methods sample very few frames far apart. The `TemporalSegmentSampler` samples frames in the way described in the [TSN paper](https://arxiv.org/abs/1608.00859)

Basically a video is split into `n` segments, and then a snippet of video `l` frames long is sampled from each of these segments with a random offset during training, and centred within the segment during testing.

In [None]:
train_sampler = TemporalSegmentSampler(segment_count=3, snippet_length=2)
frame_idx_to_list(train_sampler.sample(100))

In [None]:
test_sampler = TemporalSegmentSampler(segment_count=3, snippet_length=2, test=True)
frame_idx_to_list(test_sampler.sample(100))