# YouTube 8M video-level dataset
> Prototyping with video-level features

- toc: true 
- badges: true
- comments: true
- categories: [youtube 8m, video data, tensorflow]

Play around with [Youtube 8M](https://research.google.com/youtube8m/index.html) video-level dataset.

## Requirements

Use tensorflow 2.6.0 or higher

In [1]:
import tensorflow as tf

In [2]:
print(tf.__version__)

2.6.0


## Load data

The goal of this section is to create a tf.data.Dataset from a set of `.tfrecords` file. The sample data were downloaded with 

```
curl data.yt8m.org/download.py | shard=1,100 partition=2/video/train mirror=us python
```

per instruction available on the YouTube 8M dataset [download page](https://research.google.com/youtube8m/download.html).

### Load raw dataset

Import libraries and specify `data_folder`.

In [4]:
import os
import glob
from tensorflow.data import TFRecordDataset

In [12]:
data_folder = "/home/default/video"

List `.tfrecord` files to be loaded.

In [14]:
filenames = glob.glob(os.path.join(data_folder, "*.tfrecord"))
print(filenames[0]); print(filenames[-1])

/home/default/video/train0093.tfrecord
/home/default/video/train3749.tfrecord


Load `.tfrecord` files into a raw (not parsed) dataset.

In [15]:
raw_dataset = tf.data.TFRecordDataset(filenames)

### Parse raw dataset

Create a funtion to parse the raw data. According to YouTube 8M dataset [download section](https://research.google.com/youtube8m/download.html), the video-level data are stored as tensorflow.Example protocol buffers with the following text format:

```
features: {
  feature: {
    key  : "id"
    value: {
      bytes_list: {
        value: (Video id)
      }
    }
  }
  feature: {
    key  : "labels"
    value: {
      int64_list: {
        value: [1, 522, 11, 172]  # label list
      }
    }
  }
  feature: {
    # Average of all 'rgb' features for the video
    key  : "mean_rgb"
    value: {
      float_list: {
        value: [1024 float features]
      }
    }
  }
  feature: {
    # Average of all 'audio' features for the video
    key  : "mean_audio"
    value: {
      float_list: {
        value: [128 float features]
      }
    }
  }
}
```

In [56]:
# Create a description of the features.
feature_description = {
    'id': tf.io.FixedLenFeature([1], tf.string, default_value=''),
    'labels': tf.io.FixedLenSequenceFeature([], tf.int64, default_value=0, allow_missing=True),
    'mean_audio': tf.io.FixedLenFeature([128], tf.float32, default_value=[0.0] * 128),    
    'mean_rgb': tf.io.FixedLenFeature([1024], tf.float32, default_value=[0.0] * 1024),
}

def _parse_function(example_proto):
  # Parse the input `tf.train.Example` proto using the dictionary above.
  return tf.io.parse_single_example(example_proto, feature_description)

In [57]:
parsed_dataset = raw_dataset.map(_parse_function)
parsed_dataset

<MapDataset shapes: {id: (1,), labels: (None,), mean_audio: (128,), mean_rgb: (1024,)}, types: {id: tf.string, labels: tf.int64, mean_audio: tf.float32, mean_rgb: tf.float32}>

### Check parsed dataset

In [59]:
for parsed_record in parsed_dataset.take(1):
  print(repr(parsed_record))

{'id': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'eXbF'], dtype=object)>, 'labels': <tf.Tensor: shape=(2,), dtype=int64, numpy=array([ 0, 12])>, 'mean_audio': <tf.Tensor: shape=(128,), dtype=float32, numpy=
array([-1.2556146 ,  0.17297305,  0.53898615,  1.5446128 ,  1.4344678 ,
        0.41190457,  1.2042887 ,  0.9899097 , -0.28567997,  1.1892846 ,
        0.6182132 , -0.54916394, -0.02003632,  0.7124445 , -1.275734  ,
       -1.0121363 ,  0.8652152 ,  0.45430297, -0.5905393 , -0.8244694 ,
        0.95853716,  0.379509  , -1.1317158 ,  0.46737486,  1.3991169 ,
       -0.4367456 , -0.287044  , -0.7412639 ,  0.5608105 ,  0.9686536 ,
        0.36370906,  0.15887815,  1.1279035 , -0.08369077, -0.20577091,
       -1.467152  , -0.9784904 ,  0.44680086,  1.1796227 ,  0.14648826,
        1.3656982 ,  0.12989263, -0.9865609 , -1.2897152 ,  0.6123024 ,
        0.1184121 ,  0.49931577, -1.1900278 ,  0.0516886 ,  0.16899465,
       -1.0225939 , -0.6807922 , -1.1495618 ,  0.5336437 , -0.1