### This notebook explores what we have in the DFDC full train set metadata and json files.

The dataset for this notebook  is at
https://www.kaggle.com/zaharch/train-set-metadata-for-dfdc

The train data for this competition is big, almost 500Gb, so I hope it can be useful to have all the json files and the metadata in one dataframe.

The dataset includes, for each video file
1. Info from the json files: **filename**, **folder**, **label**, **original**
2. **split**: train (118346 videos), public validation test (400 videos) or train sample (400 videos). 119146 videos in total. Note that the public validation and the train sample are subsets of the full train, so it is enough to mark them in this dataframe.
3. Full file **md5** column
4. Hash on audio file sequence **wav.hash** and on subset of pixels **pxl.hash**
5. The rest are metadata fields from the files, obtained with ffprobe. Note that I removed many columns, which didn't give new information.

In [None]:
import numpy as np
import pandas as pd

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

In [None]:
data = pd.read_csv('/kaggle/input/train-set-metadata-for-dfdc/metadata', low_memory=False)

# Hashes

Fakes always have at least some pixel-level changes. That means that all audio fakes are also video fakes.

In [None]:
(data['pxl.hash'] == data['pxl.hash.orig']).value_counts()

There are duplicated for both **md5**, **pxl.hash** and **wav.hash**. Duplicates for **wav.hash** are OK, but duplicates for **md5** mean that there are identical files in the dataset.

In [None]:
data['md5'].value_counts().value_counts().head()

In [None]:
data['wav.hash'].value_counts().value_counts().head()

In [None]:
data['pxl.hash'].value_counts().value_counts().head()

# Other fields

This is how the data looks like

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.label.value_counts()

In [None]:
data.split.value_counts()

In [None]:
set(data.original) - set(data.filename)

In [None]:
set(data.loc[data.original == 'NAN', 'filename']) - set(data.original)

In [None]:
data.loc[data.original != 'NAN', 'original'].value_counts().hist(bins=40)

In [None]:
data.loc[data.original != 'NAN', 'original'].value_counts().value_counts().head()

In [None]:
for col in data.columns:
    print(pd.crosstab(data[col],data['label']))

In [None]:
pd.crosstab(data['video.@display_aspect_ratio'],data['label'])

In [None]:
pd.crosstab([data['video.@display_aspect_ratio'], data['audio.@codec_time_base']],data['label'])