In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
print('directory contents:', ', '.join(os.listdir('/kaggle/input/deepfake-detection-challenge')))

print(
    'num train videos:', len(os.listdir('/kaggle/input/deepfake-detection-challenge/train_sample_videos/')) - 1,
    '\nnum test videos: ',  len(os.listdir('/kaggle/input/deepfake-detection-challenge/test_videos/'))
)

In [None]:
import cv2 as cv
from matplotlib import pyplot as plt
from tqdm import tqdm

In [None]:
train_dir = '/kaggle/input/deepfake-detection-challenge/train_sample_videos/'
train_video_files = [train_dir + x for x in os.listdir(train_dir) if x.endswith('.mp4')]
test_dir = '/kaggle/input/deepfake-detection-challenge/test_videos/'
test_video_files = [test_dir + x for x in os.listdir(test_dir)]

**Update:**<br>
My mistake, I thought we didn't have access to the train_sample labels, but we do and they are there hiding as a json file in the train_sample videos folder.  

In [None]:
train_metadata = pd.read_json('/kaggle/input/deepfake-detection-challenge/train_sample_videos/metadata.json')
train_metadata = train_metadata.T
train_metadata.head()

In [None]:
train_metadata['label'].value_counts(normalize=True)

Are about 80% of the labels FAKE for train and test?  Would be interested to know for the full training dataset.

In [None]:
def show_first_frame(video_files, num_to_show=25):
    root = int(num_to_show**.5)
    fig, axes = plt.subplots(root,root, figsize=(root*5,root*5))
    for i, video_file in tqdm(enumerate(video_files[:num_to_show]), total=num_to_show):
        cap = cv.VideoCapture(video_file)
        success, image = cap.read()
        image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
        cap.release()   
        
        axes[i//root, i%root].imshow(image)
        fname = video_file.split('/')[-1]        
        try:
            label = train_metadata.loc[fname, 'label']
            axes[i//root, i%root].title.set_text(f"{fname}: {label}")
        except:
            axes[i//root, i%root].title.set_text(f"{fname}")

## train videos

In [None]:
show_first_frame(train_video_files, num_to_show=25)

## test videos

In [None]:
show_first_frame(test_video_files, num_to_show=25)

There appears to be multiple videos per person in the train_sample videos and test videos.

## This test video frame looks very fake

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12,12))
cap = cv.VideoCapture(test_dir + 'ahjnxtiamx.mp4')
cap.set(1,2)
success, image = cap.read()
image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
cap.release()   

ax.imshow(image)
fname = 'ahjnxtiamx.mp4'
ax.title.set_text(f"{fname}")

This is just one example, but a few questions it makes me think of are: <br><br>
1) Were many different GANs architectures used to generate the dataset? (probably yes since many different teams collaberated to create the full dataset).   <br><br>
2) Does it change our modelling approach if there's some very obvious fakes (created from older methods) in the dataset, mixed in with the very realistic fakes (created from newer methods)? 

## Wheres the full training data?

*Copied from the getting-started page . https://www.kaggle.com/c/deepfake-detection-challenge/overview/getting-started*

<div><div class="markdown-converter__text--rendered"><h3><strong>Datasets</strong>:</h3>
<p>There are 4 groups of datasets associated with this competition.</p>
<ol>
<li><strong>Training Set: This dataset, containing labels for the target, is available for download outside of Kaggle for competitors to build their models.</strong> It is broken up into 50 files, for ease of access and download. Due to its large size, it must be accessed through a GCS bucket which is only made available to participants after accepting the competition’s rules. Please read the rules fully before accessing the dataset, as they contain important details about the dataset’s permitted use. It is expected and encouraged that you train your models outside of Kaggle’s notebooks environment and submit to Kaggle by uploading the trained model as an external data source.</li>
<li><strong>Public Validation Set</strong>: When you commit your Kaggle notebook, the submission file output that is generated will be based on the small set of 400 videos/ids contained within this Public Validation Set. This is available on the Kaggle Data page as <code>test_videos.zip</code></li>
<li><strong>Public Test Set: This dataset is completely withheld and is what Kaggle’s platform computes the public leaderboard against.</strong> When you “Submit to Competition” from the “Output” file of a committed notebook that contains the competition’s dataset, your code will be re-run in the background against this Public Test Set. When the re-run is complete, the score will be posted to the public leaderboard. If the re-run fails, you will see an error reflected in your “My Submissions” page. Unfortunately, we are unable to surface any details about your error, so as to prevent error-probing. You are limited to 2 submissions per day, including submissions which error.</li>
<li><strong>Private Test Set: This dataset is privately held outside of Kaggle’s platform, and is used to compute the private leaderboard.</strong> It contains videos with a similar format and nature as the Training and Public Validation/Test Sets, but are real, organic videos with and without deepfakes. After the competition deadline, Kaggle transfers your 2 final selected submissions’ code to the host. They will re-run your code against this private dataset and return prediction submissions back to Kaggle for computing your final private leaderboard scores.</li>
</ol>