Code for paper "IntPhys: A Benchmark and Dataset for Intuitive Physics".
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.

IntPhys - Baseline

Code accompanying the paper "IntPhys: A Benchmark and Dataset for Intuitive Physics" This repository contains forward frame prediction models, which predict future images in synthetic videos. These predictions are either raw images or semantic masks. At test time, models are used to predict if a video is physically plausible.


Training, dev and test sets can be found here. Training dataset consists in 15000 videos of 100 frames, showing object interactions and designed with UnrealEngine. It contains metadata (depth, masks, positions of objects...). Dev dataset is made of 360 videos, half of which showing impossible events (e.g. an object disappearing). It contains metadata (depth, masks, positions of objects...), as well as the true label (possible or impossible). Test dataset is made of 3600 videos, half of which showing impossible events (e.g. an object disappearing). It contains no metadata and no labels. One can submit predictions here, a leaderboard will keep track challengers performances.

Train samples

Train samples are always physically possible and have high variability

Test and Dev samples

Test and Dev samples have a constrained variability and come as quadruplets: 2 possibles cases and 2 impossibles ones


Each video comes with it's associated depth field and object masking (each object have a unique id), along with a detailed status in JSON format.


  • Python 3.5
  • PyTorch
  • Recommanded: NVIDIA GPU. CPU only is supported but very slow.
  • Optional: Visdom for visualization.


Each model is given an input sequence and a target sequence, specified by the options parameters --input_seq and --target_seq. These parameters are patterns used by the dataloader to create inputs and targets. For example, from --input_seq 1 3 --target_seq 8 the dataloader will return all triplets [1, 3 -> 8], [2, 4 -> 9], ..., [93, 95 -> 100] from every videos, in batches specified by option parameter --bsz.

Three models:

resnet_ae: pretrained resnet-18 followed by a deconvolution network:

gan: generative adversarial network as described in the paper.

linear_rnn: recurrent neural network applied to an encoded representation of a frame. This is a beta version, not presented in the paper.


The dataloader uses a .npy lists containing absolute paths to all videos. Scripts and create those lists.


Train a mask predictor only:

python --verbose --image_save --model Resnet_ae --input scene --target mask --input_seq 1 --target_seq 1

Train a forward model:

python --verbose --image_save --model Resnet_ae --input scene --target mask --input_seq 1 3 --target_seq 8

Train a GAN model:

python --verbose --image_save --input scene --model Gan --target mask --input_seq 1 3 --target_seq 8

Train a GAN on a predicted mask instead of a mask (so that test is done on raw videos):

python --verbose --image_save --model Gan --input scene --target scene --input_seq 1 3 --target_seq 8 --maskPredictor path/to/trained/maskPredictor.pth

For GPU usage, add option --gpu. For visualization with Visdom, add option: --visdom.

A few notes

Given the size of the training set, one may want to save more than one checkpoint per epoch; this can be done with the option --n_slices (--n_slices 3 will save 3 checkpoints per epoch).