Skip to content

Conversation

datumbox
Copy link
Contributor

@datumbox datumbox commented Jul 4, 2022

The cache path currently used on video references doesn't factor in parameters that affect the stored data (such as clip length, kinetics version and frame rate). This means that every time we switch to a different model, we must invalidate the cache. This PR avoids this by adding the specific parameters in the sha1 hash used for caching.

Proof this works:

Loading data
Loading training data
It is recommended to pre-compute the dataset cache on a single-gpu first, as it will be faster
100%|██████████| 1205/1205 [03:21<00:00,  5.97it/s]
./vision/torchvision/datasets/video_utils.py:223: UserWarning: There aren't enough frames in the current video to get a clip for the given clip length and frames between clips. The video (and potentially others) will be skipped.
  warnings.warn(
Saving dataset_train to ~/.torch/vision/datasets/kinetics/9ff8f5833e.pt
Took 203.63010907173157
Loading validation data
Loading dataset_test from ~/.torch/vision/datasets/kinetics/9ff8f5833e.pt

Copy link
Contributor

@YosuaMichael YosuaMichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @datumbox for this PR! This is definitely a pain-point during experimentation and quite often the cache can lead to a weird bug if we are not aware of it.

I only have one small comment of adding step_between_clips, other than that it looks good.

import hashlib

h = hashlib.sha1(filepath.encode()).hexdigest()
value = f"{filepath}-{args.clip_len}-{args.kinetics_version}-{args.frame_rate}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add args.step_between_clips here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have such as parameter on args. If we introduce it, we should absolutely add it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@datumbox datumbox merged commit 8f98aee into pytorch:main Jul 5, 2022
@datumbox datumbox deleted the references/param_cache branch July 5, 2022 08:34
facebook-github-bot pushed a commit that referenced this pull request Jul 6, 2022
Summary:
* Update the dataset cache to factor in parameters from the args.

* Fix linter

Reviewed By: jdsgomes

Differential Revision: D37643907

fbshipit-source-id: e0590c3e3b596dd3e12041c095c4b961941e0d38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants