-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[WIP] UCF101 prototype with utilities for video loading #4838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
💊 CI failures summary and remediationsAs of commit f1a69e0 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @bjuncek. I have some comments inline about the general infrastructure. I can't really comment on the validity of the video utility datapipes that you added, because I have to little experience with videos. I'll leave that up to other reviewers.
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
Co-authored-by: Philip Meier <github.pmeier@posteo.de>
…k/vision into bkorbar/prototypes/ucf101
Ok, so I've tried doing a pass on this, trying to fix the decoder inconsistency we've been talking about offline. I don't understand datapipes well enough to understand why pop from a dict would fail or why I'd need to annotate variables in a datapipe. Everything since |
def __iter__(self) -> Iterator[Dict[str, Any]]: | ||
for video_d in self.datapipe: | ||
buffer = video_d["file"] | ||
with av.open(buffer, metadata_errors="ignore") as container: | ||
stream = container.streams.video[0] | ||
time_base = stream.time_base | ||
|
||
# duration is given in time_base units as int | ||
duration = stream.duration | ||
|
||
# get video_stream timestramps | ||
# with a tolerance for pyav imprecission | ||
_ptss = torch.arange(duration - 7) | ||
_ptss = self._unfold(_ptss) | ||
# shuffle the clips | ||
perm = torch.randperm(_ptss.size(0)) | ||
idx = perm[: self.num_clips_per_video] | ||
samples = _ptss[idx] | ||
|
||
for clip_pts in samples: | ||
start_pts = clip_pts[0].item() | ||
end_pts = clip_pts[-1].item() | ||
# video_timebase is the default time_base | ||
pts_unit = "pts" | ||
start_pts, end_pts, pts_unit = _video_opt._convert_to_sec(start_pts, end_pts, "pts", time_base) | ||
video_frames = video._read_from_stream( | ||
container, | ||
float(start_pts), | ||
float(end_pts), | ||
pts_unit, | ||
stream, | ||
{"video": 0}, | ||
) | ||
|
||
vframes_list = [frame.to_ndarray(format="rgb24") for frame in video_frames] | ||
|
||
if vframes_list: | ||
vframes = torch.as_tensor(np.stack(vframes_list)) | ||
# account for rounding errors in conversion | ||
# FIXME: fix this in the code | ||
vframes = vframes[: self.num_frames_per_clip, ...] | ||
|
||
else: | ||
vframes = torch.empty((0, 1, 1, 3), dtype=torch.uint8) | ||
print("FAIL") | ||
|
||
# [N,H,W,C] to [N,C,H,W] | ||
vframes = vframes.permute(0, 3, 1, 2) | ||
assert vframes.size(0) == self.num_frames_per_clip | ||
|
||
# TODO: support sampling rates (FPS change) | ||
# TODO: optimization (read all and select) | ||
|
||
yield { | ||
"clip": vframes, | ||
"pts": clip_pts, | ||
"range": (start_pts, end_pts), | ||
"video_meta": { | ||
"time_base": float(stream.time_base), | ||
"guessed_fps": float(stream.guessed_rate), | ||
}, | ||
"path": video_d["path"], | ||
"target": video_d["target"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just do the following:
- sample
m
start positions - for every start position, read
k
frames - yield the
k
frames at once,m
times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, this is exactly what I do:
- sample starting positions (line 132)
- for every start position (line 134) read
k
frames (line 140) - yield the frames as a sample (line 168)
Are you suggesting to take the yield outside of the loop? If so, is there any benefit to this?
import numpy as np | ||
import torch | ||
from torchdata.datapipes.iter import IterDataPipe | ||
from torchvision.io import video, _video_opt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if I would use _video_opt
in here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Any particular reason why not?
A simple
pyav
based set of utilities with a POC implementation for UCF101 datasetcc @pmeier @bjuncek