-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Open
Labels
Description
🐛 Bug
The audio stream does not correspond to the visual stream when torchvision.datasets.video_utils.VideoClips is used.
To Reproduce
Steps to reproduce the behavior:
- Here are two videos I tested on Archive.zip
- The code to reproduce
from torchvision.io import read_video
from torchvision.datasets.video_utils import VideoClips
VIDEO_PATH = './4fpkD4A_t1s_35000_45000.mp4'
VIDEO_PATH = './small.mp4'
if __name__ == "__main__":
print(f'I am using: {VIDEO_PATH}')
print(f'Output using torchvision.io.read_video:')
visual, audio, info = read_video(VIDEO_PATH, pts_unit='sec')
print('Visual:', visual.shape, 'Audio:', audio.shape, info)
print(f'Output using torchvision.datasets.video_utils.VideoClips:')
vclips = VideoClips([VIDEO_PATH], clip_length_in_frames=30, frames_between_clips=30)
for i in range(vclips.num_clips()):
visual, audio, info, vid_idx = vclips.get_clip(i)
print(f'Clip #{i}', 'Visual:', visual.shape, 'Audio:', audio.shape, info)
- The output I see
I am using: ./small.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([166, 320, 560, 3]) Audio: torch.Size([1, 266240]) {'video_fps': 30.0, 'audio_fps': 48000}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 86016]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87142]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87255]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
I am using: ./4fpkD4A_t1s_35000_45000.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([300, 720, 1280, 3]) Audio: torch.Size([2, 440320]) {'video_fps': 30.0, 'audio_fps': 44100}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 14336]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #1 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #2 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #3 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #4 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #5 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #6 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #7 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #8 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #9 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Expected behavior
- The output of
torchvision.io.read_videois ok, and it is as expected. I provide it here for reference. Also, the visual streams returned fromtorchvision.datasets.video_utils.VideoClipsare ok. - I expect the output of
torchvision.datasets.video_utils.VideoClips().get_clip()to have a comparable number of samples, i.e. 48k or 44.1k for 1 second of 30 fps video. Instead, it outputs more samples than expected or just a fraction of it. Specifically, for./small.mp4, it outputs ~87k in the first three clips and 0 in the last two (expected 48k at each clip), while for4fpkD4A_t1s_35000_45000.mp4it outputs ~14k for the first one and 15k for the rest of them (expected 44.1k at each). The later one does not even reach the expected 440k samples for the whole 10s video. Similarly, the earlier one totals to(86016 + 87142 + 87255) = 260413which does not correspond to266240loaded intorchvision.io.read_video.
Environment
Collecting environment information...
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2
OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2
Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect
Nvidia driver version: 440.44
cuDNN version: Could not collect
Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] mkl 2020.1 217
[conda] mkl-service 2.3.0 py38he904b0f_0
[conda] mkl_fft 1.1.0 py38h23d657b_0
[conda] mkl_random 1.1.1 py38h0573a6f_0
[conda] numpy 1.18.5 py38ha1c710e_0
[conda] numpy-base 1.18.5 py38hde5b4d6_0
[conda] pytorch 1.5.1 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
[conda] torchvision 0.6.1 py38_cu102 pytorch
Additional context
Currently, VideoClips does not have a doc on the website. Therefore, my misunderstanding might arise from its absence.
Reactions are currently unavailable