Skip to content

VideoClips: audio clips do not correspond to video clips #2474

@v-iashin

Description

@v-iashin

🐛 Bug

The audio stream does not correspond to the visual stream when torchvision.datasets.video_utils.VideoClips is used.

To Reproduce

Steps to reproduce the behavior:

  1. Here are two videos I tested on Archive.zip
  2. The code to reproduce
from torchvision.io import read_video
from torchvision.datasets.video_utils import VideoClips

VIDEO_PATH = './4fpkD4A_t1s_35000_45000.mp4'
VIDEO_PATH = './small.mp4'

if __name__ == "__main__":
    print(f'I am using: {VIDEO_PATH}')
    print(f'Output using torchvision.io.read_video:')
    visual, audio, info = read_video(VIDEO_PATH, pts_unit='sec')
    print('Visual:', visual.shape, 'Audio:', audio.shape, info)
    
    print(f'Output using torchvision.datasets.video_utils.VideoClips:')
    vclips = VideoClips([VIDEO_PATH], clip_length_in_frames=30, frames_between_clips=30)
    for i in range(vclips.num_clips()):
        visual, audio, info, vid_idx = vclips.get_clip(i)
        print(f'Clip #{i}', 'Visual:', visual.shape, 'Audio:', audio.shape, info)
  1. The output I see
I am using: ./small.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([166, 320, 560, 3]) Audio: torch.Size([1, 266240]) {'video_fps': 30.0, 'audio_fps': 48000}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
  warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 86016]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87142]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87255]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
I am using: ./4fpkD4A_t1s_35000_45000.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([300, 720, 1280, 3]) Audio: torch.Size([2, 440320]) {'video_fps': 30.0, 'audio_fps': 44100}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
  warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 14336]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #1 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #2 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #3 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #4 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #5 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #6 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #7 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #8 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #9 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}

Expected behavior

  1. The output of torchvision.io.read_video is ok, and it is as expected. I provide it here for reference. Also, the visual streams returned from torchvision.datasets.video_utils.VideoClips are ok.
  2. I expect the output of torchvision.datasets.video_utils.VideoClips().get_clip() to have a comparable number of samples, i.e. 48k or 44.1k for 1 second of 30 fps video. Instead, it outputs more samples than expected or just a fraction of it. Specifically, for ./small.mp4, it outputs ~87k in the first three clips and 0 in the last two (expected 48k at each clip), while for 4fpkD4A_t1s_35000_45000.mp4 it outputs ~14k for the first one and 15k for the rest of them (expected 44.1k at each). The later one does not even reach the expected 440k samples for the whole 10s video. Similarly, the earlier one totals to (86016 + 87142 + 87255) = 260413 which does not correspond to 266240 loaded in torchvision.io.read_video.

Environment

Collecting environment information...
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect

Nvidia driver version: 440.44
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2020.1                      217  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.1.0            py38h23d657b_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.18.5           py38ha1c710e_0  
[conda] numpy-base                1.18.5           py38hde5b4d6_0  
[conda] pytorch                   1.5.1           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torchvision               0.6.1                py38_cu102    pytorch

Additional context

Currently, VideoClips does not have a doc on the website. Therefore, my misunderstanding might arise from its absence.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions