VideoClips: audio clips do not correspond to video clips

## 🐛 Bug

The audio stream does not correspond to the visual stream when `torchvision.datasets.video_utils.VideoClips` is used.

## To Reproduce

Steps to reproduce the behavior:
1. Here are two videos I tested on [Archive.zip](https://github.com/pytorch/vision/files/7903650/Archive.zip)
2. The code to reproduce

```python
from torchvision.io import read_video
from torchvision.datasets.video_utils import VideoClips

VIDEO_PATH = './4fpkD4A_t1s_35000_45000.mp4'
VIDEO_PATH = './small.mp4'

if __name__ == "__main__":
    print(f'I am using: {VIDEO_PATH}')
    print(f'Output using torchvision.io.read_video:')
    visual, audio, info = read_video(VIDEO_PATH, pts_unit='sec')
    print('Visual:', visual.shape, 'Audio:', audio.shape, info)
    
    print(f'Output using torchvision.datasets.video_utils.VideoClips:')
    vclips = VideoClips([VIDEO_PATH], clip_length_in_frames=30, frames_between_clips=30)
    for i in range(vclips.num_clips()):
        visual, audio, info, vid_idx = vclips.get_clip(i)
        print(f'Clip #{i}', 'Visual:', visual.shape, 'Audio:', audio.shape, info)


```
3. The output I see
```
I am using: ./small.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([166, 320, 560, 3]) Audio: torch.Size([1, 266240]) {'video_fps': 30.0, 'audio_fps': 48000}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
  warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 86016]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #1 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87142]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #2 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 87255]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #3 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
Clip #4 Visual: torch.Size([30, 320, 560, 3]) Audio: torch.Size([1, 0]) {'video_fps': 30.0, 'audio_fps': 48000}
```

```
I am using: ./4fpkD4A_t1s_35000_45000.mp4
Output using torchvision.io.read_video:
Visual: torch.Size([300, 720, 1280, 3]) Audio: torch.Size([2, 440320]) {'video_fps': 30.0, 'audio_fps': 44100}
Output using torchvision.datasets.video_utils.VideoClips:
/home/vladimir/miniconda3/envs/bug_report_video_clips/lib/python3.8/site-packages/torchvision/io/video.py:103: UserWarning: The pts_unit 'pts' gives wrong results and will be removed in a follow-up version. Please use pts_unit 'sec'.
  warnings.warn(
100.0%
Clip #0 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 14336]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #1 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #2 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #3 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #4 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #5 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #6 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #7 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #8 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
Clip #9 Visual: torch.Size([30, 720, 1280, 3]) Audio: torch.Size([2, 15360]) {'video_fps': 30.0, 'audio_fps': 44100}
```



## Expected behavior
1. The output of `torchvision.io.read_video` is ok, and it is as expected. I provide it here for reference. Also, the visual streams returned from `torchvision.datasets.video_utils.VideoClips` are ok.
2. I expect the output of `torchvision.datasets.video_utils.VideoClips().get_clip()` to have a comparable number of samples, i.e. 48k or 44.1k for 1 second of 30 fps video. Instead, it outputs more samples than expected or just a fraction of it. Specifically, for `./small.mp4`, it outputs ~87k in the first three clips and 0 in the last two (expected 48k at each clip), while for `4fpkD4A_t1s_35000_45000.mp4` it outputs ~14k for the first one and 15k for the rest of them (expected 44.1k at each). The later one does not even reach the expected 440k samples for the whole 10s video. Similarly, the earlier one totals to `(86016 + 87142 + 87255) = 260413` which does not correspond to `266240` loaded in `torchvision.io.read_video`.



## Environment
```
Collecting environment information...
PyTorch version: 1.5.1
Is debug build: No
CUDA used to build PyTorch: 10.2

OS: Ubuntu 18.04.4 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: version 3.10.2

Python version: 3.8
Is CUDA available: Yes
CUDA runtime version: Could not collect

Nvidia driver version: 440.44
cuDNN version: Could not collect

Versions of relevant libraries:
[pip3] numpy==1.18.5
[pip3] torch==1.5.1
[pip3] torchvision==0.6.0a0+35d732a
[conda] blas                      1.0                         mkl  
[conda] cudatoolkit               10.2.89              hfd86e86_1  
[conda] mkl                       2020.1                      217  
[conda] mkl-service               2.3.0            py38he904b0f_0  
[conda] mkl_fft                   1.1.0            py38h23d657b_0  
[conda] mkl_random                1.1.1            py38h0573a6f_0  
[conda] numpy                     1.18.5           py38ha1c710e_0  
[conda] numpy-base                1.18.5           py38hde5b4d6_0  
[conda] pytorch                   1.5.1           py3.8_cuda10.2.89_cudnn7.6.5_0    pytorch
[conda] torchvision               0.6.1                py38_cu102    pytorch
```
## Additional context

Currently, `VideoClips` does not have a doc on the website. Therefore, my misunderstanding might arise from its absence.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VideoClips: audio clips do not correspond to video clips #2474

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VideoClips: audio clips do not correspond to video clips #2474

Description

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions