You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Videos in SenselabDatasets that have audio do not convert to the exact same audio in HuggingFace. This likely is caused by a few different factors: the extraction of audio from a video using torchvision does not result in the same audio as using ffmpeg directly, additionally, converting to HuggingFace datasets using their Audio feature uses Soundfile under the hood which also causes additional distortions at points.
Steps to Reproduce
In dataset_test.py, we test a video and its extracted audio and currently check the tensors when converting are close enough to each other (defined here as atol=1e-4), but the issue can be seen by checking if they are equal instead.
Expected Results
We would expect that no matter what library was used to decode the audio from a video that when converting to a HuggingFace dataset and then back to a SenselaDataset should result in the same audio throughout the process since the audio waveform is just a 2D tensor.
Actual Results
The audios when converting to HuggingFace and then back to Senselab for videos diverge with around a maximal absolute difference of 5e-5 though notably not every value diverges. It's possible that how different libraries handle silence, or near silence, cause differences in the encodings.
Additional Notes
Interestingly, this issue has not been seen in the conversion of existing audio files leading me to believe that it occurs as a result of different encodings of the audio from a video (floating point precision and bit depth).
The text was updated successfully, but these errors were encountered:
Description
Videos in SenselabDatasets that have audio do not convert to the exact same audio in HuggingFace. This likely is caused by a few different factors: the extraction of audio from a video using torchvision does not result in the same audio as using ffmpeg directly, additionally, converting to HuggingFace datasets using their Audio feature uses Soundfile under the hood which also causes additional distortions at points.
Steps to Reproduce
In dataset_test.py, we test a video and its extracted audio and currently check the tensors when converting are close enough to each other (defined here as atol=1e-4), but the issue can be seen by checking if they are equal instead.
Expected Results
We would expect that no matter what library was used to decode the audio from a video that when converting to a HuggingFace dataset and then back to a SenselaDataset should result in the same audio throughout the process since the audio waveform is just a 2D tensor.
Actual Results
The audios when converting to HuggingFace and then back to Senselab for videos diverge with around a maximal absolute difference of 5e-5 though notably not every value diverges. It's possible that how different libraries handle silence, or near silence, cause differences in the encodings.
Additional Notes
Interestingly, this issue has not been seen in the conversion of existing audio files leading me to believe that it occurs as a result of different encodings of the audio from a video (floating point precision and bit depth).
The text was updated successfully, but these errors were encountered: