Fix audio truncation by adding 20-second silent buffer#173
Fix audio truncation by adding 20-second silent buffer#173avan06 wants to merge 1 commit intospotify:mainfrom
Conversation
Modified the `predict` function in `inference.py` to always append 20 seconds of silence to the input audio before running inference. This prevents the model from incorrectly truncating the tail end of the audio, which was happening on long, continuous files due to CNN edge effects.
|
Hi! It's good to know that adding 20s of samples at the end seems to mitigate the issue, although I think we'd like to find the root cause before making any changes here, which I'm happy to help doing! Could you share more details to help reproduce the issue? (eg: audio file tested, command run, etc.) |
|
Hi, Sure, no problem. I can provide the details to reproduce this issue.
from basic_pitch.inference import predict
from basic_pitch import ICASSP_2022_MODEL_PATH
audio_path = "(GB)ONI V 隠忍を継ぐ者⧸Oni V: Innin wo Tsugumono-Soundtrack [JkWeyX7Hquc].flac"
model_output, midi_data, note_events = predict(
audio_path,
ICASSP_2022_MODEL_PATH,
onset_threshold=0.55,
frame_threshold=0.25,
minimum_note_length=100,
minimum_frequency=50,
maximum_frequency=3000
)
midi_data.write("ONI V.mid")
Please help confirm, thank you. By the way, below is the log from my execution: |
|
Hi @avan06, Sorry for the delay in answering. Let me know if you would like to co-author commits in the other PR I created to thank you for your contribution in finding the bug! |
|
Hi @hyperc54, |
When converting music with basic-pitch, the output is mistakenly truncated at the end for unknown reasons. This issue becomes more noticeable as the audio length increases. After repeated testing, it was found that simply adding 20 seconds of silence to the input audio in the
predictfunction ofbasic_pitch/inference.pycan prevent this issue from occurring.Since basic-pitch trims silence before outputting, the added 20 seconds of silence here will not result in a longer output.
Modified the
predictfunction ininference.pyto always append 20 seconds of silence to the input audio before running inference.This prevents the model from incorrectly truncating the tail end of the audio, which was happening on long, continuous files due to CNN edge effects.