hf_audio_tutorial

Any audio relevant tutorial, articles etc in Hugging Face.

Important concept

Human hearing vs frequencies

Human loundness vs frequencies

humane hearing loudness vs frequency: we perceive certain frequencies to be louder or quieter than others, even when they played with an equal amount of energy. In the 1930’s, researchers actually measured this, resulting in the first of many “curves” that try to capture human loudness bias as a function of frequency. In short, humans have natural biases in how we hear loudness at different frequencies.

Human pitch vs frequencies

humane hearing frequency/pitch vs frequency: humans also have biases in how we hear pitch. Low frequencies sound “low” to us, and higher ones sound “high”, but the relationship between frequency and perception is non-linear – a 100 Hz and 200 Hz sine wave will sound much further apart than a 10,000 Hz and 10,100 Hz wave.

Model architectures

Setting 1: with ML model trained on spectrogram while re-using original phase.

graph LR
A[(Input Audio)] -->B(STFT)
    B --> C["(Magnitude) Spectrogram"]  --> D(ML Models) --> E("Transformed \n(Magnitude) Spectrogram") --> G
    B --> F[Phase Spectrogram] --> G(Inverse STFT)
    G --> H(Generated \noutput audio)

  style C fill:grey, color:black;
  style E fill:grey, color:black;
  style F fill:grey, color:black;
  style H fill:grey, color:black;

Magnitude Spectrogram can be the one that is generated directly via STFT, or the one that is one step further as the Mel-spectrogram.

Setting 2: with ML model trained on spectrogram but using some algorithms/vocoders to reconstruct phase from transformed spectrogram and convert result to time-domain audio output.

graph LR
A[(Input Audio)] -->B(STFT)
    B --> C["(Magnitude) Spectrogram"]  --> D(ML Models) --> E("Transformed \n(Magnitude) Spectrogram") --> F("phase reconstruction\n(GL-Algorithm, Vocoder etc)") --> G(Generated \noutput audio)

  style C fill:grey, color:black;
  style E fill:grey, color:black;
  style G fill:grey, color:black;

Griffin-Lim algorithm uses the intuition that the STFT usually results in the same frequencies being activated in neighboring frames to iteratively come up with a reasonable guess for a phase.
Vocoders: front-end components that probabilistically generate waveforms given magnitude spectrograms. Instances are WaveNet, WaveGlow, which are commonly used as pretrained vocoders. Actually vocoders doesn't no just convert spectrogram to waveform, but also helps to re-adjust the input spectrogram to be more aligned with groundtruth.

Setting 3: (End-to-end) ML model that directly learns from waveform input and gives waveform output

graph LR
A[(Input Audio)] --> D(ML Models) --> G(Generated \noutput audio)

  style G fill:grey, color:black;

Learning from Loss

Articles

References

HF Audio Course: It contains from Unit 0 to Unit 8 regarding audio data ETL, various audio related tasks, and model intrudoctions.
FAU Prof. Mueller's web
FAU/Preparation Course Python Notebooks
FAU/Fundamental Musci Preprocessing Notebooks

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
FAU_PCP		FAU_PCP
figs		figs
.gitignore		.gitignore
02_5_Audio_Exercise.ipynb		02_5_Audio_Exercise.ipynb
04_4_fine_tune_music_cls.ipynb		04_4_fine_tune_music_cls.ipynb
04_6_hands_on.ipynb		04_6_hands_on.ipynb
05_2_pretrained_model_asr.ipynb		05_2_pretrained_model_asr.ipynb
LICENSE		LICENSE
README.md		README.md
Text-to-speech.ipynb		Text-to-speech.ipynb
how_to_generate_text_using_different_decoding_methods.ipynb		how_to_generate_text_using_different_decoding_methods.ipynb
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

hf_audio_tutorial

Important concept

Human hearing vs frequencies

Human loundness vs frequencies

Human pitch vs frequencies

Model architectures

Learning from Loss

Articles

References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

wkCircle/hf_audio_tutorial

Folders and files

Latest commit

History

Repository files navigation

hf_audio_tutorial

Important concept

Human hearing vs frequencies

Human loundness vs frequencies

Human pitch vs frequencies

Model architectures

Learning from Loss

Articles

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages