Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better understanding the feature extraction process #13

Open
alebzk opened this issue Feb 7, 2018 · 11 comments
Open

Better understanding the feature extraction process #13

alebzk opened this issue Feb 7, 2018 · 11 comments

Comments

@alebzk
Copy link

alebzk commented Feb 7, 2018

Hi, first of all thank you for your amazing work!

I've been looking at the code and I am interested in better understanding the feature extraction process.
I have some questions, I will do my best to be as detailed as possibe.

Let me start from compute_frame_features() in denoise.c.
The first step is computing the energy in the Opus bands (done in frame_analysis()) and then the pitch is estimated/tracked. pitch_downsample() performs the 20ms frames downsampling by halving the samples using a [0.25, 0.5, 0.25] kernel around the even samples. Is this a less expensive way to perform downsampling and low-pass filtering jointly to avoid aliasing?

Then _celt_autocorr() is called with a lag of just 4 samples on the downsampled sequence (hence, 24kHz). Why just 4? What is achieved exactly with such a low lag? And if I look at _celt_autocorr(), I don't understand why the autocorrelation computed by celt_pitch_xcorr() is modified afterwards by summing the autocorrelation for different lags.
After that, the autocorrelation is further modified once _celt_autocorr() returns (below the // Noise floor -40 dB comment). Why is that done? And finally _celt_lpc() is called and the LPC coefficients modified (I mean lpc2) and used to filter the downsampled sequence via celt_fir5().
This whole part is a bit obscure to me and it's also hard to understand where some constants come from (e.g., ac[0] *= 1.0001f; and ac[i] -= ac[i]*(.008f*i)*(.008f*i);) - I've found some possible mappings between the dB and the linear scale, but I'm not fully sure.
Overall, pitch_downsample() looks like a pre-processing step before the pitch is sought in pitch_search(). It would be great if you can share details on what is done.

My apologies if I am asking something that may be obvious to others. I'm to some extent familiar with LPC and auto-correlation, a little bit with pitch tracking. That's probably why I can't grasp all the details in the code.

Cheers,
Alessio

@alebzk
Copy link
Author

alebzk commented Feb 13, 2018

Hi again,

It looks like the pitch estimation method you implemented is SIFT [1]. Right?
If there are other relevant details on the exact method, could you explain and/or share a reference?

Alessio

[1] Markel, J. (1972). The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5), 367-377.

@alebzk
Copy link
Author

alebzk commented Feb 21, 2018

While waiting for an answer, I've checked the code better and managed to answer some of my questions. Plus, I got new ones.

This is what I've learned:

  • Using a [0.25, 0.5, 0.25] window for downsampling from 48k to 24k is indeed cheap, but it leads to a low pass filter with no steep decay; however, this should be ok for pitch estimation.
  • The autocorrelation with a maximum lag of 4 is computed since the pitch is estimated not on the audio frames directly, but on their LP residuals.
  • The estimated inverse filter coefficients used to compute the LP residuals are corrected assuming -40 dBFS noise floor; ac[0] *= 1.0001f accounts for the reduction done in ac[i] -= ac[i]*(.008f*i)*(.008f*i); the correction factors are modeled as such assuming white noise, the correlation of which follows the (0.008/i)^2 law.
  • pitch_downsample() performs both the 2x decimation and the LP residual extraction, whereas pitch_search() finds the best and the second best pitch candidates - which are further refined in remove_doubling().

My remaining questions are the following:

  • The pitch estimation is performed at 12 kHz and 24 kHz; is it truly needed? Could one use a lower sample rate?
  • Why is pitch_search() called passing PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD instead of PITCH_MAX_PERIOD for the max_pitch argument?
  • In find_best_pitch() the numerators are xcorr[i]^2, why not just xcorr[i]?
  • remove_doubling() uses second_check[16] = {0, 0, 3, 2, 3, 2, 5, 2, 3, 2, 3, 2, 5, 2, 3, 2}, it looks like these are multipliers used to look for specific (sub)harmonics; where do those values come from?
  • remove_doubling() looks at higher harmonics of the initial estimated pitch period, hence removing pitch period doubling errors; why there's no removal for halving errors?

Alessio

@jmvalin
Copy link
Member

jmvalin commented Feb 22, 2018

@alebzk the pitch estimation code is mostly copied from Opus, so it's quite possible everything in there isn't optimal for rnnoise. The idea as you seem to have guessed is to compute the auto-correlation on the residual. The (.008fi)(.008f*i) term here is means to approximate a Gaussian for lag windowing (stabilizes the LPC analysis). As for the c1 term, it's meant to convolve the LPC filter with a slight low-pass filter to make the analysis better. I'm sure it would be possible to run the pitch even lower than 12 kHz, but that seemed fine for the use I had.

Why is pitch_search() called passing PITCH_MAX_PERIOD-3*PITCH_MIN_PERIOD instead of PITCH_MAX_PERIOD for the max_pitch argument?

This avoids searching for very short periods, which can sometimes cause false detection due to formants. The very short periods are searched through remove_doubling()

In find_best_pitch() the numerators are xcorr[i]^2, why not just xcorr[i]

Yes, because we want to maximize xy/sqrt(xx*yy) but rather than compute a sqrt(), we just square everything and since the xx term is constant, we only need to maximize xy^2/yy

remove_doubling() uses second_check[16] = {0, 0, 3, 2, 3, 2, 5, 2, 3, 2, 3, 2, 5, 2, 3, 2}, it looks like these are multipliers used to look for specific (sub)harmonics; where do those values come from?

They were hand-tuned to look for expected peaks that wouldn't be expected for a different pitch period. For example, if there's a peak at T/6, then we can expect one at 5*T/6, which is a position you wouldn't expect to find a peak if the period was T/3.

remove_doubling() looks at higher harmonics of the initial estimated pitch period, hence removing pitch period doubling errors; why there's no removal for halving errors?

Period doubling (or tripling) is a common error for auto-correlation-based pitch estimators since if there's a periodicity at T, then there's also going to be a periodicity at 2T and 3T, ... OTOH, there won't be a periodicity at T/2.

@alebzk
Copy link
Author

alebzk commented Feb 22, 2018

Thanks so much for the detailed answer!

@zhly0
Copy link

zhly0 commented Jun 4, 2018

hi,@alebzk:
from your issue,I learned a lot,I am new to pitch detection.If my training sample's sample rate is 16000,is the macro define PITCH_BUF_SIZE need to change?What is the relation between the sample rate and the PITCH_BUF_SIZE?
expecting your reply,Thanks!

@alebzk
Copy link
Author

alebzk commented Jun 25, 2018

Hi @zhly0,

@jmvalin can surely say more, but I'm happy to give a first answer.

Yes. PITCH_BUF_SIZE unit is number of samples at 48 kHz; hence, it has to be adapted if the sample rate changes. However, note that the whole pitch estimation algorithm is kind of specific for the 48k case (there are 2 downsampling steps and the estimated pitch is used to compute spectral the cross-correlation features). If you want to avoid resampling from 16k to 48k, then part of the code must be adapted.

Alessio

@teslam
Copy link

teslam commented Sep 3, 2018

@alebzk thanks for your write up, very helpful (altough im just half way through). (and thanks to jmvalin ofcourse!)

I dont understand your comment: "The autocorrelation with a maximum lag of 4 is computed since the pitch is estimated not on the audio frames directly, but on their LP residuals." I dont understand where the residual is calculated (residual as in difference between x_lp and lpcestimated).
Are you referring to:
opus_val32 sum = SHL32(EXTEND32(x[i]), SIG_SHIFT)?

Thanks again

@jmvalin
Copy link
Member

jmvalin commented Sep 3, 2018

See this call:
celt_fir5(x_lp, lpc2, x_lp, len>>1, mem);
in pitch_downsample.c

@teslam
Copy link

teslam commented Sep 3, 2018

@jmvalin thanks! I have gone through that function plenty of times but double checked now when you wrote, and I think my problem comes from the fact that lpc coefficients are defined with opposite sign compared to how I thought they were.

I now assume that you define them as e.g. Matlab do: https://se.mathworks.com/help/signal/ref/lpc.html.

Thanks.

@shakingWaves
Copy link

@jmvalin @alebzk I do not understand the pitch estimation method,especially, the "pitch_search" function in pitch.c. can you share me some details?

@zuowanbushiwo
Copy link

@shakingWaves @zhly0 @jmvalin @alebzk
I also do not understand the pitch estimation method, If my training sample's sample rate is 16000, how to change pitch_downsample,pitch_search,remove_doubling those funtions?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants