Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFT API rewrite #25

Closed
andrewstanfordjason opened this issue Feb 29, 2016 · 15 comments
Closed

FFT API rewrite #25

andrewstanfordjason opened this issue Feb 29, 2016 · 15 comments

Comments

@andrewstanfordjason
Copy link
Collaborator

Here is a suggested API for fft work:

typedef struct {
    int32_t re;
    int32_t im;
} lib_dsp_fft_complex_t;

typedef struct {
    int16_t re;
    int16_t im;
} lib_dsp_fft_complex_short_t;

void lib_dsp_fft_bit_reverse( lib_dsp_fft_complex_t pts[], unsigned N );
void lib_dsp_fft_inverse(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_forward(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_unmangle_real_inputs(lib_dsp_fft_complex_t pts[], unsigned N);
void lib_dsp_fft_mangle_real_inputs(lib_dsp_fft_complex_t pts[], unsigned N);

void lib_dsp_fft_bit_reverse_short( lib_dsp_fft_complex_short_t pts[], unsigned N );
void lib_dsp_fft_inverse_short(lib_dsp_fft_complex_short_t pts[], unsigned N, const int16_t sine[]);
void lib_dsp_fft_forward_short(lib_dsp_fft_complex_short_t pts[], unsigned N, const int16_t sine[]);
void lib_dsp_fft_unmangle_real_inputs_short(lib_dsp_fft_complex_short_t pts[], unsigned N);
void lib_dsp_fft_mangle_real_inputs_short(lib_dsp_fft_complex_short_t pts[], unsigned N);

The mangle and unmangle functions could do with a better name and I'm open to suggestions. The purpose of unmangle is to take the complex output of two N length real ffts and reorder them such that the array represents two N/2 length FFTs. mangle will do the opposite, i.e. take two N/2 length complex FFT and combine them such that they can be inverted with and iFFT.

This will mean a reduced amount of code, testing, docs and that FFTs can be fit into half the memory footprint.

@larry-xmos
Copy link
Contributor

Katja from http://www.katjaas.nl calls the unmangle/mangle operations just split spectrum and merge spectrum

@andrewstanfordjason
Copy link
Collaborator Author

that's much better name. We'll go with that. Thanks. So the API will be:


void lib_dsp_fft_bit_reverse( lib_dsp_fft_complex_t pts[], unsigned N );
void lib_dsp_fft_inverse(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_forward(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_split_spectrum(lib_dsp_fft_complex_t pts[], unsigned N);
void lib_dsp_fft_merge_spectrum(lib_dsp_fft_complex_t pts[], unsigned N);

@andrewstanfordjason
Copy link
Collaborator Author

Do we really need the 16 bit versions too? If we instead had an efficient 16 -> 32 and 32 -> 16 array converter functions then we could do away with the 16 bit implementations at the cost of some run time memory (one length N 32 bit array). The benefit, other than implementation, would be increased precision on the 16 bit version.

@andrewstanfordjason
Copy link
Collaborator Author

void lib_dsp_fft_short_to_long(lib_dsp_fft_complex_short_t pts_in[], lib_dsp_fft_complex_t pts_out[], unsigned N);

void lib_dsp_fft_long_to_short(lib_dsp_fft_complex_t pts_in[], lib_dsp_fft_complex_short_t pts_out[], unsigned N);

@henkmuller
Copy link
Collaborator

I love that guys graphics.

On 8 Mar 2016, at 12:10, Larry Snizek <notifications@github.commailto:notifications@github.com> wrote:

Katja from http://www.katjaas.nl calls the unmangle/mangle operations just split spectrum and merge spectrum


Reply to this email directly or view it on GitHubhttps://github.com//issues/25#issuecomment-193757874.

@johned0
Copy link
Contributor

johned0 commented Mar 8, 2016

ASJ - you might have a good point about only providing the 32 bit versions and then converting to 16 bit. In addition to the increased precision (which has been a big problem for Thomas, with the 16 bit version), this might actually also be more efficient than the direct 16 bit versions because the 16 bit versions are a lot slower, from Thomas' benchmarks.
Maybe we should get some feedback from Thomas and customer ?
J

@johned0
Copy link
Contributor

johned0 commented Mar 8, 2016

Love the graphics too, Henk

@andrewstanfordjason
Copy link
Collaborator Author

If someone could contact them then that would be good. I want to get
this done asap as I want to use it properly soon. Thanks

On 08/03/2016 14:05, John Edwards wrote:

ASJ - you might have a good point about only providing the 32 bit
versions and then converting to 16 bit. In addition to the increased
precision (which has been a big problem for Thomas, with the 16 bit
version), this might actually also be more efficient than the direct
16 bit versions because the 16 bit versions are a lot slower, from
Thomas' benchmarks.
Maybe we should get some feedback from Thomas and AISpeech ?
J


Reply to this email directly or view it on GitHub
#25 (comment).

@johned0
Copy link
Contributor

johned0 commented Mar 8, 2016

I can do that, Andrew. I've reviewed Thomas' emails and think I've got a handle on the performance differences. I'll let you know

@andrewstanfordjason
Copy link
Collaborator Author

The accuracy will be very limited for a 16 bit implementation with the
current algorithm. However, if we go to the 32 bit approach then this
issue will be gone. I've written automated tests to verify correctness
and accuracy of the current implementation(32 bit). Forward is good
backwards is bad!

On 08/03/2016 14:22, John Edwards wrote:

I can do that, Andrew. I've reviewed Thomas' emails and think I've got
a handle on the performance differences. I'll let you know


Reply to this email directly or view it on GitHub
#25 (comment).

@ThomasGmeinder
Copy link
Contributor

Hi Guys
The main constraint at customer is memory.
They need to do a 4k FFT on each of the 8 input channels. Sample data is 16 bit.
32 bit lib_dsp_fft_forward_tworeals was prohibitive as that would need 512k just for buffers:
4096 * 8 (channels) * 4 (bytes per sample) * 2 (re+im) * 2 (double buffers).
That was the motivation to implement the 16-bit version. But that would still need 256k and thus require partitioning the processing on two tiles.
And not leave much memory for the additional temporary 32-bit buffer: 4k_4_2 = 32k. I'm assuming a double buffer would be needed for performance.

But:
Now that the method using the halved spectrum was proposed by Andrew and accepted by customer, the buffer requirements are relaxed:
4096 * 8 (channels) * 2 (bytes per sample) * 2 (double buffers) = 128 kB
That means the additional temporary buffer proposed to convert from short int to int before the FFT should fit.
An additional advantage would be that this temp buffer is only needed on the one tile.

Here's the test for the forward FFT using the new method:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/AN00209_xCORE-200_DSP_Library/app_transforms/src/app_transforms.xc#L279

Regarding Performance:
The 16 bit FFT lib_dsp_fft_forward_complex_shortfor 4k points takes 796855 cycles (7.9ms)
The 32 bit version lib_dsp_fft_forward_complex takes 524504 cycles (5.25ms)

The overhead of 2.54ms is due to unpacking short int from memory into int (for the maccs, + and - of the butterfly)
and then back into into short int before packing and storing to memory.

The function Andrew proposes to copy a 16-bit short int array in a 32-bit int array should be faster than that.
Something like this:
copy_loop:
//load 16-bit im and re from lib_dsp_fft_complex_t
{ld16s r6, r4[r2]; sub r2, r2, 1} // r6: re; decrement offset
{ld16s r3, r4[r2]; sub r2, r2, 1} // r3: im; decrement offset
std r3, r6, r5[r6] // store 32-bit im and re to lib_dsp_fft_complex_t
{bt r2, copy_loop; sub r6, r6, 1}

would take: 16384 cycles (0.164 ms).
Probably an additional cycle due to FNOP but that would still be a lot faster (0.205 ms).

Bottom line:
Performance and accuracy would be better.
The expense would be an additional 32k temp buffer which seems a worthy tradeoff.

But because memory is so tight for DOA+Beamforming+AEC (ideally also Keyword Recognition) we should also get feedback from customer.

Cheers, /T

@johned0
Copy link
Contributor

johned0 commented Mar 8, 2016

Thanks very much, Thomas,
I'll add your comments to the case, for Knight to check.
Enjoy the snow :-)
John

@ThomasGmeinder
Copy link
Contributor

I implemented and tested the array conversion functions:
void lib_dsp_fft_short_to_long(lib_dsp_fft_complex_short_t pts_in[], lib_dsp_fft_complex_t pts_out[], unsigned N);
void lib_dsp_fft_long_to_short(lib_dsp_fft_complex_t pts_in[], lib_dsp_fft_complex_short_t pts_out[], unsigned N);

Test:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/AN00209_xCORE-200_DSP_Library/app_fft_short_int/src/app_fft_short_int.xc#L392

Performance (cycles per im/re data point):
5 cycles (incl 1 FNOP):
https://github.com/ThomasGmeinder/lib_dsp/blob/master/lib_dsp/src/lib_dsp_fft_short_to_long.S#L27

5 cycles:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/lib_dsp/src/lib_dsp_fft_long_to_short.S#L28

@larry-xmos
Copy link
Contributor

that's much better name. We'll go with that. Thanks. So the API will be

Should be merge_spectra rather than merge_spectrum

@andrewstanfordjason
Copy link
Collaborator Author

all done now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants