FFT API rewrite #25

andrewstanfordjason · 2016-02-29T11:10:36Z

Here is a suggested API for fft work:

typedef struct {
    int32_t re;
    int32_t im;
} lib_dsp_fft_complex_t;

typedef struct {
    int16_t re;
    int16_t im;
} lib_dsp_fft_complex_short_t;

void lib_dsp_fft_bit_reverse( lib_dsp_fft_complex_t pts[], unsigned N );
void lib_dsp_fft_inverse(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_forward(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_unmangle_real_inputs(lib_dsp_fft_complex_t pts[], unsigned N);
void lib_dsp_fft_mangle_real_inputs(lib_dsp_fft_complex_t pts[], unsigned N);

void lib_dsp_fft_bit_reverse_short( lib_dsp_fft_complex_short_t pts[], unsigned N );
void lib_dsp_fft_inverse_short(lib_dsp_fft_complex_short_t pts[], unsigned N, const int16_t sine[]);
void lib_dsp_fft_forward_short(lib_dsp_fft_complex_short_t pts[], unsigned N, const int16_t sine[]);
void lib_dsp_fft_unmangle_real_inputs_short(lib_dsp_fft_complex_short_t pts[], unsigned N);
void lib_dsp_fft_mangle_real_inputs_short(lib_dsp_fft_complex_short_t pts[], unsigned N);

The mangle and unmangle functions could do with a better name and I'm open to suggestions. The purpose of unmangle is to take the complex output of two N length real ffts and reorder them such that the array represents two N/2 length FFTs. mangle will do the opposite, i.e. take two N/2 length complex FFT and combine them such that they can be inverted with and iFFT.

This will mean a reduced amount of code, testing, docs and that FFTs can be fit into half the memory footprint.

The text was updated successfully, but these errors were encountered:

larry-xmos · 2016-03-08T12:10:41Z

Katja from http://www.katjaas.nl calls the unmangle/mangle operations just split spectrum and merge spectrum

andrewstanfordjason · 2016-03-08T12:12:41Z

that's much better name. We'll go with that. Thanks. So the API will be:


void lib_dsp_fft_bit_reverse( lib_dsp_fft_complex_t pts[], unsigned N );
void lib_dsp_fft_inverse(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_forward(lib_dsp_fft_complex_t pts[], unsigned N, const int32_t sine[]);
void lib_dsp_fft_split_spectrum(lib_dsp_fft_complex_t pts[], unsigned N);
void lib_dsp_fft_merge_spectrum(lib_dsp_fft_complex_t pts[], unsigned N);

andrewstanfordjason · 2016-03-08T12:15:22Z

Do we really need the 16 bit versions too? If we instead had an efficient 16 -> 32 and 32 -> 16 array converter functions then we could do away with the 16 bit implementations at the cost of some run time memory (one length N 32 bit array). The benefit, other than implementation, would be increased precision on the 16 bit version.

andrewstanfordjason · 2016-03-08T12:16:44Z

void lib_dsp_fft_short_to_long(lib_dsp_fft_complex_short_t pts_in[], lib_dsp_fft_complex_t pts_out[], unsigned N);

void lib_dsp_fft_long_to_short(lib_dsp_fft_complex_t pts_in[], lib_dsp_fft_complex_short_t pts_out[], unsigned N);

henkmuller · 2016-03-08T12:49:09Z

I love that guys graphics.

On 8 Mar 2016, at 12:10, Larry Snizek <notifications@github.com mailto:notifications@github.com> wrote:

Katja from http://www.katjaas.nl calls the unmangle/mangle operations just split spectrum and merge spectrum

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/25#issuecomment-193757874.

johned0 · 2016-03-08T14:05:31Z

ASJ - you might have a good point about only providing the 32 bit versions and then converting to 16 bit. In addition to the increased precision (which has been a big problem for Thomas, with the 16 bit version), this might actually also be more efficient than the direct 16 bit versions because the 16 bit versions are a lot slower, from Thomas' benchmarks.
Maybe we should get some feedback from Thomas and customer ?
J

johned0 · 2016-03-08T14:05:40Z

Love the graphics too, Henk

andrewstanfordjason · 2016-03-08T14:10:31Z

If someone could contact them then that would be good. I want to get
this done asap as I want to use it properly soon. Thanks

On 08/03/2016 14:05, John Edwards wrote:

ASJ - you might have a good point about only providing the 32 bit
versions and then converting to 16 bit. In addition to the increased
precision (which has been a big problem for Thomas, with the 16 bit
version), this might actually also be more efficient than the direct
16 bit versions because the 16 bit versions are a lot slower, from
Thomas' benchmarks.
Maybe we should get some feedback from Thomas and AISpeech ?
J

—
Reply to this email directly or view it on GitHub
#25 (comment).

johned0 · 2016-03-08T14:22:25Z

I can do that, Andrew. I've reviewed Thomas' emails and think I've got a handle on the performance differences. I'll let you know

andrewstanfordjason · 2016-03-08T14:28:17Z

The accuracy will be very limited for a 16 bit implementation with the
current algorithm. However, if we go to the 32 bit approach then this
issue will be gone. I've written automated tests to verify correctness
and accuracy of the current implementation(32 bit). Forward is good
backwards is bad!

On 08/03/2016 14:22, John Edwards wrote:

I can do that, Andrew. I've reviewed Thomas' emails and think I've got
a handle on the performance differences. I'll let you know

—
Reply to this email directly or view it on GitHub
#25 (comment).

ThomasGmeinder · 2016-03-08T18:54:47Z

Hi Guys
The main constraint at customer is memory.
They need to do a 4k FFT on each of the 8 input channels. Sample data is 16 bit.
32 bit lib_dsp_fft_forward_tworeals was prohibitive as that would need 512k just for buffers:
4096 * 8 (channels) * 4 (bytes per sample) * 2 (re+im) * 2 (double buffers).
That was the motivation to implement the 16-bit version. But that would still need 256k and thus require partitioning the processing on two tiles.
And not leave much memory for the additional temporary 32-bit buffer: 4k_4_2 = 32k. I'm assuming a double buffer would be needed for performance.

But:
Now that the method using the halved spectrum was proposed by Andrew and accepted by customer, the buffer requirements are relaxed:
4096 * 8 (channels) * 2 (bytes per sample) * 2 (double buffers) = 128 kB
That means the additional temporary buffer proposed to convert from short int to int before the FFT should fit.
An additional advantage would be that this temp buffer is only needed on the one tile.

Here's the test for the forward FFT using the new method:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/AN00209_xCORE-200_DSP_Library/app_transforms/src/app_transforms.xc#L279

Regarding Performance:
The 16 bit FFT lib_dsp_fft_forward_complex_shortfor 4k points takes 796855 cycles (7.9ms)
The 32 bit version lib_dsp_fft_forward_complex takes 524504 cycles (5.25ms)

The overhead of 2.54ms is due to unpacking short int from memory into int (for the maccs, + and - of the butterfly)
and then back into into short int before packing and storing to memory.

The function Andrew proposes to copy a 16-bit short int array in a 32-bit int array should be faster than that.
Something like this:
copy_loop:
//load 16-bit im and re from lib_dsp_fft_complex_t
{ld16s r6, r4[r2]; sub r2, r2, 1} // r6: re; decrement offset
{ld16s r3, r4[r2]; sub r2, r2, 1} // r3: im; decrement offset
std r3, r6, r5[r6] // store 32-bit im and re to lib_dsp_fft_complex_t
{bt r2, copy_loop; sub r6, r6, 1}

would take: 16384 cycles (0.164 ms).
Probably an additional cycle due to FNOP but that would still be a lot faster (0.205 ms).

Bottom line:
Performance and accuracy would be better.
The expense would be an additional 32k temp buffer which seems a worthy tradeoff.

But because memory is so tight for DOA+Beamforming+AEC (ideally also Keyword Recognition) we should also get feedback from customer.

Cheers, /T

johned0 · 2016-03-08T19:05:35Z

Thanks very much, Thomas,
I'll add your comments to the case, for Knight to check.
Enjoy the snow :-)
John

ThomasGmeinder · 2016-03-14T18:23:21Z

I implemented and tested the array conversion functions:
void lib_dsp_fft_short_to_long(lib_dsp_fft_complex_short_t pts_in[], lib_dsp_fft_complex_t pts_out[], unsigned N);
void lib_dsp_fft_long_to_short(lib_dsp_fft_complex_t pts_in[], lib_dsp_fft_complex_short_t pts_out[], unsigned N);

Test:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/AN00209_xCORE-200_DSP_Library/app_fft_short_int/src/app_fft_short_int.xc#L392

Performance (cycles per im/re data point):
5 cycles (incl 1 FNOP):
https://github.com/ThomasGmeinder/lib_dsp/blob/master/lib_dsp/src/lib_dsp_fft_short_to_long.S#L27

5 cycles:
https://github.com/ThomasGmeinder/lib_dsp/blob/master/lib_dsp/src/lib_dsp_fft_long_to_short.S#L28

larry-xmos · 2016-03-16T12:02:41Z

that's much better name. We'll go with that. Thanks. So the API will be

Should be merge_spectra rather than merge_spectrum

andrewstanfordjason · 2016-06-30T16:06:33Z

all done now.

andrewstanfordjason closed this as completed Jun 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FFT API rewrite #25

FFT API rewrite #25

andrewstanfordjason commented Feb 29, 2016

larry-xmos commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

henkmuller commented Mar 8, 2016

johned0 commented Mar 8, 2016

johned0 commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

johned0 commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

ThomasGmeinder commented Mar 8, 2016

johned0 commented Mar 8, 2016

ThomasGmeinder commented Mar 14, 2016

larry-xmos commented Mar 16, 2016

andrewstanfordjason commented Jun 30, 2016

FFT API rewrite #25

FFT API rewrite #25

Comments

andrewstanfordjason commented Feb 29, 2016

larry-xmos commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

henkmuller commented Mar 8, 2016

johned0 commented Mar 8, 2016

johned0 commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

johned0 commented Mar 8, 2016

andrewstanfordjason commented Mar 8, 2016

ThomasGmeinder commented Mar 8, 2016

johned0 commented Mar 8, 2016

ThomasGmeinder commented Mar 14, 2016

larry-xmos commented Mar 16, 2016

andrewstanfordjason commented Jun 30, 2016