Slightly improved API. Dramatically improved implementation #1

Open
wants to merge 4 commits into
from

Conversation

Projects
None yet
3 participants
@Steve132

I dramatically improved the implementation of encode and decode to use efficient bit-twiddling and scaling to perform the morton-code encoding and decoding in constant time with only a few operations. The resulting code should be significantly faster.

I also changed the api to use pass-by-value where it was more efficient and more appropriate for a C-API. The resulting call-by-value change decreased error checking and should increase performance over pass-by-reference

Steve132 added some commits Sep 15, 2014

Update geohash.h
The API has no reason to pass the range arguments by reference.  Doing so slows down the function.  
The arguments are not return values, and the first thing the code does is dereference them and copy them into stack variables.  Passing the input ranges by value skips this step and has fewer errors.  Furthermore, the arguments are small enough (2 primatives) that putting them onto the stack by value can make the function call inline even faster.
Update geohash.c
Same change as before: change input arguments to stack based arguments for efficiency.  This also allows us to skip the case where the inputs == NULL in the error check
Update geohash.c
I re-wrote the encode and decode functions to provide a DRAMATICALLY faster interface that does no branching and uses only a few 64-bit integer operations with unrolled loops to do the interleaving/deinterleaving using binary magic numbers for efficient morton codes.  The code should go much much much faster now.
@yinqiwen

This comment has been minimized.

Show comment
Hide comment
@yinqiwen

yinqiwen Sep 16, 2014

Owner

I got a compile error

geohash.c: In function ‘geohash_decode’:
geohash.c:161:5: error: ‘ilato’ undeclared (first use in this function)
ilato = xyhilo; //get back the original integer coordinates
^
geohash.c:161:5: note: each undeclared identifier is reported only once for each function it appears in
geohash.c:162:5: error: ‘ilono’ undeclared (first use in this function)
ilono = xyhilo >> 32;
^
geohash.c:170:42: warning: incompatible implicit declaration of built-in function ‘ldexp’ [enabled by default]
area->latitude.min = lat_range.min + ldexp(ilato, -step) * lat_scale;

And IMO it's better to keep the old simple&stupid implementation, since this optimize one is hard to understand.

Owner

yinqiwen commented Sep 16, 2014

I got a compile error

geohash.c: In function ‘geohash_decode’:
geohash.c:161:5: error: ‘ilato’ undeclared (first use in this function)
ilato = xyhilo; //get back the original integer coordinates
^
geohash.c:161:5: note: each undeclared identifier is reported only once for each function it appears in
geohash.c:162:5: error: ‘ilono’ undeclared (first use in this function)
ilono = xyhilo >> 32;
^
geohash.c:170:42: warning: incompatible implicit declaration of built-in function ‘ldexp’ [enabled by default]
area->latitude.min = lat_range.min + ldexp(ilato, -step) * lat_scale;

And IMO it's better to keep the old simple&stupid implementation, since this optimize one is hard to understand.

@yinqiwen

This comment has been minimized.

Show comment
Hide comment
@yinqiwen

yinqiwen Sep 16, 2014

Owner

And about the 'geohash_decode' function, origin implementation and this optimize one have different result for same input.

Owner

yinqiwen commented Sep 16, 2014

And about the 'geohash_decode' function, origin implementation and this optimize one have different result for same input.

@yinqiwen

This comment has been minimized.

Show comment
Hide comment
@yinqiwen

yinqiwen Sep 16, 2014

Owner

I finally understand your implementation, it's brilliant.
I'll merge this by hand, and add two functions 'geohash_decode_fast/geohash_encode_fast'.

Owner

yinqiwen commented Sep 16, 2014

I finally understand your implementation, it's brilliant.
I'll merge this by hand, and add two functions 'geohash_decode_fast/geohash_encode_fast'.

@Steve132

This comment has been minimized.

Show comment
Hide comment
@Steve132

Steve132 Sep 16, 2014

So, sorry about the compile errors. I imagine they are easy to fix. I wrote this at work without access to a C compiler.

With regard to not matching the identical input and output, I didn't test this because, again, no C compiler. However, any differences should be minor

If you manage to fix the differences (which should be trivial, I'll do it if you don't), then I STRONGLY recommend not having two new functions. There's no reason to have two different APIs to do exactly the same thing but just one is different. The unix philosophy: "There should be one, and exactly one, obvious way to do something".

Users of your API shouldn't have to choose between the 'fast' way and the 'slow' way...it's confusing. Just provide one way.

So, sorry about the compile errors. I imagine they are easy to fix. I wrote this at work without access to a C compiler.

With regard to not matching the identical input and output, I didn't test this because, again, no C compiler. However, any differences should be minor

If you manage to fix the differences (which should be trivial, I'll do it if you don't), then I STRONGLY recommend not having two new functions. There's no reason to have two different APIs to do exactly the same thing but just one is different. The unix philosophy: "There should be one, and exactly one, obvious way to do something".

Users of your API shouldn't have to choose between the 'fast' way and the 'slow' way...it's confusing. Just provide one way.

@mattsta

This comment has been minimized.

Show comment
Hide comment
@mattsta

mattsta Sep 16, 2014

For comparison: the original encode took 140 ns on my machine (compiled with clang -O2) and the new encode now takes 0.57 ns to encode.

Decodes went from 100 ns to the same 0.57 ns.

(So, prior performance was 7 million encodes per second and 10 million decodes per second; now we can do 1.7 billion per second with either encodes or decodes).

New version performance comparison versus compiler optimization levels:

  • clang -O0: encode = 2 ns; decode = 0.57 ns
  • gcc -O0: encode = 101 ns; decode = 108 ns
  • clang -O1 and higher: encode = 0.57 ns; decode = 0.57 ns
  • gcc -O1: encode = 36 ns; decode = 54 ns
  • gcc -O2: encode = 30 ns; decode = 49 ns
  • gcc -O3: encode = 19 ns; decode = 21 ns

mattsta commented Sep 16, 2014

For comparison: the original encode took 140 ns on my machine (compiled with clang -O2) and the new encode now takes 0.57 ns to encode.

Decodes went from 100 ns to the same 0.57 ns.

(So, prior performance was 7 million encodes per second and 10 million decodes per second; now we can do 1.7 billion per second with either encodes or decodes).

New version performance comparison versus compiler optimization levels:

  • clang -O0: encode = 2 ns; decode = 0.57 ns
  • gcc -O0: encode = 101 ns; decode = 108 ns
  • clang -O1 and higher: encode = 0.57 ns; decode = 0.57 ns
  • gcc -O1: encode = 36 ns; decode = 54 ns
  • gcc -O2: encode = 30 ns; decode = 49 ns
  • gcc -O3: encode = 19 ns; decode = 21 ns
@Steve132

This comment has been minimized.

Show comment
Hide comment
@Steve132

Steve132 Sep 16, 2014

mattsta: can you post the assembly for the encode-decode on clang -O1 vs gcc -O3? I'm amazed there is almost a factor of 30 difference in performance and I really want to see what the compiler is doing.

mattsta: can you post the assembly for the encode-decode on clang -O1 vs gcc -O3? I'm amazed there is almost a factor of 30 difference in performance and I really want to see what the compiler is doing.

@mattsta

This comment has been minimized.

Show comment
Hide comment
@mattsta

mattsta Sep 16, 2014

Thanks for offering to check. I put files at https://gist.github.com/mattsta/cd338cb0253e85d8dfe0. It will be easier to just clone the gist and compare locally (git clone https://gist.github.com/mattsta/cd338cb0253e85d8dfe0). If you're feeling dangerous, you can try to open the gist in your browser, but the page is very large: Direct links: clang -O1 and gcc -O3.

The assembly files are from the version of geohash-int I keep at https://github.com/mattsta/geohash-int

Assembly was generated with: [gcc-4.9 | clang] geohash.c -o [compiler-options].s -O[0,1,2,3] -S -g -fverbose-asm

The timing results were found using my dumb looping timing tester: https://github.com/mattsta/krmt/blob/master/geo/test.c

It's possible the higher optimization levels were removing some useful steps since I didn't use the results of the function anywhere again.

mattsta commented Sep 16, 2014

Thanks for offering to check. I put files at https://gist.github.com/mattsta/cd338cb0253e85d8dfe0. It will be easier to just clone the gist and compare locally (git clone https://gist.github.com/mattsta/cd338cb0253e85d8dfe0). If you're feeling dangerous, you can try to open the gist in your browser, but the page is very large: Direct links: clang -O1 and gcc -O3.

The assembly files are from the version of geohash-int I keep at https://github.com/mattsta/geohash-int

Assembly was generated with: [gcc-4.9 | clang] geohash.c -o [compiler-options].s -O[0,1,2,3] -S -g -fverbose-asm

The timing results were found using my dumb looping timing tester: https://github.com/mattsta/krmt/blob/master/geo/test.c

It's possible the higher optimization levels were removing some useful steps since I didn't use the results of the function anywhere again.

@yinqiwen

This comment has been minimized.

Show comment
Hide comment
@yinqiwen

yinqiwen Sep 17, 2014

Owner

I tested with gcc 3.4.5/4.8.2 in linux, it's strange that the new decode function is slightly slower with gcc3.4.5 -O2, it seems that new glibc is much faster with math function 'ldexp'. I put all test record at
https://gist.github.com/yinqiwen/c6d0be258356996bd16b

Owner

yinqiwen commented Sep 17, 2014

I tested with gcc 3.4.5/4.8.2 in linux, it's strange that the new decode function is slightly slower with gcc3.4.5 -O2, it seems that new glibc is much faster with math function 'ldexp'. I put all test record at
https://gist.github.com/yinqiwen/c6d0be258356996bd16b

@yinqiwen

This comment has been minimized.

Show comment
Hide comment
@yinqiwen

yinqiwen Sep 18, 2014

Owner

@Steve132 @mattsta Now I use simple divide & shift operations instead of math function 'ldexp', the decode is about 300% faster than using ' ldexp'.
Test record: https://gist.github.com/yinqiwen/c6d0be258356996bd16b

Owner

yinqiwen commented Sep 18, 2014

@Steve132 @mattsta Now I use simple divide & shift operations instead of math function 'ldexp', the decode is about 300% faster than using ' ldexp'.
Test record: https://gist.github.com/yinqiwen/c6d0be258356996bd16b

@mattsta

This comment has been minimized.

Show comment
Hide comment
@mattsta

mattsta Sep 18, 2014

Using the new divide & shift decode, my time for a decode (gcc-4.9 -O2) is now 29 ns from the previous 50 ns. Good fix!

mattsta commented Sep 18, 2014

Using the new divide & shift decode, my time for a decode (gcc-4.9 -O2) is now 29 ns from the previous 50 ns. Good fix!

@Steve132

This comment has been minimized.

Show comment
Hide comment
@Steve132

Steve132 Sep 18, 2014

Try this out:

static inline double fastldiexp(double d,uint8 step)
{
    union
    { 
        double fval;
        struct
        {
            unsigned s:1;
            unsigned e:11;
            uint32_t m:52;
        };
    }
    fval=d;
    e-=step;
    return fval;
}


area->latitude.min = lat_range.min+fastldiexp(ilato,step)*lat_scale;
area->latitude.max = lat_range.min+fastldiexp(ilato+1,step)*lat_scale;
area->longitude.min = lon_range.min+fastldiexp(ilono,step)*lon_scale;
area->longitude.max = lon_range.min+fastldiexp(ilono+1,step)*lon_scale;

Try this out:

static inline double fastldiexp(double d,uint8 step)
{
    union
    { 
        double fval;
        struct
        {
            unsigned s:1;
            unsigned e:11;
            uint32_t m:52;
        };
    }
    fval=d;
    e-=step;
    return fval;
}


area->latitude.min = lat_range.min+fastldiexp(ilato,step)*lat_scale;
area->latitude.max = lat_range.min+fastldiexp(ilato+1,step)*lat_scale;
area->longitude.min = lon_range.min+fastldiexp(ilono,step)*lon_scale;
area->longitude.max = lon_range.min+fastldiexp(ilono+1,step)*lon_scale;
@mattsta

This comment has been minimized.

Show comment
Hide comment
@mattsta

mattsta Sep 19, 2014

That's some crazy compiler voodoo there. After fixing 8 syntax errors, it decodes at 36 ns compared to 29 ns for the all-shifting version. (gcc-4.9 -O2)

I think we've found the optimal results for now with the divide and shift operations. :)

mattsta commented Sep 19, 2014

That's some crazy compiler voodoo there. After fixing 8 syntax errors, it decodes at 36 ns compared to 29 ns for the all-shifting version. (gcc-4.9 -O2)

I think we've found the optimal results for now with the divide and shift operations. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment