Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__builtin_clz returning wrong value in a loop #90

Closed
michaeljclark opened this issue Aug 23, 2017 · 22 comments
Closed

__builtin_clz returning wrong value in a loop #90

michaeljclark opened this issue Aug 23, 2017 · 22 comments

Comments

@michaeljclark
Copy link

@michaeljclark michaeljclark commented Aug 23, 2017

I was testing out a new clz function derived from the GNU MP Library's highly optimized clz. It has some nice properties compared to gcc's __builtin_clz. Both of them are table based however the gcc version uses a larger table. During the process I discovered a puzzling gcc 7.1.0 bug. It is puzzling because it only occurs on RISC-V and it only occurs when calling __builtin_clz in a loop.

Despite the odd gcc bug, take a look at the asm output of the GNU MP clz function, with RISC-V custom asm; and compare to __clzdi2. GNU MP has lots of platform specific inline assembly for performance. I think gcc's __builtin_clz (__clzdi2) is broken in this case:

I tried to make a reproducer outside of a loop and the bug does not occur:

$ cat clz7.c 
#include <assert.h>

int main()
{
	assert (__builtin_clz(2) == 30);
}
$ riscv64-linux-musl-gcc -fwrapv -O3 -static clz7.c 
$ qemu-riscv64 a.out

The program runs to completion on x86-64

$ x86_64-linux-musl-gcc -fwrapv -O3 -static clz6.c 
$ ./a.out 

However __builtin_clz returns incorrect values when called in a loop (29 is returned where x == 2).

$ riscv64-linux-musl-gcc -fwrapv -O3 -static clz6.c
$ qemu-riscv64 a.out 
1 31 31
2 29 30
Assertion failed: __builtin_clz(x) == clz6(x) (clz6.c: main: 50)
Aborted

Here is the test program (note: the printf is only enabled on riscv, and the inline asm version of the new function is likely correct, rather it is __builtin_clz that is returning erroneous values):

$ cat clz6.c
/* clz implementation derived from GNU MP Library longlong.h */

#include <stdio.h>
#include <assert.h>

const
unsigned char __clz_tab[129] =
{
  1,2,3,3,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
  8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
  8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
  9
};

#if defined (__riscv)
int clz6(unsigned x)
{
    unsigned shamt;
    __asm__(
        "li t0,0x1000000\n\t"
        "sltu t0,%1,t0\n\t"
        "neg %0,t0\n\t"
        "li t0,0x10000\n\t"
        "sltu t0,%1,t0\n\t"
        "sub %0,%0,t0\n\t"
        "li t0,0x100\n\t"
        "sltu t0,%1,t0\n\t"
        "sub %0,%0,t0"
     : "=&r" (shamt) : "r" (x));
    shamt = shamt*8 + 24 + 1;
    return 33 - shamt - __clz_tab[x >> shamt];
}
#else
int clz6(unsigned x)
{
    unsigned shamt;
    shamt = -(x < 0x1000000);
    shamt -= (x < 0x10000);
    shamt -= (x < 0x100);
    shamt = shamt*8 + 24 + 1;
    return 33 - shamt - __clz_tab[x >> shamt];
}
#endif

int main()
{
    for (int x = 1; x != 0; x++) {
#if defined (__riscv)
        printf("%d %d %d\n", x, __builtin_clz(x), clz6(x));
#endif
        assert (__builtin_clz(x) == clz6(x));
    }
}
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

why is __builtin_clz expanding to __clzdi2 vs __clzsi2? We're passing an int.

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

This looks like __builtin_clzll codegen not __builtin_clz (__builtin_clzll is definitely what we are calling, as it works on x86-64) which one would imagine should expand to __clzsi2. x86 generates BSR which is bit scan reverse, and returns the leading zeros relative to the LSB i.e. in reverse. The new LZCNT is only available on Haswell. It seems many arches have CLZ or CTLZ from reading GNU MP source.

The code size for __clzdi2 is smaller than the GNU MP version but it has a branchy loop shifting 8 bits at a time and accumulating whereas the GNU MP version is essentially unrolled, and it's 32-bit not 64-bit as this version appears to be. In any case the gcc codegen for this test case on riscv appears wrong.

__clzdi2:
	  li	a5,56
	  srl	a4,a0,a5
	  andi	a4,a4,255
	  bnez	a4,10300 <__clzdi2+0x12>
	  addi	a5,a5,-8
	  bnez	a5,102f2 <__clzdi2+0x4>
	  li	a4,64
	  sub	a4,a4,a5
	  srl	a5,a0,a5
	  auipc	a0,0x7
	  ld	a0,-210(a0) # 17238 <_GLOBAL_OFFSET_TABLE_+0x48>
	  add	a5,a5,a0
	  lbu	a0,0(a5) # 0 <main-0x100e8>
	  subw	a0,a4,a0
	  ret
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

It seems gcc emits __clzdi2 and subtracts 32. I'll have to analyze the loop and see how it manages to get back an unusual result from __builtin_clz for a 32-bit integer loop induction variable.

$ cat clz7.c 
int clz(unsigned x) { return __builtin_clz(x); }
$ riscv64-linux-musl-gcc -O3 -static clz7.c
$ riscv64-linux-musl-objdump -d a.out
00000000000101e4 <clz>:
   101e4:	1502                	slli	a0,a0,0x20
   101e6:	1141                	addi	sp,sp,-16
   101e8:	9101                	srli	a0,a0,0x20
   101ea:	e406                	sd	ra,8(sp)
   101ec:	00c000ef          	jal	ra,101f8 <__clzdi2>
   101f0:	60a2                	ld	ra,8(sp)
   101f2:	3501                	addiw	a0,a0,-32
   101f4:	0141                	addi	sp,sp,16
   101f6:	8082                	ret
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

The inline asm version was missing t0 clobber but that is not the issue. Fixed below.

$ riscv64-linux-musl-gcc -fwrapv -UUSE_RISCV_ASM -O3 -static clz6.c
$ qemu-riscv64 ./a.out 
1 31 31
2 29 30
Assertion failed: __builtin_clz(x) == clz6(x) (clz6.c: main: 52)

__builtin_clz fails against the pure C version. Pretty sure this is a genuine compiler bug.

/* clz implementation derived from GNU MP Library longlong.h */

#include <stdio.h>
#include <assert.h>

const
unsigned char __clz_tab[129] =
{
  1,2,3,3,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,6,
  7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
  8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
  8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,8,
  9
};

#if defined(USE_RISCV_ASM)
int clz6(unsigned x)
{
    unsigned shamt;
    __asm__(
        "li t0,0x1000000\n\t"
        "sltu t0,%1,t0\n\t"
        "neg %0,t0\n\t"
        "li t0,0x10000\n\t"
        "sltu t0,%1,t0\n\t"
        "sub %0,%0,t0\n\t"
        "li t0,0x100\n\t"
        "sltu t0,%1,t0\n\t"
        "sub %0,%0,t0"
     : "=&r" (shamt) : "r" (x) : "t0");
    shamt = shamt*8 + 24 + 1;
    return 33 - shamt - __clz_tab[x >> shamt];
}
#else
int clz6(unsigned x)
{
    unsigned shamt;
    shamt = -(x < 0x1000000);
    shamt -= (x < 0x10000);
    shamt -= (x < 0x100);
    shamt = shamt*8 + 24 + 1;
    return 33 - shamt - __clz_tab[x >> shamt];
}
#endif

int main()
{
    for (int x = 1; x != 0; x++) {
#if defined (__riscv)
	printf("%d %d %d\n", x, __builtin_clz(x), clz6(x));
#endif
        assert (__builtin_clz(x) == clz6(x));
    }
}
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

I wonder if we can provide __clzsi2 because in addition to the addi -32, the looping version of __clzdi2 is going to loop 4 times reading 0 bytes on the first 32-bits of the 64-bit word for the __builtin_clz (32-bit) case vs __builtin_clzll (64-bit).

Providing a 22 instruction __clzsi2 would be a win when one counts the 4-7 loops in the current __clzdi2 for the 32-bit case.

@aswaterman

This comment has been minimized.

Copy link
Member

@aswaterman aswaterman commented Aug 23, 2017

Why do you think it's a compiler bug? By my reading of the code, clz6(2) should return 30 (with shamt=1, x>>shamt=1, and __clz_tab[1]=2). Isn't this a bug in clz6?

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

Look at the order in the printf. Better yet compile it. It works fine on x86.

__builtin_clz is erroneously returning 29 and clz6 is correctly returning 30

It works on x86-64.

@aswaterman

This comment has been minimized.

Copy link
Member

@aswaterman aswaterman commented Aug 23, 2017

Oh, I thought you were saying that clz6 was compiling wrong.

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

I was really confused by this. I thought it was a bug in clz6 but after studying it I was suprised to find that __builtin_clz was returning an erroneous result.

#if defined (__riscv)
printf("%d %d %d\n", x, __builtin_clz(x), clz6(x));
#endif

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

I was trying a few different clz variants. So far clz6 is my favorite 32-bit variant.

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

Here is the asm for clz6, 23 instructions:

clz6(unsigned int):
	  li t0,0x1000000
	  sltu t0,a0,t0
	  neg a5,t0
	  li t0,0x10000
	  sltu t0,a0,t0
	  sub a5,a5,t0
	  li t0,0x100
	  sltu t0,a0,t0
	  sub a5,a5,t0
	  addiw a5,a5,3
	  slliw a5,a5,3
	  addiw a4,a5,1
	  srlw a0,a0,a4
	  slli a0,a0,32
	  lui a4,%hi(.LANCHOR0)
	  addi a4,a4,%lo(.LANCHOR0)
	  srli a0,a0,32
	  add a0,a4,a0
	  lbu a0,0(a0)
	  not a5,a5
	  addiw a5,a5,33
	  subw a0,a5,a0
	  ret

Compare this to riscv gcc’s __builtin_clz which calls __clzdi2 (the 64-bit version) and iterates 4-7 times reading 0 bytes in the first 4 iterations on the upper 32-bits of a double word. I haven't counted the cycles, but probably 30 or more (20 cycles reading 0 bytes for the upper 32-bits).

$ cat clz.c
int clz(unsigned x) { return __builtin_clz(x); }
$ iscv64-linux-musl-gcc -O3 -static clz.c
$ riscv64-linux-musl-objdump -d a.out
clz:
	  slli	a0,a0,0x20
	  addi	sp,sp,-16
	  srli	a0,a0,0x20
	  sd	ra,8(sp)
	  jal	ra,101f8 <__clzdi2>
	  ld	ra,8(sp)
	  addiw	a0,a0,-32
	  addi	sp,sp,16
	  ret

__clzdi2:
	  li	a5,56
	  srl	a4,a0,a5
	  andi	a4,a4,255
	  bnez	a4,10300 <__clzdi2+0x12>
	  addi	a5,a5,-8
	  bnez	a5,102f2 <__clzdi2+0x4>
	  li	a4,64
	  sub	a4,a4,a5
	  srl	a5,a0,a5
	  auipc	a0,0x7
	  ld	a0,-210(a0) # 17238 <_GLOBAL_OFFSET_TABLE_+0x48>
	  add	a5,a5,a0
	  lbu	a0,0(a5) # 0 <main-0x100e8>
	  subw	a0,a4,a0
	  ret
@aswaterman

This comment has been minimized.

Copy link
Member

@aswaterman aswaterman commented Aug 23, 2017

Let's focus on the bug in this ticket, not the clz implementation :)

It does appear that __clzdi2(2) is returning 61.

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

OK. 👍

It runs to completion on x86-64. clz6 is effectively testing __builtin_clz. You can change the ifdef to not use the asm version and it will still fail on riscv. I think the asm version of clz6 is fine (but is not the bug here).

$ x86_64-linux-musl-gcc -fwrapv -O0 -static clz6.c 
$ ./a.out 
$
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

$ x86_64-linux-musl-gcc -fwrapv -O3 -static clz6.c 
$ ./a.out 
$

and

$ riscv64-linux-musl-gcc -fwrapv -UUSE_RISCV_ASM -O0 -static clz6.c 
$ qemu-riscv64 ./a.out 
1 30 31
Assertion failed: __builtin_clz(x) == clz6(x) (clz6.c: main: 52)
Aborted

and

$ riscv64-linux-musl-gcc -fwrapv -UUSE_RISCV_ASM -O3 -static clz6.c 
$ qemu-riscv64 ./a.out 
1 31 31
2 29 30
Assertion failed: __builtin_clz(x) == clz6(x) (clz6.c: main: 52)
Aborted

Note: I'm using -U to undef USE_RISCV_ASM so we are using the same pure C version as x86, but clz6 is not the issue. -D should also work equivalently:

$ riscv64-linux-musl-gcc -fwrapv -DUSE_RISCV_ASM -O3 -static clz6.c 
$ qemu-riscv64 ./a.out 
1 31 31
2 29 30
Assertion failed: __builtin_clz(x) == clz6(x) (clz6.c: main: 52)
Aborted
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

Odd isn't it.

Scalar evolution has duplicated the loop induction variable. Perhaps a signed vs unsigned issue.

One function is using s6 (__clzdi2) and the other function is using s7 (clz6 inlined) in the output I see from gcc-7.1.0. I'm going to try gcc-6.1.

   101a2:	001b8b1b          	addiw	s6,s7,1
   101a6:	000b0b9b          	sext.w	s7,s6
   101aa:	f60b9be3          	bnez	s7,10120 <main+0x38>
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

2 more data points. Broken with riscv gcc-6.1. Works on clang x86-64 (Xcode 8.3.3).

$ clang --version
Apple LLVM version 8.1.0 (clang-802.0.42)
$ clang -fwrapv -O3 ../riscv-tools/mc-experiment/src/clz6.c
$ ./a.out
$ $ riscv64-unknown-elf-gcc --version
riscv64-unknown-elf-gcc (GCC) 6.1.0
$ riscv64-unknown-elf-gcc -fwrapv -O3 ../riscv-tools/mc-experiment/src/clz6.c
$ rv-sim a.out 
1 31 31
2 29 30
assertion "__builtin_clz(x) == clz6(x)" failed: file "../riscv-tools/mc-experiment/src/clz6.c", line 52, function: main
$ riscv64-unknown-elf-gcc -fwrapv -O0 ../riscv-tools/mc-experiment/src/clz6.c 
$ rv-sim a.out 
1 30 31
assertion "__builtin_clz(x) == clz6(x)" failed: file "../riscv-tools/mc-experiment/src/clz6.c", line 52, function: main
@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

Hey sorry.

I think it was an off by one in __clz_tab and a symbol collision. Interesting the linker doesn't complain. Two different GNU projects (GNU MP and GCC) with different __clz_tab

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

Renamed __clz_tab to clz_tab.

Should really get a duplicate symbol warning and I wonder what happens when GNU MP is compiled assuming it defines __clz_tab in longlong.h. It won't appear on x86 because x86 lowers to BSR

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

I have a feeling this bug may show up if someone links GNU MP and uses __builtin_clz, on platforms that it defines __clz_tab which clashes with GCC's __clz_tab. It's an upstream bug.

https://github.com/ryepdx/gmp/blob/master/longlong.h

@aswaterman

This comment has been minimized.

Copy link
Member

@aswaterman aswaterman commented Aug 23, 2017

Ah, yes, making the other table static makes the problem go away.

There's no error as a natural consequence of the ELF linking model. Objects are only linked in from an archive if they are needed to satisfy a dependence. Since there was no dependence on the object that contained __clz_tab, it wasn't linked in, hence no collision actually occurred.

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

@michaeljclark

This comment has been minimized.

Copy link
Author

@michaeljclark michaeljclark commented Aug 23, 2017

Curious if that object is emitted on RISC-V GNU MP. If someone then linked -lgmp and used __builtin_clz in another module (which expects GCC's __clz_tab) they might have an issue.

I found the generic version in riscv-gnu-toolchain/riscv-gcc/libgcc/libgcc2.c

Some platforms override with asm. Some platforms define expansions for "clzsi2" and "clzdi2" in their metadata to emit machine instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.