Fix LLVM issue with bitcast+fptrunc vectorization and add type load/downconversion test #2915

sobomax · 2023-12-22T20:12:45Z

This patch is expected to fix Issue #1367 as well as add some tests around various types being downconverted after loading from disk. The test brings to light various inconsistencies that conversion might go with different backends and different type permutations, as well as hopefully help us to clean it over time to make it all more consistent.

The goal over time would be to reduce broken to 0.

The original bug I was after causes crash in the llvm-generated code, and it's only happens when the compiler sees multiple fptrunc operations like so:

define void @"E_4194304_4"(half* noalias %".1", i32* noalias %".2") "no-nans-fp-math"="true"
{
[...]
  %".17" = bitcast i32 %".7" to float
  %".18" = bitcast i32 %".10" to float
  %".19" = bitcast i32 %".13" to float
  %".20" = bitcast i32 %".16" to float
  %".21" = fptrunc float %".17" to half
  %".22" = fptrunc float %".18" to half
  %".23" = fptrunc float %".19" to half
  %".24" = fptrunc float %".20" to half

The code was crashing so dive into gdb revealed this:

Dump of assembler code for function E_4194304_4:
   0x00007ffff7bcd000 <+0>:  push   %rbp
   0x00007ffff7bcd001 <+1>:  push   %r15
   0x00007ffff7bcd003 <+3>:  push   %r14
   0x00007ffff7bcd005 <+5>:  push   %r13
   0x00007ffff7bcd007 <+7>:  push   %r12
   0x00007ffff7bcd009 <+9>:  push   %rbx
   0x00007ffff7bcd00a <+10>: sub    $0x18,%rsp
   0x00007ffff7bcd00e <+14>: mov    %rsi,%rbx
   0x00007ffff7bcd011 <+17>: mov    %rdi,%r14
   0x00007ffff7bcd014 <+20>: add    $0xc,%rbx
   0x00007ffff7bcd018 <+24>: xor    %ebp,%ebp
   0x00007ffff7bcd01a <+26>: movabs $0x0,%r15
   0x00007ffff7bcd024 <+36>: cs nopw 0x0(%rax,%rax,1)
   0x00007ffff7bcd02e <+46>: xchg   %ax,%ax
   0x00007ffff7bcd030 <+48>: movss  -0xc(%rbx),%xmm0
   0x00007ffff7bcd035 <+53>: movss  -0x8(%rbx),%xmm1
   0x00007ffff7bcd03a <+58>: movss  %xmm1,0xc(%rsp)
   0x00007ffff7bcd040 <+64>: movss  -0x4(%rbx),%xmm1
   0x00007ffff7bcd045 <+69>: movss  %xmm1,0x10(%rsp)
   0x00007ffff7bcd04b <+75>: movss  (%rbx),%xmm1
   0x00007ffff7bcd04f <+79>: movss  %xmm1,0x14(%rsp)
=> 0x00007ffff7bcd055 <+85>: call   *%r15
   0x00007ffff7bcd058 <+88>: mov    %ax,0xa(%rsp)
   0x00007ffff7bcd05d <+93>: movss  0xc(%rsp),%xmm0

%r15 is very obviously 0 at this point, so trying to compile the IR with the lcc gave me this:

        movss   -12(%r15), %xmm0                # xmm0 = mem[0],zero,zero,zero
        movss   -8(%r15), %xmm1                 # xmm1 = mem[0],zero,zero,zero
        movss   %xmm1, 32(%rsp)                 # 4-byte Spill
        movss   -4(%r15), %xmm1                 # xmm1 = mem[0],zero,zero,zero
        movss   %xmm1, 16(%rsp)                 # 4-byte Spill
        movss   (%r15), %xmm1                   # xmm1 = mem[0],zero,zero,zero
        movss   %xmm1, 12(%rsp)                 # 4-byte Spill
        callq   __truncsfhf2@PLT
        movaps  %xmm0, 48(%rsp)                 # 16-byte Spill
        movss   32(%rsp), %xmm0                 # 4-byte Reload
                                        # xmm0 = mem[0],zero,zero,zero
        callq   __truncsfhf2@PLT

Which was a bit unexpected but gave me a clue. So after trying few things that did not work, the +f16c' seems to do the trick.

None of the existing dtype tests hit the issue since you'd need multiple conversion operations to trigger the vectorization.

I've fixed the llvm so it passes all checks, as a bonus I've fixed coder to check if bfloat is supported and fallback to do conversion via LLVM backend if not. This makes coder working with GPU and TORCH, however for some reason it produces gibberish on torch (uint support is lacking?)

sobomax · 2024-01-05T20:47:13Z

Also fixes similar issue doing double->half conversion, that one emits a similar intrinsic:

  %".5" = getelementptr inbounds double, double* %".2", i32 %"loop0"
  %".6" = load double, double* %".5"
  %".7" = fptrunc double %".6" to half
  %".8" = getelementptr inbounds half, half* %".1", i32 %"loop0"

        movq    %rsi, %r14
        movq    %rdi, %r15
        xorl    %ebx, %ebx
        movabsq $__truncdfhf2, %r12
        .p2align        4, 0x90

After this change:

  %".5" = getelementptr inbounds double, double* %".2", i32 %"loop0"
  %".6" = load double, double* %".5"
  %".7" = fptrunc double %".6" to float
  %".8" = fptrunc float %".7" to half

The generated code:

        vmovsd  (%rsi,%rax,8), %xmm0            # xmm0 = mem[0],zero
        vcvtsd2ss       %xmm0, %xmm0, %xmm0
        vcvtps2ph       $4, %xmm0, %xmm0
        vpextrw $0, %xmm0, (%rdi,%rax,2)

being imported into the TG. We load the data as a stream of ints of various sizes, then bitcast them to match it original type and then often cast to some of our internal data types if needed. That turns out to be a non-trivial task to get right, esp dealing with historical backends with lots of hair and various level of support for some of the types involved. This test will excercise that conversion, the goal here is to have most of the backends at least provide a consistent behaviour when dealing with truncations and signedness. Metal and OpenCL seems to be the platforms that are closer than anyone else to the truth, OpenCL however also seems to vary between different backends.

vectorization would emit call to an undeclared function __truncsfhf2@PLT, which is not defined and essentially translates to a call via NULL pointer. What's annoying is that only happens when there are several bitcasts followed by the matching number of the fptrunc's. This only triggers when loading weights from the disk in bulk.

Also workaround llvm codegen bug doing double->half conversion.

Surprress warnings while doing type gymnastics.

github-actions · 2024-01-15T06:00:21Z

Changes

Name                              Lines    Diff    Tokens/Line    Diff
------------------------------  -------  ------  -------------  ------
tinygrad/codegen/linearizer.py      418      -2           18.4    -0.0
tinygrad/renderer/llvmir.py         135      -2           22.8    +1.2
tinygrad/dtype.py                    84      +5           17.0    +0.4
tinygrad/runtime/ops_llvm.py         49      -3           10.8    +0.9


total lines changes: -2

geohot · 2024-02-09T11:37:26Z

There's a whole bunch of different unrelated things in this PR, it's unreviewable. Want to submit small, clear win PRs?

sobomax force-pushed the pr_fix_llvm_load_cast branch from 5c4583f to d17ab4d Compare December 22, 2023 20:20

sobomax marked this pull request as draft December 22, 2023 20:54

sobomax force-pushed the pr_fix_llvm_load_cast branch 9 times, most recently from 84e8c90 to 4767f84 Compare December 26, 2023 06:01

sobomax marked this pull request as ready for review December 26, 2023 11:29

sobomax mentioned this pull request Dec 30, 2023

test_dtype fails with a segfault on LLVM on linux #1367

Open

sobomax force-pushed the pr_fix_llvm_load_cast branch from 4767f84 to a5961a2 Compare January 4, 2024 03:18

sobomax mentioned this pull request Jan 4, 2024

Get rid of the "fix bf16, TODO: check if device supports bf16" in coder #3004

Closed

sobomax force-pushed the pr_fix_llvm_load_cast branch from a5961a2 to b11b356 Compare January 5, 2024 20:34

sobomax force-pushed the pr_fix_llvm_load_cast branch from b11b356 to cc14236 Compare January 5, 2024 22:02

sobomax added 7 commits January 14, 2024 21:44

Properly cap output when doing conversion from float to ints.

bcdac7a

Also workaround llvm codegen bug doing double->half conversion.

Make min_val() and max_val() available in dtypes.

8691f12

llvm is no longer broken wrt FP16

43a643f

save lines

ccb287e

Uncripple on LLVM.

757ecc9

Surprress warnings while doing type gymnastics.

sobomax force-pushed the pr_fix_llvm_load_cast branch 2 times, most recently from d48e05c to 757ecc9 Compare January 15, 2024 05:59

geohot closed this Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix LLVM issue with bitcast+fptrunc vectorization and add type load/downconversion test #2915

Fix LLVM issue with bitcast+fptrunc vectorization and add type load/downconversion test #2915

sobomax commented Dec 22, 2023 •

edited

Loading

sobomax commented Jan 5, 2024 •

edited

Loading

github-actions bot commented Jan 15, 2024

geohot commented Feb 9, 2024

Fix LLVM issue with bitcast+fptrunc vectorization and add type load/downconversion test #2915

Fix LLVM issue with bitcast+fptrunc vectorization and add type load/downconversion test #2915

Conversation

sobomax commented Dec 22, 2023 • edited Loading

sobomax commented Jan 5, 2024 • edited Loading

github-actions bot commented Jan 15, 2024

Changes

geohot commented Feb 9, 2024

sobomax commented Dec 22, 2023 •

edited

Loading

sobomax commented Jan 5, 2024 •

edited

Loading