Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix LLVM issue with bitcast+fptrunc vectorization and add type load/downconversion test #2915

Closed
wants to merge 7 commits into from

Conversation

sobomax
Copy link
Contributor

@sobomax sobomax commented Dec 22, 2023

This patch is expected to fix Issue #1367 as well as add some tests around various types being downconverted after loading from disk. The test brings to light various inconsistencies that conversion might go with different backends and different type permutations, as well as hopefully help us to clean it over time to make it all more consistent.

The goal over time would be to reduce broken to 0.

The original bug I was after causes crash in the llvm-generated code, and it's only happens when the compiler sees multiple fptrunc operations like so:

define void @"E_4194304_4"(half* noalias %".1", i32* noalias %".2") "no-nans-fp-math"="true"
{
[...]
  %".17" = bitcast i32 %".7" to float
  %".18" = bitcast i32 %".10" to float
  %".19" = bitcast i32 %".13" to float
  %".20" = bitcast i32 %".16" to float
  %".21" = fptrunc float %".17" to half
  %".22" = fptrunc float %".18" to half
  %".23" = fptrunc float %".19" to half
  %".24" = fptrunc float %".20" to half

The code was crashing so dive into gdb revealed this:

Dump of assembler code for function E_4194304_4:
   0x00007ffff7bcd000 <+0>:  push   %rbp
   0x00007ffff7bcd001 <+1>:  push   %r15
   0x00007ffff7bcd003 <+3>:  push   %r14
   0x00007ffff7bcd005 <+5>:  push   %r13
   0x00007ffff7bcd007 <+7>:  push   %r12
   0x00007ffff7bcd009 <+9>:  push   %rbx
   0x00007ffff7bcd00a <+10>: sub    $0x18,%rsp
   0x00007ffff7bcd00e <+14>: mov    %rsi,%rbx
   0x00007ffff7bcd011 <+17>: mov    %rdi,%r14
   0x00007ffff7bcd014 <+20>: add    $0xc,%rbx
   0x00007ffff7bcd018 <+24>: xor    %ebp,%ebp
   0x00007ffff7bcd01a <+26>: movabs $0x0,%r15
   0x00007ffff7bcd024 <+36>: cs nopw 0x0(%rax,%rax,1)
   0x00007ffff7bcd02e <+46>: xchg   %ax,%ax
   0x00007ffff7bcd030 <+48>: movss  -0xc(%rbx),%xmm0
   0x00007ffff7bcd035 <+53>: movss  -0x8(%rbx),%xmm1
   0x00007ffff7bcd03a <+58>: movss  %xmm1,0xc(%rsp)
   0x00007ffff7bcd040 <+64>: movss  -0x4(%rbx),%xmm1
   0x00007ffff7bcd045 <+69>: movss  %xmm1,0x10(%rsp)
   0x00007ffff7bcd04b <+75>: movss  (%rbx),%xmm1
   0x00007ffff7bcd04f <+79>: movss  %xmm1,0x14(%rsp)
=> 0x00007ffff7bcd055 <+85>: call   *%r15
   0x00007ffff7bcd058 <+88>: mov    %ax,0xa(%rsp)
   0x00007ffff7bcd05d <+93>: movss  0xc(%rsp),%xmm0

%r15 is very obviously 0 at this point, so trying to compile the IR with the lcc gave me this:

        movss   -12(%r15), %xmm0                # xmm0 = mem[0],zero,zero,zero
        movss   -8(%r15), %xmm1                 # xmm1 = mem[0],zero,zero,zero
        movss   %xmm1, 32(%rsp)                 # 4-byte Spill
        movss   -4(%r15), %xmm1                 # xmm1 = mem[0],zero,zero,zero
        movss   %xmm1, 16(%rsp)                 # 4-byte Spill
        movss   (%r15), %xmm1                   # xmm1 = mem[0],zero,zero,zero
        movss   %xmm1, 12(%rsp)                 # 4-byte Spill
        callq   __truncsfhf2@PLT
        movaps  %xmm0, 48(%rsp)                 # 16-byte Spill
        movss   32(%rsp), %xmm0                 # 4-byte Reload
                                        # xmm0 = mem[0],zero,zero,zero
        callq   __truncsfhf2@PLT

Which was a bit unexpected but gave me a clue. So after trying few things that did not work, the +f16c' seems to do the trick.

None of the existing dtype tests hit the issue since you'd need multiple conversion operations to trigger the vectorization.

I've fixed the llvm so it passes all checks, as a bonus I've fixed coder to check if bfloat is supported and fallback to do conversion via LLVM backend if not. This makes coder working with GPU and TORCH, however for some reason it produces gibberish on torch (uint support is lacking?)

@sobomax
Copy link
Contributor Author

sobomax commented Jan 5, 2024

Also fixes similar issue doing double->half conversion, that one emits a similar intrinsic:

  %".5" = getelementptr inbounds double, double* %".2", i32 %"loop0"
  %".6" = load double, double* %".5"
  %".7" = fptrunc double %".6" to half
  %".8" = getelementptr inbounds half, half* %".1", i32 %"loop0"
        movq    %rsi, %r14
        movq    %rdi, %r15
        xorl    %ebx, %ebx
        movabsq $__truncdfhf2, %r12
        .p2align        4, 0x90

After this change:

  %".5" = getelementptr inbounds double, double* %".2", i32 %"loop0"
  %".6" = load double, double* %".5"
  %".7" = fptrunc double %".6" to float
  %".8" = fptrunc float %".7" to half

The generated code:

        vmovsd  (%rsi,%rax,8), %xmm0            # xmm0 = mem[0],zero
        vcvtsd2ss       %xmm0, %xmm0, %xmm0
        vcvtps2ph       $4, %xmm0, %xmm0
        vpextrw $0, %xmm0, (%rdi,%rax,2)

being imported into the TG. We load the data as a stream of ints
of various sizes, then bitcast them to match it original type and
then often cast to some of our internal data types if needed.

That turns out to be a non-trivial task to get right, esp
dealing with historical backends with lots of hair and various
level of support for some of the types involved.

This test will excercise that conversion, the goal here is to
have most of the backends at least provide a consistent behaviour
when dealing with truncations and signedness.

Metal and OpenCL seems to be the platforms that are closer than
anyone else to the truth, OpenCL however also seems to vary
between different backends.
vectorization would emit call to an undeclared function
__truncsfhf2@PLT, which is not defined and essentially translates
to a call via NULL pointer. What's annoying is that only happens
when there are several bitcasts followed by the matching number
of the fptrunc's. This only triggers when loading weights from
the disk in bulk.
Also workaround llvm codegen bug doing double->half conversion.
Surprress warnings while doing type gymnastics.
@sobomax sobomax force-pushed the pr_fix_llvm_load_cast branch 2 times, most recently from d48e05c to 757ecc9 Compare January 15, 2024 05:59
Copy link
Contributor

Changes

Name                              Lines    Diff    Tokens/Line    Diff
------------------------------  -------  ------  -------------  ------
tinygrad/codegen/linearizer.py      418      -2           18.4    -0.0
tinygrad/renderer/llvmir.py         135      -2           22.8    +1.2
tinygrad/dtype.py                    84      +5           17.0    +0.4
tinygrad/runtime/ops_llvm.py         49      -3           10.8    +0.9


total lines changes: -2

@geohot
Copy link
Collaborator

geohot commented Feb 9, 2024

There's a whole bunch of different unrelated things in this PR, it's unreviewable. Want to submit small, clear win PRs?

@geohot geohot closed this Feb 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants