Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal gives Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file" #2226

Open
JChunX opened this issue Nov 6, 2023 · 49 comments
Labels
bug Something isn't working

Comments

@JChunX
Copy link

JChunX commented Nov 6, 2023

Hi, I am going through the tinygrad documentation and trying to run the basic Tensor example on abstractions.py

Environment:

  • M1 Mac, Sonoma 14.0,
  • Python 3.10.13
  • tinygrad source install Nov 6th

try_tensor.py:

from tinygrad.tensor import Tensor
a = Tensor([2])
b = Tensor([3])
result = a + b
print(f"{a.numpy()} + {b.numpy()} = {result.numpy()}")

However I am getting an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tinygrad/tinygrad/tensor.py", line 126, in numpy
    return self.detach().cast(dtypes.from_np(self.dtype.np)).contiguous().to('CPU').realize().lazydata.realized.toCPU().reshape(self.shape)
  File "/tinygrad/tinygrad/tensor.py", line 105, in realize
    run_schedule(self.lazydata.schedule())
  File "/tinygrad/tinygrad/realize.py", line 27, in run_schedule
    si.out.realized = Device[si.out.device].exec_ast(si.ast, output=si.out, inputs=si.inputs, var_vals=si.var_vals, **si.out._device_extra_args())
  File "/tinygrad/tinygrad/ops.py", line 292, in exec_ast
    if ast not in self.method_cache: self.method_cache[ast] = get_program()
  File "/tinygrad/tinygrad/ops.py", line 289, in get_program
    return self.to_program(k)
  File "/tinygrad/tinygrad/ops.py", line 239, in to_program
    display_name=k.display_name, runtime_args=runtime_args).build(self.compiler, self.runtime)
  File "/tinygrad/tinygrad/ops.py", line 193, in build
    self.lib = compiler.__wrapped__(self.prg) if getenv("DISABLE_COMPILER_CACHE") else compiler(self.prg)
  File "/tinygrad/tinygrad/helpers.py", line 206, in wrapper
    return diskcache_put(table, key, func(*args, **kwargs))
  File "/tinygrad/tinygrad/runtime/ops_metal.py", line 51, in compile_metal
    unwrap(library.serializeToURL_error_(Cocoa.NSURL.URLWithString_(f"file://{output_file.name}"), None))
  File "/tinygrad/tinygrad/runtime/ops_metal.py", line 38, in unwrap
    assert err is None, str(err)
AssertionError: Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file." UserInfo={NSLocalizedDescription=MTLLibrary is not formatted as a MetalLib file.}

This was the Metal program generated:

#include <metal_stdlib>
using namespace metal;
kernel void E_(device float* data0, 
               const device float* data1, 
               const device float* data2, 
               uint3 gid [[threadgroup_position_in_grid]], 
               uint3 lid [[thread_position_in_threadgroup]]) {
    float val0 = *(data1+0);
    float val1 = *(data2+0);
    *(data0+0) = (val0+val1);
}

Compile options:

<MTLCompileOptionsInternal: 0x10b7eea90>
    preprocessorMacros:  
    fastMathEnabled = 1 
    framebufferReadEnabled = 0 
    preserveInvariance = 0 
    optimizationLevel = MTLLibraryOptimizationLevelDefault 
    libraryType = MTLLibraryTypeExecutable 
    installName = <null> 
    compileSymbolVisibility =  0 
    allowReferencingUndefinedSymbols =  0 
    maxTotalThreadsPerThreadgroup =  0 
    languageVersion = default
@alexbaden
Copy link

I noticed the same issue. If you go through Xcode instead of the Metal shader API via METAL_XCODE then your example works, but other examples are pretty slow (I suppose because the number of shader compiles is both slower and going way up?)

@alexbaden
Copy link

alexbaden commented Nov 8, 2023

I managed to get an output by setting the metal library type to be dynamic and adding an install path to the MTLCompileOptions. This gets me the an output but now I am seeing the following error when we load the function from the compiled shader library:

validateMTLFunctionType:8593: failed assertion `type is not a valid MTLFunctionType.'

I managed to dump the shader output, which looks like this:

; ModuleID = 'shader.air'
source_filename = "/var/folders/cb/x67yc6g126nfkr5_tr0cdcnh0000gn/T/tmpw5t7x1sm.metallib"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx14.1.0"

; Function Attrs: argmemonly mustprogress nofree norecurse nosync nounwind willreturn
define void @E_(float addrspace(1)* nocapture noundef writeonly "air-buffer-no-alias" %data0, float addrspace(1)* nocapture noundef readonly "air-buffer-no-alias" %data1, float addrspace(1)* nocapture noundef readonly "air-buffer-no-alias" %data2, <3 x i32> noundef %gid, <3 x i32> noundef %lid) local_unnamed_addr #0 {
entry:
  %0 = load float, float addrspace(1)* %data1, align 4, !tbaa !14, !alias.scope !18, !noalias !21
  %1 = load float, float addrspace(1)* %data2, align 4, !tbaa !14, !alias.scope !24, !noalias !25
  %sub = fsub fast float %0, %1
  store float %sub, float addrspace(1)* %data0, align 4, !tbaa !14, !alias.scope !26, !noalias !27
  ret void
}

attributes #0 = { argmemonly mustprogress nofree norecurse nosync nounwind willreturn "approx-func-fp-math"="true" "frame-pointer"="all" "min-legal-vector-width"="96" "no-builtins" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "unsafe-fp-math"="true" }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7}
!air.version = !{!8}
!air.compile_options = !{!9, !10, !11}
!llvm.ident = !{!12}
!air.language_version = !{!13}

The code looks correct, but the metadata is not right. I built the same shader locally using the metal command line tools installed with Xcode 15 and got a target triple for macOS 14.0.0 and a different metal version in the later metadata. A working shader his additional metadata for the kernel type:

!air.kernel = !{!15}
!15 = !{void (float addrspace(1)*, float addrspace(1)*, float addrspace(1)*, <3 x i32>, <3 x i32>)* @E_, !16, !17}

Could there be any issue with whatever driver the GPU is using for runtime compilation? Or perhaps some bug in pyobjc, but I don't think that's likely as there have been no major changes to the metal shader library in macOS 14.

This is on MacOS 14.1 by the way. Everything worked fine on this machine before I upgraded to Sonoma.

@wozeparrot wozeparrot added the bug Something isn't working label Nov 8, 2023
@alexbaden
Copy link

I spent some more time looking into this, and I am stuck. It looks like the main issue is in macOS 14 you cannot get a MTLFunction from a MTLDynamicLibrary, and MTLDynamicLibrary is the only library object with the serializeTo method. Even a small standalone example in swift gives the error:

validateMTLFunctionType, line 8593: error 'type is not a valid MTLFunctionType.'

Buried in the header file I found this comment about MTLLibraryType:

 /**
     @property type
     @abstract Which type the library should be compiled as. The default value is MTLLibraryTypeExecutable.
     @discussion MTLLibraryTypeExecutable is suitable to build a library of "kernel", "vertex" and "fragment" qualified functions.
     MTLLibraryType is suitable when the compilation result will instead be used to instantiate a MTLDynamicLibrary.
     MTLDynamicLibrary contains no qualified functions, but it's unqualified functions and variables can be used as an external dependency for compiling other libraries.
    */
    @available(macOS 11.0, *)
    open var libraryType: MTLLibraryType

But I have not found anything in the documentation about how to use unqualified functions - and presumably we want to compile to some sort of byte representation in the compile_metal function, not build some external library and link it elsewhere.

In summary, it looks like the executable MTLLibrary is what we want, but whatever process allowed us to get the executable library as bytes on macOS 13 has changed - and using serializable dynamic libraries don't seem to give us access to the function through the runtime (perhaps they did before?).

@geohot
Copy link
Collaborator

geohot commented Nov 9, 2023

I have macOS 13. Is 14 broken?

@boltzmann-brain
Copy link

I have the same error on macOS 13.4.1 trying to run the mnist example.

Traceback (most recent call last):
  File "/Users/leif-hancox-li/tinygrad/examples/mnist_gan.py", line 98, in <module>
    loss_d += train_discriminator(optim_d, data_real, data_fake)
  File "/Users/leif-hancox-li/tinygrad/examples/mnist_gan.py", line 61, in train_discriminator
    optimizer.step()
  File "/Users/leif-hancox-li/tinygrad/tinygrad/nn/optim.py", line 52, in step
    self.t.assign(self.t + 1).realize()
  File "/Users/leif-hancox-li/tinygrad/tinygrad/tensor.py", line 105, in realize
    run_schedule(self.lazydata.schedule())
  File "/Users/leif-hancox-li/tinygrad/tinygrad/realize.py", line 27, in run_schedule
    si.out.realized = Device[si.out.device].exec_ast(si.ast, output=si.out, inputs=si.inputs, var_vals=si.var_vals, **si.out._device_extra_args())
  File "/Users/leif-hancox-li/tinygrad/tinygrad/ops.py", line 292, in exec_ast
    if ast not in self.method_cache: self.method_cache[ast] = get_program()
  File "/Users/leif-hancox-li/tinygrad/tinygrad/ops.py", line 289, in get_program
    return self.to_program(k)
  File "/Users/leif-hancox-li/tinygrad/tinygrad/ops.py", line 237, in to_program
    return ASTRunner(k.function_name, src, k.global_size, k.local_size,
  File "/Users/leif-hancox-li/tinygrad/tinygrad/ops.py", line 193, in build
    self.lib = compiler.__wrapped__(self.prg) if getenv("DISABLE_COMPILER_CACHE") else compiler(self.prg)
  File "/Users/leif-hancox-li/tinygrad/tinygrad/helpers.py", line 216, in wrapper
    return diskcache_put(table, key, func(*args, **kwargs))
  File "/Users/leif-hancox-li/tinygrad/tinygrad/runtime/ops_metal.py", line 51, in compile_metal
    unwrap(library.serializeToURL_error_(Cocoa.NSURL.URLWithString_(f"file://{output_file.name}"), None))
  File "/Users/leif-hancox-li/tinygrad/tinygrad/runtime/ops_metal.py", line 38, in unwrap
    assert err is None, str(err)
AssertionError: Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file." UserInfo={NSLocalizedDescription=MTLLibrary is not formatted as a MetalLib file.}

@roelofvandijk
Copy link
Contributor

roelofvandijk commented Nov 17, 2023

Runs fine on macOS 14.1.1, macbook pro(m2). The library that is generated has these options:

<native-selector libraryType of <MTLCompileOptionsInternal: 0x12664bdb0>
    preprocessorMacros:  
    fastMathEnabled = 1 
    framebufferReadEnabled = 0 
    preserveInvariance = 0 
    optimizationLevel = MTLLibraryOptimizationLevelDefault 
    libraryType = MTLLibraryTypeExecutable 
    installName = <null> 
    compileSymbolVisibility =  0 
    allowReferencingUndefinedSymbols =  0 
    maxTotalThreadsPerThreadgroup =  0 
    languageVersion = default>

@alexbaden: Note that this is a MTLLibraryTypeExecutable that serializes without error.

@nullhook
Copy link
Contributor

nullhook commented Nov 17, 2023

this seems to be working on macOS 14, macbook pro (m1)

here's the IR code for the exact same kernel, which works. the metal api recognizes it as a valid metallib file.

; ModuleID = 'shader.air'
source_filename = "E_"
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v24:32:32-v32:32:32-v48:64:64-v64:64:64-v96:128:128-v128:128:128-v192:256:256-v256:256:256-v512:512:512-v1024:1024:1024-n8:16:32"
target triple = "air64-apple-macosx14.1.0"

; Function Attrs: argmemonly mustprogress nofree norecurse nosync nounwind willreturn
define void @E_(float addrspace(1)* nocapture writeonly "air-buffer-no-alias" %data0, float addrspace(1)* nocapture readonly "air-buffer-no-alias" %data1, float addrspace(1)* nocapture readonly "air-buffer-no-alias" %data2, <3 x i32> %gid, <3 x i32> %lid) local_unnamed_addr #0 {
entry:
  %0 = load float, float addrspace(1)* %data1, align 4, !tbaa !22, !alias.scope !26, !noalias !29
  %1 = load float, float addrspace(1)* %data2, align 4, !tbaa !22, !alias.scope !32, !noalias !33
  %add = fadd fast float %1, %0
  store float %add, float addrspace(1)* %data0, align 4, !tbaa !22, !alias.scope !34, !noalias !35
  ret void
}

attributes #0 = { argmemonly mustprogress nofree norecurse nosync nounwind willreturn "approx-func-fp-math"="true" "frame-pointer"="all" "min-legal-vector-width"="96" "no-builtins" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "no-signed-zeros-fp-math"="true" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "unsafe-fp-math"="true" }

!llvm.module.flags = !{!0, !1, !2, !3, !4, !5, !6, !7}
!llvm.ident = !{!8}
!air.version = !{!9}
!air.language_version = !{!10}
!air.compile_options = !{!11, !12, !13}
!air.kernel = !{!14}

immediate mismatches i see, language_version, compile_options, ident, and kernel.

you could try patching the output binary with the above metadata and load it as a test.

@roelofvandijk
Copy link
Contributor

This seems promising: #2369

@alexbaden
Copy link

I am still having this problem - M1 MacBook Air (2020) with macOS 14.1.1. Interestingly with #2369 I get a different problem:

  File "/Users/alexb/Projects/tinygrad/tinygrad/ops.py", line 314, in to_program
    display_name=k.display_name, runtime_args=runtime_args).build(self.compiler, self.runtime)
  File "/Users/alexb/Projects/tinygrad/tinygrad/ops.py", line 280, in build
    self.clprg = runtime(self.name, self.lib)
  File "/Users/alexb/Projects/tinygrad/tinygrad/runtime/ops_metal.py", line 61, in __init__
    self.library = unwrap(METAL.device.newLibraryWithData_error_(data, None))
  File "/Users/alexb/Projects/tinygrad/tinygrad/runtime/ops_metal.py", line 45, in unwrap
    assert err is None, str(err)
AssertionError: Error Domain=MTLLibraryErrorDomain Code=1 "Invalid library file" UserInfo={NSLocalizedDescription=Invalid library file}

where line 61 is the call to newLibraryWithData. It would seem something is whacky with exporting the compiled shader library on certain Mseries Macs? This worked fine on the exact same machine w/ macOS 13.x.

@nullhook
Copy link
Contributor

nullhook commented Nov 20, 2023

the underlying issue is, serializeToURL isn't an exposed event to MTLLibrary instance.

The pyobjc bindings we use inspects the objective-c runtime for methods, regardless if it's documented or not. In general we should limit ourselves with API described by the Metal.framework.

The new error you're seeing is related to raw bytes being passed to newLibraryWithData, if you use the MetalLibrary instance directly you wouldn't need it.

@montezdesousa
Copy link

montezdesousa commented Nov 20, 2023

I managed to skip that same error by setting the library type in the options to dynamic. Btw this works the same if writing to file.

But then kernel just crashes at line 70. Because fxn is None, even though it is inside the library...
Screenshot 2023-11-20 at 20 58 54

snapshot of library
Screenshot 2023-11-20 at 21 00 16

@nullhook
Copy link
Contributor

nullhook commented Nov 20, 2023

MTLDynamicLibrary isn't a viable solution, it isn't supported on all devices. Macs with Intel or AMD gpus will suffer.

@nullhook
Copy link
Contributor

nullhook commented Nov 20, 2023

@alexbaden can you try this: #2372 (still WIP)

@alexbaden
Copy link

#2372 resolves this issue for me!

@bojanbabic
Copy link

bojanbabic commented Nov 29, 2023

Same issue on Mac M1 with Sonoma 14.1.1 and Apple Silicon. It works if I apply patch from #2372 and add:
METAL_XCODE=1 to the shell script.

@BrianLitwin
Copy link

BrianLitwin commented Nov 30, 2023

Same issue on Mac w/ 13.6 - "Invalid library file" at METAL.device.newLibraryWithData_error_ - #2372 resolves the issue

@chenyuxyz chenyuxyz pinned this issue Dec 2, 2023
@chenyuxyz chenyuxyz changed the title Running introductory Tensor example on Metal gives Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file" Metal gives Error Domain=MTLLibraryErrorDomain Code=1 "MTLLibrary is not formatted as a MetalLib file" Dec 2, 2023
@arthur-brainchain
Copy link

arthur-brainchain commented Dec 3, 2023

I tried to run the patch from the other repo as mentioned by @bojanbabic above, and set METAL_XCODE, etc, but still not able to run, for example, beautiful_mnist.py or beautiful_cartpole.py with the patch. Could someone care to share a set of build steps with the fix, or alternatively suggest something else?

System Info:

Screenshot 2023-12-03 at 6 58 38 PM

Sonoma 14.1.1

@nima10khodaveisi
Copy link
Contributor

xcode-select --switch /Applications/Xcode.app/Contents/Developer

fixed the issue for me.

@SamRaymond
Copy link
Sponsor

This took a while...
My M2 Air works like a charm, Metal was in the site-packages, for my M1 Pro 16, Metal wasn't in the packages
If I run METAL_XCODE=1 python docs/abstractions2.py
with DEVICE = "GPU" this works for me, same for other files.

Running on M1 Macbook Pro 16 inch 2021 (14.2.1 (23C71))

@claydugo
Copy link

Piggybacking from @SamRaymond 's work.

I got to this issue after trying to run the hello word matmul example in the README

$ METAL_XCODE=1 DEBUG=3 python -c "from tinygrad import Tensor;N = 1024; a, b = Tensor.rand(N, N), Tensor.rand(N, N);c = (a.reshape(N, 1, N) * b.T.reshape(1, N, N)).sum(axis=2);print((c.numpy() - (a.numpy() @ b.numpy())).mean())"
*** METAL   rand  seed 1703711284 size 1048576         dtype dtypes.float
*** METAL   rand  seed 1703711285 size 1048576         dtype dtypes.float
  0 ━┳ STORE MemBuffer(idx=0, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1), strides=(1024, 1, 0), offset=0, mask=None, contiguous=True),)))
  1  ┗━┳ SUM (1024, 1024, 1)
  2    ┗━┳ MUL
  3      ┣━━ LOAD MemBuffer(idx=1, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1024), strides=(1024, 0, 1), offset=0, mask=None, contiguous=False),)))
  4      ┗━━ LOAD MemBuffer(idx=2, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1024), strides=(0, 1, 1024), offset=0, mask=None, contiguous=False),)))
TENSOR CORES [(1, 1024, 1)] [(0, 1024, 1024)] tensor_core<METAL, [8, 8, 8], dtypes.float, dtypes.float>
3 alias 1: idxs= [Variable('gidx0', 0, 31), NumNode(0), NumNode(0), Variable('lidx3', 0, 1), NumNode(0), ((Variable('lidx4', 0, 15)//2)%4), NumNode(0), Variable('ridx7', 0, 127), (((Variable('lidx4', 0, 15)%2)*2)+((Variable('lidx4', 0, 15)//8)*4)+Variable('None', 0, 1)), NumNode(0), Variable('None', 0, 3), NumNode(0)]
4 alias 2: idxs= [NumNode(0), Variable('gidx1', 0, 7), Variable('lidx2', 0, 3), NumNode(0), (Variable('lidx4', 0, 15)//8), NumNode(0), (Variable('lidx4', 0, 15)%2), Variable('ridx7', 0, 127), (((Variable('lidx4', 0, 15)//2)%4)+(Variable('lidx3', 0, 1)*4)), Variable('None', 0, 1), NumNode(0), Variable('None', 0, 3)]
Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/Users/clay/git/forks/tinygrad/tinygrad/tensor.py", line 131, in numpy
    return self.cast(self.dtype.scalar()).contiguous().realize().lazydata.base.realized.toCPU().astype(self.dtype.np, copy=True).reshape(self.shape)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/tensor.py", line 101, in realize
    run_schedule(self.lazydata.schedule())
  File "/Users/clay/git/forks/tinygrad/tinygrad/realize.py", line 28, in run_schedule
    prg = lower_schedule_item(si)
          ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/realize.py", line 21, in lower_schedule_item
    return Device[si.out.device].get_runner(si.ast)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/device.py", line 314, in get_runner
    def get_runner(self, ast:LazyOp) -> CompiledASTRunner: return self.to_program(self.get_linearizer(ast))
                                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/device.py", line 286, in to_program
    return CompiledASTRunner(k.ast, k.name, src, k.global_size, k.local_size, runtime_args).build(self.compiler, self.runtime)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/device.py", line 255, in build
    self.clprg = runtime(self.name, self.lib)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/runtime/ops_metal.py", line 30, in __init__
    self.library = unwrap2(self.device.device.newLibraryWithData_error_(data, None))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/clay/git/forks/tinygrad/tinygrad/helpers.py", line 46, in unwrap2
    assert err is None, str(err)
           ^^^^^^^^^^^
AssertionError: Error Domain=MTLLibraryErrorDomain Code=1 "Invalid library file" UserInfo={NSLocalizedDescription=Invalid library file}

with GPU=1

GPU=1 METAL_XCODE=1 DEBUG=3 python -c "from tinygrad import Tensor;N = 1024; a, b = Tensor.rand(N, N), Tensor.rand(N, N);c = (a.reshape(N, 1, N) * b.T.reshape(1, N, N)).sum(axis=2);print((c.numpy() - (a.numpy() @ b.numpy())).mean())"
CLDevice: got 1 platforms and 1 devices
*** GPU   rand  seed 1703711410 size 1048576         dtype dtypes.float
*** GPU   rand  seed 1703711411 size 1048576         dtype dtypes.float
  0 ━┳ STORE MemBuffer(idx=0, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1), strides=(1024, 1, 0), offset=0, mask=None, contiguous=True),)))
  1  ┗━┳ SUM (1024, 1024, 1)
  2    ┗━┳ MUL
  3      ┣━━ LOAD MemBuffer(idx=1, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1024), strides=(1024, 0, 1), offset=0, mask=None, contiguous=False),)))
  4      ┗━━ LOAD MemBuffer(idx=2, dtype=dtypes.float, st=ShapeTracker(views=(View(shape=(1024, 1024, 1024), strides=(0, 1, 1024), offset=0, mask=None, contiguous=False),)))
*** GPU        1 r_32_16_8_16_256_4_4_4                arg   3 mem  0.01 GB tm   7579.38us/     7.58ms (  283.33 GFLOPS,    1.66 GB/s)
2.5320332e-09
avg:   283.33 GFLOPS     1.66 GB/s           total:     1 kernels     2.15 GOPS     0.01 GB     7.58 ms

On Sonoma 14.2.1, MacBook Air M1 2020

@10-zin
Copy link

10-zin commented Dec 28, 2023

For all facing this issue -> just run the existing master with these two env variables "METAL_XCODE=1 DISABLE_COMPILER_CACHE=1".
It passes all the tests for m1 macos sonoma 14.x.x

Essentialy there is some issue with when you use the cached metal library in M1 air macos14.x.x atleast.
With DISABLE_COMPILER_CACHE=1 in line 254 in device.py we call the wrapped attr.. so basically without the disckcache added functionality and it re-compiles everytime which makes it work.

Got the hint from #2372 and the latest discussion there.
this way u dont even need to do any change in the code.

@DKormann
Copy link
Contributor

@certik you probably need to clear cache should be at ~/Library/Caches/tinygrad/

@dattienle2573
Copy link

GPU=1 works for me but any ideas how to run with ipynb files?

@DKormann
Copy link
Contributor

DKormann commented Mar 4, 2024

@dattienle2573
from tinygrad import Device
Device.DEFAULT = "GPU"

@Leikoe
Copy link
Contributor

Leikoe commented Mar 14, 2024

seems related to this issue in pyobjc: ronaldoussoren/pyobjc#580,
people can/can't run the same snippet based on their python version.

Don't know if this helps, explains it...., but basically this is the error

import Metal, libdispatch

prg="""#include <metal_stdlib>
using namespace metal;
kernel void r_5(device int* data0, const device int* data1, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  int acc0 = -2147483648;
  int val0 = *(data1+0);
  int val1 = *(data1+1);
  int val2 = *(data1+2);
  int val3 = *(data1+3);
  int val4 = *(data1+4);
  int alu0 = max(((val0+1)*2*val0),0);
  int alu1 = max(((val1+1)*2*val1),0);
  int alu2 = max(((val2+1)*2*val2),0);
  int alu3 = max(((val3+1)*2*val3),0);
  int alu4 = max(((val4+1)*2*val4),0);
  int alu5 = max(alu0,acc0);
  int alu6 = max(alu1,alu5);
  int alu7 = max(alu2,alu6);
  int alu8 = max(alu3,alu7);
  int alu9 = max(alu4,alu8);
  *(data0+0) = alu9;
}"""



options = Metal.MTLCompileOptions.new()

print(f"{options=}")
print(f"{options.libraryType=}")
compiler = Metal.MTLCreateSystemDefaultDevice()




def unwrap2(x):
  ret, err = x
  assert err is None, str(err)
  return ret

r = compiler.newLibraryWithSource_options_error_(prg, options, None)
library = unwrap2(r)
lib = library.libraryDataContents().bytes().tobytes()

device = compiler

#### 
data = libdispatch.dispatch_data_create(lib, len(lib), None, None)
library = unwrap2(device.newLibraryWithData_error_(data, None))
fxn = library.newFunctionWithName_("r_5")
pipeline_state = unwrap2(device.newComputePipelineStateWithFunction_error_(fxn, None))

Which outputs

Traceback (most recent call last):
  File "/Users/tyoc213/github/tinygrad/test3.py", line 50, in <module>
    library = unwrap2(device.newLibraryWithData_error_(data, None))
  File "/Users/tyoc213/github/tinygrad/test3.py", line 38, in unwrap2
    assert err is None, str(err)
AssertionError: Error Domain=MTLLibraryErrorDomain Code=1 "Invalid library file" UserInfo={NSLocalizedDescription=Invalid library file}

Don't know if that makes more clear what is going on?

the snippet here works in python 3.12 but doesn't in 3.10.12 for me
note that running tinygrad with metal still gives the same error even though the snippet runs without error.

@Leikoe
Copy link
Contributor

Leikoe commented Mar 14, 2024

update: it was compiler cache, the error in this issue doesn't occur when using python 3.12

@Leikoe
Copy link
Contributor

Leikoe commented Mar 15, 2024

both python 3.12 and 3.10 use pyobjc-* 10.1.
Did the following tests:

  • Compile shader with python3.10, load on python3.12 (doesn't work)
  • Compile shader with python3.12, load on python3.10 (works)

Seems that we have a problem with pyobjc metal compiler bindings on python <3.12
edit: after looking through the compiled shader by python3.10 objc-metal, we see that it doesn't begin by MTLB magic

@Leikoe
Copy link
Contributor

Leikoe commented Mar 15, 2024

image same shader compiled with same script found above on python 3.12 (left) and 3.10 on the right. Both the beginning and end of the compiled shader are wrong on python <3.12

@certik
Copy link

certik commented Mar 15, 2024

@Leikoe great job investigating this!

@Leikoe
Copy link
Contributor

Leikoe commented Mar 15, 2024

looks like xcode doesn't use the same metal compiler as objc metal bindings (14.0.0 for xcode and 14.1.0 on my machine)

@Leikoe
Copy link
Contributor

Leikoe commented Mar 15, 2024

looking at metadata in the generated library bytes by objc-metal bindings

image (left is python3.12 and right is python3.10)

We see that they both are using the same "Apple metal version 32023.26"

@Leikoe
Copy link
Contributor

Leikoe commented Mar 15, 2024

TLDR: on python 3.10.12, compiler.newLibraryWithSource_options_error_(prg, options, None) produces an invalid library.

Workarounds: use @10-zin 's answer

For all facing this issue -> just run the existing master with these two env variables "METAL_XCODE=1 DISABLE_COMPILER_CACHE=1". It passes all the tests for m1 macos sonoma 14.x.x

Essentialy there is some issue with when you use the cached metal library in M1 air macos14.x.x atleast. With DISABLE_COMPILER_CACHE=1 in line 254 in device.py we call the wrapped attr.. so basically without the disckcache added functionality and it re-compiles everytime which makes it work.

Got the hint from #2372 and the latest discussion there. this way u dont even need to do any change in the code.

or use python 3.12 or 3.8

(just discovered that it works in 3.8 idk at this point)

@Leikoe
Copy link
Contributor

Leikoe commented Mar 15, 2024

new update: It is probably not a python version problem but rather a miniforge3 (or maybe conda?) problem.

my python3.10 was miniforge3's and by using brew's it fixed it.

This still doesn't make much sense as my pyobjc bindings are from pypi anyways and same ver.
image

update: created two venvs, one with brew's python3.10 and one with miniforge3's conda python3.10 and diff -bur all 4 pyobjc packages are identical, it's not pyobjc.

starting to think it's miniconda3's python binary 😭

@pushpendre
Copy link

pushpendre commented Mar 16, 2024

Yeah can confirm, I was able to run the program using the following commands

/Users/$USER/.homebrew/bin/python3 -m venv hb_py_3_12
source hb_py_3_12/bin/activate
python -m pip install git+https://github.com/tinygrad/tinygrad.git
cat > try_tensor.py <<EOF
from tinygrad.tensor import Tensor
a = Tensor([2])
b = Tensor([3])
result = a + b
print(f"{a.numpy()} + {b.numpy()} = {result.numpy()}")
EOF
DEBUG=9 python try_tensor.py 

The above works and shows that it's using METAL, but then if I switch to conda's python

deactivate # deactivate existing virtualenv
conda activate tg
which python # /Users/$USER/anaconda3/envs/tg/bin/python
python try_tensor.py 

then I get the error.

Both conda and the virtualenv have same-ish python version

/Users/$USER/anaconda3/envs/tg/bin/python3 --version # Python 3.12.1
hb_py_3_12/bin/python --version # Python 3.12.2

The info strings for conda and homebrew's python are as follows:

Python 3.12.1 | packaged by Anaconda, Inc. | (main, Jan 19 2024, 09:45:58) [Clang 14.0.6 ] on darwin
Python 3.12.2 (main, Feb  6 2024, 20:19:44) [Clang 15.0.0 (clang-1500.1.0.2.5)] on darwin

@Leikoe
Copy link
Contributor

Leikoe commented Mar 17, 2024

likely a pyobjc/libobjc problem, sends logs differ with same python + pyobjc* packages

@Leikoe
Copy link
Contributor

Leikoe commented Mar 17, 2024

weirder than expected, objective C logs from the working system python3.10.13 are duplicated but conda's python3.10.13 aren't ??

@Leikoe
Copy link
Contributor

Leikoe commented Mar 17, 2024

For reference:
NSObjCMessageLoggingEnabled=YES or https://github.com/ronaldoussoren/pyobjc/blob/master/pyobjc-core/Examples/Scripts/instrumentSends.py to log sends
DYLD_PRINT_LIBRARIES=1 to see loaded dyld
OBJC_HELP Objective-C runtime debugging. Set variable=YES to enable.

see also:
https://wincent.com/wiki/Instrumenting_Objective-C_message_sends
https://www.dribin.org/dave/blog/archives/2006/04/22/tracing_objc/

@Leikoe
Copy link
Contributor

Leikoe commented Mar 17, 2024

when running the same script with conda python3.10.13 and brew python 3.10.13 with
OBJC_PRINT_CLASS_SETUP=1, the following can be seen:

conda's (tinygrad3.10.13):

OBJC_PRINT_CLASS_SETUP=1 python extra/create_metallib.py                                                                                                               
objc[75281]: CLASS: found 78 classes during launch

brew:

(.venv) leo@leikos-macbook-pro tinygrad % OBJC_PRINT_CLASS_SETUP=1 python extra/create_metallib.py
objc[75405]: CLASS: found 2232 classes during launch

The objective C code running is indeed different when using conda's python

@Leikoe
Copy link
Contributor

Leikoe commented Mar 18, 2024

This bug still occurs when using conda packaged PyObjc.

@Leikoe
Copy link
Contributor

Leikoe commented Mar 18, 2024

bug occurred in pyobjc=10.1 and still occurs in pyobjc=10.2

@the-praxs
Copy link

Got this error on my conda python 3.12. So can't say its working on 3.12 entirely.

@Leikoe
Copy link
Contributor

Leikoe commented Mar 20, 2024

Got this error on my conda python 3.12. So can't say its working on 3.12 entirely.

Maybe I wasn't clear enough, all conda python versions I've tried had made this bug occur.

@eugeneteoh
Copy link

I'm getting the same error with miniconda on my Intel Macbook Pro 2020. Installing tinygrad directly on system python works fine. Another confirmation that this is specific to miniconda.

@Leikoe
Copy link
Contributor

Leikoe commented Apr 17, 2024

if anyone wants to help b -[_MTLDevice newLibraryWithSource:options:error:] when lldb metal_minimal.py with

import ctypes
from ctypes import cdll, c_char_p, c_void_p, c_bool, CDLL, c_uint, c_ulong, util, c_int


libobjc = CDLL(util.find_library("objc"))

# Class objc_getClass(const char *name)
libobjc.objc_getClass.restype = c_void_p
libobjc.objc_getClass.argtypes = [c_char_p]
ensure_bytes = lambda bs: bs if isinstance(bs, bytes) else bs.encode()
getClass = lambda name: libobjc.objc_getClass(ensure_bytes(name))

libobjc.sel_registerName.restype = c_void_p
libobjc.sel_registerName.argtypes = [c_char_p]

libobjc.objc_msgSend.restype = c_void_p
libobjc.objc_msgSend.argtypes = [c_void_p, c_void_p]

def objc_msgSend(obj, sel, *args, restype=c_void_p, argtypes=[]):
    msgSend = libobjc.objc_msgSend
    msgSend.restype = restype
    msgSend.argtypes = [c_void_p, c_void_p] + argtypes
    return msgSend(obj, libobjc.sel_registerName(sel.encode()), *args)

# CDLL("/System/Library/Frameworks/CoreFoundation.framework/CoreFoundation")
CDLL("/System/Library/Frameworks/Foundation.framework/Foundation")

NSString = getClass("NSString")
assert NSString is not None
def to_nsstring(s: bytes):
    r = objc_msgSend(NSString, "stringWithUTF8String:", ctypes.create_string_buffer(s), argtypes=[c_char_p])
    assert r is not None
    return r
def from_nsstring(nsstring: c_void_p):
    return ctypes.string_at(objc_msgSend(nsstring, "UTF8String"), size=objc_msgSend(nsstring, "length")).decode()
def from_nsdata(nsdata: c_void_p):
    return ctypes.string_at(objc_msgSend(libraryDataContents, "bytes"), size=objc_msgSend(libraryDataContents, "length"))


metal = CDLL("/System/Library/Frameworks/Metal.framework/Metal")
# CDLL("/System/Library/Frameworks/MetalTools.framework/MetalTools")
# CDLL("/System/Library/Frameworks/MetalPerformanceShaders.framework/MetalPerformanceShaders")
core_graphics = CDLL("/System/Library/Frameworks/CoreGraphics.framework/CoreGraphics")

metal.MTLCreateSystemDefaultDevice.restype = c_void_p
metal.MTLCreateSystemDefaultDevice.argtypes = []

dev = metal.MTLCreateSystemDefaultDevice()
print(f"device pointer: {dev}")

source = """
#include <metal_stdlib>
using namespace metal;
kernel void r_32_256_2_20_20n1(device float* data0, const device float* data1, const device float* data2, const device float* data3, const device float* data4, const device float* data5, const device float* data6, uint3 gid [[threadgroup_position_in_grid]], uint3 lid [[thread_position_in_threadgroup]]) {
  threadgroup float temp[256];
  int gidx0 = gid.x; /* 32 */
  int lidx1 = lid.x; /* 256 */
  float acc0 = 0.0f;
  float val0 = *(data2+gidx0);
  float val1 = *(data3+gidx0);
  for (int ridx0 = 0; ridx0 < 2; ridx0++) {
    for (int ridx1 = 0; ridx1 < 20; ridx1++) {
      for (int ridx2 = 0; ridx2 < 20; ridx2++) {
        float val2 = *(data1+(gidx0*400)+(lidx1*25600)+(ridx0*12800)+(ridx1*20)+ridx2);
        int alu0 = ((gidx0*100)+(lidx1*6400)+(ridx0*3200)+((ridx1/2)*10)+(ridx2/2));
        float val3 = *(data4+alu0);
        float val4 = *(data5+alu0);
        float val5 = *(data6+alu0);
        float alu1 = (val2-val0);
        acc0 = ((alu1*val1*((float)(((alu1*val1)==val3))/val4)*val5)+acc0);
      }
    }
  }
  *(temp+lidx1) = acc0;
  threadgroup_barrier(mem_flags::mem_threadgroup);
  if ((lidx1<1)) {
    float acc1 = 0.0f;
    for (int ridx3 = 0; ridx3 < 256; ridx3++) {
      float val6 = *(temp+ridx3);
      acc1 = (val6+acc1);
    }
    *(data0+gidx0) = acc1;
  }
}"""
lib = objc_msgSend(dev, "newLibraryWithSource:options:error:", to_nsstring(source.encode()), c_void_p(0), c_void_p(0), argtypes=[c_void_p, c_void_p, c_void_p])
libraryDataContents = objc_msgSend(lib, "libraryDataContents")
print(from_nsdata(libraryDataContents))

@Leikoe
Copy link
Contributor

Leikoe commented Apr 30, 2024

After some long investigations here are some news:

This bug is due to the compilation request having the wrong type (3 instead of 13) when using conda's python to call metal compiler.

This is set in the XPCCompilerConnection::BuildRequestInternal call during compilation (requestType).

Investigating the root cause right now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests