Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Triton - Assertion failure: "Unexpected MMA layout version found" #142

Closed
clxyder opened this issue Apr 8, 2023 · 7 comments
Closed

Triton - Assertion failure: "Unexpected MMA layout version found" #142

clxyder opened this issue Apr 8, 2023 · 7 comments

Comments

@clxyder
Copy link

clxyder commented Apr 8, 2023

Has anyone run into this issue? I am currently off of 9463299 on the triton branch.

python: /project/lib/Analysis/Utility.cpp:136: bool mlir::supportMMA(mlir::Value, int): Assertion `(version == 1 || version == 2) && "Unexpected MMA layout version found"' failed.
Aborted

I found this issue#1271 in the triton repo, but it doesn't seem there is a solution to the issue.

Setup Details

GPU: GTX 1070 Ti
CUDA: 11.8
OS: Win 10 via WSL2

Reproduction

In order to reproduce this I am using the following for my model and my quantized weights:

Python dependencies

  • safetensors==0.3.0
  • datasets==2.10.1
  • sentencepiece==0.1.97
  • transformers==4.28.0.dev0
  • accelerate==0.17.1
  • triton==2.0.0
  • torch==2.0.0+cu118
  • protobuf==3.20.3

Execution

I am running the inference with the following script:

models_dir="/repos/language-models/models"
llama_7b_hf="${models_dir}/llama-7b-hf"
llama_4bit="${models_dir}/llama7b-4bit-v2/llama7b-4bit-ts-ao-g128-v2.safetensors"
prompt="Building a website can be done in 10 simple steps:"

CUDA_VISIBLE_DEVICES=0 python llama_inference.py "${llama_7b_hf}" --wbits 4 --groupsize 128 --load "${llama_4bit}" --text "${prompt}" --max_length 512

CUDA

I was able to run the same exact model and quantized weights using the cuda branch off of 610fdae.

Everything worked fine, but it just took a long time to load the model and perform inference.

And after reading #82 (comment) I wanted to checkout the triton branch to experience it for myself.

Has anyone run into a similar issue? Or is there someone who can vouch that this works on a 1070 Ti?

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 8, 2023

me too.. pascal card. they fsck us.

@clxyder
Copy link
Author

clxyder commented Apr 9, 2023

Hey @Ph0rk0z thank you for confirming the issue is with our HW.

Would you mind leaving a comment on the triton issue I linked above to get the attention of the triton team?

Hopefully we can have a solution soon.

@C0deMunk33
Copy link

Also getting this on the cuda branch, pascal card here as well

@clxyder
Copy link
Author

clxyder commented Apr 11, 2023

Thanks for letting us know @C0deMunk33, would you mind also leaving a comment on the triton issue?

Thanks again!

@clxyder
Copy link
Author

clxyder commented Apr 13, 2023

Closing because triton-lang/triton#1505 seems to provide inference on Pascal series GPUs for f32.

@clxyder clxyder closed this as completed Apr 13, 2023
@Ph0rk0z
Copy link

Ph0rk0z commented Apr 14, 2023

Does it work for you? I tried it and got this:

error: 'llvm.intr.fmuladd' op requires the same type for all operands and results
Pass execution failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.
Aborted

I will check with the stock implementation.

@Ph0rk0z
Copy link

Ph0rk0z commented Apr 14, 2023

Loading models I get this now:

File “/home/mint/text-generation-webui/repositories/GPTQ-for-LLaMa/custom_autotune.py”, line 72, in _bench
except triton.compiler.OutOfResources:
AttributeError: module ‘triton.compiler’ has no attribute ‘OutOfResources’

Doesn't appear to be running out of memory on GPU or CPU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants