AMD RX 7900 XTX Wrong outputs #120

makaveli10 · 2024-03-01T14:18:19Z

Hello, I was benchmarking this on a AMD device in docker container rocm/pytorch:latest which has pytorch 2.2.0 the compilation works and so doesn the quantization but generation is bugged?

./scripts/prepare.sh $MODEL_REPO

Model config {'block_size': 2048, 'vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'dim': 4096, 'intermediate_size': 11008, 'n_local_heads': 32, 'head_dim': 128, 'rope_base': 10000, 'norm_eps': 1e-05}
Saving checkpoint to checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth
Loading model ...
Quantizing model weights for int8 weight-only symmetric per-channel quantization
Writing quantized weights to checkpoints/meta-llama/Llama-2-7b-chat-hf/model_int8.pth

When I run generate:

python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
Using device=cuda
Loading model ...
Time to load model: 23.11 seconds
/home/workspace/gpt-fast/model.py:189: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/workspace/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:507.)
  y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
/root/.local/lib/python3.9/site-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
/root/.local/lib/python3.9/site-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
/root/.local/lib/python3.9/site-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
  warnings.warn(
Compilation time: 60.60 seconds
Hello, my name is ⁇  ⁇ � ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇ � ⁇ � ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ �� ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ �� ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇ � ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇ �� ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇ � ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇ � ⁇ �� ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇ �� ⁇ � ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇  ⁇ � ⁇ ��� ⁇  ⁇ � ⁇  ⁇  ⁇  ⁇ � ⁇ � ⁇  ⁇ � ⁇ 
Time for inference 1: 3.46 sec total, 57.76 tokens/sec

Not sure whats going wrong because I use the same container for another transformer based model which works with torch compile for me.

Further, I tried compiling torch nightly 2.3.0 from source to see if that would change anything which compiles but no luck there as well it gives the same outputs on generate.

Without --compile generation works as expected:

python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"

Using device=cuda
Loading model ...
Time to load model: 4.04 seconds
Hello, my name is Samantha and I'm a 19-year-old from Australia. I've been playing guitar for about 4 years now and I absolutely love it. I'm mostly self-taught, but I do take lessons occasionally to improve my skills. My favorite genre to play is indie rock, but I also enjoy playing acoustic covers of pop and folk songs. When I'm not playing guitar, I'm usually listening to music or writing songs. I'm super excited to be here and can't wait to connect with other guitar enthusiasts! 🎸❤️t was a beautiful summer evening, and Samantha was sitting in her backyard, strumming her guitar. She had just finished a long day of work and was feeling a bit stressed, so she decided to take a break and play some music. As she played, she felt her worries slowly drifting away,
Time for inference 4: 7.30 sec total, 27.40 tokens/sec
Bandwidth achieved: 369.30 GB/s

Thanks

The text was updated successfully, but these errors were encountered:

makaveli10 closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD RX 7900 XTX Wrong outputs #120

AMD RX 7900 XTX Wrong outputs #120

makaveli10 commented Mar 1, 2024 •

edited

Loading

AMD RX 7900 XTX Wrong outputs #120

AMD RX 7900 XTX Wrong outputs #120

Comments

makaveli10 commented Mar 1, 2024 • edited Loading

makaveli10 commented Mar 1, 2024 •

edited

Loading