You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I was benchmarking this on a AMD device in docker container rocm/pytorch:latest which has pytorch 2.2.0 the compilation works and so doesn the quantization but generation is bugged?
./scripts/prepare.sh $MODEL_REPO
Model config {'block_size': 2048, 'vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'dim': 4096, 'intermediate_size': 11008, 'n_local_heads': 32, 'head_dim': 128, 'rope_base': 10000, 'norm_eps': 1e-05}
Saving checkpoint to checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth
Loading model ...
Quantizing model weights for int8 weight-only symmetric per-channel quantization
Writing quantized weights to checkpoints/meta-llama/Llama-2-7b-chat-hf/model_int8.pth
When I run generate:
python generate.py --compile --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
Using device=cuda
Loading model ...
Time to load model: 23.11 seconds
/home/workspace/gpt-fast/model.py:189: UserWarning: 1Torch was not compiled with memory efficient attention. (Triggered internally at /home/workspace/pytorch/aten/src/ATen/native/transformers/hip/sdp_utils.cpp:507.)
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
/root/.local/lib/python3.9/site-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
warnings.warn(
/root/.local/lib/python3.9/site-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
warnings.warn(
/root/.local/lib/python3.9/site-packages/torch/backends/cuda/__init__.py:342: FutureWarning: torch.backends.cuda.sdp_kernel() is deprecated. In the future, this context manager will be removed. Please see, torch.nn.attention.sdpa_kernel() for the new context manager, with updated signature.
warnings.warn(
Compilation time: 60.60 seconds
Hello, my name is ⁇ ⁇ � ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ � ⁇ � ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ �� ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ �� ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ � ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ �� ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ � ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ � ⁇ �� ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ �� ⁇ � ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ � ⁇ ��� ⁇ ⁇ � ⁇ ⁇ ⁇ ⁇ � ⁇ � ⁇ ⁇ � ⁇
Time for inference 1: 3.46 sec total, 57.76 tokens/sec
Not sure whats going wrong because I use the same container for another transformer based model which works with torch compile for me.
Further, I tried compiling torch nightly 2.3.0 from source to see if that would change anything which compiles but no luck there as well it gives the same outputs on generate.
Without --compile generation works as expected:
python generate.py --checkpoint_path checkpoints/$MODEL_REPO/model.pth --prompt "Hello, my name is"
Using device=cuda
Loading model ...
Time to load model: 4.04 seconds
Hello, my name is Samantha and I'm a 19-year-old from Australia. I've been playing guitar for about 4 years now and I absolutely love it. I'm mostly self-taught, but I do take lessons occasionally to improve my skills. My favorite genre to play is indie rock, but I also enjoy playing acoustic covers of pop and folk songs. When I'm not playing guitar, I'm usually listening to music or writing songs. I'm super excited to be here and can't wait to connect with other guitar enthusiasts! 🎸❤️t was a beautiful summer evening, and Samantha was sitting in her backyard, strumming her guitar. She had just finished a long day of work and was feeling a bit stressed, so she decided to take a break and play some music. As she played, she felt her worries slowly drifting away,
Time for inference 4: 7.30 sec total, 27.40 tokens/sec
Bandwidth achieved: 369.30 GB/s
Thanks
The text was updated successfully, but these errors were encountered:
Hello, I was benchmarking this on a AMD device in docker container
rocm/pytorch:latest
which has pytorch 2.2.0 the compilation works and so doesn the quantization but generation is bugged?When I run generate:
Not sure whats going wrong because I use the same container for another transformer based model which works with torch compile for me.
Further, I tried compiling torch nightly
2.3.0
from source to see if that would change anything which compiles but no luck there as well it gives the same outputs on generate.Without
--compile
generation works as expected:Thanks
The text was updated successfully, but these errors were encountered: