@FIR-1044 - lama.cpp: its crashing for new model due to recent bug #71
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Validated all gemma model
akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/gemma3:1b-it-q4_K_M --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
I understand you're trying
llama_perf_sampler_print: sampling time = 0.01 ms / 12 runs ( 0.00 ms per token, 857142.86 tokens per second)
llama_perf_context_print: load time = 11005.97 ms
llama_perf_context_print: prompt eval time = 8869.34 ms / 15 tokens ( 591.29 ms per token, 1.69 tokens per second)
llama_perf_context_print: eval time = 5165.31 ms / 5 runs ( 1033.06 ms per token, 0.97 tokens per second)
llama_perf_context_print: total time = 21792.62 ms / 20 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 1144 1844 947503 828.24
MUL OPU 3454 8404 4266409 1235.21
RMS_NORM OPU 3454 3454 2435789 705.21
MUL_MAT CPU 20598 0 78091817 3791.23
SCALE CPU 2225 0 2710 1.22
CONT CPU 4181 0 230922 55.23
RESHAPE CPU 6551 0 2577 0.39
VIEW CPU 9772 0 1294 0.13
PERMUTE CPU 8016 0 1288 0.16
TRANSPOSE CPU 2095 0 298 0.14
GET_ROWS CPU 129 0 454 3.52
SET_ROWS CPU 3883 0 2705 0.70
SOFT_MAX OPU 572 3744 2125329 3715.61
ROPE CPU 4395 0 36985 8.42
GLU CPU 2156 0 296249 137.41
Interrupted by user
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/gemma3:1b-it-fp16 --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
I understand you're looking
llama_perf_sampler_print: sampling time = 99.35 ms / 21 runs ( 4.73 ms per token, 211.38 tokens per second)
llama_perf_context_print: load time = 17071.58 ms
llama_perf_context_print: prompt eval time = 6913.49 ms / 15 tokens ( 460.90 ms per token, 2.17 tokens per second)
llama_perf_context_print: eval time = 3696.45 ms / 5 runs ( 739.29 ms per token, 1.35 tokens per second)
llama_perf_context_print: total time = 27211.75 ms / 20 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 1144 1844 954084 833.99
MUL OPU 3454 8404 4300667 1245.13
RMS_NORM OPU 3454 3454 2227842 645.00
MUL_MAT CPU 20475 0 47999475 2344.30
SCALE CPU 2226 0 2757 1.24
CONT CPU 4213 0 234803 55.73
RESHAPE CPU 6561 0 2271 0.35
VIEW CPU 9892 0 1445 0.15
PERMUTE CPU 7844 0 1228 0.16
TRANSPOSE CPU 2000 0 351 0.18
GET_ROWS CPU 117 0 353 3.02
SET_ROWS CPU 3777 0 2629 0.70
SOFT_MAX OPU 572 3744 2036892 3561.00
ROPE CPU 4376 0 37872 8.65
GLU CPU 2198 0 321821 146.42
Interrupted by user
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/google-gemma-3-1b-it-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
Please tell me your cat'
llama_perf_sampler_print: sampling time = 102.87 ms / 20 runs ( 5.14 ms per token, 194.43 tokens per second)
llama_perf_context_print: load time = 7667.50 ms
llama_perf_context_print: prompt eval time = 5957.95 ms / 14 tokens ( 425.57 ms per token, 2.35 tokens per second)
llama_perf_context_print: eval time = 3045.03 ms / 5 runs ( 609.01 ms per token, 1.64 tokens per second)
llama_perf_context_print: total time = 15241.09 ms / 19 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 1144 1794 996862 871.38
MUL OPU 3454 8173 4397025 1273.02
RMS_NORM OPU 3454 3454 2343316 678.44
MUL_MAT CPU 20520 0 31297671 1525.23
SCALE CPU 2188 0 3248 1.48
CONT CPU 4227 0 244907 57.94
RESHAPE CPU 6549 0 2715 0.41
VIEW CPU 9830 0 1484 0.15
PERMUTE CPU 7972 0 1506 0.19
TRANSPOSE CPU 2017 0 314 0.16
GET_ROWS CPU 115 0 356 3.10
SET_ROWS CPU 3779 0 2909 0.77
SOFT_MAX OPU 572 3640 2109735 3688.35
ROPE CPU 4405 0 38399 8.72
GLU CPU 2161 0 332427 153.83
Interrupted by user
[akapoor@wssw01 llama.cpp]$