Skip to content

Conversation

@akapoor3518
Copy link

Validated all gemma model

akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/gemma3:1b-it-q4_K_M --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
I understand you're trying

quit

llama_perf_sampler_print: sampling time = 0.01 ms / 12 runs ( 0.00 ms per token, 857142.86 tokens per second)
llama_perf_context_print: load time = 11005.97 ms
llama_perf_context_print: prompt eval time = 8869.34 ms / 15 tokens ( 591.29 ms per token, 1.69 tokens per second)
llama_perf_context_print: eval time = 5165.31 ms / 5 runs ( 1033.06 ms per token, 0.97 tokens per second)
llama_perf_context_print: total time = 21792.62 ms / 20 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 1144 1844 947503 828.24
MUL OPU 3454 8404 4266409 1235.21
RMS_NORM OPU 3454 3454 2435789 705.21
MUL_MAT CPU 20598 0 78091817 3791.23
SCALE CPU 2225 0 2710 1.22
CONT CPU 4181 0 230922 55.23
RESHAPE CPU 6551 0 2577 0.39
VIEW CPU 9772 0 1294 0.13
PERMUTE CPU 8016 0 1288 0.16
TRANSPOSE CPU 2095 0 298 0.14
GET_ROWS CPU 129 0 454 3.52
SET_ROWS CPU 3883 0 2705 0.70
SOFT_MAX OPU 572 3744 2125329 3715.61
ROPE CPU 4395 0 36985 8.42
GLU CPU 2156 0 296249 137.41
Interrupted by user
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/gemma3:1b-it-fp16 --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
I understand you're looking

llama_perf_sampler_print: sampling time = 99.35 ms / 21 runs ( 4.73 ms per token, 211.38 tokens per second)

llama_perf_context_print: load time = 17071.58 ms
llama_perf_context_print: prompt eval time = 6913.49 ms / 15 tokens ( 460.90 ms per token, 2.17 tokens per second)
llama_perf_context_print: eval time = 3696.45 ms / 5 runs ( 739.29 ms per token, 1.35 tokens per second)
llama_perf_context_print: total time = 27211.75 ms / 20 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 1144 1844 954084 833.99
MUL OPU 3454 8404 4300667 1245.13
RMS_NORM OPU 3454 3454 2227842 645.00
MUL_MAT CPU 20475 0 47999475 2344.30
SCALE CPU 2226 0 2757 1.24
CONT CPU 4213 0 234803 55.73
RESHAPE CPU 6561 0 2271 0.35
VIEW CPU 9892 0 1445 0.15
PERMUTE CPU 7844 0 1228 0.16
TRANSPOSE CPU 2000 0 351 0.18
GET_ROWS CPU 117 0 353 3.02
SET_ROWS CPU 3777 0 2629 0.70
SOFT_MAX OPU 572 3744 2036892 3561.00
ROPE CPU 4376 0 37872 8.65
GLU CPU 2198 0 321821 146.42
Interrupted by user
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/google-gemma-3-1b-it-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
Please tell me your cat'

llama_perf_sampler_print: sampling time = 102.87 ms / 20 runs ( 5.14 ms per token, 194.43 tokens per second)
llama_perf_context_print: load time = 7667.50 ms
llama_perf_context_print: prompt eval time = 5957.95 ms / 14 tokens ( 425.57 ms per token, 2.35 tokens per second)
llama_perf_context_print: eval time = 3045.03 ms / 5 runs ( 609.01 ms per token, 1.64 tokens per second)
llama_perf_context_print: total time = 15241.09 ms / 19 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 1144 1794 996862 871.38
MUL OPU 3454 8173 4397025 1273.02
RMS_NORM OPU 3454 3454 2343316 678.44
MUL_MAT CPU 20520 0 31297671 1525.23
SCALE CPU 2188 0 3248 1.48
CONT CPU 4227 0 244907 57.94
RESHAPE CPU 6549 0 2715 0.41
VIEW CPU 9830 0 1484 0.15
PERMUTE CPU 7972 0 1506 0.19
TRANSPOSE CPU 2017 0 314 0.16
GET_ROWS CPU 115 0 356 3.10
SET_ROWS CPU 3779 0 2909 0.77
SOFT_MAX OPU 572 3640 2109735 3688.35
ROPE CPU 4405 0 38399 8.72
GLU CPU 2161 0 332427 153.83
Interrupted by user
[akapoor@wssw01 llama.cpp]$

Copy link

@atrivedi-tsavoritesi atrivedi-tsavoritesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving but it is not clear why this clean up triggers a crash, is it because in interactive mode instead of waiting we are cleaning the backend ?

@akapoor3518
Copy link
Author

approving but it is not clear why this clean up triggers a crash, is it because in interactive mode instead of waiting we are cleaning the backend ?

Actually, this wasn’t an error condition—it was a lambda function return that I misinterpreted as a regular return in the middle of the program, which led me to trigger cleanup prematurely. During cleanup, I freed memory that was later accessed during a memcpy, causing the issue.
You’re not seeing this on FPGA because you weren’t running in interactive mode.

@atrivedi-tsavoritesi
Copy link

approving but it is not clear why this clean up triggers a crash, is it because in interactive mode instead of waiting we are cleaning the backend ?

Actually, this wasn’t an error condition—it was a lambda function return that I misinterpreted as a regular return in the middle of the program, which led me to trigger cleanup prematurely. During cleanup, I freed memory that was later accessed during a memcpy, causing the issue. You’re not seeing this on FPGA because you weren’t running in interactive mode.

got it, thanks for clarification

@akapoor3518 akapoor3518 merged commit 11b4019 into master Oct 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants