Skip to content

Conversation

@akapoor3518
Copy link

Enhancements to Performance Statistics:

Added backend-level breakdown (e.g., CPU, TSAVORITE) for each operation.

Included unary operation details in both summary and detailed outputs.

Fixed column formatting and alignment in the summary and detailed CSV output for improved readability.

##########
Terminal output
[akapoor@wssw01 llama.cpp]$ ./build-posix/bin/llama-cli -p "my cat's name is" -m /proj/work/akapoor/llama.cpp-may22/llama.cpp/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tsavorite -c 12288 --temp 0.0 --n-predict 1 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup
my cat's name is L

llama_perf_sampler_print: sampling time = 2.02 ms / 8 runs ( 0.25 ms per token, 3966.29 tokens per second)llama_perf_context_print: load time = 16983.31 ms
llama_perf_context_print: prompt eval time = 16428.90 ms / 7 tokens ( 2346.99 ms per token, 0.43 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 16985.71 ms / 8 tokens

=== GGML Perf Summary ===
Op : Runs Total us Avg us
ADD : 171 28077 164.19
[CPU ] : 170 5135 30.21
[TSAVORITE ] : 1 22942 22942.00
MUL : 133 6164882 46352.50
[CPU ] : 88 7098 80.66
[TSAVORITE ] : 45 6157784 136839.64
RMS_NORM : 180 3266 18.14
[CPU ] : 180 3266 18.14
MUL_MAT : 713 7003799 9823.00
[CPU ] : 713 7003799 9823.00
CPY : 170 1426 8.39
[CPU ] : 170 1426 8.39
CONT : 86 264 3.07
[CPU ] : 86 264 3.07
RESHAPE : 310 183 0.59
[CPU ] : 310 183 0.59
VIEW : 294 42 0.14
[CPU ] : 294 42 0.14
PERMUTE : 303 68 0.22
[CPU ] : 303 68 0.22
TRANSPOSE : 78 19 0.24
[CPU ] : 78 19 0.24
GET_ROWS : 11 6916 628.73
[CPU ] : 11 6916 628.73
SOFT_MAX : 88 5600 63.64
[CPU ] : 88 5600 63.64
ROPE : 170 2998 17.64
[CPU ] : 170 2998 17.64
UNARY : 22 8308663 377666.50
[TSAVORITE ] : 22 8308663 377666.50
-> SILU : 22 8308663 377666.50

GGML Tsavorite Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1      2.000     2.000     2.000  [ 0%] GGML Tsavorite 

========================================================================================================================
1 18573.000 18573.000 18573.000 [100%] TOTAL

[akapoor@wssw01 llama.cpp]$

Snapshot on detail written at file:
#########
[akapoor@wssw01 llama.cpp]$ cat ggml_perf-all-shape.log |more

=== GGML Detailed Op Perf (21526.203 ms total) ===
Backend Op Runs Total ms Avg ms ne[0] ne[1] ne[2] ne[3]
CPU GET_ROWS 4 6.902 1.726 2048 7 1 1
CPU RMS_NORM 4 0.347 0.087 2048 7 1 1
TSAVORITE MUL 1 142.612 142.612 2048 7 1 1
CPU MUL_MAT 4 34.957 8.739 2048 7 1 1
CPU RESHAPE 2 0.004 0.002 64 32 7 1
CPU ROPE 4 0.270 0.068 64 32 7 1
CPU MUL_MAT 4 3.840 0.960 256 7 1 1
CPU RESHAPE 4 0.003 0.001 64 4 7 1
CPU ROPE 3 0.027 0.009 64 4 7 1
CPU MUL_MAT 4 3.811 0.953 256 7 1 1
CPU RESHAPE 4 0.002 0.001 64 4 7 1
CPU VIEW 2 0.003 0.002 1792 1 1 1
CPU CPY 4 0.043 0.011 1792 1 1 1
CPU RESHAPE 2 0.000 0.000 256 7 1 1
CPU TRANSPOSE 3 0.003 0.001 7 256 1 1
CPU VIEW 3 0.000 0.000 7 256 1 1
CPU CPY 4 0.034 0.009 7 256 1 1
CPU VIEW 4 0.001 0.000 32 4 64 1
CPU PERMUTE 4 0.004 0.001 32 64 4 1
CPU VIEW 2 0.000 0.000 64 4 32 1
CPU PERMUTE 2 0.001 0.001 64 32 4 1
CPU PERMUTE 3 0.000 0.000 64 7 32 1
CPU MUL_MAT 4 0.868 0.217 32 7 32 1
CPU SOFT_MAX 4 0.256 0.064 32 7 32 1
--More--

@akapoor3518 akapoor3518 merged commit 8736109 into master Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants