@FIR-781 - LLama.cpp ggml Stats:Adding Backend and Unary OP Detail #31
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enhancements to Performance Statistics:
Added backend-level breakdown (e.g., CPU, TSAVORITE) for each operation.
Included unary operation details in both summary and detailed outputs.
Fixed column formatting and alignment in the summary and detailed CSV output for improved readability.
##########
Terminal output
[akapoor@wssw01 llama.cpp]$ ./build-posix/bin/llama-cli -p "my cat's name is" -m /proj/work/akapoor/llama.cpp-may22/llama.cpp/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tsavorite -c 12288 --temp 0.0 --n-predict 1 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup
my cat's name is L
llama_perf_sampler_print: sampling time = 2.02 ms / 8 runs ( 0.25 ms per token, 3966.29 tokens per second)llama_perf_context_print: load time = 16983.31 ms
llama_perf_context_print: prompt eval time = 16428.90 ms / 7 tokens ( 2346.99 ms per token, 0.43 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 16985.71 ms / 8 tokens
=== GGML Perf Summary ===
Op : Runs Total us Avg us
ADD : 171 28077 164.19
[CPU ] : 170 5135 30.21
[TSAVORITE ] : 1 22942 22942.00
MUL : 133 6164882 46352.50
[CPU ] : 88 7098 80.66
[TSAVORITE ] : 45 6157784 136839.64
RMS_NORM : 180 3266 18.14
[CPU ] : 180 3266 18.14
MUL_MAT : 713 7003799 9823.00
[CPU ] : 713 7003799 9823.00
CPY : 170 1426 8.39
[CPU ] : 170 1426 8.39
CONT : 86 264 3.07
[CPU ] : 86 264 3.07
RESHAPE : 310 183 0.59
[CPU ] : 310 183 0.59
VIEW : 294 42 0.14
[CPU ] : 294 42 0.14
PERMUTE : 303 68 0.22
[CPU ] : 303 68 0.22
TRANSPOSE : 78 19 0.24
[CPU ] : 78 19 0.24
GET_ROWS : 11 6916 628.73
[CPU ] : 11 6916 628.73
SOFT_MAX : 88 5600 63.64
[CPU ] : 88 5600 63.64
ROPE : 170 2998 17.64
[CPU ] : 170 2998 17.64
UNARY : 22 8308663 377666.50
[TSAVORITE ] : 22 8308663 377666.50
-> SILU : 22 8308663 377666.50
GGML Tsavorite Profiling Results:
Calls Total(ms) T/call Self(ms) Function
========================================================================================================================
1 18573.000 18573.000 18573.000 [100%] TOTAL
[akapoor@wssw01 llama.cpp]$
Snapshot on detail written at file:
#########
[akapoor@wssw01 llama.cpp]$ cat ggml_perf-all-shape.log |more
=== GGML Detailed Op Perf (21526.203 ms total) ===
Backend Op Runs Total ms Avg ms ne[0] ne[1] ne[2] ne[3]
CPU GET_ROWS 4 6.902 1.726 2048 7 1 1
CPU RMS_NORM 4 0.347 0.087 2048 7 1 1
TSAVORITE MUL 1 142.612 142.612 2048 7 1 1
CPU MUL_MAT 4 34.957 8.739 2048 7 1 1
CPU RESHAPE 2 0.004 0.002 64 32 7 1
CPU ROPE 4 0.270 0.068 64 32 7 1
CPU MUL_MAT 4 3.840 0.960 256 7 1 1
CPU RESHAPE 4 0.003 0.001 64 4 7 1
CPU ROPE 3 0.027 0.009 64 4 7 1
CPU MUL_MAT 4 3.811 0.953 256 7 1 1
CPU RESHAPE 4 0.002 0.001 64 4 7 1
CPU VIEW 2 0.003 0.002 1792 1 1 1
CPU CPY 4 0.043 0.011 1792 1 1 1
CPU RESHAPE 2 0.000 0.000 256 7 1 1
CPU TRANSPOSE 3 0.003 0.001 7 256 1 1
CPU VIEW 3 0.000 0.000 7 256 1 1
CPU CPY 4 0.034 0.009 7 256 1 1
CPU VIEW 4 0.001 0.000 32 4 64 1
CPU PERMUTE 4 0.004 0.001 32 64 4 1
CPU VIEW 2 0.000 0.000 64 4 32 1
CPU PERMUTE 2 0.001 0.001 64 32 4 1
CPU PERMUTE 3 0.000 0.000 64 7 32 1
CPU MUL_MAT 4 0.868 0.217 32 7 32 1
CPU SOFT_MAX 4 0.256 0.064 32 7 32 1
--More--