@FIR-1006 - GGML: PERF changes with following option #61

akapoor3518 · 2025-10-07T05:15:02Z

./tsi-pkg-build.sh "debug"
drwxr-xr-x 3 akapoor tsiusers 4096 Oct 6 22:02 tsi-ggml
-rw-r--r-- 1 akapoor tsiusers 16596324 Oct 6 22:02 tsi-ggml-0.0.9.tz
-rw-r--r-- 1 akapoor tsiusers 0 Oct 6 22:08 tsi-op.txt
-rw-r--r-- 1 akapoor tsiusers 216208 Oct 6 22:08 ggml_perf-all-shape.log
[akapoor@wssw01 llama.cpp]$

POsix result
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat

llama_perf_sampler_print: sampling time = 24.92 ms / 16 runs ( 1.56 ms per token, 642.11 tokens per second)
llama_perf_context_print: load time = 2926.90 ms
llama_perf_context_print: prompt eval time = 2318.16 ms / 6 tokens ( 386.36 ms per token, 2.59 tokens per second)
llama_perf_context_print: eval time = 3709.19 ms / 9 runs ( 412.13 ms per token, 2.43 tokens per second)
llama_perf_context_print: total time = 6663.61 ms / 15 tokens

=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 2024 3127498 1545.21
MUL OPU 2070 1265143 611.18
RMS_NORM OPU 2070 1277210 617.01
MUL_MAT CPU 36433 50387125 1383.01
CONT CPU 7572 391048 51.64
RESHAPE CPU 11215 5531 0.49
VIEW CPU 17361 2274 0.13
PERMUTE CPU 13563 1998 0.15
TRANSPOSE CPU 3119 732 0.23
GET_ROWS CPU 389 1047 2.69
SET_ROWS CPU 7143 4922 0.69
SOFT_MAX CPU 3606 313995 87.08
ROPE CPU 7824 33861 4.33
GLU OPU 1012 1198366 1184.16

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

Will testing for below and add the result soon
############
./tsi-pkg-build.sh
##########
POSIX
[akapoor@wssw01 llama.cpp]$ ls -lrt
total 33520
-rw-r--r-- 1 akapoor tsiusers 47860 Oct 3 11:30 AUTHORS
-rw-r--r-- 1 akapoor tsiusers 10730 Oct 3 11:30 CMakeLists.txt
-rw-r--r-- 1 akapoor tsiusers 4570 Oct 3 11:30 CMakePresets.json
-rw-r--r-- 1 akapoor tsiusers 451 Oct 3 11:30 CODEOWNERS
-rw-r--r-- 1 akapoor tsiusers 6872 Oct 3 11:30 CONTRIBUTING.md
-rw-r--r-- 1 akapoor tsiusers 1078 Oct 3 11:30 LICENSE
-rw-r--r-- 1 akapoor tsiusers 257 Oct 3 11:30 Makefile
-rw-r--r-- 1 akapoor tsiusers 32515 Oct 3 11:30 README.md
-rw-r--r-- 1 akapoor tsiusers 5347 Oct 3 11:30 SECURITY.md
-rwxr-xr-x 1 akapoor tsiusers 21760 Oct 3 11:30 build-xcframework.sh
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 ci
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 cmake
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 common
-rwxr-xr-x 1 akapoor tsiusers 426709 Oct 3 11:30 convert_hf_to_gguf.py
-rwxr-xr-x 1 akapoor tsiusers 24225 Oct 3 11:30 convert_hf_to_gguf_update.py
-rwxr-xr-x 1 akapoor tsiusers 19106 Oct 3 11:30 convert_llama_ggml_to_gguf.py
-rwxr-xr-x 1 akapoor tsiusers 20291 Oct 3 11:30 convert_lora_to_gguf.py
drwxr-xr-x 6 akapoor tsiusers 4096 Oct 3 11:30 docs
drwxr-xr-x 30 akapoor tsiusers 4096 Oct 3 11:30 examples
-rw-r--r-- 1 akapoor tsiusers 1556 Oct 3 11:30 flake.lock
-rw-r--r-- 1 akapoor tsiusers 7243 Oct 3 11:30 flake.nix
drwxr-xr-x 5 akapoor tsiusers 4096 Oct 3 11:30 ggml
drwxr-xr-x 5 akapoor tsiusers 4096 Oct 3 11:30 gguf-py
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 grammars
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 include
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 licenses
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 media
-rw-r--r-- 1 akapoor tsiusers 3319 Oct 3 11:30 model-rerun.py
drwxr-xr-x 3 akapoor tsiusers 4096 Oct 3 11:30 models
-rw-r--r-- 1 akapoor tsiusers 163 Oct 3 11:30 mypy.ini
drwxr-xr-x 3 akapoor tsiusers 4096 Oct 3 11:30 pocs
-rw-r--r-- 1 akapoor tsiusers 124786 Oct 3 11:30 poetry.lock
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 prompts
-rw-r--r-- 1 akapoor tsiusers 1336 Oct 3 11:30 pyproject.toml
-rw-r--r-- 1 akapoor tsiusers 616 Oct 3 11:30 pyrightconfig.json
-rw-r--r-- 1 akapoor tsiusers 551 Oct 3 11:30 requirements.txt
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 requirements
drwxr-xr-x 4 akapoor tsiusers 4096 Oct 3 11:30 scripts
drwxr-xr-x 2 akapoor tsiusers 4096 Oct 3 11:30 tests
drwxr-xr-x 17 akapoor tsiusers 4096 Oct 3 11:30 tools
drwxr-xr-x 7 akapoor tsiusers 4096 Oct 3 11:30 vendor
drwxr-xr-x 8 akapoor tsiusers 4096 Oct 3 11:38 ggml-tsi-kernel
-rw-r--r-- 1 akapoor tsiusers 39107 Oct 3 16:37 perf_runs
-rw-r--r-- 1 akapoor tsiusers 16594787 Oct 3 18:07 tsi-ggml-0.0.8.tz
-rwxr-xr-x 1 akapoor tsiusers 5686 Oct 6 14:58 tsi-pkg-build.sh
drwxr-xr-x 3 akapoor tsiusers 8192 Oct 6 21:52 src
drwxr-xr-x 12 akapoor tsiusers 4096 Oct 6 22:15 build-posix
drwxr-xr-x 12 akapoor tsiusers 4096 Oct 6 22:17 build-fpga
drwxr-xr-x 3 akapoor tsiusers 4096 Oct 6 22:18 tsi-ggml
-rw-r--r-- 1 akapoor tsiusers 16594365 Oct 6 22:18 tsi-ggml-0.0.9.tz
-rw-r--r-- 1 akapoor tsiusers 0 Oct 6 22:19 tsi-op.txt
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat

llama_perf_sampler_print: sampling time = 22.07 ms / 16 runs ( 1.38 ms per token, 725.06 tokens per second)
llama_perf_context_print: load time = 2801.38 ms
llama_perf_context_print: prompt eval time = 2187.48 ms / 6 tokens ( 364.58 ms per token, 2.74 tokens per second)
llama_perf_context_print: eval time = 3322.74 ms / 9 runs ( 369.19 ms per token, 2.71 tokens per second)
llama_perf_context_print: total time = 6148.70 ms / 15 tokens

=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 2024 2548795 1259.29
MUL OPU 2070 1208959 584.04
RMS_NORM OPU 2070 1218512 588.65
MUL_MAT CPU 36520 45462314 1244.86
CONT CPU 7738 344245 44.49
RESHAPE CPU 11191 4382 0.39
VIEW CPU 17823 1957 0.11
PERMUTE CPU 13396 1874 0.14
TRANSPOSE CPU 3054 531 0.17
GET_ROWS CPU 410 942 2.30
SET_ROWS CPU 7805 4385 0.56
SOFT_MAX CPU 3966 301781 76.09
ROPE CPU 7815 31120 3.98
GLU OPU 1012 1498672 1480.90

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

2090 690.4660 0.3304 83.0950 [ 8.42%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
4180 607.0840 0.1452 607.0840 └─ [ 7.41%] tsi::runtime::executeWithTimeout
2090 0.2870 1.37e-04 0.2870 └─ [3.50e-03%] LOAD_BLOB Command Execution

############
FPGA
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# ./run_platform_test.sh
Check if tnApcMgr is running; if it is not, uncomment below line and execute the run_platform_test.sh script.
weights.safetensors exists
Running on v0.1.1.tsv37_09_25_2025
[2018-03-10 10:26:22.775] [error] [llama.cpp:14] No expected result file specified, disabling validation.
Usage: %s llama_reference.safetensors

[2018-03-10 10:26:22.785] [info] Build: 2025-09-18 15:35:19 v0.3.5 (3815dc9/HEAD) | Type: RelWithDebInfo | Device: FPGA
[2018-03-10 10:26:24.599] [info] [llama.cpp:63] Execution time: 1787 ms
[2018-03-10 10:26:24.599] [info] [llama.cpp:66] [LlamaForCausalLM_Random] No expected result file specified, skipping result validation.

Profiling Results (LlamaForCausalLM_Random):

Calls Total(ms) T/call Self(ms) Function

-  1346.9020    0.0000   1346.9020  [71.34%] [Thread] LlamaForCausalLM_Random

60 1120.1490 18.6691 0.6810 └─ [59.33%] tsi::runtime::TsavRT::getTensor
60 1119.2710 18.6545 1119.2710 └─ [59.28%] tsi::runtime::memory::SafeTensorsParser::loadTensors
120 0.1970 0.0016 0.1970 └─ [1.04e-02%] tsi::runtime::memory::SafeTensorsParser::getTensorBuffer
251 103.5070 0.4124 0.0000 └─ [ 5.48%] tsi::runtime::TsavRT::awaitCommandListCompletion
251 982.1793 3.9131 982.1793 └─ [52.02%] TXE 0 Idle
96 22.7013 0.2365 22.7013 └─ [ 1.20%] [ txe_blob_1 ]
16 10.5175 0.6573 10.5175 └─ [5.57e-01%] [ txe_blob_12 ]
8 9.0426 1.1303 9.0426 └─ [4.79e-01%] [ txe_blob_10 ]
8 8.3096 1.0387 8.3096 └─ [4.40e-01%] [ txe_blob_7 ]
8 8.1105 1.0138 8.1105 └─ [4.30e-01%] [ txe_blob_8 ]
32 7.6033 0.2376 7.6033 └─ [4.03e-01%] [ txe_blob_6 ]
8 7.4146 0.9268 7.4146 └─ [3.93e-01%] [ txe_blob_9 ]
8 5.9618 0.7452 5.9618 └─ [3.16e-01%] [ txe_blob_11 ]
16 1.5032 0.0940 1.5032 └─ [7.96e-02%] [ txe_blob_3 ]
16 1.5032 0.0939 1.5032 └─ [7.96e-02%] [ txe_blob_5 ]
16 1.4915 0.0932 1.4915 └─ [7.90e-02%] [ txe_blob_2 ]
16 1.4855 0.0928 1.4855 └─ [7.87e-02%] [ txe_blob_4 ]
3 0.4315 0.1438 0.4315 └─ [2.29e-02%] [ txe_blob_0 ]
1 57.5400 57.5400 56.7190 └─ [ 3.05%] tsi::runtime::TsavRTFPGA::finalize
1 0.8210 0.8210 0.8210 └─ [4.35e-02%] tsi::runtime::TsavRTFPGA::releaseTxes
1 24.9280 24.9280 22.5730 └─ [ 1.32%] tsi::runtime::TsavRTFPGA::initialize
1 1.1240 1.1240 1.1240 └─ [5.95e-02%] tsi::runtime::TsavRT::initialize
1 1.0890 1.0890 1.0890 └─ [5.77e-02%] tsi::runtime::TsavRTFPGA::initializeQueues
1 0.1420 0.1420 0.0910 └─ [7.52e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2 0.0510 0.0255 0.0510 └─ [2.70e-03%] tsi::runtime::executeWithTimeout
1 12.2820 12.2820 1.8020 └─ [6.51e-01%] tsi::runtime::TsavRT::initTensorLoader
1 9.3500 9.3500 9.3500 └─ [4.95e-01%] tsi::runtime::memory::SafeTensorsParser::parseJSONHeader
1 1.1300 1.1300 1.1300 └─ [5.99e-02%] tsi::runtime::memory::SafeTensorsParser::SafeTensorsParser
251 7.3810 0.0294 6.7520 └─ [3.91e-01%] tsi::runtime::TsavRT::finalizeCommandList
251 0.6290 0.0025 0.6290 └─ [3.33e-02%] tsi::runtime::executeWithTimeout
251 6.3240 0.0252 6.3240 └─ [3.35e-01%] tsi::runtime::TsavRT::addCommandToList
13 4.8120 0.3702 4.8120 └─ [2.55e-01%] tsi::runtime::TsavRTFPGA::loadBlob
33 3.0540 0.0925 3.0540 └─ [1.62e-01%] tsi::runtime::TsavRT::stridedCopy
129 2.4480 0.0190 2.4480 └─ [1.30e-01%] tsi::runtime::TsavRT::copy
527 2.3030 0.0044 2.3030 └─ [1.22e-01%] tsi::runtime::TsavRT::allocate
586 1.7530 0.0030 1.7530 └─ [9.28e-02%] tsi::runtime::TsavRT::deallocate
13 0.4210 0.0324 0.4210 └─ [2.23e-02%] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

251 95.0160 0.3785 2.3550 [ 5.03%] [Thread] tsi::runtime::TsavRT::processResponses
251 92.6610 0.3692 92.6610 └─ [ 4.91%] tsi::runtime::executeWithTimeout

-  1887.9950    0.0000   1887.9950  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9921

my cat's name is Luna.

llama_perf_sampler_print: sampling time = 109.94 ms / 11 runs ( 9.99 ms per token, 100.06 tokens per second)
llama_perf_context_print: load time = 55526.58 ms
llama_perf_context_print: prompt eval time = 44015.07 ms / 6 tokens ( 7335.84 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 51390.07 ms / 4 runs (12847.52 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 107040.20 ms / 10 tokens

=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 484 1827348 3775.51
MUL OPU 495 994388 2008.86
RMS_NORM OPU 495 995228 2010.56
MUL_MAT CPU 8244 544153177 66005.97
CONT CPU 1383 1296559 937.50
RESHAPE CPU 1422 18541 13.04
VIEW CPU 2188 2860 1.31
PERMUTE CPU 1633 2441 1.49
TRANSPOSE CPU 387 740 1.91
GET_ROWS CPU 90 17732 197.02
SET_ROWS CPU 1719 37145 21.61
SOFT_MAX CPU 580 1002234 1727.99
ROPE CPU 1717 125413 73.04
GLU OPU 242 1128996 4665.27

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    22.1660   22.1660     20.5430  [2.03e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     0.7950    0.7950      0.7950  └─ [7.28e-04%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.6020    0.6020      0.6020  └─ [5.51e-04%] tsi::runtime::TsavRT::initialize
1     0.2260    0.2260      0.1680  └─ [2.07e-04%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0580    0.0290      0.0580    └─ [5.31e-05%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1310 1059.7610 0.8090 0.0000 [9.71e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.06e+05 80.6956 1.06e+05 └─ [96.81%] TXE 0 Idle
215 196.9416 0.9160 196.9416 └─ [1.80e-01%] [ txe_swiglu ]
225 139.7646 0.6212 139.7646 └─ [1.28e-01%] [ txe_rms_norm ]
440 133.4932 0.3034 133.4932 └─ [1.22e-01%] [ txe_mult ]
430 126.5029 0.2942 126.5029 └─ [1.16e-01%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1310 726.3770 0.5545 690.6780 [6.65e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 35.6990 0.0273 35.6990 └─ [3.27e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1310 1428.4040 1.0904 44.9740 [ 1.31%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1383.4300 1.0561 1383.4300 └─ [ 1.27%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    79.8770   79.8770     62.3710  [7.31e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    17.5060   17.5060     17.5060  └─ [1.60e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1537 73.0430 0.0475 73.0430 [6.69e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

1     4.9810    4.9810      4.9810  [4.56e-03%] [Thread] OPU

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1310 393.8620 0.3007 393.8620 [3.61e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1310 76.7380 0.0586 76.7380 [7.03e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1310 99.9070 0.0763 99.9070 [9.15e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1310 15.7400 0.0120 15.7400 [1.44e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   1.09e+05    0.0000    1.09e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8255

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 128.37 ms / 11 runs ( 11.67 ms per token, 85.69 tokens per second)
llama_perf_context_print: load time = 25702.35 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 51612.33 ms / 4 runs (12903.08 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 64458.90 ms / 5 tokens

=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 440 1326424 3014.60
MUL OPU 450 635470 1412.16
RMS_NORM OPU 450 1059817 2355.15
MUL_MAT CPU 7824 468509365 59881.05
CONT CPU 1321 1161228 879.05
RESHAPE CPU 1265 19943 15.77
VIEW CPU 1992 3757 1.89
PERMUTE CPU 1484 2468 1.66
TRANSPOSE CPU 369 765 2.07
GET_ROWS CPU 78 14125 181.09
SET_ROWS CPU 1440 27984 19.43
SOFT_MAX CPU 516 655100 1269.57
ROPE CPU 1372 89013 64.88
GLU OPU 220 867334 3942.43

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    38.8710   38.8710     34.3800  [5.83e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     3.1180    3.1180      3.1180  └─ [4.68e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.6880    0.6880      0.6880  └─ [1.03e-03%] tsi::runtime::TsavRT::initialize
1     0.6850    0.6850      0.6250  └─ [1.03e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0600    0.0300      0.0600    └─ [9.00e-05%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

624 503.8260 0.8074 0.0000 [7.56e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
624 63502.1511 101.7663 63502.1511 └─ [95.28%] TXE 0 Idle
88 79.0270 0.8980 79.0270 └─ [1.19e-01%] [ txe_swiglu ]
180 68.7334 0.3819 68.7334 └─ [1.03e-01%] [ txe_rms_norm ]
180 55.1530 0.3064 55.1530 └─ [8.28e-02%] [ txe_mult ]
176 50.5374 0.2871 50.5374 └─ [7.58e-02%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

624 418.5910 0.6708 406.5780 [6.28e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
624 12.0130 0.0193 12.0130 └─ [1.80e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

624 720.9390 1.1554 27.6310 [ 1.08%] [Thread] tsi::runtime::TsavRT::processResponses
624 693.3080 1.1111 693.3080 └─ [ 1.04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    81.3050   81.3050     63.1190  [1.22e-01%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    18.1860   18.1860     18.1860  └─ [2.73e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

806 58.3560 0.0724 58.3560 [8.76e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

1     6.1460    6.1460      6.1460  [9.22e-03%] [Thread] OPU

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

624 244.2940 0.3915 244.2940 [3.67e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

624 45.7490 0.0733 45.7490 [6.86e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

624 48.5660 0.0778 48.5660 [7.29e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

624 9.0120 0.0144 9.0120 [1.35e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 66649.4240    0.0000  66649.4240  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7732

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#

##########
./tsi-pkg-build.sh "release"
POSIX
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 10 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat

llama_perf_sampler_print: sampling time = 27.03 ms / 16 runs ( 1.69 ms per token, 591.96 tokens per second)

llama_perf_context_print: load time = 4058.84 ms
llama_perf_context_print: prompt eval time = 2441.94 ms / 6 tokens ( 406.99 ms per token, 2.46 tokens per second)
llama_perf_context_print: eval time = 3601.34 ms / 9 runs ( 400.15 ms per token, 2.50 tokens per second)
llama_perf_context_print: total time = 7689.99 ms / 15 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 2024 3039086 1501.52
MUL 2070 1252986 605.31
RMS_NORM 2070 1265910 611.55
MUL_MAT 36304 50427030 1389.02
CONT 7765 359564 46.31
RESHAPE 11239 4822 0.43
VIEW 17414 2088 0.12
PERMUTE 13406 1657 0.12
TRANSPOSE 3223 697 0.22
GET_ROWS 377 3201 8.49
SET_ROWS 7419 4735 0.64
SOFT_MAX 3810 327672 86.00
ROPE 7774 33695 4.33
GLU 1012 1256797 1241.89

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

2090 710.3070 0.3399 91.9660 [ 7.29%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
4180 617.6930 0.1478 617.6930 └─ [ 6.34%] tsi::runtime::executeWithTimeout
2090 0.6480 3.10e-04 0.6480 └─ [6.65e-03%] LOAD_BLOB Command Execution

#######
FPGA
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#
*** file: tsi-ggml-0.0.9.tz
$ sz -vv tsi-ggml-0.0.9.tz
Sending: tsi-ggml-0.0.9.tz
Bytes Sent:16593627 BPS:89467

Transfer complete

*** exit status: 0 ***
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# tar -zxvf tsi-ggml-0.0.9.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_abs_16.blob
tsi-ggml/blobs/txe_neg_16.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_sqrt_16.blob
tsi-ggml/blobs/txe_sqr_16.blob
tsi-ggml/blobs/txe_inv_16.blob
tsi-ggml/blobs/txe_sin_16.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_sigmoid_16.blob
tsi-ggml/blobs/txe_silu_16.blob
tsi-ggml/blobs/txe_swiglu_16.blob
tsi-ggml/blobs/txe_rms_norm_16.blob
tsi-ggml/blobs/txe_swiglu.blob
tsi-ggml/blobs/txe_rms_norm.blob
tsi-ggml/blobs/txe_add_16.blob
tsi-ggml/blobs/txe_sub_16.blob
tsi-ggml/blobs/txe_mult_16.blob
tsi-ggml/blobs/txe_div_16.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# cd tsi-ggml
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# ./ggml.sh
cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin/tsi-ggml# cd ..
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 126.21 ms / 11 runs ( 11.47 ms per token, 87.16 tokens per second)
llama_perf_context_print: load time = 50298.15 ms
llama_perf_context_print: prompt eval time = 38549.89 ms / 5 tokens ( 7709.98 ms per token, 0.13 tokens per second)
llama_perf_context_print: eval time = 51209.39 ms / 4 runs (12802.35 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 101661.49 ms / 9 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 484 1588575 3282.18
MUL 495 915783 1850.07
RMS_NORM 495 990609 2001.23
MUL_MAT 8230 539900785 65601.55
CONT 1442 1347499 934.47
RESHAPE 1356 18341 13.53
VIEW 2054 2991 1.46
PERMUTE 1556 2548 1.64
TRANSPOSE 396 838 2.12
GET_ROWS 84 17034 202.79
SET_ROWS 1594 33222 20.84
SOFT_MAX 576 939702 1631.43
ROPE 1552 99430 64.07
GLU 242 1182510 4886.40

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    30.2260   30.2260     25.7880  [2.91e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     3.1300    3.1300      3.1300  └─ [3.01e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.6860    0.6860      0.6260  └─ [6.61e-04%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0600    0.0300      0.0600    └─ [5.78e-05%] tsi::runtime::executeWithTimeout
1     0.6220    0.6220      0.6220  └─ [5.99e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1204 973.2090 0.8083 0.0000 [9.37e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1204 1.00e+05 83.3721 1.00e+05 └─ [96.68%] TXE 0 Idle
194 176.8172 0.9114 176.8172 └─ [1.70e-01%] [ txe_swiglu ]
225 129.9412 0.5775 129.9412 └─ [1.25e-01%] [ txe_rms_norm ]
397 120.6315 0.3039 120.6315 └─ [1.16e-01%] [ txe_mult ]
388 113.7148 0.2931 113.7148 └─ [1.10e-01%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1204 655.9050 0.5448 629.0780 [6.32e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1204 26.8270 0.0223 26.8270 └─ [2.58e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1204 1311.2360 1.0891 49.8420 [ 1.26%] [Thread] tsi::runtime::TsavRT::processResponses
1204 1261.3940 1.0477 1261.3940 └─ [ 1.21%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    80.7280   80.7280     62.8320  [7.78e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    17.8960   17.8960     17.8960  └─ [1.72e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1431 73.1690 0.0511 73.1690 [7.05e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

1     5.1420    5.1420      5.1420  [4.95e-03%] [Thread] OPU

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1204 371.8770 0.3089 371.8770 [3.58e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1204 90.9340 0.0755 90.9340 [8.76e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1204 91.5520 0.0760 91.5520 [8.82e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1204 15.5410 0.0129 15.5410 [1.50e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   1.04e+05    0.0000    1.04e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8292

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv37_09_25_2025/bin#

dineshReddy6381

Approved

@FIR-1006 - GGML: PERF changes with following option

09a3864

akapoor3518 requested review from Nithyanand-G, atrivedi-tsavoritesi, dineshReddy6381, dmpatra, gkethamallax, mikeuhler and mmankal as code owners October 7, 2025 05:15

dineshReddy6381 approved these changes Oct 7, 2025

View reviewed changes

akapoor3518 merged commit 40bfeea into master Oct 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@FIR-1006 - GGML: PERF changes with following option #61

@FIR-1006 - GGML: PERF changes with following option #61

Uh oh!

akapoor3518 commented Oct 7, 2025 •

edited

Loading

Uh oh!

dineshReddy6381 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@FIR-1006 - GGML: PERF changes with following option #61

@FIR-1006 - GGML: PERF changes with following option #61

Uh oh!

Conversation

akapoor3518 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OPU Profiling Results:

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

Profiling Results (LlamaForCausalLM_Random):

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

251 95.0160 0.3785 2.3550 [ 5.03%] [Thread] tsi::runtime::TsavRT::processResponses 251 92.6610 0.3692 92.6610 └─ [ 4.91%] tsi::runtime::executeWithTimeout

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9921

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1310 726.3770 0.5545 690.6780 [6.65e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 1310 35.6990 0.0273 35.6990 └─ [3.27e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1310 1428.4040 1.0904 44.9740 [ 1.31%] [Thread] tsi::runtime::TsavRT::processResponses 1310 1383.4300 1.0561 1383.4300 └─ [ 1.27%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1537 73.0430 0.0475 73.0430 [6.69e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1310 393.8620 0.3007 393.8620 [3.61e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1310 76.7380 0.0586 76.7380 [7.03e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1310 99.9070 0.0763 99.9070 [9.15e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1310 15.7400 0.0120 15.7400 [1.44e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8255

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

624 418.5910 0.6708 406.5780 [6.28e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 624 12.0130 0.0193 12.0130 └─ [1.80e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

624 720.9390 1.1554 27.6310 [ 1.08%] [Thread] tsi::runtime::TsavRT::processResponses 624 693.3080 1.1111 693.3080 └─ [ 1.04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

806 58.3560 0.0724 58.3560 [8.76e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

624 244.2940 0.3915 244.2940 [3.67e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

624 45.7490 0.0733 45.7490 [6.86e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

624 48.5660 0.0778 48.5660 [7.29e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

624 9.0120 0.0144 9.0120 [1.35e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7732

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1204 655.9050 0.5448 629.0780 [6.32e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 1204 26.8270 0.0223 26.8270 └─ [2.58e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1204 1311.2360 1.0891 49.8420 [ 1.26%] [Thread] tsi::runtime::TsavRT::processResponses 1204 1261.3940 1.0477 1261.3940 └─ [ 1.21%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1431 73.1690 0.0511 73.1690 [7.05e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1204 371.8770 0.3089 371.8770 [3.58e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1204 90.9340 0.0755 90.9340 [8.76e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1204 91.5520 0.0760 91.5520 [8.82e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

akapoor3518 commented Oct 7, 2025 •

edited

Loading

251 95.0160 0.3785 2.3550 [ 5.03%] [Thread] tsi::runtime::TsavRT::processResponses
251 92.6610 0.3692 92.6610 └─ [ 4.91%] tsi::runtime::executeWithTimeout

1310 726.3770 0.5545 690.6780 [6.65e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 35.6990 0.0273 35.6990 └─ [3.27e-02%] tsi::runtime::executeWithTimeout

1310 1428.4040 1.0904 44.9740 [ 1.31%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1383.4300 1.0561 1383.4300 └─ [ 1.27%] tsi::runtime::executeWithTimeout

624 418.5910 0.6708 406.5780 [6.28e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
624 12.0130 0.0193 12.0130 └─ [1.80e-02%] tsi::runtime::executeWithTimeout

624 720.9390 1.1554 27.6310 [ 1.08%] [Thread] tsi::runtime::TsavRT::processResponses
624 693.3080 1.1111 693.3080 └─ [ 1.04%] tsi::runtime::executeWithTimeout

1204 655.9050 0.5448 629.0780 [6.32e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1204 26.8270 0.0223 26.8270 └─ [2.58e-02%] tsi::runtime::executeWithTimeout

1204 1311.2360 1.0891 49.8420 [ 1.26%] [Thread] tsi::runtime::TsavRT::processResponses
1204 1261.3940 1.0477 1261.3940 └─ [ 1.21%] tsi::runtime::executeWithTimeout