Skip to content

Conversation

@akapoor3518
Copy link

@akapoor3518 akapoor3518 commented Sep 3, 2025

We are currently implementing the following updates:

For customers, we will not display where the kernel is running (OPU or CPU).
Additional improvements are underway.
We’ve ensured that the full prompt response is printed before showing any profile data
TESTED at POSIX and FPGA
#############
POSIX LOG

[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 100 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
is Luna.
I'm a cat person and I love my cat Luna. She is a very cute cat and I love her fur. She is a very smart cat and she can do many things. She can climb trees and she can jump over furniture. She is a very loyal cat and she always protects her owner. She is a very loving cat and she always cares for her owner. She is a very kind and loving cat. She is

llama_perf_sampler_print: sampling time = 290.73 ms / 106 runs ( 2.74 ms per token, 364.60 tokens per second)

llama_perf_context_print: load time = 5149.74 ms
llama_perf_context_print: prompt eval time = 4346.23 ms / 6 tokens ( 724.37 ms per token, 1.38 tokens per second)
llama_perf_context_print: eval time = 81737.38 ms / 99 runs ( 825.63 ms per token, 1.21 tokens per second)
llama_perf_context_print: total time = 87267.48 ms / 105 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 4400 12224789 2778.36
MUL 6700 15731162 2347.93
RMS_NORM 17128 67394 3.93
MUL_MAT 79229 127276753 1606.44
CPY 16159 50592 3.13
CONT 8178 8333 1.02
RESHAPE 32601 15146 0.46
VIEW 30039 4546 0.15
PERMUTE 30645 6038 0.20
TRANSPOSE 7135 1443 0.20
GET_ROWS 1099 2809 2.56
SOFT_MAX 8668 208601 24.07
ROPE 16822 97254 5.78
UNARY 2200 6906739 3139.43
-> SILU 2200 6906739 3139.43

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    22.8440   22.8440      0.2080  [2.58e-02%] [Thread] OPU 
1    22.6360   22.6360      8.5560  └─ [2.55e-02%] tsi::runtime::TsavRTPosix::initialize
1    13.9060   13.9060      0.3340    └─ [1.57e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1    12.0330   12.0330     12.0330      └─ [1.36e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.3980    1.3980      1.3980      └─ [1.58e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1410    0.1410      0.1240      └─ [1.59e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0170    0.0170      0.0170        └─ [1.92e-05%] tsi::runtime::executeWithTimeout
1     0.1740    0.1740      0.1740    └─ [1.96e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     5.3730    5.3730      3.6490  [6.06e-03%] [Thread] tsi::runtime::TsavRT::finalize
1     1.7130    1.7130      0.0660  └─ [1.93e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     1.6470    1.6470      0.0900    └─ [1.86e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     1.5200    1.5200      1.5200      └─ [1.71e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0370    0.0370      0.0330      └─ [4.17e-05%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0040    0.0040      0.0040        └─ [4.51e-06%] tsi::runtime::executeWithTimeout
2     0.0110    0.0055      0.0110  └─ [1.24e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

13830 17822.4030 1.2887 583.5730 [20.10%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
27660 17231.9610 0.6230 17231.9610 └─ [19.43%] tsi::runtime::executeWithTimeout
13830 6.8690 4.97e-04 6.8690 └─ [7.75e-03%] LOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2181038720[0x820...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

13830 5052.9680 0.3654 545.9530 [ 5.70%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
27660 4497.4390 0.1626 4497.4390 └─ [ 5.07%] tsi::runtime::executeWithTimeout
13830 9.5760 6.92e-04 9.5760 └─ [1.08e-02%] UNLOAD_BLOB Command Execution
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2181038720[0x8...
13830 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

13832 6337.2120 0.4582 114.5360 [ 7.15%] [Thread] tsi::runtime::TsavRT::processResponses
13832 6222.6760 0.4499 6222.6760 └─ [ 7.02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

13830 318.9850 0.0231 299.1750 [3.60e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
13830 19.8100 0.0014 19.8100 └─ [2.23e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

13831 80.4440 0.0058 80.4440 [9.07e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

13830 66.8830 0.0048 66.8830 [7.54e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

13830 10979.1980 0.7939 10979.1980 [12.38%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

13830 30.4860 0.0022 30.4860 [3.44e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 88681.5350    0.0000  88681.5350  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9992

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$

##########
TESTED at FPGA1
drwxr-xr-x 3 root root 960 Mar 10 2018 tsi-ggml_backup_aug26
-rw-r--r-- 1 root root 14067478 Sep 3 2025 tsi-ggml-0.0.6.tz
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv34_08_14_2025/bin# tar -xvzf tsi-ggml-0.0.6.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv34_08_14_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 125.31 ms / 11 runs ( 11.39 ms per token, 87.78 tokens per second)
llama_perf_context_print: load time = 28414.99 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 92456.34 ms / 4 runs (23114.09 ms per token, 0.04 tokens per second)
llama_perf_context_print: total time = 105830.93 ms / 5 tokens

=== GGML Perf Summary ===
Op Runs Total us Avg us
ADD 176 574437 3263.85
MUL 268 765142 2855.01
RMS_NORM 601 27510 45.77
MUL_MAT 3076 254179146 82633.01
CPY 503 32486 64.58
CONT 174 1617 9.29
RESHAPE 683 8044 11.78
VIEW 618 1072 1.73
PERMUTE 583 902 1.55
TRANSPOSE 153 358 2.34
GET_ROWS 36 13093 363.69
SOFT_MAX 236 25738 109.06
ROPE 573 34942 60.98
UNARY 88 354482 4028.20
-> SILU 88 354482 4028.20

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1   140.5670  140.5670     34.6830  [1.47e-01%] [Thread] OPU 
1   105.8840  105.8840     85.1790  └─ [1.10e-01%] tsi::runtime::TsavRTFPGA::initialize
1     9.1630    9.1630      9.1630    └─ [9.55e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     8.4550    8.4550      8.4550    └─ [8.81e-03%] tsi::runtime::TsavRT::initialize
1     3.0870    3.0870      2.6060    └─ [3.22e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.4810    0.2405      0.4810      └─ [5.01e-04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

532 669.8210 1.2591 0.0000 [6.98e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
532 1.85e+05 347.4875 1.85e+05 └─ [192.71%] TXE 0 Idle
268 232.5000 0.8675 232.5000 └─ [2.42e-01%] [ txe_mult ]
88 135.9557 1.5450 135.9557 └─ [1.42e-01%] [ txe_silu ]
176 106.0360 0.6025 106.0360 └─ [1.11e-01%] [ txe_add ]

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

532 423.8360 0.7967 417.7460 [4.42e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
532 6.0900 0.0114 6.0900 └─ [6.35e-03%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

532 832.2530 1.5644 117.1030 [8.68e-01%] [Thread] tsi::runtime::TsavRT::processResponses
532 715.1500 1.3443 715.1500 └─ [7.45e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    65.5700   65.5700     50.9570  [6.84e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    14.6130   14.6130     14.6130  └─ [1.52e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

533 57.2040 0.1073 57.2040 [5.96e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

532 282.8630 0.5317 282.8630 [2.95e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

532 42.4230 0.0797 42.4230 [4.42e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

532 42.6560 0.0802 42.6560 [4.45e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

532 10.0770 0.0189 10.0770 [1.05e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 95930.4680    0.0000  95930.4680  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6180

root@agilex7_dk_si_agf014ea:/usr/bin/tsi/v0.1.1.tsv34_08_14_2025/bin#

Copy link

@atrivedi-tsavoritesi atrivedi-tsavoritesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add fpga logs as well

@akapoor3518 akapoor3518 merged commit 2e3b794 into master Sep 3, 2025
}
// Ensure all prompt response output is fully completed before printing profiling data.
// Free threadpools first to guarantee all worker threads have finished processing and output.
// This prevents interleaving of prompt response and profiling output, ensuring correct order.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spacing is not aligned

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants