Skip to content

Conversation

@akapoor3518
Copy link

@akapoor3518 akapoor3518 commented Oct 18, 2025

POSIX Validation
akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F321.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 1 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    42.7730   42.7730      3.2870  [ 2.02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    39.2080   39.2080      5.1000  └─ [ 1.85%] tsi::runtime::TsavRTPosix::initializeQueues
1    32.7540   32.7540     32.7540    └─ [ 1.55%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.2580    1.2580      1.2580    └─ [5.95e-02%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0960    0.0960      0.0870    └─ [4.54e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0090    0.0090      0.0090      └─ [4.26e-04%] tsi::runtime::executeWithTimeout
1     0.2780    0.2780      0.2780  └─ [1.31e-02%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     2.9980    2.9980      2.6600  [1.42e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.3120    0.3120      0.0610  └─ [1.48e-02%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.2510    0.2510      0.0880    └─ [1.19e-02%] tsi::runtime::TsavRT::executeSyncCommand
1     0.1250    0.1250      0.1250      └─ [5.91e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0380    0.0380      0.0340      └─ [1.80e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0040    0.0040      0.0040        └─ [1.89e-04%] tsi::runtime::executeWithTimeout
2     0.0260    0.0130      0.0260  └─ [1.23e-03%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

2    32.6360   16.3180      0.1350  [ 1.54%] [Thread] tsi::runtime::TsavRT::processResponses
2    32.5010   16.2505     32.5010  └─ [ 1.54%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 2114.9780 0.0000 2114.9780 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6667

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$

FPGA Validation
With good and bas model
oot@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 110.51 ms / 11 runs ( 10.05 ms per token, 99.54 tokens per second)
llama_perf_context_print: load time = 24828.69 ms
llama_perf_context_print: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: eval time = 51249.24 ms / 4 runs (12812.31 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 63245.64 ms / 5 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 440 440 1287759 2926.72
MUL OPU 450 450 683687 1519.30
RMS_NORM OPU 450 450 1174534 2610.08
MUL_MAT CPU 7881 0 465226500 59031.40
CONT CPU 1208 0 1153354 954.76
RESHAPE CPU 1148 0 31816 27.71
VIEW CPU 1770 0 2891 1.63
PERMUTE CPU 1423 0 2425 1.70
TRANSPOSE CPU 294 0 741 2.52
GET_ROWS CPU 79 0 15513 196.37
SET_ROWS CPU 1525 0 28783 18.87
SOFT_MAX CPU 584 0 660988 1131.83
ROPE CPU 1446 0 130146 90.00
GLU OPU 220 220 840087 3818.58

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    29.9400   29.9400     25.5800  [4.58e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     2.9920    2.9920      2.9920  └─ [4.57e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.7060    0.7060      0.6450  └─ [1.08e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0610    0.0305      0.0610    └─ [9.33e-05%] tsi::runtime::executeWithTimeout
1     0.6620    0.6620      0.6620  └─ [1.01e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

624 559.2910 0.8963 0.0000 [8.55e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
624 62293.2020 99.8288 62293.2020 └─ [95.23%] TXE 0 Idle
88 79.3556 0.9018 79.3556 └─ [1.21e-01%] [ txe_swiglu ]
180 68.6981 0.3817 68.6981 └─ [1.05e-01%] [ txe_rms_norm ]
180 55.1381 0.3063 55.1381 └─ [8.43e-02%] [ txe_mult ]
176 50.7105 0.2881 50.7105 └─ [7.75e-02%] [ txe_add ]

[Thread] OPU (cumulative over all threads)

1     5.4760    5.4760      5.0230  [8.37e-03%] [Thread] OPU 
1     0.4530    0.4530      0.4530  └─ [6.93e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

624 407.4300 0.6529 389.5260 [6.23e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
624 17.9040 0.0287 17.9040 └─ [2.74e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

624 734.5330 1.1771 26.1010 [ 1.12%] [Thread] tsi::runtime::TsavRT::processResponses
624 708.4320 1.1353 708.4320 └─ [ 1.08%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    80.3720   80.3720     61.9600  [1.23e-01%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    18.4120   18.4120     18.4120  └─ [2.81e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

626 67.4180 0.1077 67.4180 [1.03e-01%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

624 239.5510 0.3839 239.5510 [3.66e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

624 46.8340 0.0751 46.8340 [7.16e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

624 47.4160 0.0760 47.4160 [7.25e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

624 8.3380 0.0134 8.3380 [1.27e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 65412.8550    0.0000  65412.8550  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.7556

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# vi run_llama_cli.sh
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    30.3830   30.3830     26.0470  [ 1.40%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     3.0080    3.0080      3.0080  └─ [1.39e-01%] tsi::runtime::TsavRTFPGA::initializeQueues
1     0.6830    0.6830      0.6240  └─ [3.14e-02%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.0590    0.0295      0.0590    └─ [2.72e-03%] tsi::runtime::executeWithTimeout
1     0.6450    0.6450      0.6450  └─ [2.97e-02%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    57.4810   57.4810     56.5930  [ 2.65%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1     0.8880    0.8880      0.8880  └─ [4.09e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

========================================================================================================================
- 2171.7140 0.0000 2171.7140 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 0.0000 0.0000

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#

@akapoor3518 akapoor3518 merged commit 461411f into master Oct 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants