Skip to content

Conversation

@akapoor3518
Copy link

@akapoor3518 akapoor3518 commented Oct 15, 2025

Posix Result
build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 1 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
llama_perf_sampler_print: sampling time = 2.32 ms / 7 runs ( 0.33 ms per token, 3021.15 tokens per second)

llama_perf_context_print: load time = 14263.70 ms
llama_perf_context_print: prompt eval time = 4560.23 ms / 6 tokens ( 760.04 ms per token, 1.32 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 14266.72 ms / 7 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 44 254 156411 3554.80
MUL OPU 45 260 144592 3213.16
RMS_NORM OPU 45 45 66228 1471.73
MUL_MAT CPU 693 0 5878977 8483.37
CONT CPU 172 0 9045 52.59
RESHAPE CPU 239 0 170 0.71
VIEW CPU 398 0 68 0.17
PERMUTE CPU 298 0 94 0.32
TRANSPOSE CPU 71 0 17 0.24
GET_ROWS CPU 9 0 1907 211.89
SET_ROWS CPU 173 0 259 1.50
SOFT_MAX OPU 22 4224 2242909 101950.41
ROPE CPU 174 0 3156 18.14
GLU OPU 22 127 133368 6062.18

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    27.9170   27.9170      9.1830  [1.71e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    18.5750   18.5750      2.1900  └─ [1.13e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1    15.2600   15.2600     15.2600    └─ [9.32e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.0570    1.0570      1.0570    └─ [6.46e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0680    0.0680      0.0630    └─ [4.15e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0050    0.0050      0.0050      └─ [3.05e-05%] tsi::runtime::executeWithTimeout
1     0.1590    0.1590      0.1590  └─ [9.71e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    40.3820   40.3820     39.8780  [2.47e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.4940    0.4940      0.0530  └─ [3.02e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4410    0.4410      0.0820    └─ [2.69e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3290    0.3290      0.3290      └─ [2.01e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0300    0.0300      0.0270      └─ [1.83e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.83e-05%] tsi::runtime::executeWithTimeout
2     0.0100    0.0050      0.0100  └─ [6.11e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

4910 781.6510 0.1592 138.1140 [ 4.77%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
9820 642.4060 0.0654 642.4060 └─ [ 3.92%] tsi::runtime::executeWithTimeout
4910 1.1310 2.30e-04 1.1310 └─ [6.91e-03%] LOAD_BLOB Command Execution
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

4910 791.2890 0.1612 158.7450 [ 4.83%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
9820 630.9930 0.0643 630.9930 └─ [ 3.85%] tsi::runtime::executeWithTimeout
4910 1.5510 3.16e-04 1.5510 └─ [9.47e-03%] UNLOAD_BLOB Command Execution
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

4912 865.4440 0.1762 37.8030 [ 5.29%] [Thread] tsi::runtime::TsavRT::processResponses
4912 827.6410 0.1685 827.6410 └─ [ 5.06%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

FPGA Result
Currently this is failing due Memory Alloc Failure, Dushmanta is chasing
perror("tsi_aligned_malloc: Failed to allocate aligned memory\n");

hence i have disable SOFT_MAX for fpga and tested, below is result

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# tar -zxvf tsi-ggml-0.0.10_working.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_swiglu.blob
tsi-ggml/blobs/txe_div_16.blob
tsi-ggml/blobs/txe_abs_16.blob
tsi-ggml/blobs/txe_rms_norm.blob
tsi-ggml/blobs/txe_soft_max.blob
tsi-ggml/blobs/txe_add_16.blob
tsi-ggml/blobs/txe_sub_16.blob
tsi-ggml/blobs/txe_mult_16.blob
tsi-ggml/blobs/txe_neg_16.blob
tsi-ggml/blobs/txe_sqrt_16.blob
tsi-ggml/blobs/txe_sqr_16.blob
tsi-ggml/blobs/txe_inv_16.blob
tsi-ggml/blobs/txe_sin_16.blob
tsi-ggml/blobs/txe_sigmoid_16.blob
tsi-ggml/blobs/txe_silu_16.blob
tsi-ggml/blobs/txe_swiglu_16.blob
tsi-ggml/blobs/txe_rms_norm_16.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 119.16 ms / 11 runs ( 10.83 ms per token, 92.32 tokens per second)
llama_perf_context_print: load time = 54439.14 ms
llama_perf_context_print: prompt eval time = 43831.27 ms / 6 tokens ( 7305.21 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 51031.02 ms / 4 runs (12757.75 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 105619.16 ms / 10 tokens

=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 484 1680847 3472.82
MUL OPU 495 1064577 2150.66
RMS_NORM OPU 495 1029119 2079.03
MUL_MAT CPU 8227 544547653 66190.31
CONT CPU 1431 1299315 907.98
RESHAPE CPU 1366 17971 13.16
VIEW CPU 2174 3322 1.53
PERMUTE CPU 1591 2693 1.69
TRANSPOSE CPU 404 813 2.01
GET_ROWS CPU 82 17480 213.17
SET_ROWS CPU 1604 34087 21.25
SOFT_MAX CPU 585 1001348 1711.71
ROPE CPU 1557 101817 65.39
GLU OPU 242 1067356 4410.56

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    41.6820   41.6820     31.8990  [3.87e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     5.3710    5.3710      5.3710  └─ [4.98e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     2.4750    2.4750      1.6420  └─ [2.30e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.8330    0.4165      0.8330    └─ [7.73e-04%] tsi::runtime::executeWithTimeout
1     1.9370    1.9370      1.9370  └─ [1.80e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1310 1017.7360 0.7769 0.0000 [9.44e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.04e+05 79.5849 1.04e+05 └─ [96.71%] TXE 0 Idle
215 196.1439 0.9123 196.1439 └─ [1.82e-01%] [ txe_swiglu ]
225 140.6145 0.6250 140.6145 └─ [1.30e-01%] [ txe_rms_norm ]
440 133.5208 0.3035 133.5208 └─ [1.24e-01%] [ txe_mult ]
430 126.5711 0.2944 126.5711 └─ [1.17e-01%] [ txe_add ]

[Thread] OPU (cumulative over all threads)

1     5.3030    5.3030      5.0330  [4.92e-03%] [Thread] OPU 
1     0.2700    0.2700      0.2700  └─ [2.50e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1310 746.8850 0.5701 729.0130 [6.93e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 17.8720 0.0136 17.8720 └─ [1.66e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1310 1492.0960 1.1390 53.5740 [ 1.38%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1438.5220 1.0981 1438.5220 └─ [ 1.33%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    80.1760   80.1760     62.3420  [7.44e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    17.8340   17.8340     17.8340  └─ [1.65e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1312 68.7070 0.0524 68.7070 [6.37e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1310 369.7450 0.2822 369.7450 [3.43e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1310 76.1190 0.0581 76.1190 [7.06e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1310 98.7010 0.0753 98.7010 [9.16e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1310 15.6920 0.0120 15.6920 [1.46e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   1.08e+05    0.0000    1.08e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8544

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#

static ggml_backend_buffer_t
ggml_backend_tsavorite_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
GGML_TSAVORITE_LOG_INFO("Start %s\n", __func__);
tsi_log_setup();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this code here ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes for cases when ggml_tsavorite_init is not called but still need to setup log file

@mikeuhler mikeuhler removed their request for review October 15, 2025 22:07
@akapoor3518 akapoor3518 merged commit 721bb4b into master Oct 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants