@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML #63

akapoor3518 · 2025-10-15T21:40:16Z

Posix Result
build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 1 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
llama_perf_sampler_print: sampling time = 2.32 ms / 7 runs ( 0.33 ms per token, 3021.15 tokens per second)

llama_perf_context_print: load time = 14263.70 ms
llama_perf_context_print: prompt eval time = 4560.23 ms / 6 tokens ( 760.04 ms per token, 1.32 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 14266.72 ms / 7 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 44 254 156411 3554.80
MUL OPU 45 260 144592 3213.16
RMS_NORM OPU 45 45 66228 1471.73
MUL_MAT CPU 693 0 5878977 8483.37
CONT CPU 172 0 9045 52.59
RESHAPE CPU 239 0 170 0.71
VIEW CPU 398 0 68 0.17
PERMUTE CPU 298 0 94 0.32
TRANSPOSE CPU 71 0 17 0.24
GET_ROWS CPU 9 0 1907 211.89
SET_ROWS CPU 173 0 259 1.50
SOFT_MAX OPU 22 4224 2242909 101950.41
ROPE CPU 174 0 3156 18.14
GLU OPU 22 127 133368 6062.18

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    27.9170   27.9170      9.1830  [1.71e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    18.5750   18.5750      2.1900  └─ [1.13e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1    15.2600   15.2600     15.2600    └─ [9.32e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     1.0570    1.0570      1.0570    └─ [6.46e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0680    0.0680      0.0630    └─ [4.15e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0050    0.0050      0.0050      └─ [3.05e-05%] tsi::runtime::executeWithTimeout
1     0.1590    0.1590      0.1590  └─ [9.71e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    40.3820   40.3820     39.8780  [2.47e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.4940    0.4940      0.0530  └─ [3.02e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4410    0.4410      0.0820    └─ [2.69e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3290    0.3290      0.3290      └─ [2.01e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0300    0.0300      0.0270      └─ [1.83e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [1.83e-05%] tsi::runtime::executeWithTimeout
2     0.0100    0.0050      0.0100  └─ [6.11e-05%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

4910 781.6510 0.1592 138.1140 [ 4.77%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
9820 642.4060 0.0654 642.4060 └─ [ 3.92%] tsi::runtime::executeWithTimeout
4910 1.1310 2.30e-04 1.1310 └─ [6.91e-03%] LOAD_BLOB Command Execution
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

4910 791.2890 0.1612 158.7450 [ 4.83%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
9820 630.9930 0.0643 630.9930 └─ [ 3.85%] tsi::runtime::executeWithTimeout
4910 1.5510 3.16e-04 1.5510 └─ [9.47e-03%] UNLOAD_BLOB Command Execution
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

4912 865.4440 0.1762 37.8030 [ 5.29%] [Thread] tsi::runtime::TsavRT::processResponses
4912 827.6410 0.1685 827.6410 └─ [ 5.06%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

FPGA Result
Currently this is failing due Memory Alloc Failure, Dushmanta is chasing
perror("tsi_aligned_malloc: Failed to allocate aligned memory\n");

hence i have disable SOFT_MAX for fpga and tested, below is result

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# tar -zxvf tsi-ggml-0.0.10_working.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_swiglu.blob
tsi-ggml/blobs/txe_div_16.blob
tsi-ggml/blobs/txe_abs_16.blob
tsi-ggml/blobs/txe_rms_norm.blob
tsi-ggml/blobs/txe_soft_max.blob
tsi-ggml/blobs/txe_add_16.blob
tsi-ggml/blobs/txe_sub_16.blob
tsi-ggml/blobs/txe_mult_16.blob
tsi-ggml/blobs/txe_neg_16.blob
tsi-ggml/blobs/txe_sqrt_16.blob
tsi-ggml/blobs/txe_sqr_16.blob
tsi-ggml/blobs/txe_inv_16.blob
tsi-ggml/blobs/txe_sin_16.blob
tsi-ggml/blobs/txe_sigmoid_16.blob
tsi-ggml/blobs/txe_silu_16.blob
tsi-ggml/blobs/txe_swiglu_16.blob
tsi-ggml/blobs/txe_rms_norm_16.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.

llama_perf_sampler_print: sampling time = 119.16 ms / 11 runs ( 10.83 ms per token, 92.32 tokens per second)
llama_perf_context_print: load time = 54439.14 ms
llama_perf_context_print: prompt eval time = 43831.27 ms / 6 tokens ( 7305.21 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 51031.02 ms / 4 runs (12757.75 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 105619.16 ms / 10 tokens

=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 484 1680847 3472.82
MUL OPU 495 1064577 2150.66
RMS_NORM OPU 495 1029119 2079.03
MUL_MAT CPU 8227 544547653 66190.31
CONT CPU 1431 1299315 907.98
RESHAPE CPU 1366 17971 13.16
VIEW CPU 2174 3322 1.53
PERMUTE CPU 1591 2693 1.69
TRANSPOSE CPU 404 813 2.01
GET_ROWS CPU 82 17480 213.17
SET_ROWS CPU 1604 34087 21.25
SOFT_MAX CPU 585 1001348 1711.71
ROPE CPU 1557 101817 65.39
GLU OPU 242 1067356 4410.56

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    41.6820   41.6820     31.8990  [3.87e-02%] [Thread] tsi::runtime::TsavRTFPGA::initialize
1     5.3710    5.3710      5.3710  └─ [4.98e-03%] tsi::runtime::TsavRTFPGA::initializeQueues
1     2.4750    2.4750      1.6420  └─ [2.30e-03%] tsi::runtime::TsavRTFPGA::sendNOPTestCommand
2     0.8330    0.4165      0.8330    └─ [7.73e-04%] tsi::runtime::executeWithTimeout
1     1.9370    1.9370      1.9370  └─ [1.80e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

1310 1017.7360 0.7769 0.0000 [9.44e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.04e+05 79.5849 1.04e+05 └─ [96.71%] TXE 0 Idle
215 196.1439 0.9123 196.1439 └─ [1.82e-01%] [ txe_swiglu ]
225 140.6145 0.6250 140.6145 └─ [1.30e-01%] [ txe_rms_norm ]
440 133.5208 0.3035 133.5208 └─ [1.24e-01%] [ txe_mult ]
430 126.5711 0.2944 126.5711 └─ [1.17e-01%] [ txe_add ]

[Thread] OPU (cumulative over all threads)

1     5.3030    5.3030      5.0330  [4.92e-03%] [Thread] OPU 
1     0.2700    0.2700      0.2700  └─ [2.50e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1310 746.8850 0.5701 729.0130 [6.93e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 17.8720 0.0136 17.8720 └─ [1.66e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1310 1492.0960 1.1390 53.5740 [ 1.38%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1438.5220 1.0981 1438.5220 └─ [ 1.33%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

1    80.1760   80.1760     62.3420  [7.44e-02%] [Thread] tsi::runtime::TsavRTFPGA::finalize
1    17.8340   17.8340     17.8340  └─ [1.65e-02%] tsi::runtime::TsavRTFPGA::releaseTxes

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1312 68.7070 0.0524 68.7070 [6.37e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1310 369.7450 0.2822 369.7450 [3.43e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1310 76.1190 0.0581 76.1190 [7.06e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1310 98.7010 0.0753 98.7010 [9.16e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1310 15.6920 0.0120 15.6920 [1.46e-02%] [Thread] tsi::runtime::TsavRT::deallocate

-   1.08e+05    0.0000    1.08e+05  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8544

root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#

atrivedi-tsavoritesi · 2025-10-15T22:06:57Z

ggml/src/ggml-tsavorite/ggml-tsavorite.cpp

 static ggml_backend_buffer_t
 ggml_backend_tsavorite_buffer_type_alloc_buffer(ggml_backend_buffer_type_t buft, size_t size) {
  GGML_TSAVORITE_LOG_INFO("Start %s\n", __func__);
+  tsi_log_setup();


Do you need this code here ?

Yes for cases when ggml_tsavorite_init is not called but still need to setup log file

@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML

85b71fd

akapoor3518 requested review from Nithyanand-G, atrivedi-tsavoritesi, dineshReddy6381, dmpatra, gkethamallax, mikeuhler and mmankal as code owners October 15, 2025 21:40

atrivedi-tsavoritesi reviewed Oct 15, 2025

View reviewed changes

mikeuhler removed their request for review October 15, 2025 22:07

Added memory Alignment for 128 Bytes

41137ce

atrivedi-tsavoritesi approved these changes Oct 17, 2025

View reviewed changes

akapoor3518 merged commit 721bb4b into master Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML #63

@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML #63

Uh oh!

akapoor3518 commented Oct 15, 2025 •

edited

Loading

Uh oh!

atrivedi-tsavoritesi Oct 15, 2025

Uh oh!

akapoor3518 Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML #63

@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML #63

Uh oh!

Conversation

akapoor3518 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

4912 865.4440 0.1762 37.8030 [ 5.29%] [Thread] tsi::runtime::TsavRT::processResponses 4912 827.6410 0.1685 827.6410 └─ [ 5.06%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

[Thread] OPU (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

1310 746.8850 0.5701 729.0130 [6.93e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList 1310 17.8720 0.0136 17.8720 └─ [1.66e-02%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

1310 1492.0960 1.1390 53.5740 [ 1.38%] [Thread] tsi::runtime::TsavRT::processResponses 1310 1438.5220 1.0981 1438.5220 └─ [ 1.33%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

1312 68.7070 0.0524 68.7070 [6.37e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)

1310 369.7450 0.2822 369.7450 [3.43e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

1310 76.1190 0.0581 76.1190 [7.06e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)

1310 98.7010 0.0753 98.7010 [9.16e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

1310 15.6920 0.0120 15.6920 [1.46e-02%] [Thread] tsi::runtime::TsavRT::deallocate

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.8544

Uh oh!

atrivedi-tsavoritesi Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

akapoor3518 Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akapoor3518 commented Oct 15, 2025 •

edited

Loading

4912 865.4440 0.1762 37.8030 [ 5.29%] [Thread] tsi::runtime::TsavRT::processResponses
4912 827.6410 0.1685 827.6410 └─ [ 5.06%] tsi::runtime::executeWithTimeout

1310 746.8850 0.5701 729.0130 [6.93e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 17.8720 0.0136 17.8720 └─ [1.66e-02%] tsi::runtime::executeWithTimeout

1310 1492.0960 1.1390 53.5740 [ 1.38%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1438.5220 1.0981 1438.5220 └─ [ 1.33%] tsi::runtime::executeWithTimeout