@FIR-999 - Create SOFT_MAX for tsavorite-backend for GGML #63
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Posix Result
build-posix/bin/llama-cli -p "my cat's name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 1 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt
llama_perf_sampler_print: sampling time = 2.32 ms / 7 runs ( 0.33 ms per token, 3021.15 tokens per second)
llama_perf_context_print: load time = 14263.70 ms
llama_perf_context_print: prompt eval time = 4560.23 ms / 6 tokens ( 760.04 ms per token, 1.32 tokens per second)
llama_perf_context_print: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_perf_context_print: total time = 14266.72 ms / 7 tokens
=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 44 254 156411 3554.80
MUL OPU 45 260 144592 3213.16
RMS_NORM OPU 45 45 66228 1471.73
MUL_MAT CPU 693 0 5878977 8483.37
CONT CPU 172 0 9045 52.59
RESHAPE CPU 239 0 170 0.71
VIEW CPU 398 0 68 0.17
PERMUTE CPU 298 0 94 0.32
TRANSPOSE CPU 71 0 17 0.24
GET_ROWS CPU 9 0 1907 211.89
SET_ROWS CPU 173 0 259 1.50
SOFT_MAX OPU 22 4224 2242909 101950.41
ROPE CPU 174 0 3156 18.14
GLU OPU 22 127 133368 6062.18
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)
4910 781.6510 0.1592 138.1140 [ 4.77%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
9820 642.4060 0.0654 642.4060 └─ [ 3.92%] tsi::runtime::executeWithTimeout
4910 1.1310 2.30e-04 1.1310 └─ [6.91e-03%] LOAD_BLOB Command Execution
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)
4910 791.2890 0.1612 158.7450 [ 4.83%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
9820 630.9930 0.0643 630.9930 └─ [ 3.85%] tsi::runtime::executeWithTimeout
4910 1.5510 3.16e-04 1.5510 └─ [9.47e-03%] UNLOAD_BLOB Command Execution
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
4910 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
4912 865.4440 0.1762 37.8030 [ 5.29%] [Thread] tsi::runtime::TsavRT::processResponses
4912 827.6410 0.1685 827.6410 └─ [ 5.06%] tsi::runtime::executeWithTimeout
[Thread] OPU (cumulative over all threads)
FPGA Result
Currently this is failing due Memory Alloc Failure, Dushmanta is chasing
perror("tsi_aligned_malloc: Failed to allocate aligned memory\n");
hence i have disable SOFT_MAX for fpga and tested, below is result
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# tar -zxvf tsi-ggml-0.0.10_working.tz
tsi-ggml/blobs/
tsi-ggml/blobs/txe_add.blob
tsi-ggml/blobs/txe_sub.blob
tsi-ggml/blobs/txe_mult.blob
tsi-ggml/blobs/txe_div.blob
tsi-ggml/blobs/txe_abs.blob
tsi-ggml/blobs/txe_neg.blob
tsi-ggml/blobs/txe_sqrt.blob
tsi-ggml/blobs/txe_sqr.blob
tsi-ggml/blobs/txe_inv.blob
tsi-ggml/blobs/txe_sin.blob
tsi-ggml/blobs/txe_sigmoid.blob
tsi-ggml/blobs/txe_silu.blob
tsi-ggml/blobs/txe_swiglu.blob
tsi-ggml/blobs/txe_div_16.blob
tsi-ggml/blobs/txe_abs_16.blob
tsi-ggml/blobs/txe_rms_norm.blob
tsi-ggml/blobs/txe_soft_max.blob
tsi-ggml/blobs/txe_add_16.blob
tsi-ggml/blobs/txe_sub_16.blob
tsi-ggml/blobs/txe_mult_16.blob
tsi-ggml/blobs/txe_neg_16.blob
tsi-ggml/blobs/txe_sqrt_16.blob
tsi-ggml/blobs/txe_sqr_16.blob
tsi-ggml/blobs/txe_inv_16.blob
tsi-ggml/blobs/txe_sin_16.blob
tsi-ggml/blobs/txe_sigmoid_16.blob
tsi-ggml/blobs/txe_silu_16.blob
tsi-ggml/blobs/txe_swiglu_16.blob
tsi-ggml/blobs/txe_rms_norm_16.blob
tsi-ggml/ggml.sh
tsi-ggml/libggml-base.so
tsi-ggml/libggml-cpu.so
tsi-ggml/libggml.so
tsi-ggml/libggml-tsavorite.so
tsi-ggml/libllama.so
tsi-ggml/llama-cli
tsi-ggml/simple-backend-tsi
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin# ./run_llama_cli.sh
is Luna.
llama_perf_sampler_print: sampling time = 119.16 ms / 11 runs ( 10.83 ms per token, 92.32 tokens per second)
llama_perf_context_print: load time = 54439.14 ms
llama_perf_context_print: prompt eval time = 43831.27 ms / 6 tokens ( 7305.21 ms per token, 0.14 tokens per second)
llama_perf_context_print: eval time = 51031.02 ms / 4 runs (12757.75 ms per token, 0.08 tokens per second)
llama_perf_context_print: total time = 105619.16 ms / 10 tokens
=== GGML Perf Summary ===
Op Target Runs Total us Avg us
ADD OPU 484 1680847 3472.82
MUL OPU 495 1064577 2150.66
RMS_NORM OPU 495 1029119 2079.03
MUL_MAT CPU 8227 544547653 66190.31
CONT CPU 1431 1299315 907.98
RESHAPE CPU 1366 17971 13.16
VIEW CPU 2174 3322 1.53
PERMUTE CPU 1591 2693 1.69
TRANSPOSE CPU 404 813 2.01
GET_ROWS CPU 82 17480 213.17
SET_ROWS CPU 1604 34087 21.25
SOFT_MAX CPU 585 1001348 1711.71
ROPE CPU 1557 101817 65.39
GLU OPU 242 1067356 4410.56
OPU Profiling Results:
Calls Total(ms) T/call Self(ms) Function
[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)
1310 1017.7360 0.7769 0.0000 [9.44e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion
1310 1.04e+05 79.5849 1.04e+05 └─ [96.71%] TXE 0 Idle
215 196.1439 0.9123 196.1439 └─ [1.82e-01%] [ txe_swiglu ]
225 140.6145 0.6250 140.6145 └─ [1.30e-01%] [ txe_rms_norm ]
440 133.5208 0.3035 133.5208 └─ [1.24e-01%] [ txe_mult ]
430 126.5711 0.2944 126.5711 └─ [1.17e-01%] [ txe_add ]
[Thread] OPU (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)
1310 746.8850 0.5701 729.0130 [6.93e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
1310 17.8720 0.0136 17.8720 └─ [1.66e-02%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)
1310 1492.0960 1.1390 53.5740 [ 1.38%] [Thread] tsi::runtime::TsavRT::processResponses
1310 1438.5220 1.0981 1438.5220 └─ [ 1.33%] tsi::runtime::executeWithTimeout
[Thread] tsi::runtime::TsavRTFPGA::finalize (cumulative over all threads)
[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)
1312 68.7070 0.0524 68.7070 [6.37e-02%] [Thread] tsi::runtime::TsavRT::allocate
[Thread] tsi::runtime::TsavRTFPGA::loadBlob (cumulative over all threads)
1310 369.7450 0.2822 369.7450 [3.43e-01%] [Thread] tsi::runtime::TsavRTFPGA::loadBlob
[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)
1310 76.1190 0.0581 76.1190 [7.06e-02%] [Thread] tsi::runtime::TsavRT::addCommandToList
[Thread] tsi::runtime::TsavRTFPGA::unloadBlob (cumulative over all threads)
1310 98.7010 0.0753 98.7010 [9.16e-02%] [Thread] tsi::runtime::TsavRTFPGA::unloadBlob
[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)
1310 15.6920 0.0120 15.6920 [1.46e-02%] [Thread] tsi::runtime::TsavRT::deallocate
========================================================================================================================
Counter Metrics:
Metric Min Max Avg
Queue_0_Occupancy 0.0000 1.0000 0.8544
root@agilex7_dk_si_agf014eb:/usr/bin/tsi/v0.1.1.tsv38_10_12_2025/bin#