Skip to content

Conversation

@akapoor3518
Copy link

Due to recent SDK changes, it is now mandatory to call tsi_finalize() if tsi_initialize() was previously invoked—even in successful execution scenarios.
Although only the CPU backend may be actively used, the system still initializes all available backends (CPU and Tsavorite). Therefore, tsi_finalize() must be called at the end of the program to ensure proper cleanup

Validation Log

akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/SakanaAI-TinySwallow-1.5B-Instruct-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
ヨーロッパには

llama_perf_sampler_print: sampling time = 42.21 ms / 25 runs ( 1.69 ms per token, 592.30 tokens per second)
llama_perf_context_print: load time = 7959.91 ms
llama_perf_context_print: prompt eval time = 6300.66 ms / 19 tokens ( 331.61 ms per token, 3.02 tokens per second)
llama_perf_context_print: eval time = 2889.78 ms / 5 runs ( 577.96 ms per token, 1.73 tokens per second)
llama_perf_context_print: total time = 10895.97 ms / 24 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 8004 0 41379 5.17
MUL CPU 3193 0 19486 6.10
RMS_NORM CPU 3137 0 13568 4.33
MUL_MAT CPU 12498 0 56513273 4521.79
CPY CPU 60 0 340 5.67
RESHAPE CPU 5218 0 1299 0.25
VIEW CPU 6763 0 1026 0.15
PERMUTE CPU 3951 0 787 0.20
GET_ROWS CPU 177 0 554 3.13
SET_ROWS CPU 3096 0 2407 0.78
ROPE CPU 3444 0 24709 7.17
FLASH_ATTN_EXT CPU 1765 0 123465 69.95
GLU CPU 1684 0 103134 61.24

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    11.4970   11.4970      3.4820  [1.05e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1     7.9250    7.9250      0.2790  └─ [7.25e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     6.9490    6.9490      6.9490    └─ [6.36e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.5540    0.5540      0.5540    └─ [5.07e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1430    0.1430      0.1320    └─ [1.31e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0110    0.0110      0.0110      └─ [1.01e-04%] tsi::runtime::executeWithTimeout
1     0.0900    0.0900      0.0900  └─ [8.24e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     1.1850    1.1850      0.5690  [1.08e-02%] [Thread] tsi::runtime::TsavRT::finalize
1     0.5970    0.5970      0.0520  └─ [5.46e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.5450    0.5450      0.0580    └─ [4.99e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.4660    0.4660      0.4660      └─ [4.27e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0210    0.0210      0.0180      └─ [1.92e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0030    0.0030      0.0030        └─ [2.75e-05%] tsi::runtime::executeWithTimeout
2     0.0190    0.0095      0.0190  └─ [1.74e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

2     7.3630    3.6815      0.1450  [6.74e-02%] [Thread] tsi::runtime::TsavRT::processResponses
2     7.2180    3.6090      7.2180  └─ [6.61e-02%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 10924.2980 0.0000 10924.2980 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6667

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
? どの国

llama_perf_sampler_print: sampling time = 9.75 ms / 25 runs ( 0.39 ms per token, 2564.63 tokens per second)
llama_perf_context_print: load time = 15577.61 ms
llama_perf_context_print: prompt eval time = 4315.17 ms / 19 tokens ( 227.11 ms per token, 4.40 tokens per second)
llama_perf_context_print: eval time = 1848.66 ms / 5 runs ( 369.73 ms per token, 2.70 tokens per second)
llama_perf_context_print: total time = 17437.72 ms / 24 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 2752 0 23871 8.67
MUL CPU 2615 0 19151 7.32
RMS_NORM CPU 2780 0 14278 5.14
MUL_MAT CPU 9859 0 37101377 3763.20
CPY CPU 59 0 339 5.75
RESHAPE CPU 4752 0 3177 0.67
VIEW CPU 5131 0 849 0.17
PERMUTE CPU 3167 0 481 0.15
GET_ROWS CPU 187 0 655 3.50
SET_ROWS CPU 2534 0 1886 0.74
ROPE CPU 2716 0 21700 7.99
FLASH_ATTN_EXT CPU 1400 0 165174 117.98
GLU CPU 1394 0 51338 36.83

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    33.8910   33.8910      5.7050  [1.94e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    28.0950   28.0950     16.8860  └─ [1.61e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1    10.3960   10.3960     10.3960    └─ [5.95e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.7510    0.7510      0.7510    └─ [4.29e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0620    0.0620      0.0570    └─ [3.55e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0050    0.0050      0.0050      └─ [2.86e-05%] tsi::runtime::executeWithTimeout
1     0.0910    0.0910      0.0910  └─ [5.20e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     1.6460    1.6460      0.8740  [9.41e-03%] [Thread] tsi::runtime::TsavRT::finalize
1     0.7510    0.7510      0.0540  └─ [4.29e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.6970    0.6970      0.0590    └─ [3.99e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.6180    0.6180      0.6180      └─ [3.53e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0200    0.0200      0.0180      └─ [1.14e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0020    0.0020      0.0020        └─ [1.14e-05%] tsi::runtime::executeWithTimeout
2     0.0210    0.0105      0.0210  └─ [1.20e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

2    10.8960    5.4480      0.1470  [6.23e-02%] [Thread] tsi::runtime::TsavRT::processResponses
2    10.7490    5.3745     10.7490  └─ [6.15e-02%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 17486.7300 0.0000 17486.7300 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6667

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat’s name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device none -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
is Luna.

llama_perf_sampler_print: sampling time = 9.54 ms / 12 runs ( 0.79 ms per token, 1258.13 tokens per second)
llama_perf_context_print: load time = 1785.21 ms
llama_perf_context_print: prompt eval time = 1356.91 ms / 6 tokens ( 226.15 ms per token, 4.42 tokens per second)
llama_perf_context_print: eval time = 1827.41 ms / 5 runs ( 365.48 ms per token, 2.74 tokens per second)
llama_perf_context_print: total time = 3624.01 ms / 11 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD CPU 2567 0 16399 6.39
MUL CPU 2425 0 11021 4.54
RMS_NORM CPU 2570 0 8644 3.36
MUL_MAT CPU 9859 0 25922896 2629.36
CPY CPU 58 0 409 7.05
RESHAPE CPU 4840 0 3105 0.64
VIEW CPU 5182 0 844 0.16
PERMUTE CPU 3085 0 440 0.14
GET_ROWS CPU 186 0 542 2.91
SET_ROWS CPU 2454 0 1771 0.72
ROPE CPU 2643 0 13420 5.08
FLASH_ATTN_EXT CPU 1387 0 64807 46.72
GLU CPU 1326 0 31194 23.52
I

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    31.8710   31.8710      6.6750  [8.67e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    25.0970   25.0970      4.9910  └─ [6.83e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1    19.1340   19.1340     19.1340    └─ [5.20e-01%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.8660    0.8660      0.8660    └─ [2.36e-02%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1060    0.1060      0.1010    └─ [2.88e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0050    0.0050      0.0050      └─ [1.36e-04%] tsi::runtime::executeWithTimeout
1     0.0990    0.0990      0.0990  └─ [2.69e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1     3.9800    3.9800      1.7860  [1.08e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     2.1710    2.1710      0.0780  └─ [5.90e-02%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     2.0930    2.0930      0.0960    └─ [5.69e-02%] tsi::runtime::TsavRT::executeSyncCommand
1     1.9610    1.9610      1.9610      └─ [5.33e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0360    0.0360      0.0320      └─ [9.79e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0040    0.0040      0.0040        └─ [1.09e-04%] tsi::runtime::executeWithTimeout
2     0.0230    0.0115      0.0230  └─ [6.25e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

2    20.3350   10.1675      0.1100  [5.53e-01%] [Thread] tsi::runtime::TsavRT::processResponses
2    20.2250   10.1125     20.2250  └─ [5.50e-01%] tsi::runtime::executeWithTimeout

========================================================================================================================
- 3677.1210 0.0000 3677.1210 [100.00%] TOTAL

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.6667

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "my cat’s name" -m /proj/rel/sw/ggml/models/Tiny-Llama-v0.3-FP32-1.1B-F32.gguf --device tSavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn
is Luna.

llama_perf_sampler_print: sampling time = 10.12 ms / 12 runs ( 0.84 ms per token, 1185.54 tokens per second)
llama_perf_context_print: load time = 5221.85 ms
llama_perf_context_print: prompt eval time = 4381.37 ms / 6 tokens ( 730.23 ms per token, 1.37 tokens per second)
llama_perf_context_print: eval time = 5068.42 ms / 5 runs ( 1013.68 ms per token, 0.99 tokens per second)
llama_perf_context_print: total time = 10301.90 ms / 11 tokens

=== GGML Perf Summary ===
Op Target Runs TSI_KERNEL-RUN Total us Avg us
ADD OPU 704 914 1066133 1514.39
MUL OPU 720 935 524754 728.83
RMS_NORM OPU 720 720 483102 670.98
MUL_MAT CPU 12667 0 28239979 2229.41
CONT CPU 2377 0 116430 48.98
RESHAPE CPU 3584 0 2967 0.83
VIEW CPU 5100 0 816 0.16
PERMUTE CPU 4442 0 1348 0.30
TRANSPOSE CPU 991 0 278 0.28
GET_ROWS CPU 139 0 410 2.95
SET_ROWS CPU 2444 0 1868 0.76
SOFT_MAX OPU 352 14784 8494141 24131.08
ROPE CPU 2649 0 15887 6.00
GLU OPU 352 457 563632 1601.23
I

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    10.1890   10.1890      2.9420  [9.83e-02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1     7.1490    7.1490      0.2350  └─ [6.90e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     6.3400    6.3400      6.3400    └─ [6.12e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.5060    0.5060      0.5060    └─ [4.88e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0680    0.0680      0.0620    └─ [6.56e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0060    0.0060      0.0060      └─ [5.79e-05%] tsi::runtime::executeWithTimeout
1     0.0980    0.0980      0.0980  └─ [9.46e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    40.4810   40.4810     37.9290  [3.91e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     2.5410    2.5410      0.0500  └─ [2.45e-02%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     2.4910    2.4910      0.0480    └─ [2.40e-02%] tsi::runtime::TsavRT::executeSyncCommand
1     2.4300    2.4300      2.4300      └─ [2.34e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0130    0.0130      0.0110      └─ [1.25e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0020    0.0020      0.0020        └─ [1.93e-05%] tsi::runtime::executeWithTimeout
2     0.0110    0.0055      0.0110  └─ [1.06e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

9210 1835.5240 0.1993 198.7350 [17.71%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
18420 1636.1620 0.0888 1636.1620 └─ [15.79%] tsi::runtime::executeWithTimeout
9210 0.6270 6.81e-05 0.6270 └─ [6.05e-03%] LOAD_BLOB Command Execution
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

9210 1500.5380 0.1629 263.1990 [14.48%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
18420 1235.3790 0.0671 1235.3790 └─ [11.92%] tsi::runtime::executeWithTimeout
9210 1.9600 2.13e-04 1.9600 └─ [1.89e-02%] UNLOAD_BLOB Command Execution
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

9212 1660.8980 0.1803 84.7190 [16.03%] [Thread] tsi::runtime::TsavRT::processResponses
9212 1576.1790 0.1711 1576.1790 └─ [15.21%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0970    0.0970      0.0450  [9.36e-04%] [Thread] OPU 
1     0.0520    0.0520      0.0520  └─ [5.02e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

9210 81.2290 0.0088 70.7740 [7.84e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
9210 10.4550 0.0011 10.4550 └─ [1.01e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

9212 9.4650 0.0010 9.4650 [9.13e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

9210 25.2420 0.0027 25.2420 [2.44e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

9210 2053.8720 0.2230 2053.8720 [19.82%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

9210 6.5680 7.13e-04 6.5680 [6.34e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 10363.9040    0.0000  10363.9040  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9998

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    10.1890   10.1890      2.9420  [9.83e-02%] [Thread] tsi::runtime::TsavRTPosix::initialize
1     7.1490    7.1490      0.2350  └─ [6.90e-02%] tsi::runtime::TsavRTPosix::initializeQueues
1     6.3400    6.3400      6.3400    └─ [6.12e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.5060    0.5060      0.5060    └─ [4.88e-03%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.0680    0.0680      0.0620    └─ [6.56e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0060    0.0060      0.0060      └─ [5.79e-05%] tsi::runtime::executeWithTimeout
1     0.0980    0.0980      0.0980  └─ [9.46e-04%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    40.4810   40.4810     37.9290  [3.91e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     2.5410    2.5410      0.0500  └─ [2.45e-02%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     2.4910    2.4910      0.0480    └─ [2.40e-02%] tsi::runtime::TsavRT::executeSyncCommand
1     2.4300    2.4300      2.4300      └─ [2.34e-02%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0130    0.0130      0.0110      └─ [1.25e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0020    0.0020      0.0020        └─ [1.93e-05%] tsi::runtime::executeWithTimeout
2     0.0110    0.0055      0.0110  └─ [1.06e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

9210 1835.5240 0.1993 198.7350 [17.71%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
18420 1636.1620 0.0888 1636.1620 └─ [15.79%] tsi::runtime::executeWithTimeout
9210 0.6270 6.81e-05 0.6270 └─ [6.05e-03%] LOAD_BLOB Command Execution
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148009600[0x800...
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

9210 1500.5380 0.1629 263.1990 [14.48%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
18420 1235.3790 0.0671 1235.3790 └─ [11.92%] tsi::runtime::executeWithTimeout
9210 1.9600 2.13e-04 1.9600 └─ [1.89e-02%] UNLOAD_BLOB Command Execution
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148009600[0x8...
9210 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

9212 1660.8980 0.1803 84.7190 [16.03%] [Thread] tsi::runtime::TsavRT::processResponses
9212 1576.1790 0.1711 1576.1790 └─ [15.21%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.0970    0.0970      0.0450  [9.36e-04%] [Thread] OPU 
1     0.0520    0.0520      0.0520  └─ [5.02e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

9210 81.2290 0.0088 70.7740 [7.84e-01%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
9210 10.4550 0.0011 10.4550 └─ [1.01e-01%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

9212 9.4650 0.0010 9.4650 [9.13e-02%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

9210 25.2420 0.0027 25.2420 [2.44e-01%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

9210 2053.8720 0.2230 2053.8720 [19.82%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

9210 6.5680 7.13e-04 6.5680 [6.34e-02%] [Thread] tsi::runtime::TsavRT::deallocate

- 10363.9040    0.0000  10363.9040  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9998

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$ build-posix/bin/llama-cli -p "ヨーロッパには何人の国がありますか" -m /proj/rel/sw/ggml/models/SakanaAI-TinySwallow-1.5B-Instruct-F32.gguf --device tsavorite -c 12288 --temp 0.0 --n-predict 6 --repeat-penalty 1.5 -b 1024 --top-k 50 --top-p 0.9 --repeat-last-n 5 --no-warmup --no-display-prompt --single-turn

OPU Profiling Results:

Calls Total(ms) T/call Self(ms) Function

1    17.5990   17.5990      6.0710  [3.36e-01%] [Thread] tsi::runtime::TsavRTPosix::initialize
1    11.4340   11.4340      0.2870  └─ [2.18e-01%] tsi::runtime::TsavRTPosix::initializeQueues
1    10.3340   10.3340     10.3340    └─ [1.97e-01%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.6800    0.6800      0.6800    └─ [1.30e-02%] tsi::runtime::TsavRTPosix::requestTXEDevice
1     0.1330    0.1330      0.1260    └─ [2.54e-03%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0070    0.0070      0.0070      └─ [1.34e-04%] tsi::runtime::executeWithTimeout
1     0.0940    0.0940      0.0940  └─ [1.80e-03%] tsi::runtime::TsavRT::initialize

[Thread] tsi::runtime::TsavRT::finalize (cumulative over all threads)

1    14.2640   14.2640     13.7440  [2.72e-01%] [Thread] tsi::runtime::TsavRT::finalize
1     0.5140    0.5140      0.0740  └─ [9.82e-03%] tsi::runtime::TsavRTPosix::detachFromTXEDevice
1     0.4400    0.4400      0.0810    └─ [8.40e-03%] tsi::runtime::TsavRT::executeSyncCommand
1     0.3320    0.3320      0.3320      └─ [6.34e-03%] tsi::runtime::TsavRT::awaitCommandListCompletion
1     0.0270    0.0270      0.0250      └─ [5.16e-04%] tsi::runtime::TsavRT::finalizeCommandList
1     0.0020    0.0020      0.0020        └─ [3.82e-05%] tsi::runtime::executeWithTimeout
2     0.0060    0.0030      0.0060  └─ [1.15e-04%] tsi::runtime::TsavRT::deallocate

[Thread] tsi::runtime::TsavRTPosix::loadBlob (cumulative over all threads)

39 92.3920 2.3690 0.9830 [ 1.76%] [Thread] tsi::runtime::TsavRTPosix::loadBlob
78 91.3990 1.1718 91.3990 └─ [ 1.75%] tsi::runtime::executeWithTimeout
39 0.0100 2.56e-04 0.0100 └─ [1.91e-04%] LOAD_BLOB Command Execution
39 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=2 (LOAD_BLOB), blob_args=[2148533888[0x801...
39 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRTPosix::unloadBlob (cumulative over all threads)

39 6.1320 0.1572 1.1920 [1.17e-01%] [Thread] tsi::runtime::TsavRTPosix::unloadBlob
78 4.9100 0.0629 4.9100 └─ [9.38e-02%] tsi::runtime::executeWithTimeout
39 0.0300 7.69e-04 0.0300 └─ [5.73e-04%] UNLOAD_BLOB Command Execution
39 0.0000 0.0000 0.0000 └─ [0.00e+00%] Command{command=3 (UNLOAD_BLOB), blob_args=[2148533888[0x8...
39 0.0000 0.0000 0.0000 └─ [0.00e+00%] TXE 0 Idle

[Thread] tsi::runtime::TsavRT::processResponses (cumulative over all threads)

41 17.1700 0.4188 0.4880 [3.28e-01%] [Thread] tsi::runtime::TsavRT::processResponses
41 16.6820 0.4069 16.6820 └─ [3.19e-01%] tsi::runtime::executeWithTimeout

[Thread] OPU (cumulative over all threads)

1     0.1400    0.1400      0.0900  [2.67e-03%] [Thread] OPU 
1     0.0500    0.0500      0.0500  └─ [9.55e-04%] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::finalizeCommandList (cumulative over all threads)

39 0.3960 0.0102 0.3530 [7.56e-03%] [Thread] tsi::runtime::TsavRT::finalizeCommandList
39 0.0430 0.0011 0.0430 └─ [8.21e-04%] tsi::runtime::executeWithTimeout

[Thread] tsi::runtime::TsavRT::allocate (cumulative over all threads)

41 0.1870 0.0046 0.1870 [3.57e-03%] [Thread] tsi::runtime::TsavRT::allocate

[Thread] tsi::runtime::TsavRT::addCommandToList (cumulative over all threads)

39 0.1340 0.0034 0.1340 [2.56e-03%] [Thread] tsi::runtime::TsavRT::addCommandToList

[Thread] tsi::runtime::TsavRT::awaitCommandListCompletion (cumulative over all threads)

39 7.8790 0.2020 7.8790 [1.50e-01%] [Thread] tsi::runtime::TsavRT::awaitCommandListCompletion

[Thread] tsi::runtime::TsavRT::deallocate (cumulative over all threads)

39 0.0540 0.0014 0.0540 [1.03e-03%] [Thread] tsi::runtime::TsavRT::deallocate

-  5235.2730    0.0000   5235.2730  [100.00%] TOTAL

========================================================================================================================

Counter Metrics:

Metric Min Max Avg

Queue_0_Occupancy 0.0000 1.0000 0.9917

[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$
[akapoor@wssw01 llama.cpp]$

Copy link

@dineshReddy6381 dineshReddy6381 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved

@akapoor3518 akapoor3518 merged commit e6d62c9 into master Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants