How to calculate the number of  cached loras 

### System Info

GPU Name: NVIDIA A800
TensorRT-LLM: 0.11.0
Nvidia Driver: 535.129.03
OS: Ubuntu 22.04
triton-inference-server backend：tensorrtllm_backend

### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

- Inference with lora，https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm#running-lora-inference-with-inflight-batching
- base model:qwen1.5-7b-chat, 
- lora rank is 8，so the size of lora weight is (4x4096x2x8+3x(4096+11008)x8)x32x2 byte = 38.125MB;
- steps as below：

- I set the host cache parameter "lora_cache_host_memory_bytes" to 39976960 ,set "lora_cache_gpu_memory_fraction" to 0.1  and start service. log info as below:

[TensorRT-LLM][INFO] Using 39976960 bytes for LoRA host cache
[TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache
[TensorRT-LLM][INFO] Max LoRA size is 19988480
[TensorRT-LLM][INFO] LoRA host Cache can hold 1 max sized LoRAs
[TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

- send request with lora, error occured:

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Error storing task=1 in PEFT cache -- Cache is full. There are no done tasks to evict (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:243)
1       0x7f20d8f960a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x74c0a0) [0x7f20d8f960a0]
2       0x7f20dac724e0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 64
3       0x7f20daca3258 tensorrt_llm::executor::Executor::Impl::fetchNewRequests(int) + 2968
4       0x7f20daca4627 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455

- modify the "lora_cache_host_memory_bytes" to 104857600(100MB) and restart service,log as below:

[TensorRT-LLM][INFO] Using 104857600 bytes for LoRA host cache
[TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache
[TensorRT-LLM][INFO] Max LoRA size is 19988480
[TensorRT-LLM][INFO] LoRA host Cache can hold 3 max sized LoRAs
[TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

**Theoretically, host cache can only hold 2 lora. 100MB//38.125MB = 2.** but the log is 3, I think the log is wrong.

- I send first request with lora-1 and send 2th request with lora-2, both request worked well. I think the 2 loras had been cached to host cache.

  so, I send third request with lora-1,only with lora_task_id,without weight and config,but error occured:
  
 [TensorRT-LLM][WARNING] LoRA task 1 not found in cache. Please send LoRA weights with request
 
 Then I send 4th request with lora-2,only with lora_task_id, it worked fine.  
 **That is to say, lora-1 had been evicted, I want to know why?** 
 

- If I set "lora_cache_host_memory_bytes" to a larger value, step3 would worked well.


### Expected behavior

The number of loras which host cache can hold is same as lora_cache_host_memory_bytes//lora_size

### actual behavior

The number of loras which host cache can hold is not same as lora_cache_host_memory_bytes//lora_size

### additional notes

If I set lora_cache_host_memory_bytes to 1G,  I want to know how many loras can be cached Exactly.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to calculate the number of cached loras #559

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to calculate the number of cached loras #559

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions