Skip to content

How to calculate the number of cached loras  #559

@limertang

Description

@limertang

System Info

GPU Name: NVIDIA A800
TensorRT-LLM: 0.11.0
Nvidia Driver: 535.129.03
OS: Ubuntu 22.04
triton-inference-server backend:tensorrtllm_backend

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

[TensorRT-LLM][INFO] Using 39976960 bytes for LoRA host cache
[TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache
[TensorRT-LLM][INFO] Max LoRA size is 19988480
[TensorRT-LLM][INFO] LoRA host Cache can hold 1 max sized LoRAs
[TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

  • send request with lora, error occured:

[TensorRT-LLM][ERROR] Encountered an error when fetching new request: Error storing task=1 in PEFT cache -- Cache is full. There are no done tasks to evict (/home/jenkins/agent/workspace/LLM/release-0.11/L0_PostMerge/llm/cpp/tensorrt_llm/batch_manager/peftCacheManager.cpp:243)
1 0x7f20d8f960a0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(+0x74c0a0) [0x7f20d8f960a0]
2 0x7f20dac724e0 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::updatePeftCache(std::shared_ptr<tensorrt_llm::batch_manager::LlmRequest> const&) + 64
3 0x7f20daca3258 tensorrt_llm::executor::Executor::Impl::fetchNewRequests(int) + 2968
4 0x7f20daca4627 tensorrt_llm::executor::Executor::Impl::executionLoop() + 455

  • modify the "lora_cache_host_memory_bytes" to 104857600(100MB) and restart service,log as below:

[TensorRT-LLM][INFO] Using 104857600 bytes for LoRA host cache
[TensorRT-LLM][INFO] Using 312836096 bytes for LoRA device cache
[TensorRT-LLM][INFO] Max LoRA size is 19988480
[TensorRT-LLM][INFO] LoRA host Cache can hold 3 max sized LoRAs
[TensorRT-LLM][INFO] LoRA device Cache can hold 8 max sized LoRAs

Theoretically, host cache can only hold 2 lora. 100MB//38.125MB = 2. but the log is 3, I think the log is wrong.

  • I send first request with lora-1 and send 2th request with lora-2, both request worked well. I think the 2 loras had been cached to host cache.

    so, I send third request with lora-1,only with lora_task_id,without weight and config,but error occured:

[TensorRT-LLM][WARNING] LoRA task 1 not found in cache. Please send LoRA weights with request

Then I send 4th request with lora-2,only with lora_task_id, it worked fine.
That is to say, lora-1 had been evicted, I want to know why?

  • If I set "lora_cache_host_memory_bytes" to a larger value, step3 would worked well.

Expected behavior

The number of loras which host cache can hold is same as lora_cache_host_memory_bytes//lora_size

actual behavior

The number of loras which host cache can hold is not same as lora_cache_host_memory_bytes//lora_size

additional notes

If I set lora_cache_host_memory_bytes to 1G, I want to know how many loras can be cached Exactly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions