COMB is a plug-and-play caching system for long-context LLM serving.
COMB
├── benchmarks # For benchmarking
├── comb
│ ├── entrypoints
│ │ ├── api_server.py # For online server
│ │ └── comb.py # For offline inference
│ ├── integration
│ │ ├── hf # hf transformers backend
│ │ ├── vllm # vLLM backend
│ │ └── __init__.py
│ ├── storage
│ │ ├── chunk_processor.py # For generating PIC
│ │ ├── pic_allocator.py # For allocating memory
│ │ ├── pic_manager.py # For managing PIC
│ │ └── pic_utils.py
│ ├── transfer
│ │ └── cuda_ipc_utils.py # For inter-process communication
│ ├── __init__.py
│ ├── output.py
│ └── supported_models.py
├── data
├── examples # For use case
├── training # For training
├── environment.yml
└── requirements.txt
Run the following commands to prepare the environment. We recommend appending two export commands to the end of ~/.bashrc.
export PYTHONPATH=~/Comb:$PYTHONPATH
export TOKENIZERS_PARALLELISM=true
pip install -r requirements.txtInstall vllm. (Recommended for efficiency and benchmarking)
pip install vllmCurrently we only support meta-llama/Llama-3.1-8B-Instruct and deepseek-ai/DeepSeek-V2-Lite-Chat. If you want to use another model, you can also train a Comb model by yourself through following our instructions.
You can find examples in the folder examples.
- basic.py for offline inference.
- online_serving.py for server.
See Instructions.