A high-throughput and memory-efficient inference and serving engine for LLMs
amd cuda inference pytorch transformer llama gpt rocm model-serving tpu hpu mlops xpu llm inferentia llmops llm-serving qwen deepseek trainium
-
Updated
Mar 7, 2025 - Python