A high-throughput and memory-efficient inference and serving engine for LLMs
amd
cuda
inference
pytorch
transformer
llama
gpt
rocm
model-serving
tpu
hpu
mlops
xpu
llm
inferentia
llmops
llm-serving
trainium
-
Updated
Nov 16, 2024 - Python