A fast, lightweight, parallel inference server for Llama LLMs.
server
inference
llama
llm-inference
exllama
llama2
flash-attention-2
paged-attention
llama3
exllamav2
-
Updated
Jun 19, 2024 - Python