Skip to content

st-scale/llm-engine

 
 

Repository files navigation

⚡ LLM Engine ⚡

The open source engine for fine-tuning large language models. LLM Engine is the easiest way to customize and serve LLMs. Use Scale's hosted version or run it in your own cloud.

💻 Quick Install

pip install scale-llm-engine

🤔 About

Foundation models are emerging as the building blocks of AI. However, deploying these models to the cloud and fine-tuning them still requires infrastructure and ML expertise, and can be expensive.

LLM Engine is a Python library, CLI, and Helm chart that provides everything you need to fine-tune and serve foundation models in the cloud using Kubernetes. Key features include:

🎁 Ready-to-use Fine-Tuning and Inference APIs for your favorite models: LLM Engine comes with ready-to-use APIs for your favorite open-source models, including MPT, Falcon, and LLaMA. Use Scale-hosted endpoints or deploy to your own infrastructure.

🐳 Deploying from any docker image: Turn any Docker image into an auto-scaling deployment with simple APIs.

🎙️Optimized Inference: LLM Engine provides inference APIs for streaming responses and dynamically batching inputs for higher throughput and lower latency.

🤗 Open-Source Integrations: Deploy any Huggingface model with a single command.

🔥 Features Coming Soon

Fast Cold-Start Times: To prevent GPUs from idling, LLM Engine automatically scales your model to zero when it's not in use and scales up within seconds, even for large foundation models.

💸 Cost-Optimized: Deploy AI models cheaper than commercial ones, including cold-start and warm-down times.

🚀 Getting Started

Navigate to https://spellbook.scale.com where you will get a Scale API key on the settings page. Set this API key as the SCALE_API_KEY environment variable by adding the following line to your .zshrc or .bash_profile:

export SCALE_API_KEY = "[Your API key]"

With your API key set, you can now send LLM Engine requests using the Python client:

from llmengine import Completion

response = Completion.create(
    model_name="llama-7b",
    prompt="Hello, my name is",
    max_new_tokens=10,
    temperature=0.2,
)
print(response.outputs[0].text)

See the documentation site for more details.

About

Scale LLM Engine public repository

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.5%
  • Other 1.5%