What is this? Minions is a communication protocol that enables small on-device models to collaborate with frontier models in the cloud. By only reading long contexts locally, we can reduce cloud costs with minimal or no quality degradation. This repository provides a demonstration of the protocol. Get started below or see our paper and blogpost below for more information.
Paper: Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models
Blogpost: https://hazyresearch.stanford.edu/blog/2025-02-24-minions
We have tested the following setup on Mac and Ubuntu with Python 3.10-3.11 (Note: Python 3.13 is not supported)
Optional: Create a virtual environment with your favorite package manager (e.g. conda, venv, uv)
conda create -n minions python=3.11
Step 1: Clone the repository and install the Python package.
git clone https://github.com/HazyResearch/minions.git
cd minions
pip install -e . # installs the minions package in editable mode
note: for optional MLX-LM install the package with the following command:
pip install -e ".[mlx]"
note: for optional Cartesia-MLX install, pip install the basic package and then follow the instructions below.
Step 2: Install a server for running the local model.
We support two servers for running local models: ollama
and tokasaurus
. You need to install at least one of these.
- You should use
ollama
if you do not have access to NVIDIA GPUs. Installollama
following the instructions here. To enable Flash Attention, runlaunchctl setenv OLLAMA_FLASH_ATTENTION 1
and, if on a mac, restart the ollama app. - You should use
tokasaurus
if you have access to NVIDIA GPUs and you are running the Minions protocol, which benefits from the high-throughput oftokasaurus
. Installtokasaurus
with the following command:
uv pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ tokasaurus==0.0.1.post1
Optional: Install Cartesia-MLX (only available on Apple Silicon)
- Download XCode
- Install the command line tools by running
xcode-select --install
- Install the Nanobind🧮
pip install nanobind@git+https://github.com/wjakob/nanobind.git@2f04eac452a6d9142dedb957701bdb20125561e4
- Install the Cartesia Metal backend by running the following command:
pip install git+https://github.com/cartesia-ai/edge.git#subdirectory=cartesia-metal
- Install the Cartesia-MLX package by running the following command:
pip install git+https://github.com/cartesia-ai/edge.git#subdirectory=cartesia-mlx
Step 3: Set your API key for at least one of the following cloud LLM providers.
If needed, create an OpenAI API Key or TogetherAI API key for the cloud model.
# OpenAI
export OPENAI_API_KEY=<your-openai-api-key>
export OPENAI_BASE_URL=<your-openai-base-url> # Optional: Use a different OpenAI API endpoint
# Together AI
export TOGETHER_API_KEY=<your-together-api-key>
# OpenRouter
export OPENROUTER_API_KEY=<your-openrouter-api-key>
export OPENROUTER_BASE_URL=<your-openrouter-base-url> # Optional: Use a different OpenRouter API endpoint
# Perplexity
export PERPLEXITY_API_KEY=<your-perplexity-api-key>
export PERPLEXITY_BASE_URL=<your-perplexity-base-url> # Optional: Use a different Perplexity API endpoint
# Tokasaurus
export TOKASAURUS_BASE_URL=<your-tokasaurus-base-url> # Optional: Use a different Tokasaurus API endpoint
To try the Minion or Minions protocol, run the following command:
streamlit run app.py
If you are seeing an error about the ollama
client,
An error occurred: Failed to connect to Ollama. Please check that Ollama is downloaded, running and accessible. https://ollama.com/download
try running the following command:
OLLAMA_FLASH_ATTENTION=1 ollama serve
The following example is for an ollama
local client and an openai
remote client.
The protocol is minion
.
from minions.clients.ollama import OllamaClient
from minions.clients.openai import OpenAIClient
from minions.minion import Minion
local_client = OllamaClient(
model_name="llama3.2",
)
remote_client = OpenAIClient(
model_name="gpt-4o",
)
# Instantiate the Minion object with both clients
minion = Minion(local_client, remote_client)
context = """
Patient John Doe is a 60-year-old male with a history of hypertension. In his latest checkup, his blood pressure was recorded at 160/100 mmHg, and he reported occasional chest discomfort during physical activity.
Recent laboratory results show that his LDL cholesterol level is elevated at 170 mg/dL, while his HDL remains within the normal range at 45 mg/dL. Other metabolic indicators, including fasting glucose and renal function, are unremarkable.
"""
task = "Based on the patient's blood pressure and LDL cholesterol readings in the context, evaluate whether these factors together suggest an increased risk for cardiovascular complications."
# Execute the minion protocol for up to two communication rounds
output = minion(
task=task,
context=[context],
max_rounds=2
)
The following example is for an ollama
local client and an openai
remote client.
The protocol is minions
.
from minions.clients.ollama import OllamaClient
from minions.clients.openai import OpenAIClient
from minions.minions import Minions
from pydantic import BaseModel
class StructuredLocalOutput(BaseModel):
explanation: str
citation: str | None
answer: str | None
local_client = OllamaClient(
model_name="llama3.2",
temperature=0.0,
structured_output_schema=StructuredLocalOutput
)
remote_client = OpenAIClient(
model_name="gpt-4o",
)
# Instantiate the Minion object with both clients
minion = Minions(local_client, remote_client)
context = """
Patient John Doe is a 60-year-old male with a history of hypertension. In his latest checkup, his blood pressure was recorded at 160/100 mmHg, and he reported occasional chest discomfort during physical activity.
Recent laboratory results show that his LDL cholesterol level is elevated at 170 mg/dL, while his HDL remains within the normal range at 45 mg/dL. Other metabolic indicators, including fasting glucose and renal function, are unremarkable.
"""
task = "Based on the patient's blood pressure and LDL cholesterol readings in the context, evaluate whether these factors together suggest an increased risk for cardiovascular complications."
# Execute the minion protocol for up to two communication rounds
output = minion(
task=task,
doc_metadata="Medical Report",
context=[context],
max_rounds=2
)
To run Minion/Minions in a notebook, checkout minions.ipynb
.
To run Minion/Minions in a CLI, checkout minions_cli.py
.
Set your choice of local and remote models by running the following command. The format is <provider>/<model_name>
. Choice of providers are ollama
, openai
, anthropic
, together
, perplexity
, openrouter
, groq
, and mlx
.
export MINIONS_LOCAL=ollama/llama3.2
export MINIONS_REMOTE=openai/gpt-4o
minions --help
minions --context <path_to_context> --protocol <minion|minions>
export AZURE_OPENAI_API_KEY=your-api-key
export AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/
export AZURE_OPENAI_API_VERSION=2024-02-15-preview
Here's an example of how to use Azure OpenAI with the Minions protocol in your own code:
from minions.clients.ollama import OllamaClient
from minions.clients.azure_openai import AzureOpenAIClient
from minions.minion import Minion
local_client = OllamaClient(
model_name="llama3.2",
)
remote_client = AzureOpenAIClient(
model_name="gpt-4o", # This should match your deployment name
api_key="your-api-key",
azure_endpoint="https://your-resource-name.openai.azure.com/",
api_version="2024-02-15-preview",
)
# Instantiate the Minion object with both clients
minion = Minion(local_client, remote_client)
- Avanika Narayan (contact: avanika@cs.stanford.edu)
- Dan Biderman (contact: biderman@stanford.edu)
- Sabri Eyuboglu (contact: eyuboglu@cs.stanford.edu)