A Runpod Serverless worker that serves a Hugging Face text generation model using the model caching feature. The model is pre-downloaded to a network volume and loaded at startup in offline mode, eliminating cold-start downloads.
Defaults to microsoft/Phi-3-mini-4k-instruct but works with any Hugging Face text generation model available through the model caching feature.
- Runpod's model caching feature downloads the model to
/runpod-volume/huggingface-cache/hub/before the worker starts. - The handler resolves the local snapshot path from the cache directory.
- The model and tokenizer are loaded once at startup in offline mode (
HF_HUB_OFFLINE=1). - Incoming requests are processed using a
transformerstext generation pipeline.
handler.py # Serverless handler with model loading and inference
Dockerfile # Container image based on runpod/pytorch:2.4.0
requirements.txt # Python dependencies
build-and-push.sh # Build and push Docker image to Docker Hub
- Go to Serverless > New Endpoint in the Runpod console.
- Under Container Image, enter a built image (e.g.,
your-username/cached-model-worker:latest), or use Import Git Repository to build directly from this repo. - Select a GPU with at least 16 GB VRAM.
- Under the Model section, enter the model name:
microsoft/Phi-3-mini-4k-instruct. - Set container disk to at least 20 GB.
- Select Deploy Endpoint.
Using the included script:
chmod +x build-and-push.sh
./build-and-push.sh your-dockerhub-usernameOr manually:
docker build -t your-username/cached-model-worker:latest .
docker push your-username/cached-model-worker:latestOr with Depot for faster cloud builds:
depot build -t your-username/cached-model-worker:latest . --platform linux/amd64 --push{
"input": {
"prompt": "What is the capital of France?",
"max_tokens": 256,
"temperature": 0.7
}
}| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
string | "Hello!" |
The text prompt for generation. |
max_tokens |
integer | 256 |
Maximum number of tokens to generate. |
temperature |
float | 0.7 |
Sampling temperature (higher = more random). |
{
"output": {
"status": "success",
"output": "What is the capital of France?\n\nThe capital of France is Paris."
}
}Set the MODEL_NAME environment variable on your endpoint to any Hugging Face model ID that is available through model caching:
MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct
Make sure the model name in the endpoint's Model section matches the MODEL_NAME environment variable.
| Variable | Default | Description |
|---|---|---|
MODEL_NAME |
microsoft/Phi-3-mini-4k-instruct |
Hugging Face model ID to load. |
Add this to your handler or run it in the worker logs to verify the cache directory:
import os
cache_root = "/runpod-volume/huggingface-cache/hub"
if os.path.exists(cache_root):
print(f"Cache root exists: {cache_root}")
for item in os.listdir(cache_root):
print(f" {item}")
else:
print(f"Cache root does NOT exist: {cache_root}")If the cache directory is empty or missing, make sure you added the model in the Model section when creating the endpoint.
| Issue | Solution |
|---|---|
| Model downloads instead of using cache | Add the model in the endpoint's Model section. The cache path must be /runpod-volume/huggingface-cache, not /runpod/model-store/. |
| "No space left on device" | Increase Container Disk to at least 20 GB when creating the endpoint. |
| Slow cold starts | Verify the model is cached (check logs for [ModelStore] Using snapshot). Set Active Workers > 0 to keep workers warm. |
trust_remote_code errors |
Set trust_remote_code=False when the model is natively supported by transformers. Custom model code in cached snapshots can conflict with newer transformers versions. |
| Flash attention errors | Add attn_implementation="eager" to from_pretrained() if the base image does not include flash-attn. |
| PyTorch version mismatch | Use a base image with PyTorch >= 2.2. The runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04 image works with current transformers versions. |
- Deploy a cached model - Full tutorial for this worker.
- Model caching - How model caching works on Runpod Serverless.
- Runpod Serverless - Serverless overview.