Runpod Cached Model Worker

A Runpod Serverless worker that serves a Hugging Face text generation model using the model caching feature. The model is pre-downloaded to a network volume and loaded at startup in offline mode, eliminating cold-start downloads.

Defaults to microsoft/Phi-3-mini-4k-instruct but works with any Hugging Face text generation model available through the model caching feature.

How it works

Runpod's model caching feature downloads the model to /runpod-volume/huggingface-cache/hub/ before the worker starts.
The handler resolves the local snapshot path from the cache directory.
The model and tokenizer are loaded once at startup in offline mode (HF_HUB_OFFLINE=1).
Incoming requests are processed using a transformers text generation pipeline.

Files

handler.py           # Serverless handler with model loading and inference
Dockerfile           # Container image based on runpod/pytorch:2.4.0
requirements.txt     # Python dependencies
build-and-push.sh    # Build and push Docker image to Docker Hub

Deploy

Go to Serverless > New Endpoint in the Runpod console.
Under Container Image, enter a built image (e.g., your-username/cached-model-worker:latest), or use Import Git Repository to build directly from this repo.
Select a GPU with at least 16 GB VRAM.
Under the Model section, enter the model name: microsoft/Phi-3-mini-4k-instruct.
Set container disk to at least 20 GB.
Select Deploy Endpoint.

Build and push

Using the included script:

chmod +x build-and-push.sh
./build-and-push.sh your-dockerhub-username

Or manually:

docker build -t your-username/cached-model-worker:latest .
docker push your-username/cached-model-worker:latest

Or with Depot for faster cloud builds:

depot build -t your-username/cached-model-worker:latest . --platform linux/amd64 --push

Request format

{
  "input": {
    "prompt": "What is the capital of France?",
    "max_tokens": 256,
    "temperature": 0.7
  }
}

Parameter	Type	Default	Description
`prompt`	string	`"Hello!"`	The text prompt for generation.
`max_tokens`	integer	`256`	Maximum number of tokens to generate.
`temperature`	float	`0.7`	Sampling temperature (higher = more random).

Response format

{
  "output": {
    "status": "success",
    "output": "What is the capital of France?\n\nThe capital of France is Paris."
  }
}

Use a different model

Set the MODEL_NAME environment variable on your endpoint to any Hugging Face model ID that is available through model caching:

MODEL_NAME=meta-llama/Llama-3.2-1B-Instruct

Make sure the model name in the endpoint's Model section matches the MODEL_NAME environment variable.

Environment variables

Variable	Default	Description
`MODEL_NAME`	`microsoft/Phi-3-mini-4k-instruct`	Hugging Face model ID to load.

Troubleshooting

Check if the model is cached

Add this to your handler or run it in the worker logs to verify the cache directory:

import os

cache_root = "/runpod-volume/huggingface-cache/hub"

if os.path.exists(cache_root):
    print(f"Cache root exists: {cache_root}")
    for item in os.listdir(cache_root):
        print(f"  {item}")
else:
    print(f"Cache root does NOT exist: {cache_root}")

If the cache directory is empty or missing, make sure you added the model in the Model section when creating the endpoint.

Common issues

Issue	Solution
Model downloads instead of using cache	Add the model in the endpoint's Model section. The cache path must be `/runpod-volume/huggingface-cache`, not `/runpod/model-store/`.
"No space left on device"	Increase Container Disk to at least 20 GB when creating the endpoint.
Slow cold starts	Verify the model is cached (check logs for `[ModelStore] Using snapshot`). Set Active Workers > 0 to keep workers warm.
`trust_remote_code` errors	Set `trust_remote_code=False` when the model is natively supported by `transformers`. Custom model code in cached snapshots can conflict with newer transformers versions.
Flash attention errors	Add `attn_implementation="eager"` to `from_pretrained()` if the base image does not include `flash-attn`.
PyTorch version mismatch	Use a base image with PyTorch >= 2.2. The `runpod/pytorch:2.4.0-py3.11-cuda12.4.1-devel-ubuntu22.04` image works with current transformers versions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Runpod Cached Model Worker

How it works

Files

Deploy

Build and push

Request format

Response format

Use a different model

Environment variables

Troubleshooting

Check if the model is cached

Common issues

Related

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Dockerfile		Dockerfile
README.md		README.md
build-and-push.sh		build-and-push.sh
handler.py		handler.py
requirements.txt		requirements.txt

runpod-workers/model-store-cache-example

Folders and files

Latest commit

History

Repository files navigation

Runpod Cached Model Worker

How it works

Files

Deploy

Build and push

Request format

Response format

Use a different model

Environment variables

Troubleshooting

Check if the model is cached

Common issues

Related

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages