Skip to content

tboulet/fileproxy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

fileproxy

File-based RPC for running Python functions across network-isolated nodes.

Designed for HPC clusters where compute nodes lack internet access but share a filesystem with login nodes that do.

Installation

pip install fileproxy

Or from source:

git clone https://github.com/tboulet/fileproxy.git
cd fileproxy
pip install -e .

Quick Start

1. Define and start the server (login node)

Create a server script (example here with litellm.completion as the function to proxy):

# server_script.py
import fileproxy
import litellm

if __name__ == "__main__":
    fileproxy.run_server({
        "litellm_completion": litellm.completion,
    })

Run it on the login node:

python server_script.py

Tip: On HPC clusters, run the server in a persistent terminal session (e.g., TMUX) so it survives SSH disconnections. See guide_TMUX.md for a quick reference.

2. Use the proxy in your code (compute node)

import fileproxy

# Create a proxy that behaves like the original function
completion = fileproxy.proxy("litellm_completion")

# Use it exactly like litellm.completion
response = completion(model="gpt-4", messages=[{"role": "user", "content": "Hello"}])

The proxy serializes the arguments to a file, the server picks it up, runs the real function, and writes the result back. The proxy polls for the result and returns it.

Multiple Functions

Register multiple functions on the same server:

# Server
import fileproxy
import litellm
import requests

if __name__ == "__main__":
    fileproxy.run_server({
        "litellm_completion": litellm.completion,
        "http_post": requests.post,
        "http_get": requests.get,
    })
# Client
import fileproxy

completion = fileproxy.proxy("litellm_completion")
http_post = fileproxy.proxy("http_post")
http_get = fileproxy.proxy("http_get")

Configuration

Data directory

By default, fileproxy stores request/response files in ~/.cache/fileproxy/. Override with:

  1. Constructor argument: fileproxy.proxy("func", base_dir="/path/to/dir")
  2. Environment variable: export FILEPROXY_DIR=/path/to/dir

The server and client must use the same base directory on a shared filesystem.

Workers (parallel execution)

By default, the server processes requests sequentially. To handle multiple requests concurrently (useful when registering multiple functions or serving multiple clients):

# Process up to 4 requests in parallel
fileproxy.run_server(functions, workers=4)

With workers=1 (default), requests are executed one at a time. With workers=2 or more, requests are dispatched to a thread pool. This is particularly useful when mixing slow functions (e.g., LLM calls) with fast ones (e.g., HTTP requests) — a slow call won't block unrelated requests.

Note: Registered functions must be thread-safe when using workers > 1. Most common use cases (HTTP requests, API calls) are thread-safe.

Timeouts

# Client waits 10s for server acknowledgement (default: 10s)
func = fileproxy.proxy("my_func", no_server_timeout=15.0)

The timeout only applies while waiting for the server to acknowledge the request (pick it up). Once the server starts processing, the client waits indefinitely — slow functions will not cause false timeouts.

Poll interval

# Server checks for new requests every 0.5s (default: 0.2s)
fileproxy.run_server(functions, poll_interval=0.5)

# Client checks for response every 0.2s (default: 0.1s)
func = fileproxy.proxy("my_func", poll_interval=0.2)

How It Works

Compute Node (no internet)          Login Node (has internet)
─────────────────────────          ──────────────────────────

proxy("func")(args, kwargs)        Server polls input dir
  │                                  │
  ├─ Write request.pkl ──────────────┤
  │  to input dir                    ├─ Read request.pkl
  │                                  ├─ Create _started sentinel
  │  (client sees _started,          ├─ Call func(*args, **kwargs)
  │   disables timeout)              ├─ Write response.pkl (atomic)
  ├──────────────────────────────────┤  to output dir
  ├─ Read response.pkl              │
  ├─ Return result                  │

Directory structure

~/.cache/fileproxy/
├── func_name_1/
│   ├── input/       # Request files (.pkl)
│   └── output/      # Response files (.pkl) + _started sentinels
├── func_name_2/
│   ├── input/
│   └── output/
├── logs/
│   └── server_20260310_143000.log
└── server_heartbeat.json

Error Handling

fileproxy uses custom exception types to distinguish infrastructure errors from function errors:

import fileproxy
from fileproxy import FileProxyError, ServerNotRunningError

func = fileproxy.proxy("my_func")

try:
    result = func(args)
except ServerNotRunningError:
    # fileproxy infrastructure problem: server is not running
    print("Start the fileproxy server!")
except FileProxyError:
    # Other fileproxy infrastructure problem
    print("Something went wrong with the file proxy")
except ValueError:
    # Exception raised by the actual function on the server side
    # (re-raised with original type)
    print("The function itself failed")
  • FileProxyError: Base class for all fileproxy infrastructure errors.
  • ServerNotRunningError(FileProxyError): Server did not acknowledge the request within the timeout.
  • Server-side function exceptions are re-raised with their original type (not wrapped in FileProxyError).

Exception propagation details

When the proxied function raises an exception on the server, the proxy re-raises it on the client with the original exception type in most cases. For example, a server-side ValueError("bad input") becomes a client-side ValueError("bad input").

However, some exception classes have non-standard __init__ signatures that prevent Python's pickle from reconstructing them (e.g., litellm.RateLimitError requires llm_provider and model arguments). In these cases, the original exception cannot be faithfully reconstructed, so the proxy raises a RuntimeError instead, with a message of the form:

RuntimeError: Server-side RateLimitError: rate limited

In summary:

  • Standard exceptions (e.g., ValueError, TypeError, KeyError, most custom exceptions with a simple __init__(self, message) signature): re-raised with original type and message.
  • Non-picklable exceptions (non-standard __init__ that fails to round-trip through pickle): raised as RuntimeError("Server-side {OriginalType}: {original_message}").

Logs

Server logs are written to {base_dir}/logs/server_YYYYMMDDHHMMSS.log and also printed to the server terminal. Each log file corresponds to one server session.

Important Notes

Multiple servers

Do not run multiple fileproxy servers with the same base_dir. On startup, the server checks for an existing heartbeat and raises FileProxyError if another server appears to be running. To override and kill the old server, use force=True:

# force=True signals the old server to stop, waits for it to shut down,
# then starts the new server
fileproxy.run_server(functions, force=True)

If you need truly independent servers running simultaneously, use different base_dir values:

FILEPROXY_DIR=~/.cache/fileproxy-project-a python server_a.py
FILEPROXY_DIR=~/.cache/fileproxy-project-b python server_b.py

Restarting the server

When you restart the server, it clears all pending request/response files. Any client calls that were in-flight will eventually time out with ServerNotRunningError. This is by design — it prevents stale requests from a previous session from being processed.

Checking server status

From any node that shares the filesystem:

import fileproxy

info = fileproxy.status()
print(info["alive"])       # True/False
print(info["functions"])   # ["litellm_completion", "http_post", ...]
print(info["pid"])         # Server process ID
print(info["requests_processed"])  # Total requests handled

Safety Mechanisms

  • Atomic writes: Responses are written to a .tmp file then renamed, preventing clients from reading partial data.
  • Started sentinel: When the server begins processing a request, it creates a _started marker file. The client uses this to distinguish "server is processing (wait)" from "server is not running (fail fast)."
  • Exception propagation: If the function raises an exception on the server, the exception object is pickled and re-raised on the client side with its original type.
  • Unpicklable response handling: If the server cannot pickle the response (e.g., it contains open file handles), the client receives a FileProxyError instead of hanging.
  • Cleanup: Request, response, and sentinel files are removed after processing.
  • Startup cleanup: The server clears stale files from previous runs on startup.

Limitations

  • Arguments and return values must be picklable (most Python objects are — strings, dicts, lists, numbers, dataclasses, etc. Lambdas, open file handles, and generators are not).
  • Latency overhead of ~100-200ms per call due to filesystem polling.
  • Server and client must share a filesystem (e.g., NFS home directory on HPC clusters). Local-only filesystems like /tmp won't work across nodes.
  • If the server crashes (e.g., killed by OOM) while processing a request, the client will wait indefinitely for that request. Restart the server to recover.
  • If the server and client use different Python environments, server-side exceptions from libraries not installed on the client will be raised as RuntimeError instead of their original type.

License

MIT

About

File-based RPC for running Python functions across network-isolated nodes. Useful for HPC clusters where compute nodes lack internet access.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages