[RFC] IOChain: request/response filters for OpenAI-compatible serving #26222
mukeshbaphna
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: IOChain request/response filters for SGLang inference serving
Author: Mukesh Baphna (
@mukeshbaphna)Status: Draft / seeking maintainer feedback
Related:
Summary
SGLang should provide a small, explicit request/response filter pipeline at the OpenAI-compatible serving layer.
The goal is to let deployments inspect, reject, annotate, or observe inference requests and responses without forking SGLang or monkey-patching internal code. This proposal is intentionally narrower than a general plugin system: it focuses only on request/response lifecycle hooks around OpenAI-compatible serving.
Motivation
Today, customization is possible at the HTTP middleware layer, but HTTP middleware only sees raw transport-level request/response data. It does not have structured access to:
This leaves production users with weak extension points for common serving needs such as:
The broader need overlaps with the plugin-system request in #13825, while #6621 covers custom HTTP middleware. IOChain would complement those efforts by adding a structured inference-serving hook point rather than another raw HTTP middleware layer.
Goals
Non-goals
Proposed API
blocking = Truemeans the request waits for the hook to finish.blocking = Falsemeans SGLang schedules the hook as a background task. This is intended for telemetry, audit logging, and best-effort export.Proposed context object
Open question:
metadatamay need to be mutable if filters are expected to share state between ingress and egress. If SGLang prefers immutable context objects, hooks can return an updated context instead.Lifecycle
For non-streaming requests:
IOContextafter the OpenAI-compatible request is parsed.on_requestfilters in configured order.on_responsefilters in configured order.For streaming requests:
on_requestbefore stream generation starts.on_responseafter completion.Open question: if a streaming request is disconnected or cancelled, should
on_responserun with cancellation/error metadata?Initial registration proposal
Add a CLI/config option similar in spirit to middleware registration:
Each value resolves to a Python class or factory. Filters run in the order provided.
Open question: maintainers may prefer config-file registration instead of, or in addition to, CLI registration.
Error handling
For blocking filters:
on_requestshould fail the request with a clear server-side error path.IOFilterReject.on_responseshould be logged and should not corrupt an already generated response unless maintainers prefer strict failure semantics.For non-blocking filters:
Request mutation
This RFC proposes starting with observation plus explicit rejection.
Request mutation can be added later if maintainers want prompt rewriting or request normalization. Starting without mutation keeps the first version easier to reason about and avoids unclear ownership of validation, tokenization, and scheduler invariants.
Open question: should v1 allow mutation of safe fields, or should mutation be deferred?
Security considerations
Performance considerations
Compatibility
This should be additive and disabled by default.
Existing HTTP middleware behavior should remain unchanged. IOChain should be compatible with existing OpenAI-compatible endpoints and should not require users to modify model code.
Proposed implementation phases
Phase 1: minimal observability hooks
IOFilter,IOContext, and filter-chain execution.on_requestandon_response.Phase 2: streaming completion hooks
Phase 3: built-in examples
Phase 4: optional policy/rejection support
Questions for maintainers
Suggested next step
If maintainers are open to this direction, I can open a draft implementation PR for Phase 1 with tests and benchmark numbers for no-filter and no-op-filter overhead.
Beta Was this translation helpful? Give feedback.
All reactions