-
Notifications
You must be signed in to change notification settings - Fork 102
feat: Load the SchedulerConfig from a configuration file/text and make it easier to add plugins #881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ Deploy Preview for gateway-api-inference-extension ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Hi @shmuelk. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
"sigs.k8s.io/yaml" | ||
) | ||
|
||
type Config struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to treat this similar to CRDs so that we version it, and also place this under https://github.com/kubernetes-sigs/gateway-api-inference-extension/tree/main/api/config
See examples:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this need to be a generic config api for epp, we need to encapsulate the plugins configuration in its own struct so we can accommodate other parameters (most flags should move here basically)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. In the two examples you mention in your comment, while the files are under api/config, they are not really configuration kind of things, but rather true CRDs for CRs that are listened to by the appropriate controllers.
Are you proposing that the config for the EPP be a CRD with the EPP using K8S APIs to listen for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this need to be a generic config api for epp, we need to encapsulate the plugins configuration in its own struct so we can accommodate other parameters (most flags should move here basically)
I attempted to make the configuration generic for the EPP, which is why it is under pkg/epp/common/config.
I understand the desire, to place the plugins configuration under a a sub-structure. Which parts exactly did you think should go there (plugin_definitions, profile_picker, and/or scheduler_profiles) ?
I would like to see plugin_definitions and scheduler_profiles as top level items.
scheduler_profiles, while today it is completely about plugins, this may change in the future to include flags and other options.
plugin_definitions, while clearly all about plugins, gets referenced in multiple places in the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused. In the two examples you mention in your comment, while the files are under api/config, they are not really configuration kind of things, but rather true CRDs for CRs that are listened to by the appropriate controllers.
They are not CRDs, they don't get deployed on the cluster.
Are you proposing that the config for the EPP be a CRD with the EPP using K8S APIs to listen for it?
No, not deployed as CRDs (i.e., not have a CustomResourceDefinition manifest to deploy them on the cluster). It is still a yaml file we read from a file path. What I am suggesting is that the go struct mimics how we define CRDs, for the most part this means having an inlined metav1.TypeMeta
parameter and define a group+version for it.
That is how it was done for the various k8s components like kube-scheduler and kube-controller-manager, and what we are doing for other custom controllers (like Kueue and LeaderWorkerSet).
This helps with versioning and makes us consistent with other projects under k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I attempted to make the configuration generic for the EPP, which is why it is under pkg/epp/common/config.
In other projects under k8s, this was placed under api/config
, I suggest to be consistent with that precedence; we can name the struct EndpointPickerConfiguration
.
I understand the desire, to place the plugins configuration under a a sub-structure. Which parts exactly did you think should go there (plugin_definitions, profile_picker, and/or scheduler_profiles) ?
When I suggested this, I was thinking about the three sub components that we are trying to make extensible: endpoint picking, flow control and data layer / metrics scraping. I was wondering if we need to split the plugin configuration for each (e.g, a struct for SchedulingConfig, FlowControlConfig and MetricsScrapingConfig)
The other option, which is pretty much what you have in this PR, is to have a single list of plugins across all layers that get registered at the extension points they implement. A slight tweak to the names and types you have:
type EndpointPickerConfig struct {
metav1.TypeMeta
Plugins []PluginConfig
SchedulingProfiles []SchedulingProfile
}
type PluginConfig struct {
Name string
Parameters json.RawMessage
}
type SchedulingProfile struct {
Name string
Plugins []string // references to plugins
}
In the above, we don't even have an explicit ProfilePicker parameter; the profile picker would just be one of the plugins. We can add validations, like only one plugin in the Plugins
list implements the ProfilePicker extension point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your second option, with one tweak. The PluginConfig struct needs an additional field, PluginName. The Name field simply provides a name that can be referenced in the SchedulingProfile. This allows for multiple instances of the same plugin with different parameters. An example of this might a Load aware plugin that you want to configure differently for different levels of service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the config be plumped in this case?
If a plugin has one config, it is easy, we only pass it at instantiation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Name field of the PluginConfiguration does NOT refer to the name of the plugin. It is a name that can be referred to in SchedulingProfile entries. There is my proposal (and code) an additional field PluginName that is the name of the plugin (i.e. what the plugin's Name() function returns).
As an aside, there is apparently a proposal to rename the Plugin.Name function to Plugin.Type. That would make things clearer here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g I have accepted your suggestion and made the configuration structure K8S CR like and the code now uses the K8S apimachinery runtime to load the structs. There are many other changes as well that I think will make things more acceptable
} | ||
|
||
// Load config either from supplied text or from a file | ||
func LoadConfig(configText []byte, fileName string, log logr.Logger) (*Config, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining it similar to a CRD would allow us to use the decoder pkg from k8s, see this example: https://github.com/kubernetes-sigs/jobset/blob/a7a669590b3483715c193e0af53dbb5f4f066b03/pkg/config/config.go#L19
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above
@shmuelk see #881 (comment) for the commit messages that need to be fixed. |
@shmuelk this PR contains several unrelated commits that need to be cleaned up. |
I have rebased in better fashion, which I hope eliminated this issue |
|
||
// +required | ||
// +kubebuilder:validation:Required | ||
Plugins map[string]PluginSpec `json:"plugins"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Plugins and SchedulingProfiles fields are now both arrays. A Name field has been added to both.
PluginName string `json:"pluginName"` | ||
|
||
// +optional | ||
Parameters json.RawMessage `json:"parameters"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in kube-scheduler, we used runtime.Object, probably allows us to use k8s api-machinery functions for decoding https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/apis/config/types.go#L207
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the code in the K8S scheduler, runtime.Object is simply cast to the appropriate type used by the plugin. Those types seem to include a standard K8S TypeMeta stanza. You want every plugin's parameters object to start with apiVersion and kind? You were trying to same extra name tags, isn't here even more acute?
Also if we go with runtime.Object, every plugin with parameters would need to register it's schema in the global schema for the config...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a secondary schema that this not used/known to the k8s client and only used in configuration?
I do think that the use of Object makes sense as it would natively separate the plugin type from its name, which is currently unclear, IMO. This is a side note - I'll open an issue to discuss to avoid sidetracking the feedback.
cmd/epp/main.go
Outdated
// --- Initialize Core EPP Components --- | ||
scheduler := scheduling.NewScheduler(datastore) | ||
if schedulerV2 { | ||
// register all of the known plgin factories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// register all of the known plgin factories | |
// register all of the known plugin factories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
// EndpointPickerConfig struct. | ||
// | ||
// This naming convension is required by the defalter-gen code. | ||
func SetDefaults_EndpointPickerConfig(cfg *EndpointPickerConfig) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use this function to default the parameters of the various in-tree plugins because defaulting is associated with the api version. I believe this function will be automatically invoked the file is decoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this "extension point" is used to set default values in the CR in question. Look at the examples you pointed me to. I don't think it should be used to specify default values for parameters of plugins. The plugins should handle that themselves.
This could be used, if we wanted to create default configurations, albeit in a round about way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The in-tree plugins config types should be part of the versioned api, their defaulting should be as well.
See as an example: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/apis/config/v1/defaults.go
We can handle that as a followup though to keep this PR focused on defining the api.
cmd/epp/main.go
Outdated
var instantiatedPlugins map[string]plugins.Plugin | ||
var scheduler *scheduling.Scheduler | ||
|
||
if len(*configText) != 0 || len(*configFile) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
recommend to simplify it and validate that only one flag is set, otherwise we will need to define merge semantics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want the ability to specify the configuration from a file or from in-line text in the commanmd arguments. Envoy supports this. This enables things like:
containers:
- name: epp
args:
- -poolName
- ${POOL_NAME}
- -poolNamespace
- "default"
- -v
- "4"
- --zap-encoder
- "json"
- -grpcPort
- "9002"
- -grpcHealthPort
- "9003"
- --configText
- |
apiVersion: inference.networking.x-k8s.io/v1alpha1
kind: EndpointPickerConfig
plugins:
lowQueue:
pluginName: low-queue
parameters:
threshold: 10
maxScore:
pluginName: max-score
profileHandler:
pluginName: single-profile
schedulingProfiles:
default:
plugins:
- pluginRef: lowQueue
- pluginRef: maxScore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps at this point, I'll just add an error message if one specifies both the configFile and configText parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps at this point, I'll just add an error message if one specifies both the configFile and configText parameters?
sounds good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Added a check in flags validation code
cmd/epp/main.go
Outdated
@@ -336,3 +379,21 @@ func verifyMetricMapping(mapping backendmetrics.MetricMapping, logger logr.Logge | |||
logger.Info("Not scraping metric: LoraRequestInfo") | |||
} | |||
} | |||
|
|||
func RegisterAllPlgugins() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func RegisterAllPlgugins() { | |
func RegisterAllPlugins() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
cmd/epp/main.go
Outdated
scheduler = scheduling.NewScheduler(datastore) | ||
} | ||
|
||
if reqHeaderBasedSchedulerForTesting { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is repeated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved inside the switch
cmd/epp/main.go
Outdated
var theConfig *v1alpha1.EndpointPickerConfig | ||
var instantiatedPlugins map[string]plugins.Plugin | ||
var scheduler *scheduling.Scheduler | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no config is provided, we need to use a default one. This actually raises the question of how to disable default plugins. We should probably have a disabled
flag in the plugin struct that defaults to false. I will leave a comment on the api file in this regards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current code that I have written, will use the existing GIE methods of configuration, is none is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, we can do a followup PR to standardize defaulting.
* feat: add health check for epp cluster Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * remove tls Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * don't use tls Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * health checking flag Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * fix import Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> * add tls options Signed-off-by: zhengkezhou1 <madzhou1@gmail.com> --------- Signed-off-by: zhengkezhou1 <madzhou1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is in a reasonable shape to merge. There are a number of followups to do:
-
Concrete items to execute on:
- Enable registering out of tree plugins
- Cleanup the scheduler config instantiation paths and converge on one only, for example, when the config flag is not set, start from a default struct.
-
Open items to still converge on:
- Config API structure, I am still in favor of the approach we currently have.
- Parameters json.RawMessage vs runtime.Object and versioning of the parameters
- Do we want to default the Plugins and SchedulingProfiles parameters? If so, we need to define merging semantics and how to disable a plugin enabled by default (Add a disabled parameter to PluginSpec?)
what do you folks think?
@@ -23,6 +23,9 @@ import ( | |||
) | |||
|
|||
func main() { | |||
// Register all known plugin factories | |||
runner.RegisterAllPlgugins() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a way to register out-of-tree plugins, for the example the function could accept a variadic parameter of out-of-tree plugins
overall I feel comfortable with most of the code in this PR.
|
That already exists. Simply invoke the plugins.Register function for any out-of-tree plugins one wants to add. |
…#820) Signed-off-by: Ira <IRAR@il.ibm.com>
kubernetes-sigs#980) * Add a metrics to track loaded adapters * Update the sample manifests * Add explanation of metrics from dyanmic LoRA adapter sidecar * Add explanation of metrics from dyanmic LoRA adapter sidecar (take 2) * Update metrics.md based on feedback
…sigs#928) * refactor: Replace prefix cache structure with golang-lru Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com> * fix: rename prefix scorer parameters and convert test to benchmark test Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * feat: Add per server LRU capacity Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * fix: Fix typos and error handle Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> * fix: add safety check for LRUCapacityPerServer Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> --------- Signed-off-by: Kfir Toledo <kfir.toledo@ibm.com> Co-authored-by: Maroon Ayoub <maroon.ayoub@ibm.com>
…ernetes-sigs#834) * copy of accepted inference pool test to start from. * add yaml file for the test * update time out * update the yaml file to add port 9002 * read timeout config from local repo * remove excess comments * correct spelling for scenarios * check route condition on RouteConditionResolvedRefs * remove empty lines in yaml * set optional/defaulted fields as unspecified * fix timeout * fix boilerplate header * change varialbe names to use primary secondary consistently. * remove extra comments * factor out common code * Add actual http traffic validation using echo-basic * remove extra comments from manifest * remove modifiedTimeoutConfig.HTTPRouteMustHaveCondition per review comment. * intermediate update * fix the test run * factor out common code * move epp def to shared manifest * remove extra comments * revert back to two epps * add to do for epp image * switch to GeneralMustHaveConditionTimeout * undo gateway version changes * remove unused HTTPRouteMustHaveConditions * update doc string for GetPod * update docstring * Remove resource type from names in manifests. * remove type from name * remove health check * add todo for combining getpod methods
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
Signed-off-by: Shmuel Kallner <kallner@il.ibm.com>
// +required | ||
// +kubebuilder:validation:Required | ||
// PluginName specifies the plugin to be instantiated. | ||
PluginName string `json:"pluginName"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another follow up is to probably change this to "type" @elevran
/approve |
/lgtm would be easier to iterate after merging. |
/approve Agreed, this is workable. /unhold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ahg-g, kfswain, shmuelk The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds the ability to:
Fixes #839
This PR adds a configuration of the form:
The
plugins
section declares the plugin instances to be used in the EPP. A single named plugin instances can be used in multipleSchedulerProfile
s as well as in other components of the EPP. It is an array structs with the following fields:name
: A name to be used in the field SchedulingProfile.pluginRef. This field is optional, if omitted the value of thepluginName
field will used.pluginName
: names the plugin to instantiate.parameters
: passed unparsed to the named plugin's factory function when creating the instance.The
scheduleringProfiles
section defines the set ofSchedulingProfile
s that will be in the created in theSchedulerConfig
. It is an array of structs with the following fields:name
: The name of the SchedulingProfileplugins
: The array ofPlugin
s to be associated with thisSchedulerProfile
. Each element in this array has the following fields:pluginRef
: names a plugin instance by the name it was given in theplugins
section.weight
: the weight to be used if the plugin is aScorer
The startup of the EPP will choose the ProfileHandler plugin and any PostResponse plugins automatically from those specified in the plugins section of the configuration.
To enable the instantiation of plugins from the configuration file/text a singleton/static plugin registry was added. Each plugin was modified to add a factory function. The EPP
main()
function invokes a new function that registers all of the plugins' factory functions in the registry.