Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KIT-80] MLFlow RNS Store #42

Closed
dongreenberg opened this issue Mar 31, 2023 · 0 comments
Closed

[KIT-80] MLFlow RNS Store #42

dongreenberg opened this issue Mar 31, 2023 · 0 comments
Labels

Comments

@dongreenberg
Copy link
Contributor

dongreenberg commented Mar 31, 2023

Right now we support two kinds of RNS stores for saving and loading - the Runhouse RNS and the git repo. MLFlow has a high degree of flexibility in the storage backends users can persist their logs and experiments to, and many DS teams already have these stores set up. MLFlow only provides first-class support for models as saved and loaded primitives from the store (which is funny, because "models" are a primitive we specifically don't support, on purpose). Today, people are saving and loading other infrastructure metadata as free-form strings (e.g. s3 paths), but this has significant limitations:

  1. Sharing infra via strings is like sharing files via filepaths, it's just not an "active" ready to use object like sharing a Google doc.
  2. MLFlow doesn't make it easy to pull these strings programmatically (that I can see, and I've asked several active users in MLFlow slack about it), they intend for them to be surfaced through the UI or via their search api.
  3. Strings obviously only go so far - no one is sharing the equivalent of a Runhouse function's metadata through these strings to allow for shared services.
    From MLFlow slack:

image

I think there are a few possible APIs we can provide here:

  1. Allow users to use MLFlow as the RNS store via the same Resource.save() or Resource.from_name() etc. APIs. This would mean many users who already use MLFlow could start sharing Runhouse resources immediately, without any approvals for external metadata storage or setup, and it would integrate into their existing MLFlow-centric experiment workflows. Disadvantage is the experiment or project centric structure of how MLFlow presents metadata. We wouldn't be capturing these to Runhouse RNS to show in a team-centric way.
  2. Integrate with the MLFlow tracer to allow save() to write to both MLFlow (with names only) and Runhouse RNS. That way, users can view Runhouse's interfaces for a single pane of glass into the infra resources, and MLFlow's as an experiment, project, or model centric homepage. Essentially, this would mean using MLFlow's experiment and project structure to add foldering convenience to resources, but avoid having two sources of truth for the metadata. If the user has a particular MLFlow experiment set, and then they request rh.Table.from_name("bert_dropout_v5"), we'll be going to MLFlow first to get the full RNS path for that resource, and then to Runhouse to fetch the resource itself. We could also support an api to pull a dict of all available resources for an experiment at once.
  3. Introduce an mlflow.runhouse integration (or model type) which facilitates saving and loading of Runhouse resources. This would allow saving and loading of resources in a familiar way to MLFlow users, but would also add a lot of new non-model things into the users' model registry.

The easiest way to think about the user journey is like this (showing a notebook-centric workflow in a system like Databricks just to stress the assumptions, but this would all work even more simply in a git+IDE setting):

  1. User is in a notebook working on an experiment. They're using MLFlow to document hyperparameters, metrics, and logs as they set up the experiment. They create a number of functions (pull, preprocess, train, eval, etc.) and artifacts (preprocessed tables, folder of model outputs or intermediate checkpoints, etc.) which they want to dispatch to remote infra and preserve within the experiment. They use Runhouse to facilitate the infra interactions, and then call .save() on their resources to persist the metadata and the record of the resources used in the experiment (including versions probably) to MLFlow.
  2. Another user needs to reproduce or play with the experiment. They either 1) open the MLFlow UI and see the resources used for the experiment, so they can begin playing with them in a new notebook or script (or do the same by pulling through the API), or 2) open the notebook, and begin working with the resources without having to regenerate them because they can be loaded from RNS.
  3. Eventually, this experiment is chosen to move to(ward) production. An initial inference function can be shared by RNS name with a customer team for QA without needing to undergo an export step, and the notebook can even be scheduled to run repeatedly with the inference function auto-updating (and an export step to an inference engine can be added, obviously). If the user wants to schedule the logic to run as a dependency of another job, or have fault-tolerance, monitoring, etc. they can simply adapt the notebook into a single script to drop into their orchestrator (with many heavy or light options there). The user can load their resources as-is by name inside their reproduction script (e.g. rh.Function.from_name("yolo_v5_training_dropout")), copy out the full logic (including functions) from the notebook, or copy out the logic and flow but move the reusable functions into a shared git repo and import them into the script.

Cc @rmehyde, @ankmathur96

From SyncLinear.com | KIT-80

@dongreenberg dongreenberg changed the title [KIT-80] MLFlow RNS connector [KIT-80] MLFlow RNS Mar 31, 2023
@dongreenberg dongreenberg changed the title [KIT-80] MLFlow RNS [KIT-80] MLFlow RNS Store Apr 4, 2023
@dongreenberg dongreenberg closed this as not planned Won't fix, can't repro, duplicate, stale Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant