[Train] [Tune] Refactor MLflow #20802

amogkam · 2021-11-30T19:45:10Z

Pulls out Tune's MLflow logging logic to a shared MLflow util.
Adds an MLflow logger callback to Ray Train

Closes #20642

Why are these changes needed?

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…n-mlflow

Yard1

This looks awesome!

python/ray/util/ml_utils/mlflow.py

matthewdeng · 2021-11-30T20:43:12Z

.buildkite/pipeline.yml

@@ -715,7 +715,7 @@
    - bazel test --config=ci $(./scripts/bazel_export_options) --build_tests_only --test_tag_filters=client_unit_tests,-gpu_only --test_env=RAY_CLIENT_MODE=1 python/ray/util/sgd/...

 - label: ":octopus: Tune/SGD/Modin/Dask tests and examples. Python 3.7"
-  conditions: ["RAY_CI_TUNE_AFFECTED", "RAY_CI_SGD_AFFECTED"]
+  conditions: ["RAY_CI_TUNE_AFFECTED", "RAY_CI_TRAIN_AFFECTED"]


Why are we removing RAY_CI_SGD_AFFECTED here?

I don't think it was ever needed in the first place.

python/ray/train/callbacks/callback.py

matthewdeng · 2021-11-30T20:52:07Z

python/ray/train/callbacks/logging.py

+    Args:
+        tracking_uri (Optional[str]): The tracking URI for where to manage
+            experiments and runs. This can either be a local file path or a
+            remote server. This arg gets passed directly to mlflow
+            initialization.
+        registry_uri (Optional[str]): The registry URI that gets passed
+            directly to mlflow initialization.
+        experiment_id (Optional[str]): The experiment id of an already
+            existing experiment. If not
+            passed in, experiment_name will be used.
+        experiment_name (Optional[str]): The experiment name to use for this
+            Train run.
+            If the experiment with the name already exists with MLflow,
+            it will be used. If not, a new experiment will be created with
+            this name.


Can you document the behavior when these are None?

Could you elaborate more on what you'd like to see here?

There is information in the description for the arguments on the behavior if None is passed in.

Ah, so I'm not clear from this doc what the behavior is if tracking_uri or registry_uri is None. Taking another look at the implementation and docs I now see that it's mentioned in logdir doc - it would be helpful to include this information in each of these parameter docs as well.

I also can't tell from this doc at all what happens if experiment_name is None (and the corresponding environment variable is not set). From the error message, it seems like at least experiment_name or experiment_id must be set.

Got it, thanks for the explanation! Added it to the docstring.

python/ray/train/callbacks/logging.py

python/ray/train/trainer.py

python/ray/train/callbacks/logging.py

krfricke

Looks great so far

.buildkite/pipeline.yml

python/ray/train/callbacks/callback.py

python/ray/train/callbacks/logging.py

python/ray/train/trainer.py

python/ray/tune/integration/mlflow.py

amogkam · 2021-12-04T01:32:58Z

@matthewdeng @krfricke please take another look!

amogkam · 2021-12-13T18:02:08Z

@matthewdeng can you take another look please?

Also I agree with the future TODOs you listed out. Do you want to make Github issues for them?

.buildkite/pipeline.ml.yml

matthewdeng

Functionality LGTM!

.buildkite/pipeline.ml.yml

doc/source/train/user_guide.rst

python/ray/train/examples/mlflow_fashion_mnist_example.py

…n-mlflow

amogkam · 2021-12-22T01:17:38Z

Failing tests are unrelated, going to merge

amogkam added 15 commits November 17, 2021 12:56

wip

9437251

wip

2720e9c

wip

9740782

wip

b73503e

add file

0331ef6

cleanup tune integration

4efb081

finish tune

4cdcb31

add train

ae64f04

update example

41e9b0d

almost done

0557561

finish

567924c

formatting

b004dee

Merge branch 'master' of https://github.com/ray-project/ray into trai…

d02237a

…n-mlflow

CI

06de0d9

formatting

4755286

amogkam assigned matthewdeng, Yard1, krfricke and xwjiang2010 Nov 30, 2021

Yard1 approved these changes Nov 30, 2021

View reviewed changes

python/ray/util/ml_utils/mlflow.py Show resolved Hide resolved

matthewdeng reviewed Nov 30, 2021

View reviewed changes

update error message

5950734

matthewdeng reviewed Dec 1, 2021

View reviewed changes

python/ray/train/callbacks/logging.py Outdated Show resolved Hide resolved

krfricke reviewed Dec 2, 2021

View reviewed changes

amogkam added 2 commits December 3, 2021 17:25

address comments

1faaea4

Address comments

0683718

amogkam requested review from matthewdeng and krfricke December 4, 2021 01:32

fix datasets+train test

de719eb

amogkam added 3 commits December 13, 2021 09:32

address comments

ac564e9

fix failing tests

15e9c21

docs

3cba50b

amogkam requested a review from matthewdeng December 13, 2021 18:01

amogkam commented Dec 13, 2021

View reviewed changes

.buildkite/pipeline.ml.yml Outdated Show resolved Hide resolved

amogkam and others added 3 commits December 13, 2021 10:03

Update .buildkite/pipeline.ml.yml

939f3b2

fix cuj example

4379615

Merge branch 'train-mlflow' of github.com:amogkam/ray into train-mlflow

34c64b7

matthewdeng approved these changes Dec 13, 2021

View reviewed changes

.buildkite/pipeline.ml.yml Show resolved Hide resolved

doc/source/train/user_guide.rst Outdated Show resolved Hide resolved

python/ray/train/examples/mlflow_fashion_mnist_example.py Show resolved Hide resolved

amogkam added 16 commits December 13, 2021 10:46

update examples page

61bfe2e

Merge branch 'master' of https://github.com/ray-project/ray into trai…

192420e

…n-mlflow

update docs

8880fb1

update user guide

4caad5b

fix test

d031913

add more steps to the user guide

6d35f86

formatting

c827cb4

update logs

66c0b3c

Merge branch 'master' of https://github.com/ray-project/ray into trai…

4f9eb20

…n-mlflow

update

d2aa72c

updates

a5ffbd0

fix test

fb7325f

bump tensorboard version

95fdc64

fix

b045256

fix artifact recursion

dd894ce

fix test

c2f8dab

amogkam merged commit 57db464 into ray-project:master Dec 22, 2021

amogkam deleted the train-mlflow branch December 22, 2021 01:17

tgaddair mentioned this pull request Apr 9, 2022

Fixed Mlflow integration in Ray Tune to set the correct experiment ID ludwig-ai/ludwig#1894

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Train] [Tune] Refactor MLflow #20802

[Train] [Tune] Refactor MLflow #20802

amogkam commented Nov 30, 2021 •

edited

Loading

Yard1 left a comment

matthewdeng Nov 30, 2021

amogkam Dec 3, 2021

matthewdeng Nov 30, 2021

amogkam Nov 30, 2021

matthewdeng Dec 1, 2021

amogkam Dec 4, 2021

krfricke left a comment

amogkam commented Dec 4, 2021

amogkam commented Dec 13, 2021

matthewdeng left a comment

amogkam commented Dec 22, 2021

[Train] [Tune] Refactor MLflow #20802

[Train] [Tune] Refactor MLflow #20802

Conversation

amogkam commented Nov 30, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

Yard1 left a comment

Choose a reason for hiding this comment

matthewdeng Nov 30, 2021

Choose a reason for hiding this comment

amogkam Dec 3, 2021

Choose a reason for hiding this comment

matthewdeng Nov 30, 2021

Choose a reason for hiding this comment

amogkam Nov 30, 2021

Choose a reason for hiding this comment

matthewdeng Dec 1, 2021

Choose a reason for hiding this comment

amogkam Dec 4, 2021

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

amogkam commented Dec 4, 2021

amogkam commented Dec 13, 2021

matthewdeng left a comment

Choose a reason for hiding this comment

amogkam commented Dec 22, 2021

amogkam commented Nov 30, 2021 •

edited

Loading