Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow autologging issue # 1618 #2092

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
103 changes: 103 additions & 0 deletions docs/userguide/torch_forecasting_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -461,6 +461,109 @@ model.fit(...)

*Note* : The callback will give one more element in the `loss_logger.val_loss` as the model trainer performs a validation sanity check before the training begins.

#### Example with MLflow Autologging
madtoinou marked this conversation as resolved.
Show resolved Hide resolved

MLflow using interface (UI) and autologging to track Dart's pytorch models.
```python
import pandas as pd
from torchmetrics import MeanAbsolutePercentageError
madtoinou marked this conversation as resolved.
Show resolved Hide resolved
from darts.dataprocessing.transformers import Scaler
from darts.datasets import AirPassengersDataset
from darts.models import NBEATSModel

# read data
series = AirPassengersDataset().load()

# create training and validation sets:
train, val = series.split_after(pd.Timestamp(year=1957, month=12, day=1))

# normalize the time series
transformer = Scaler()
train = transformer.fit_transform(train)
val = transformer.transform(val)

# any TorchMetric or val_loss can be used as the monitor
torch_metrics = torchmetrics.regression.MeanAbsolutePercentageError()

# MLflow setup
## Run this command with environment activated: mlflow ui --port xxxx (e.g. 5000, 5001, 5002)
# Copy and paste url from command line to web browser
import mlflow
import torchmetrics
from mlflow.data.pandas_dataset import PandasDataset

mlflow.pytorch.autolog(log_every_n_epoch=1, log_every_n_step=None,
log_models=True, log_datasets=True, disable=False,
exclusive=False, disable_for_unsupported_versions=False,
silent=False, registered_model_name=None, extra_tags=None
)

import mlflow.pytorch
from mlflow.client import MlflowClient

model_name = "Darts"

with mlflow.start_run(nested=True) as run:

dataset: PandasDataset = mlflow.data.from_pandas(series, source="AirPassengersDataset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO should call series.pd_dataframe() because in my tests it worked only on DataFrames not Series:
dataset: PandasDataset = mlflow.data.from_pandas(series.pd_dataframe(), source="AirPassengersDataset")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added via commit c15eb1e
Thank you!


# Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
# dataset is used for model training
mlflow.log_input(dataset, context="training")

mlflow.log_param("model_type", "Darts_Pytorch_model")
mlflow.log_param("input_chunk_length", 24)
mlflow.log_param("output_chunk_length", 12)
mlflow.log_param("n_epochs", 500)
mlflow.log_param("model_name", 'NBEATS_MLflow')
mlflow.log_param("log_tensorboard", True)
mlflow.log_param("torch_metrics", "torchmetrics.regression.MeanAbsolutePercentageError()")
mlflow.log_param("nr_epochs_val_period", 1)
mlflow.log_param("pl_trainer_kwargs", "{callbacks: [loss_logger]}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a bit more concise, with a list of parameters to log and access them in the model.model_params, model.pl_module_params or model.trainer_params attributes (potentially after fusing these dictionaries into one).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou
I made the code more concise as suggested. Refer to commit f798244

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou , @dennisbader ,

I guess you may have been busy with the latest Darts release. Did you have time to look at my latest changes? Does this work to merge the first part of the MLFlow integration as previously discussed?

Cheers!

By the way, nicely done! I need to test the latest release speeds. :)



from pytorch_lightning.callbacks import Callback

class LossLogger(Callback):
def __init__(self):
self.train_loss = []
self.val_loss = []

# will automatically be called at the end of each epoch
def on_train_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.train_loss.append(float(trainer.callback_metrics["train_loss"]))

def on_validation_epoch_end(self, trainer: "pl.Trainer", pl_module: "pl.LightningModule") -> None:
self.val_loss.append(float(trainer.callback_metrics["val_loss"]))


loss_logger = LossLogger()

# create the model
model = NBEATSModel(
input_chunk_length=24,
output_chunk_length=12,
n_epochs=500,
model_name='NBEATS_MLflow',
log_tensorboard=True,
torch_metrics=torch_metrics,
nr_epochs_val_period=1,
pl_trainer_kwargs={"callbacks": [loss_logger]})

# use validation dataset
model.fit(
series=train,
val_series=val,
)

# predit
forecast = model.predict(len(val))

# Registering model
model_uri = f"runs:/{run.info.run_id}/darts-NBEATS"
mlflow.register_model(model_uri=model_uri, name=model_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO runs currently do not save models as artifacts becaus there is no call to log_model(). Therefore model_uri is not pointing to a valid model. Please correct me if I'm wrong, however, it did not work in my tests. If it works, can you pleas add an example how to load with loaded_model = mlflow.<flavor>.load_model(model_uri=model_uri) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to have a look at this, my understanding is that it saves it both using mlflow.<model_flavor>.log_model() or mlflow.register_model(). Refer to https://mlflow.org/docs/latest/model-registry.html.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand it you can only (kind of) promote a already saved model to a registered model..

Model
An MLflow Model is created from an experiment or run that is logged with one of the model flavor’s mlflow.<model_flavor>.log_model() methods. Once logged, this model can then be registered with the Model Registry.

Registered Model
An MLflow Model can be registered with the Model Registry. A registered model has a unique name, contains versions, aliases, tags, and other metadata.
https://mlflow.org/docs/latest/model-registry.html#concepts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand it you can only (kind of) promote a already saved model to a registered model..

Model
An MLflow Model is created from an experiment or run that is logged with one of the model flavor’s mlflow.<model_flavor>.log_model() methods. Once logged, this model can then be registered with the Model Registry.
Registered Model
An MLflow Model can be registered with the Model Registry. A registered model has a unique name, contains versions, aliases, tags, and other metadata.
https://mlflow.org/docs/latest/model-registry.html#concepts

Hello @turbotimon ,
My plans was to share how to do the first part of issue # 1618. As it allows you to track the experiments (so one don't get lost if you are doing more of a few iterations), select the best one, see metrics progress, etc.
I will have a look at what you are suggesting. However, the normal mlflow.python.loadmodel or mlflow.pytorch.loadmodel doesn't seems to work because the way Darts is wrapping torch.nn modules. All test I have done so far didn't work to save and load the model. Only to record hyperparameters, other torchmetrics related information, dataset hash, etc. Thus, I propose to share this, and later we can add saving and loading models (but my last weekend research it is pointing out that the model.py files and common models may need to be rewritten to be compatible with MLFlow (I might be wrong).

Cheers!

Copy link
Contributor

@turbotimon turbotimon Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cargecla1 Yes, that's a good idea to split these two things

If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like:

dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle")

Let me know if i can help anything!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @turbotimon

After trying to save the Darts model using dartsmodel.save("mymodel.pickle") as suggested, I am getting the following error:

File ~.conda\envs\venv_darts\lib\site-packages\torch\serialization.py:653, in _save(obj, zip_file, pickle_module, pickle_protocol) 651 pickler = pickle_module.Pickler(data_buf, protocol=pickle_protocol) 652 pickler.persistent_id = persistent_id --> 653 pickler.dump(obj) 654 data_value = data_buf.getvalue() 655 zip_file.write_record('data.pkl', data_value, len(data_value))

AttributeError: Can't pickle local object 'LayerSummary._register_hook..hook'

Would you know why this is happening?

Cheers

Hi @cla-ra3426, sorry i missed your question. No idea.. but must have to do somehting with darts itself and not mlflow i suppose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cargecla1 Yes, that's a good idea to split these two things
If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like:

dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle")

Let me know if i can help anything!

Hello @turbotimon , @madtoinou , @dennisbader,

Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?

Cheers

Hello gents (@turbotimon , @madtoinou , @dennisbader,)

Have you had time to consider my proposal above? "Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?"

The only solution I have right now it is to train the model with MLFlow to track and monitor the model and retrain it with pure Darts to be able to save it and load to predict, Darts saving method "fails" when run inside a MLFlow run, and MLFlow log and save methods don't want to work with Darts either.

Thank you in advice for your consideration!

Cheers!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cargecla1,

Sorry for the delay, loading model is kind-of part of linked issue so it would probably be better to also include it in this PR. But if it too troublesome, we can treat it separately.

@cla-ra3426,

The error you're getting when trying to pickle the model is probably due to the (pytorch-lightning) callbacks. Can you try removing them prior to exporting the model?

I will test this example more thoroughly when I have more time, try to see if I can come up with a solution for the loading.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou

I will try this and see how it goes.

Thanks for this suggestion and for considering splitting the problem into two if above doesn't work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou ,

I took your suggestion on board as per above, but that didn't fixed the issue with loading and predicting steps.

Refer to commit f798244

```

## Performance Recommendations
This section recaps the main factors impacting the performance when
training and using torch-based models.
Expand Down
Loading