BUG: Loss and metrics are not logged correctly #726

NickleDave · 2023-10-23T22:57:23Z

Describe the bug
None of the values that we log during runs through the CLI are recorded correctly in tensorboard log files

To Reproduce
Train any model e.g. as described in the tutorial: https://vak.readthedocs.io/en/latest/get_started/autoannotate.html
Here is how I am testing, by running a learning curve in a dev environment, set up as described here: https://vak.readthedocs.io/en/latest/development/contributors.html#setting-up-a-development-environment

vak learncurve tests/data_for_tests/generated/configs/TweetyNet_learncurve_audio_cbin_annot_notmat.toml

Then load the events file with vak.common.tensorboard.events2df, inspect the columns, and plot the values. You will see there is no train_loss, and none of the validation metrics make sense -- if anything the error values appear to increase slightly at the end of a run

You can verify that this is not just a bug in vak by starting up the Tensorboard server where you will again see that there is no train loss, and all the validation metrics look like garbage.
For example, here we see the validation loss in the logs goes up (even though progress bar logging shows that the accuracy increases and the model appears to converge)

However if I train basically the same model in a script, then I don't have this issue. When I again inspect the logs using tensorboard, I now see that the train loss is logged, and the validation metrics appear to make sense:

The script I used to check this is here:
https://github.com/vocalpy/vak/blob/mre-train-loss-logged-in-script-not-library/test-lightning-loss-tweetynet.ipynb

The workaround for this for now then is to run everything as a script 😦

Expected behavior
Logged values are correct

Screenshots
See above

Desktop (please complete the following information):

Operating System: Linux
Version: 1.0.0a4, dev

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

* Change type annotations on models.definition.validate to just be Type, not Type[ModelDefinition] * Rewrite models.base.Model to not subclass LightningModule, and instead to have class variables definition and family, that are *both* used by the from_config method to make a new instance of the family with instances of the definition's attributes (trust me, this makes perfect sense) * Change model decorator to not subclass family, and to instead subclass Model, then make a new instance of the subclass, add that instance to the registry, and then return the instance. This makes it possible to call the 'from_config' method of the instance and get a new class instance. Think of the instance as a Singleton * Rewrite FrameClassificationModel to subclass LightningModule directly, remove from_config method * Rewrite ParemetricUMAPModel to subclass LightningModule directly, remove from_config method * Rewrite base.Model class, move/rename methods so that we will get a singleton that can return new lightning.Module instances with the appropriate instances of network + loss + optimizer + metrics from its from_config method * Add load_state_dict_from_path method to FrameClassificationModel and ParametricUmapModel * Change model_family decorator to check if family_class is a subtype of LightningModule, not vak.base.Model * Rewrite models.base.Model.__init__ to take definition and family attributes, that are used by from_config method * Fix how models.decorator.model makes Model instance -- we don't subclass anymore, but we do change Model instance's __name__,__doc__, and __module__ to match those of the definition * Fix FrameClassificationModel to subclass lightning.LightningModule (not pytorch_lightning.LightningModule :eyeroll:) and to no longer pass network, loss, etc to super().__init__ since its no longer a sub-class of models.base.Model * Fix ParametricUMAPModel to subclass lightning.LightningModule (not lightning.pytorch.LightningModule) and to no longer pass network, loss, etc to super().__init__ since its no longer a sub-class of models.base.Model * Fix how we add a Model instance to MODEL_REGISTRY * Fix src/vak/models/frame_classification_model.py to set network/loss/optimizer/metrics as attributes on self * Fix src/vak/models/parametric_umap_model.py to set network/loss/optimizer/metrics as attributes on self * Fix how we get MODEL_FAMILY_FROM_NAME dict in models.registry.__getattr__ * Fix classes in tests/test_models/conftest.py so we can use them to run tests * Fix tests in tests/test_models/test_base.py * Add method from_instances to vak.models.base.Model * Rename vak.models.base.Model -> vak.models.factory.ModelFactory * Add tests in tests/test_models/test_factory.py from test_frame_classification_model * Fix unit test in tests/test_models/test_convencoder_umap.py * Fix unit tests in tests/test_models/test_decorator.py * Fix unit tests in tests/test_models/test_tweetynet.py * Fix adding load_from_state_dict method to ParametricUMAPModel * Fix unit tests in tests/test_models/test_frame_classification_model.py * Rename method in tests/test_models/test_convencoder_umap.py * Fix unit tests in tests/test_models/test_ed_tcn.py * Add a unit test from another test_models module to test_factory.py * Add a unit test from another test_models module to test_factory.py * Fix unit tests in tests/test_models/test_registry.py * Remove unused fixture 'monkeypath' in tests/test_models/test_frame_classification_model.py * Fix unit tests in tests/test_models/test_parametric_umap_model.py * BUG: Fix how we check if we need to add empty dicts to model config in src/vak/config/model.py * Rename model_class -> model_factory in src/vak/models/get.py * Clean up docstring in src/vak/models/factory.py * Fix how we parametrize two unit tests in tests/test_config/test_model.py * Fix ConvEncoderUMAP configs in tests/data_for_tests/configs to have network.encoder sub-table * Rewrite docstring, fix type annotations, rename vars for clarity in src/vak/models/decorator.py * Revise docstring in src/vak/models/definition.py * Revise type hinting + docstring in src/vak/models/get.py * Revise docstring + comment in src/vak/models/registry.py * Fix unit test in tests/test_models/test_factory.py * Fix ParametricUMAPModel to use a ModuleDict * Fix unit test in tests/test_models/test_convencoder_umap.py * Fix unit test in tests/test_models/test_factory.py * Fix unit test in tests/test_models/test_parametric_umap_model.py * Fix common.tensorboard.events2df to avoid pandas error about re-indexing with duplicate values -- we need to not use the 'epoch' Scalar since it's all zeros

NickleDave · 2024-05-06T17:00:35Z

Fixed by #753

NickleDave added the BUG Something isn't working label Oct 23, 2023

NickleDave mentioned this issue Jan 22, 2024

BUG/CLN: Refactor model abstraction so we don't subclass LightningModel, to fix loss logging #737

Closed

NickleDave closed this as completed May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Loss and metrics are not logged correctly #726

BUG: Loss and metrics are not logged correctly #726

NickleDave commented Oct 23, 2023

NickleDave commented May 6, 2024

BUG: Loss and metrics are not logged correctly #726

BUG: Loss and metrics are not logged correctly #726

Comments

NickleDave commented Oct 23, 2023

NickleDave commented May 6, 2024