check for `_unit_system_name` in `save_as_dataset` #4316

chrishavlin · 2023-01-28T00:23:15Z

There might be a better way to fix this, I'm not super familiar with the unit system attributes and how they're handled on re-load. But I wanted to get this in before the weekend.

yt/frontends/ytdata/utilities.py

brittonsmith

Is this happening because of something special about the base dataset? Why does it not have a unit system name? It might be worth understanding this before this goes in.

brittonsmith · 2023-01-30T14:53:27Z

yt/frontends/ytdata/utilities.py

+        if hasattr(ds, "_unit_system_name"):
+            _yt_array_hdf5_attr(fh, "unit_system_name", ds._unit_system_name)
+        else:
+            _yt_array_hdf5_attr(
+                fh, "unit_system_name", ds.unit_system.name.split("_")[0]
+            )


Maybe something like this? Note, I haven't tested if it solves the problem, just a code suggestion.

Suggested change

if hasattr(ds, "_unit_system_name"):

_yt_array_hdf5_attr(fh, "unit_system_name", ds._unit_system_name)

else:

_yt_array_hdf5_attr(

fh, "unit_system_name", ds.unit_system.name.split("_")[0]

)

name = getattr(ds, "_unit_system_name", ds.unit_system.name.split("_")[0])

_yt_array_hdf5_attr(fh, "unit_system_name", name)

After the digging this morning, I think it might also be totally sufficient only write the _unit_system_name attribute

_yt_array_hdf5_attr(fh, "unit_system_name", ds._unit_system_name)

I initially wasn't 100% sure that having unit_system ensures always having _unit_system_name and so I put it in the if block, but I suspect it might be the case. I'll check on that more thoroughly...

I ended up changing how the unit unit_system.name is initially set

actually ended up having to switch back to this approach. But this section is now only writing ds._unit_system_name

chrishavlin · 2023-01-30T17:47:01Z

Is this happening because of something special about the base dataset? Why does it not have a unit system name? It might be worth understanding this before this goes in.

I agree! Should have marked it as a WIP on Friday when I pushed it up :)

But I think I figured it out after a bit more digging...

The following

import yt
ds = yt.load("AM06/AM06.out1.00300.athdf")
print(ds.unit_system.name)

prints out '2493b7e1126463ce3bae781a48ef579b'

That string comes from how the code UnitSystem is created: Dataset._assign_unit_system) calls yt.units.unit_systems.create_code_unit_system which looks like:

def create_code_unit_system(unit_registry, current_mks_unit=None):
    code_unit_system = UnitSystem(
        name=unit_registry.unit_system_id,
        length_unit="code_length",
        mass_unit="code_mass",
        time_unit="code_time",
        temperature_unit="code_temperature",
        current_mks_unit=current_mks_unit,
        registry=unit_registry,
    )
    <----------- TRIMMED --------------->

That unit_registry.unit_system_id attribute being used to set the name is a hashed property set by unyt.unit_regsitry.UnitRegistry (the UnitRegsitry docstring reads: "This is a unique identifier for the unit registry created from a FNV hash.").

The history around that name= line is fairly confusing -- seems to me it was changed temporarily in #2728 (in commit df5bc0336f6f07816290ed3754d236dddeb7ab01 ) to

    name="code_{}".format(unit_registry.unit_system_id),

which actually would fix the bug here since when you call save_as_dataset, it would split it and save as code:

        _yt_array_hdf5_attr(
            fh, "unit_system_name", ds.unit_system.name.split("_")[0]
        )

but then the name was changed back to just the UnitSystem hash while that PR was iterated on.

The reason why you need to restart the session to get the error is because yt.unit_system_registry will get re-initialized only when you initially import yt. So when you load a dataset, you get a unique hash for that dataset's code length UnitSystem:

import yt
print(yt.unit_system_registry.keys())
ds = yt.load("AM06/AM06.out1.00300.athdf")
print(yt.unit_system_registry.keys())

prints

dict_keys(['cgs', 'mks', 'imperial', 'galactic', 'solar', 'geometrized', 'planck'])
dict_keys(['cgs', 'mks', 'imperial', 'galactic', 'solar', 'geometrized', 'planck', '2493b7e1126463ce3bae781a48ef579b'])

but when you restart your kernel, yt.unit_system_registry is reset, leading to a key error when trying to re-load the file that was saved with prj.data_source.save_as_dataset() (since that unique UnitSystem hash no longer exists).

So in the end, I think my fix here might actually be a good approach. I'm also still thinking about how to write a test for this -- it's tricky because it's session-scope dependent. There might be a nice way to do it with pytest...

chrishavlin · 2023-01-30T17:52:58Z

changed to WIP for now while I test things out more carefully -- feel free to continue commenting though! I just didn't want it to get merged inadvertently...

neutrinoceros · 2023-01-30T21:25:27Z

The history around that name= line is fairly confusing -- seems to me it was changed temporarily in #2728 (in commit df5bc0336f6f07816290ed3754d236dddeb7ab01 ) to
   name="code_{}".format(unit_registry.unit_system_id),
which actually would fix the bug here since when you call save_as_dataset, it would split it and save as code:
       _yt_array_hdf5_attr(
           fh, "unit_system_name", ds.unit_system.name.split("_")[0]
       )
but then the name was changed back to just the UnitSystem hash while that PR was iterated on.

I think we completely missed this in #2728, great job finding your way back there !

If just re-adding the code_ prefix fixes the bug I think this is the way to go here. I think adding a couple comments on the .split call as well as the line where the prefix is added, linking to #4315 would be sufficient of a guard to prevent this bug from reappearing, if writing a test for it is as complex as it sounds.

chrishavlin · 2023-01-31T00:00:43Z

If just re-adding the code_ prefix fixes the bug I think this is the way to go here.

I agree! It is simpler, and it does fix the bug here. I did add a minimal unit test -- it's not a full test of the bug, but it at least checks that loading back in a saved dataset has the proper unit_system.name (which fails on main).

chrishavlin · 2023-01-31T16:07:54Z

Ok, so turns out switching the name to include the code prefix led to some tricky test failures from key errors internal to unyt (in unit_registry._sanitize_unit_system. So I went back to change which attribute to write from within save_as_dataset.

Apologies for the back and forth on this. But I think I should be done now assuming all the tests pass...

neutrinoceros · 2023-01-31T16:26:36Z

If tests pass this time, I suggest rebasing the branch to a single commit so the history will hopefully be easier to follow in the future

neutrinoceros

LGTM, thanks @chrishavlin !
I'll give @brittonsmith some time to respond before I merge

chrishavlin · 2023-02-10T21:13:10Z

pre-commit.ci autofix

for more information, see https://pre-commit.ci

chrishavlin added the bug label Jan 28, 2023

chrishavlin commented Jan 28, 2023

View reviewed changes

yt/frontends/ytdata/utilities.py Outdated Show resolved Hide resolved

brittonsmith reviewed Jan 30, 2023

View reviewed changes

chrishavlin changed the title ~~check for _unit_system_name in save_as_dataset~~ [WIP] check for _unit_system_name in save_as_dataset Jan 30, 2023

chrishavlin changed the title ~~[WIP] check for _unit_system_name in save_as_dataset~~ check for _unit_system_name in save_as_dataset Jan 30, 2023

write out _unit_system_name

e66c31c

chrishavlin force-pushed the fix_unit_system_name_reload branch from 03144f3 to e66c31c Compare January 31, 2023 18:28

neutrinoceros previously approved these changes Feb 2, 2023

View reviewed changes

Merge branch 'main' into fix_unit_system_name_reload

5880be5

chrishavlin dismissed neutrinoceros’s stale review via 5880be5 February 10, 2023 21:05

[pre-commit.ci] auto fixes from pre-commit.com hooks

4902ada

for more information, see https://pre-commit.ci

neutrinoceros previously approved these changes Feb 10, 2023

View reviewed changes

Merge branch 'main' into fix_unit_system_name_reload

baf6074

neutrinoceros dismissed their stale review via baf6074 February 20, 2023 11:45

neutrinoceros enabled auto-merge February 20, 2023 11:45

neutrinoceros approved these changes Feb 20, 2023

View reviewed changes

neutrinoceros merged commit 3c564f2 into yt-project:main Feb 20, 2023

chrishavlin deleted the fix_unit_system_name_reload branch June 14, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check for `_unit_system_name` in `save_as_dataset` #4316

check for `_unit_system_name` in `save_as_dataset` #4316

chrishavlin commented Jan 28, 2023

brittonsmith left a comment

brittonsmith Jan 30, 2023

chrishavlin Jan 30, 2023 •

edited

chrishavlin Jan 31, 2023

chrishavlin Jan 31, 2023

chrishavlin commented Jan 30, 2023 •

edited

chrishavlin commented Jan 30, 2023

neutrinoceros commented Jan 30, 2023

chrishavlin commented Jan 31, 2023

chrishavlin commented Jan 31, 2023 •

edited

neutrinoceros commented Jan 31, 2023

neutrinoceros left a comment

chrishavlin commented Feb 10, 2023

check for _unit_system_name in save_as_dataset #4316

check for _unit_system_name in save_as_dataset #4316

Conversation

chrishavlin commented Jan 28, 2023

brittonsmith left a comment

Choose a reason for hiding this comment

brittonsmith Jan 30, 2023

Choose a reason for hiding this comment

chrishavlin Jan 30, 2023 • edited

Choose a reason for hiding this comment

chrishavlin Jan 31, 2023

Choose a reason for hiding this comment

chrishavlin Jan 31, 2023

Choose a reason for hiding this comment

chrishavlin commented Jan 30, 2023 • edited

chrishavlin commented Jan 30, 2023

neutrinoceros commented Jan 30, 2023

chrishavlin commented Jan 31, 2023

chrishavlin commented Jan 31, 2023 • edited

neutrinoceros commented Jan 31, 2023

neutrinoceros left a comment

Choose a reason for hiding this comment

chrishavlin commented Feb 10, 2023

check for `_unit_system_name` in `save_as_dataset` #4316

check for `_unit_system_name` in `save_as_dataset` #4316

chrishavlin Jan 30, 2023 •

edited

chrishavlin commented Jan 30, 2023 •

edited

chrishavlin commented Jan 31, 2023 •

edited