Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert dtype=object arrays if possible #518

Merged
merged 9 commits into from
Nov 18, 2021
Merged

Conversation

niklassiemer
Copy link
Member

No description provided.

pyiron_base/generic/hdfio.py Outdated Show resolved Hide resolved
#TODO: remove this function upon 1.0.0 release
@staticmethod
def _convert_dtype_obj_array(obj: np.ndarray):
result = np.array(obj.tolist())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone has a better function to do this, I would be happy. Especially, since the docstring states

Notes
-----
The array may be recreated via ``a = np.array(a.tolist())``, although this
may sometimes lose precision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be too worried. I think they mean that you might get a conversion from int64 down to int32. AFAIK we don't have any users who care about using long int/floats (or short ones for more memory efficiency), so any sloppiness here should be perfectly safe.

Copy link
Member

@liamhuber liamhuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

#TODO: remove this function upon 1.0.0 release
@staticmethod
def _convert_dtype_obj_array(obj: np.ndarray):
result = np.array(obj.tolist())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't be too worried. I think they mean that you might get a conversion from int64 down to int32. AFAIK we don't have any users who care about using long int/floats (or short ones for more memory efficiency), so any sloppiness here should be perfectly safe.

f"Returned array was converted from dtype='O' to dtype={result.dtype} "
f"via `np.array(result.tolist())`.\n"
f"Please run rewrite_hdf5() to update this data! "
f"To update all your data run update tool.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the PR that introduces the tool we need to remember to come back and reference it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^^ @pmrv

@pmrv
Copy link
Contributor

pmrv commented Nov 16, 2021

Am currently working on identifying affected projects and testing the update bot. I can verify that normal lammps jobs submitted with 0.3.8 are affected, e.g. when you do

>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic
/cells                   Group
/cells/data              Dataset {3003, 3}
/cells/index             Dataset {1001}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Group
/forces/data             Dataset {54054, 3}
/forces/index            Dataset {1001}
/indices                 Group
/indices/data            Dataset {54054}
/indices/index           Dataset {1001}
/positions               Group
/positions/data          Dataset {54054, 3}
/positions/index         Dataset {1001}
/pressures               Group
/pressures/data          Dataset {3003, 3}
/pressures/index         Dataset {1001}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Group
/unwrapped_positions/data Dataset {54054, 3}
/unwrapped_positions/index Dataset {1001}
/velocities              Group
/velocities/data         Dataset {54054, 3}
/velocities/index        Dataset {1001}
/volume                  Dataset {1001}

instead of this with this branch

>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic
/cells                   Dataset {1001, 3, 3}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Dataset {1001, 54, 3}
/indices                 Dataset {1001, 54}
/positions               Dataset {1001, 54, 3}
/pressures               Dataset {1001, 3, 3}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Dataset {1001, 54, 3}
/velocities              Dataset {1001, 54, 3}
/volume                  Dataset {1001}

@pmrv
Copy link
Contributor

pmrv commented Nov 16, 2021

Then with the update script I just pushed I get the 'correct' thing again

>>> !python /u/zora/software/pyiron_base/update_scripts/pyiron_base_0.3_to_0.4.py {pr.path}

100%|█████████████████████████████████████████████| 1/1 [00:00<00:00,  3.66it/s]
>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic

/cells                   Dataset {1001, 3, 3}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Dataset {1001, 54, 3}
/indices                 Dataset {1001, 54}
/positions               Dataset {1001, 54, 3}
/pressures               Dataset {1001, 3, 3}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Dataset {1001, 54, 3}
/velocities              Dataset {1001, 54, 3}
/volume                  Dataset {1001}

@pmrv
Copy link
Contributor

pmrv commented Nov 16, 2021

I'd like to test it on a larger project, but I'm running into the other bug at #517 with this branch, so it will have to wait until tomorrow.

@niklassiemer
Copy link
Member Author

Am currently working on identifying affected projects and testing the update bot. I can verify that normal lammps jobs submitted with 0.3.8 are affected, e.g. when you do

>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic
/cells                   Group
/cells/data              Dataset {3003, 3}
/cells/index             Dataset {1001}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Group
/forces/data             Dataset {54054, 3}
/forces/index            Dataset {1001}
/indices                 Group
/indices/data            Dataset {54054}
/indices/index           Dataset {1001}
/positions               Group
/positions/data          Dataset {54054, 3}
/positions/index         Dataset {1001}
/pressures               Group
/pressures/data          Dataset {3003, 3}
/pressures/index         Dataset {1001}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Group
/unwrapped_positions/data Dataset {54054, 3}
/unwrapped_positions/index Dataset {1001}
/velocities              Group
/velocities/data         Dataset {54054, 3}
/velocities/index        Dataset {1001}
/volume                  Dataset {1001}

instead of this with this branch

>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic
/cells                   Dataset {1001, 3, 3}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Dataset {1001, 54, 3}
/indices                 Dataset {1001, 54}
/positions               Dataset {1001, 54, 3}
/pressures               Dataset {1001, 3, 3}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Dataset {1001, 54, 3}
/velocities              Dataset {1001, 54, 3}
/volume                  Dataset {1001}

Now I am slightly confused. You are talking about different hdf files produced with 0.3.8 and this branch, which should behave like the current master in this respect? If yes, then I am fine :)

Co-authored-by: Niklas Siemer <70580458+niklassiemer@users.noreply.github.com>
@pmrv
Copy link
Contributor

pmrv commented Nov 16, 2021

Am currently working on identifying affected projects and testing the update bot. I can verify that normal lammps jobs submitted with 0.3.8 are affected, e.g. when you do

>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic
/cells                   Group
/cells/data              Dataset {3003, 3}
/cells/index             Dataset {1001}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Group
/forces/data             Dataset {54054, 3}
/forces/index            Dataset {1001}
/indices                 Group
/indices/data            Dataset {54054}
/indices/index           Dataset {1001}
/positions               Group
/positions/data          Dataset {54054, 3}
/positions/index         Dataset {1001}
/pressures               Group
/pressures/data          Dataset {3003, 3}
/pressures/index         Dataset {1001}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Group
/unwrapped_positions/data Dataset {54054, 3}
/unwrapped_positions/index Dataset {1001}
/velocities              Group
/velocities/data         Dataset {54054, 3}
/velocities/index        Dataset {1001}
/volume                  Dataset {1001}

instead of this with this branch

>>> !h5ls -r {j.project_hdf5.file_name}/{j.name}/output/generic
/cells                   Dataset {1001, 3, 3}
/energy_pot              Dataset {1001}
/energy_tot              Dataset {1001}
/forces                  Dataset {1001, 54, 3}
/indices                 Dataset {1001, 54}
/positions               Dataset {1001, 54, 3}
/pressures               Dataset {1001, 3, 3}
/steps                   Dataset {1001}
/temperature             Dataset {1001}
/unwrapped_positions     Dataset {1001, 54, 3}
/velocities              Dataset {1001, 54, 3}
/volume                  Dataset {1001}

Now I am slightly confused. You are talking about different hdf files produced with 0.3.8 and this branch, which should behave like the current master in this respect? If yes, then I am fine :)

Yes, everything after #503 should write as in the second example, everything before (and 0.3.8) as in the first example.

@niklassiemer
Copy link
Member Author

Although I canceled the windows-latest 3.9 test, the output shows the complete test suite with a ok. Thus, I take this as passing test! Codacy complaint is irrelevant. Therefore, @pmrv merge once you could test it.

@niklassiemer niklassiemer mentioned this pull request Nov 17, 2021
@niklassiemer
Copy link
Member Author

We really should hurry a bit and merge this and #519 and release 0.4.0 tomorrow. We need the writing of dtype=object arrays to be fixed and the h5io issue solved also on our conda release!

Co-authored-by: Niklas Siemer <70580458+niklassiemer@users.noreply.github.com>
@niklassiemer
Copy link
Member Author

@pmrv Do you want to add something to the robot already or leave it as it is and make it faster in the next release?

@niklassiemer niklassiemer merged commit 9253669 into master Nov 18, 2021
@delete-merged-branch delete-merged-branch bot deleted the backwards_hdf_dtype_obj branch November 18, 2021 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants