Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening partly written HDF5 files with happi after simulation crash #655

Closed
DoubleAgentDave opened this issue Sep 25, 2023 · 16 comments
Closed
Labels
feature-request something that could be added to the code needs-user-input the issue cannot be resolved without additional information

Comments

@DoubleAgentDave
Copy link
Contributor

DoubleAgentDave commented Sep 25, 2023

Unfortunately some of the clusters I am using occasionally crash. This can caused errors in writing to HDF5 files. Using happi I sometimes can't open HDF5 files with incompletely written lines of data. I.E. in a probe data file there are dumptimes that are half written and some bit of data is missing.

For example I recently ran a simulation with a probe diagnostic which writes a HDF5 data set every 4 timesteps. It is quite likely that if the simulation hits the wall time before a write occurs, or if the simulation crashes during a write, then there will be half written data in the file. But the previously written data could still be useful and the simulation may not need to be rerun.

When you use happi to open these files the following error occurs:

signal = Ez_probe.getData()
  File "~/Smilei/happi/_Diagnostics/Diagnostic.py", line 163, in getData
    data.append( self._dataAtTime(t) )
  File "~/Smilei/happi/_Diagnostics/Diagnostic.py", line 862, in _dataLinAtTime
    A = self._getDataAtTime(t)
  File "~/Smilei/happi/_Diagnostics/Probe.py", line 372, in _getDataAtTime
    data = self._dataForTime[t][n,first:last]
TypeError: 'NoneType' object is not subscriptable

When I look at the file in something like HDFCompass I can see that there are two empty data lines, but the rest of the data is intact:

hdfcompass

I think it would be relatively simple to use a try-except statement somewhere to obtain this data in this scenario. It would likely be relatively easy to simulate this problem using a correctly written HDF5 file and adding a couple of unexpected data sets at the end of the file.

@DoubleAgentDave DoubleAgentDave added the feature-request something that could be added to the code label Sep 25, 2023
@DoubleAgentDave
Copy link
Contributor Author

Admittedly the 'flush_every' function helps reduce the chance of the HDF5 file being corrupted during a write while it's running, so part of what I said is not quite right, but the problem still exists that some bits of data at the end of the HDF5 files is miswritten sometimes during a crash.

@mccoys
Copy link
Contributor

mccoys commented Sep 25, 2023

Thank you for suggesting this. I actually had the same comment a few days ago from a colleague.

@DoubleAgentDave
Copy link
Contributor Author

DoubleAgentDave commented Sep 26, 2023

Sorry, bad code in previous thing, this recreates the probes hdf5 file (at least I think it does) and seems to work all of the time as far as I have tested:

`

import h5py
f_dest = h5py.File("Probes0_fixed.h5", "w")
f_src = h5py.File("Probes0.h5", "r")
for key in f_src:
    try:
        f_dest.create_dataset_like(str(key), f_src[key])
        f_dest[key][()] = f_src[key][()]
        for attrib in f_src[key].attrs.keys():
            f_dest[key].attrs.create(attrib, f_src[key].attrs[attrib])

    except KeyError:
        print("faulty key = " + str(key))

for attrib in f_src.attrs.keys():
    f_dest.attrs.create(attrib, f_src.attrs[attrib])

`

@DoubleAgentDave
Copy link
Contributor Author

Just to note when I use the above script it doesn't always work. The individual attributes must also be tested before the key is written to the new H5 file as sometimes a key can be correctly created but not filled with attributes correctly.

@mccoys
Copy link
Contributor

mccoys commented Nov 28, 2023

Do you have an idea of to reproduce this? I cannot get a corrupted file

@DoubleAgentDave
Copy link
Contributor Author

DoubleAgentDave commented Nov 28, 2023 via email

@mccoys
Copy link
Contributor

mccoys commented Nov 28, 2023

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

@DoubleAgentDave
Copy link
Contributor Author

DoubleAgentDave commented Nov 28, 2023 via email

@DoubleAgentDave
Copy link
Contributor Author

It does not happen on my system for some reason. Would you be able to produce a small example and send it with dropbox or equivalent?

I've sent you a link in element chat

@mccoys
Copy link
Contributor

mccoys commented Nov 30, 2023

I have not received it. My name on element is fredpz

@DoubleAgentDave
Copy link
Contributor Author

DoubleAgentDave commented Nov 30, 2023 via email

@DoubleAgentDave
Copy link
Contributor Author

DoubleAgentDave commented Nov 30, 2023 via email

@DoubleAgentDave
Copy link
Contributor Author

ok, sent again, hopefully right person this time :)

@mccoys
Copy link
Contributor

mccoys commented Dec 6, 2023

I made a change for happi in the develop branch. Could you test it?

@mccoys mccoys added the needs-user-input the issue cannot be resolved without additional information label Dec 11, 2023
@DoubleAgentDave
Copy link
Contributor Author

Yes, that seems to allow me to access files which I couldn't before, thanks! That's eliminated a step which was quite annoying and will save me some time too, really appreciated!

@DoubleAgentDave
Copy link
Contributor Author

Just to be clear, with the old version I tried to access a broken probes0.h5 file and got this error:

Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/media/david_blackman/left_external/broadband/simulation_results/bandwidth/redone_diags/fixed_ions/narrow/../../../../py1D/make_signal_files.py", line 193, in start
signal = Ez_probe.getData()
File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Diagnostic.py", line 163, ingetData
data.append( self._dataAtTime(t) )
File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Diagnostic.py", line 862, in_dataLinAtTime
A = self._getDataAtTime(t)
File "/home/david_blackman/codes/smilei_new/Smilei/happi/_Diagnostics/Probe.py", line 372, in _getDataAtTime
data = self._dataForTime[t][n,first:last]
TypeError: 'NoneType' object is not subscriptable

Now I get no error and successfully build up my probe signals so I can process them properly.!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request something that could be added to the code needs-user-input the issue cannot be resolved without additional information
Projects
None yet
Development

No branches or pull requests

2 participants