Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with parallelization for 5.2.2 #1117

Closed
torresramiro350 opened this issue Feb 5, 2024 · 6 comments · Fixed by #1118
Closed

Issue with parallelization for 5.2.2 #1117

torresramiro350 opened this issue Feb 5, 2024 · 6 comments · Fixed by #1118
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged

Comments

@torresramiro350
Copy link

In uproot 5.1.2, I was able to parallelize the reading of ROOT files with the following example below. However, with the new release uproot 5.2.2, I get this error:

AttributeError("'FSSpecSource' object has no attribute '_fh'")
import numpy as np
import uproot
import yaml
from numpy.typing import NDArray

from dask.base import compute
from dask.delayed import delayed
from dask.distributed import Client

cli = Client(n_workers=5)

@delayed
def get_array( bin_id: str, inmap: uproot.ReadOnlyDirectory) -> tuple[str, NDArray[np.float64]]:
    outdata_bin = inmap[f"nHit{bin_id.zfill(2)}/data/count"].array().to_numpy()
    return bin_id, outdata_bin

@delayed
def get_bkg( bin_id: str, inmap: uproot.ReadOnlyDirectory) -> tuple[str, NDArray[np.float64]]:
    outdata_bin = inmap[f"nHit{bin_id.zfill(2)}/bkg/count"].array().to_numpy()
    return bin_id, outdata_bin

maptree="./maptree.root"

with uproot.open(maptree) as inmap:
    analysis_bin_names = inmap["BinInfo/name"].array().to_numpy().astype(str)

    tasks = [get_array(bin_id, inmap) for bin_id in analysis_bin_names]
    tasks_bkg = [get_bkg(bin_id, inmap) for bin_id in analysis_bin_names]

    counts_info = compute(*tasks)
    bkg_info = compute(*tasks_bkg)
    data = dict(counts_info)
    bkg = dict(bkg_info)
cli.close()

An example of the file can be retrieved here: https://data.hawc-observatory.org/datasets/geminga2017/geminga2017-download/maptree.root

Additionally, now with the new release I try to parallelize reading larger of +1GB ROOT files and what it used to take < 1 minute now runs for +3 minutes.

@torresramiro350 torresramiro350 added the bug (unverified) The problem described would be a bug, but needs to be triaged label Feb 5, 2024
@jpivarski
Copy link
Member

One way we can get AttributeErrors when introducing a new default like FSSpecSource is by expecting it on all Sources but the new Source doesn't have it. That's not what happened here: _fh is only on FSSpecSource, so it's an implementation detail of that particular class. For the attribute to be missing, there must be multiple ways to create the class, and the case that broke for you, @torresramiro350, is one that took an alternative that failed to make the attribute.

Reading the code, it looks like _fh is a transient file handle, and it's allowed to be None. (That's its initial state in __init__.) In fact, it looks like it's missing from __setstate__, which reconstitutes an object during unpickling:

def __getstate__(self):
self._fh = None
state = dict(self.__dict__)
state.pop("_executor")
state.pop("_file")
state.pop("_fh")
return state
def __setstate__(self, state):
self.__dict__ = state
self._open()

Since _fh and _file are both introduced with an initial value of None by __init__ (just before _open), it seems like they ought to be reintroduced by __setstate__, so I did that in #1118.

I have not tested it, so let me know, @torresramiro350, if that's right.

@lobis, is this correct? I'll add you as a reviewer to the PR so that you can weigh in there, but do you see anything else that might be a problem here?

@torresramiro350
Copy link
Author

Introducing self._fh = None in __setstate__ does seem to fix the issue.

@jpivarski
Copy link
Member

When @lobis approves it, I'll merge it. Thanks for testing!

@torresramiro350
Copy link
Author

torresramiro350 commented Feb 5, 2024

I also tried running the example above with uproot 5.2.2 and it seems to take considerably longer to read a ROOT file of ~ 1GB than in version 5.1.2. I'll try to generate a file with fake numbers that can somehow help to diagnose it further.

I realize that will have to be on another issue. Thanks for the help!

@jpivarski
Copy link
Member

That should be a different issue. I've been hearing a few things about local file access being slower; maybe we should revert to MemoryMappedSource for local files and use fsspec only for remote files...

@jpivarski jpivarski linked a pull request Feb 6, 2024 that will close this issue
@jpivarski
Copy link
Member

I linked this issue to the PR after merging the PR. Closing the issue manually...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug (unverified) The problem described would be a bug, but needs to be triaged
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants