Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading histogram input instead of ntuples #219

Closed
alexander-held opened this issue Apr 23, 2021 · 4 comments · Fixed by #289
Closed

Reading histogram input instead of ntuples #219

alexander-held opened this issue Apr 23, 2021 · 4 comments · Fixed by #289
Labels
config Affects configuration schema enhancement New feature or request help wanted Extra attention is needed

Comments

@alexander-held
Copy link
Member

alexander-held commented Apr 23, 2021

Currently cabinetry is designed with columnar data input in mind (e.g. ntuples). Support for histogram input would be useful to also have. A conceptual issue with histograms is that the construction of the histograms already must use information about regions, samples, systematic uncertainties. This information then has to be duplicated in the cabinetry config to build a workspace, and duplication of information is not ideal.

For context: ntuple reading

When reading ntuple inputs, the main steps of the workflow that are relevant to this discussion are the following:

cabinetry.template_builder.create_histograms(config)
cabinetry.template_postprocessor.run(config)
ws = cabinetry.workspace.build(config)

The first step identifies all template histograms that have to be built, reads the input columnar data at the locations provided in the config, and stores the resulting histograms in a specific format in the location the user provides in the config. The second step reads these histograms that were just produced, and applies optional post-processing, creating another set of histograms with a fixed naming scheme (in the same user-provided location). The third step uses the histograms in this location to build a workspace (using the post-processed histograms when available, and the original histograms otherwise).

Possible approaches

1) Users provide histograms in expected format and location

The user puts all histograms in the location cabinetry expects, correctly formatted and named. The name is given by cabinetry.histo.build_name. A cabinetry config is required to run cabinetry.workspace.build as a first step of the workflow. Since the config currently requires ntuple-related arguments, those could be filled with placeholders since they are not needed. A better solution would be to make those arguments optional.

2) Users provide a function that points to histograms

Users could provide a function that works similar to cabinetry.histo.Histogram.from_config, but loads histograms from wherever they are stored. The mechanisms would probably be similar to template building overrides. The template histogram creation step in cabinetry would be bypassed (see issue below), possibly leading to some issues with existing functionality (e.g. cabinetry.visualize.templates would have to be rewritten).

3) Re-use template building overrides

The existing template building overrides technique could be used to provide a custom function to cabinetry that returns histogram information. This is already possible without any changes required. It would work like ntuple inputs, and cabinetry stores the histograms created in this step in the location at which it expects them when building a workspace. This step could be made more convenient by allowing custom config properties, which users could then use in their logic to construct paths to their histograms.

4) Extend config schema to support histogram inputs natively

The config could be extended with a set of options that work similar to the input file path specification for ntuples. cabinetry.template_builder.create_histograms would then read histograms and save the results again in the expected location for workspace building. It would also be possible to skip this step and directly proceed with workspace building, reading histograms from the custom locations instead (see resulting issue below).

Comparison of approaches

Approaches 1) and 3) are already usable, but could be made more convenient. Approach 4) may be the easiest to use.

Approaches 2) and possibly 4) (depending on implementation) suffer from the potential issue (below) that parsing through inputs to the workspace becomes more difficult, and that it is less clear how to fit in post-processing. The advantage on the other hand is that one step less is required in the workflow.

Issues with skipping the template creation step in cabinetry

When reading histograms from a custom location instead of the place cabinetry expects them in for workspace building, cabinetry.visualize.templates would have to be rewritten. Furthermore smoothing and other post-processing steps may be more difficult or impossible. When they are not required, it would be faster to read the histogram information from the original source, but since that step in general should be reasonably fast it may be worth keeping the histogram-to-histogram copy step anyway (the equivalent of cabinetry.template_builder.create_histograms).

@ntadej
Copy link

ntadej commented May 19, 2021

Thanks for pointing out to this issue after the vCHEP talk.

I know that the risk of duplication is high, but I suppose some content will always be duplicated. I suppose signal region optimisations will be done outside of cabinetry. So either one has histograms ready already from this step, or one needs to copy the selection to the configuration here and also setup the input ntuples in a way that cabinetry can process them. I am a bit worried that ntuples -> workspace might take a very long time, which might not be a desired thing.

Slightly related comment: I am newly working on a top measurement and also still keeping in touch with exotics searches and might be interested in running those through cabinetry.

@alexander-held
Copy link
Member Author

I agree that in practice some duplication may not be easily avoided. When performing something like a signal region optimization, typically the quantity of interest is still something derived from using the workspace. Histograms would then still need to be built one way or another.

One way to minimize duplication could be using a cabinetry-like configuration (or something that could easily be converted) to steer the ntuple->histogram step, even if this is run outside of cabinetry.
An alternative pattern could be using cabinetry to call external tools that efficiently perform ntuple->histogram, possibly customized for individual analysis needs.

The ntuple->histogram handling is missing performance improvements in cabinetry. There is some low-hanging fruit that should speed it up quite a bit. One concrete idea is to collect all histogram construction instructions, send them all together to a tool (which does not yet exist), and this tool would then group them together to efficiently process them through e.g. coffea. By putting this optimization into some external layer, analyzers could also use it for their own histogram creation code. Another thing that does not yet exist but would help a lot with performance is implementing a caching mechanism to avoid reproducing histograms after a change in the fit model that does not affect them.

The main motivation I am aware of for using histogram inputs to cabinetry-like tools is that the tools typically are slower producing the histograms than custom code that can be optimized for the specific use case. Are there other reasons beyond this? I would be interested to find out.

Happy to hear about your interest, please do not hesitate to get in touch if you run into issues!

@alexander-held alexander-held added this to To do in v0.4 via automation Aug 24, 2021
@alexander-held
Copy link
Member Author

Moving this up to a target for v0.4 following a conversation with @gordonwatts and @BenGalewsky. This feature should allow for a demonstrator of using cabinetry with data served by ServiceX and saved as histograms via coffea.

@alexander-held
Copy link
Member Author

#289 is implementing method 4) described above: it extends the config schema and adds an API to cabinetry to support histogram inputs. This feature should become available via version v0.4 relatively soon after merging these changes. A new issue will be tracking possible follow-up items: #291.

@alexander-held alexander-held moved this from To do to In progress in v0.4 Oct 8, 2021
@alexander-held alexander-held added this to To do in Core functionality via automation Oct 8, 2021
@alexander-held alexander-held moved this from To do to In progress in Core functionality Oct 8, 2021
Core functionality automation moved this from In progress to Done Oct 8, 2021
v0.4 automation moved this from In progress to Done Oct 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
config Affects configuration schema enhancement New feature or request help wanted Extra attention is needed
Projects
No open projects
Development

Successfully merging a pull request may close this issue.

2 participants