Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible to integrate CMS's Combine workflow? #344

Open
kratsg opened this issue Oct 25, 2018 · 22 comments
Open

Possible to integrate CMS's Combine workflow? #344

kratsg opened this issue Oct 25, 2018 · 22 comments
Assignees
Labels
follow up research experimental stuff

Comments

@kratsg
Copy link
Contributor

kratsg commented Oct 25, 2018

Question

CMS uses a tool called Combine which is built on top of RooStats/RooFit.

It seems very possible, as it appears that CMS' workspace is defined as a plaintext file called a datacard, to be able to provide a datacard2json tool to translate the datacard into something usable by pyhf.

@kratsg kratsg added follow up research experimental stuff labels Oct 25, 2018
@matthewfeickert
Copy link
Member

matthewfeickert commented Oct 27, 2018

After having some discussions at the US LUA meeting I think that we might want to talk with Josh Bendavid (@bendavid) about this.

@jonas-eschle
Copy link

Since this is still an open issue, let me mention that on 25.6 there will be a meeting including Josh and others to talk about the implementations of binned/template pdfs in order to move the community in this niche closer together.

@matthewfeickert
Copy link
Member

Tagging @mattbellis and @benkrikler here, given the email conversation that Matt started (thanks Matt!) RE: what would need to be done to extend the HistFactory JSON v1.0.0 schema to allow translation of CMS Combine cards. @mattbellis provided a toy Combine card that we can start with. @benkrikler had started some discussion on this front at CHEP 2019, so his thoughts and input are very welcome here too.

@kratsg kratsg self-assigned this Jan 12, 2020
@kratsg
Copy link
Contributor Author

kratsg commented Jan 12, 2020

I'll assign myself on this for now, since I have a small side project that is looking into Combine (part of my SUSY role in ATLAS) and want to explore some code for this.

@matthewfeickert
Copy link
Member

matthewfeickert commented Dec 4, 2020

So it seems that CMS has added some rather complete tutorials that describe the Combine model (HT @kpedro88):

@lukasheinrich
Copy link
Contributor

together with #1188 it should be much more straight forward to built a combine-like model

@kratsg
Copy link
Contributor Author

kratsg commented Mar 3, 2021

page @alexander-held

@alexander-held
Copy link
Member

I was curious about the possibility of converting datacards into pyhf workspaces and wrote a small utility https://github.com/alexander-held/datacard-to-pyhf. I do not know much about CMS Combine and the datacard format, so the implementation likely has a range of issues. The most glaring one is that it only supports single-bin channels (and no shape systematics) at the moment.
It runs fine with the toy example from above, resulting in a best-fit of

r =  0.9040 -0.2753 +0.3202

The paper reports 0.93 +0.26 −0.23 (stat.) +0.13 −0.09 (syst.) in the abstract. With another simple example, I do not see perfect agreement between the fit with Combine and pyhf (via MINUIT) either, so there are probably other differences to be understood.

@lukasheinrich
Copy link
Contributor

awesome that's a great start.. taking on the simplest example and successively adding features was also pyhf's approach in general. tagging also @clelange

@nsmith-
Copy link

nsmith- commented Mar 5, 2021

In case its helpful @andrzejnovak put together a conda recipe for combine in cms-analysis/HiggsAnalysis-CombinedLimit#648
Though it is still python 2 :(

@nucleosynthesis
Copy link

Are you able to run the "standalone" (works on a CernVM) version of combine [1]? that might help better compare the expected from combine vs pyHF (since the tutorial cards, even the more advanced ones are not identical to the real thing in the papers)

[1] http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/#standalone-version-of-combine

@alexander-held
Copy link
Member

Could a standalone Docker image also be possible? Having no CVMFS dependence at all would be useful to allow running validations anywhere.

@andrzejnovak
Copy link
Member

@alexander-held check the PR Nick linked

@alexander-held
Copy link
Member

Having the conda version available is great! I view a ready-to-use Docker image as complementary to that (I guess with conda there is compilation involved?).

@andrzejnovak
Copy link
Member

Sure, just wanted to point out that with the conda env you can build the image on the fly as well without having to access cvmfs when compiling stuff.

@nucleosynthesis
Copy link

nucleosynthesis commented Apr 22, 2021

Any way to run standalone is fine. I'm not sure how well synched the version with conda env is with the main branch (the 102x vs 112x), but for this I don't think it really matters too much. Just wasn't sure whether the comparison by @alexander-held was a direct comparison of a combine run or not.

@alexander-held
Copy link
Member

alexander-held commented Apr 22, 2021

@nucleosynthesis Yes, for the comparison with pyhf I was running Combine on lxplus. I saw small differences for these example datacards when comparing a result obtained with Combine to a result obtained by converting the model to pyhf and then minimizing the resulting HistFactory version of the model. The differences were small enough for me to be confident that the conversion is generally working, but slightly larger than what I would expect purely from slight differences in minimization. They probably come down to things like interpolation algorithm differences.

While writing this comment I noticed one discrepancy: my lnN treatment for a value 1.2 would use 0.8 for -1σ, but I should use 1/1.2. I will give this a try.

Would you recommend the CombineHarvester Python API for datacard parsing? I remember looking at it last month but did not know how complete and up-to-date it is. I think the biggest challenge in creating the corresponding pyhf model for a given datacard is figuring out how to correctly parse the datacard format.

Is there a good place to ask technical questions about Combine model building details (as a non-CMS member)?

@nucleosynthesis
Copy link

CombineHarvester is probably overkill, though it should be up to date. A while ago we made a python dumping option in the Datacard parser --dump-datacard which prints to stdout an equivalent python script that can be run to just do the same thing as running text2workspace.py over the datacard. That might be helpful to see whats being mapped to what (see here: http://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part2/settinguptheanalysis/#automatic-production-of-datacards-and-workspaces )

For discussions / Q's to the combine team, probably the easiest thing is to submit an issue here and add the label "question" for now.

@alexander-held
Copy link
Member

I had not noticed --dump-datacard before, but this looks super useful. Thanks a lot!

@matthewfeickert
Copy link
Member

Just dumping here that @ajgilbert gave a session on Combine (:+1:) at the first hands-on workshop on publication of statistical models where the last 4 slides are relevant to pyhf and Combine interop and probability model preservation.

@mattbellis
Copy link

There's a CMS Top workshop taking place this week where Combine will be discussed. It is at a time that I can't attend, but I'm going to try to reach out to the speaker(s) to see if there's any interest in understanding how / if we can create examples comparing and contrasting combine and pyhf.

Building some hypersimple comparison examples is on a very long to-do list of mine. :)

@maxgalli
Copy link

Hi! I don't know if this can be useful, but a while ago I started working on @alexander-held's repo with the intention of adding support for shape-based analyses datacards. You can find it here and the output can be tested e.g. with this datacard.

A few huge disclaimers:

  • the reason why I started working on this was mostly to have a way to easily handle and visualize huge datacards like the ones I use for the Run2 differential combination, and the json format is probably the best way to do it (text files like datacards clearly don't scale well when there are a lot of channels, samples and modifiers); this means that the way I translate some aspects are not compatible with pyhf itself, and have to be changed (see e.g. up and down modifiers, where I introduce a dictionary called shape with entries up and down containing the bin values for each modifier - clearly incompatible with a pyhf analysis flow)
  • as you can see, there are quite some types of modifiers that I didn't even try to translate - in these cases I simply ignored them with a comment or translated them to something completely meaningless

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
follow up research experimental stuff
Projects
None yet
Development

No branches or pull requests

10 participants