fix: Report expected p-values based on their quantiles under background hypotheses #1162

lukasheinrich · 2020-10-30T16:08:19Z

Description

Resolves Toys giving mysterious results #1161
Resolves Update ToDo on readme to indicate that we've added support for toys #1264

Fixes issue reported by @kpachal

This changes the way we report expected limits

we used to

find quantiles in test stat distribution [q1,q2,....q5]
report CLs/p-values for those test stats [p1,p2,...p5]

however this can lead to orderings in which the p-values are not monotonic as the relationship between CLs and test statistic value is not necessarily monotonic..

we now

compute a sample of Cls values based on test statistics sampled from toys
report exected CLs/p-values as quantiles of that those test stats [p1,p2,...p5]

This is also what ROOT does. It might present some future questions regarding #966 but is ok for now.

import pyhf
import json
import numpy as np
import matplotlib.pyplot as plt
import pyhf.contrib.viz.brazil as brazil

w = pyhf.Workspace(json.loads('''
{
    "channels": [
        {
            "name": "SR_combined",
            "samples": [
                {
                    "data": [4],
                    "modifiers": [
                        {"data": null,"name": "lumi","type": "lumi"},
                        {"data": null,"name": "mu","type": "normfactor"}
                    ],
                    "name": "signal"
                },
                {
                    "data": [0.1],
                    "modifiers": [
                        {"data": null,"name": "lumi","type": "lumi"},
                        {"data": {"hi": 2.0,"lo": 0.5},"name": "uncorr_bkguncrt","type": "normsys"}
                    ],
                    "name": "background"
                }
            ]
        }
    ],
    "measurements": [
        {
            "config": {
                "parameters": [
                    {"auxdata": [1.0],"bounds": [[0.5,1.5]],"inits": [1.0],"name": "lumi","sigmas": [0.01]},
                    {"bounds": [[0.0,10.0]],"inits": [1.0],"name": "mu"}
                ],
                "poi": "mu"
            },
            "name": "Measurement"
        }
    ],
    "observations": [{"data": [0],"name": "SR_combined"}
    ],
    "version": "1.0.0"
}
'''))
m = w.model()
d = w.data(m)

mutests = np.linspace(0.5,1.0,4)
results = [pyhf.infer.hypotest(mu,d,m,return_expected_set=True, calctype='toybased', ntoys = 15000) for mu in mutests]

f,ax = plt.subplots()
brazil.plot_results(ax,mutests,results)
plt.semilogy()
plt.ylim(1e-3,1)
plt.xlim(0,1.35)

produces this plot

ReadTheDocs build: https://pyhf.readthedocs.io/en/fix_for_kate/api.html#inference

For the PR Assignees:

Summarize commit messages into a comprehensive review of the PR

* Add two ROOT validation scripts for toys
* Use same initialization parameters for b and s+b toys
* Leave (expected) p-value calculation to calculators
   - Add pvalue and expected_pvalues to calculator API
* Toy-based calculator returns expected values based on CLs distribution quantiles

Co-authored-by: Matthew Feickert <matthew.feickert@cern.ch>
Co-authored-by: Giordon Stark <kratsg@gmail.com>

src/pyhf/infer/calculators.py

kratsg · 2020-10-30T17:25:36Z

Tests need to get fixed. Probably also have a good test to assert correct behavior.

lukasheinrich · 2020-10-30T17:30:43Z

it's not completely fixed though: i.e. the green band is not within the yellow one.. still the numerics is on open issue

lukasheinrich · 2020-11-04T11:00:05Z

I added a validation script StandardHypoTestDemo.py which roughly follows StandardHypoTestDemo.C except that it configures it for exclusion instead of discovery.

With this we can xcheck toys against ROOT.

for the workspace 'xmlimport_input_bkg.json' (which si xmlimport_input_bkg excecpt with data put on bkg expectation) it looks like this

https://root.cern.ch/doc/v614/StandardHypoTestDemo_8C.html

The pyhf code is

lukasheinrich · 2020-11-04T11:54:16Z

ok some more important points

in this regime lumi uncertainty becomes important so one must
- add a lumi modifier to each
- add a lumi sigma into the parameter config

once this is done, the delta peak gets broadened (left ROOT right toys)

the brazil band doesn't look too great still, but it's looking better - trying w/ more toys

This is the workspace

{
    "channels": [
        {
            "name": "SR_combined",
            "samples": [
                {
                    "data": [
                        1
                    ],
                    "modifiers": [
                        {
                            "data": null,
                            "name": "lumi",
                            "type": "lumi"
                        },
                        {
                            "data": null,
                            "name": "mu",
                            "type": "normfactor"
                        }
                    ],
                    "name": "signal"
                },
                {
                    "data": [
                        0.1
                    ],
                    "modifiers": [
                        {
                            "data": null,
                            "name": "lumi",
                            "type": "lumi"
                        },
                        {
                            "data": {
                                "hi": 1.1,
                                "lo": 0.9
                            },
                            "name": "uncorr_bkguncrt",
                            "type": "normsys"
                        }
                    ],
                    "name": "background"
                }
            ]
        }
    ],
    "measurements": [
        {
            "config": {
                "parameters": [
                    {
                        "auxdata": [
                            1.0
                        ],
                        "bounds": [
                            [
                                0.5,
                                1.5
                            ]
                        ],
                        "inits": [
                            1.0
                        ],
                        "name": "lumi",
                        "sigmas": [
                            0.1
                        ]
                    },
                    {
                        "bounds": [
                            [
                                0.0,
                                10.0
                            ]
                        ],
                        "inits": [
                            1.0
                        ],
                        "name": "mu"
                    }
                ],
                "poi": "mu"
            },
            "name": "Measurement"
        }
    ],
    "observations": [
        {
            "data": [
                0
            ],
            "name": "SR_combined"
        }
    ],
    "version": "1.0.0"
}

lgtm-com · 2020-11-04T11:54:36Z

This pull request introduces 2 alerts when merging a8c2f29 into 81c9adb - view on LGTM.com

new alerts:

2 for Unused local variable

lgtm-com · 2020-11-06T01:42:00Z

This pull request introduces 2 alerts when merging d4d1898 into a3b34a5 - view on LGTM.com

new alerts:

2 for Unused local variable

src/pyhf/infer/calculators.py

lgtm-com · 2020-11-08T19:18:19Z

This pull request introduces 4 alerts when merging 63eaa6d into e4011ff - view on LGTM.com

new alerts:

3 for Variable defined multiple times
1 for Unused local variable

codecov · 2020-11-08T19:57:52Z

Codecov Report

Merging #1162 (26ada72) into master (0e71f2f) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1162   +/-   ##
=======================================
  Coverage   97.48%   97.48%           
=======================================
  Files          63       63           
  Lines        3733     3740    +7     
  Branches      530      531    +1     
=======================================
+ Hits         3639     3646    +7     
  Misses         55       55           
  Partials       39       39

Flag	Coverage Δ
unittests	`97.48% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/pyhf/infer/intervals.py	`100.00% <ø> (ø)`
src/pyhf/infer/__init__.py	`100.00% <100.00%> (ø)`
src/pyhf/infer/calculators.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0e71f2f...26ada72. Read the comment docs.

src/pyhf/infer/calculators.py

validation/StandardHypoTestDemo.py

lukasheinrich · 2020-11-09T12:42:29Z

this leaving this for later

this is now the plot w/ ROOT

and this w/ pyhf

which should be equivalent within the toy sample variance (20k)

src/pyhf/infer/calculators.py

matthewfeickert

Thanks for the nice rebasing work @kratsg.

So a concern that I have here, that I myself contributed to when working on this weeks ago, is that we're introducing inconsistent return types for CLs-like values now. Previously, all CLs-like / p-value-like values were 0-d tensors

pyhf/src/pyhf/infer/__init__.py

Line 149 in 0e71f2f

# Ensure that all CL values are 0-d tensors

However, now we have things like hypotest returning 0-d tensors and things like pvalues and expected_pvalues returning floats only for NumPy and for all other backends returning 0-d tensors, yet in all the examples we're showing CLs-like / p-value-like values being returned.

Example: Run the docstring example from this PR for expected_pvalues for various backends and note the return type of CLs_exp_band[0].

import pyhf

for backend in ["numpy", "pytorch", "jax"]:
    print(f"\nbackend: {backend}")
    pyhf.set_backend(backend)
    model = pyhf.simplemodels.hepdata_like(
        signal_data=[12.0, 11.0], bkg_data=[50.0, 52.0], bkg_uncerts=[3.0, 7.0]
    )
    observations = [51, 48]
    data = observations + model.config.auxdata
    mu_test = 1.0
    asymptotic_calculator = pyhf.infer.calculators.AsymptoticCalculator(
        data, model, test_stat="qtilde"
    )
    _ = asymptotic_calculator.teststatistic(mu_test)
    sig_plus_bkg_dist, bkg_dist = asymptotic_calculator.distributions(mu_test)
    CLsb_exp_band, CLb_exp_band, CLs_exp_band = asymptotic_calculator.expected_pvalues(
        sig_plus_bkg_dist, bkg_dist
    )
    print(f"CLs expected band: {CLs_exp_band}")
    print(
        f"of type: {type(CLs_exp_band[0])} and shape {pyhf.tensorlib.shape(CLs_exp_band[0])}"
    )

gives


backend: numpy
CLs expected band: [0.0026062609501074576, 0.01382005356161206, 0.06445320535890459, 0.23525643861460702, 0.573036205919389]
of type: <class 'numpy.float64'> and shape ()

backend: pytorch
CLs expected band: [tensor(0.0026), tensor(0.0138), tensor(0.0645), tensor(0.2353), tensor(0.5730)]
of type: <class 'torch.Tensor'> and shape ()

backend: jax
CLs expected band: [DeviceArray(0.00260626, dtype=float64), DeviceArray(0.01382005, dtype=float64), DeviceArray(0.0644532, dtype=float64), DeviceArray(0.23525643, dtype=float64), DeviceArray(0.57303619, dtype=float64)]
of type: <class 'jax.interpreters.xla._DeviceArray'> and shape ()

We should try to come to a consensus on what the return type for a CLs-like / p-value-like value should be. From looking back at PR #944 and Issue #714, I think that the CLs-like values should be 0-d tensor as it allows for us to sidestep this difference in backend behavior (I believe this is the motivation for the current behavior). While conceptually it makes sense to have a p-value-like just be a float to emphasize the scalar nature, having it be a 0-d tensor make it appear to a user as still having scalar like behavior.

The validation notebooks here are fine for now and we can revise them when we address Issue #1241.

@kratsg all the notebook stuff is great though. 👍

docs/api.rst

src/pyhf/infer/__init__.py

src/pyhf/infer/calculators.py

matthewfeickert

The CLs-like / p-value-like return type problem is now moved to Issue #1268 and will be resolved in a follow up PR.

lukasheinrich mentioned this pull request Oct 30, 2020

Toys giving mysterious results #1161

Closed

matthewfeickert added the fix A bug fix label Oct 30, 2020

matthewfeickert assigned lukasheinrich Oct 30, 2020

matthewfeickert requested changes Oct 30, 2020

View reviewed changes

src/pyhf/infer/calculators.py Show resolved Hide resolved

kratsg approved these changes Oct 30, 2020

View reviewed changes

matthewfeickert force-pushed the fix_for_kate branch from a8c2f29 to 08247c0 Compare November 5, 2020 23:52

lukasheinrich force-pushed the fix_for_kate branch from d4d1898 to 4f79f05 Compare November 8, 2020 17:27

lukasheinrich commented Nov 8, 2020

View reviewed changes

src/pyhf/infer/calculators.py Outdated Show resolved Hide resolved

lukasheinrich changed the title ~~fix: unify inits~~ feat: report expected p-values based on their quantiles under background hypotheses Nov 9, 2020

lukasheinrich requested review from matthewfeickert and kratsg November 9, 2020 00:53

lukasheinrich changed the title ~~feat: report expected p-values based on their quantiles under background hypotheses~~ fix: report expected p-values based on their quantiles under background hypotheses Nov 9, 2020

matthewfeickert reviewed Nov 9, 2020

View reviewed changes

src/pyhf/infer/calculators.py Outdated Show resolved Hide resolved

matthewfeickert reviewed Nov 9, 2020

View reviewed changes

validation/StandardHypoTestDemo.py Show resolved Hide resolved

This was referenced Nov 9, 2020

vectorize pvalue calculations #1168

Open

feat: customizable metrics in hypotest #966

Open

matthewfeickert force-pushed the fix_for_kate branch from b77d7de to ab1b78e Compare November 10, 2020 04:40

matthewfeickert added the docs Documentation related label Nov 10, 2020

matthewfeickert reviewed Nov 10, 2020

View reviewed changes

src/pyhf/infer/calculators.py Outdated Show resolved Hide resolved

matthewfeickert added 7 commits January 18, 2021 19:38

Correct distribution type

8d4e816

Make clear purpose of percentiles

affac46

Add pyhf version to hypo test demo

03c444a

add notes on test stat definition

1cbda00

Restructure to allow multiple runs

3b93c12

Add temp for debugging

4933911

Mark off non-asymptotic calculators in README

4f0d430

kratsg force-pushed the fix_for_kate branch from 23d5dc3 to 4f0d430 Compare January 19, 2021 00:53

fix flake

e66319d

kratsg approved these changes Jan 19, 2021

View reviewed changes

kratsg added 4 commits January 18, 2021 20:21

fix validation

bd3d3bf

fix up the transpose

262d132

fix up doctest

b2ce8a3

update learn notebook for calculators to use the new API

169a46c

kratsg added the feat/enhancement New feature or request label Jan 19, 2021

kratsg requested a review from matthewfeickert January 19, 2021 02:22

kratsg and others added 3 commits January 19, 2021 07:45

Merge branch 'master' into fix_for_kate

488f3e3

Use bkg_only to be consistent with sig_plus_bkg naming

d5b4a4b

Uncomment pyhf bits from validation scripts

b8207b9

matthewfeickert added the tests pytest label Jan 20, 2021

matthewfeickert requested changes Jan 20, 2021

View reviewed changes

docs/api.rst Show resolved Hide resolved

src/pyhf/infer/__init__.py Outdated Show resolved Hide resolved

src/pyhf/infer/calculators.py Outdated Show resolved Hide resolved

Make more explicit the percentiles are for Normal dist

26ada72

matthewfeickert mentioned this pull request Jan 20, 2021

Ensure consistent return type of CLs-like / p-value-like values #1268

Closed

matthewfeickert approved these changes Jan 20, 2021

View reviewed changes

matthewfeickert added the API Changes the public API label Jan 20, 2021

kratsg merged commit fed49ca into master Jan 20, 2021

kratsg deleted the fix_for_kate branch January 20, 2021 19:44

kratsg mentioned this pull request Feb 4, 2021

Toys seems to give strange results #892

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Report expected p-values based on their quantiles under background hypotheses #1162

fix: Report expected p-values based on their quantiles under background hypotheses #1162

lukasheinrich commented Oct 30, 2020 •

edited by matthewfeickert

Loading

kratsg commented Oct 30, 2020

lukasheinrich commented Oct 30, 2020 •

edited

Loading

lukasheinrich commented Nov 4, 2020 •

edited

Loading

lukasheinrich commented Nov 4, 2020

lgtm-com bot commented Nov 4, 2020

lgtm-com bot commented Nov 6, 2020

lgtm-com bot commented Nov 8, 2020

codecov bot commented Nov 8, 2020 •

edited

Loading

lukasheinrich commented Nov 9, 2020 •

edited

Loading

matthewfeickert left a comment

matthewfeickert left a comment •

edited

Loading

fix: Report expected p-values based on their quantiles under background hypotheses #1162

fix: Report expected p-values based on their quantiles under background hypotheses #1162

Conversation

lukasheinrich commented Oct 30, 2020 • edited by matthewfeickert Loading

Description

kratsg commented Oct 30, 2020

lukasheinrich commented Oct 30, 2020 • edited Loading

lukasheinrich commented Nov 4, 2020 • edited Loading

lukasheinrich commented Nov 4, 2020

lgtm-com bot commented Nov 4, 2020

lgtm-com bot commented Nov 6, 2020

lgtm-com bot commented Nov 8, 2020

codecov bot commented Nov 8, 2020 • edited Loading

Codecov Report

lukasheinrich commented Nov 9, 2020 • edited Loading

matthewfeickert left a comment

Choose a reason for hiding this comment

matthewfeickert left a comment • edited Loading

Choose a reason for hiding this comment

lukasheinrich commented Oct 30, 2020 •

edited by matthewfeickert

Loading

lukasheinrich commented Oct 30, 2020 •

edited

Loading

lukasheinrich commented Nov 4, 2020 •

edited

Loading

codecov bot commented Nov 8, 2020 •

edited

Loading

lukasheinrich commented Nov 9, 2020 •

edited

Loading

matthewfeickert left a comment •

edited

Loading