Skip to content
This repository has been archived by the owner on Jan 27, 2023. It is now read-only.

Question on behavior of weights and variances #97

Closed
matthewfeickert opened this issue Mar 17, 2021 · 9 comments · Fixed by #98
Closed

Question on behavior of weights and variances #97

matthewfeickert opened this issue Mar 17, 2021 · 9 comments · Fixed by #98
Labels
question Further information is requested

Comments

@matthewfeickert
Copy link
Member

matthewfeickert commented Mar 17, 2021

From the discussions on Gitter today motivated by matthewfeickert/heputils#24 I think I am confused by the behavior of uproot3-methods treatment of weights and variances. Here is a short example

$ cat requirements.txt 
uproot~=4.0.6
uproot3~=3.14.4
$ pip list | grep "uproot"
uproot            4.0.6
uproot3           3.14.4
uproot3-methods   0.10.0
# issue.py
import numpy as np
import uproot
import uproot3

if __name__ == "__main__":
    bins = np.arange(0, 8)
    counts = np.array([2 ** x for x in range(len(bins[:-1]))])
    with uproot3.recreate("test.root", compression=uproot3.ZLIB(4)) as outfile:
        outfile["data"] = (counts, bins)

    root_file = uproot.open("test.root")
    hist = root_file["data"]

    print(f"hist has weights: {hist.weighted}")
    print(f"hist values: {hist.values()}")
    print("\nvariances are square of uncertainties\n")
    print(f"hist errors: {hist.errors()}")
    print(f"hist variances: {hist.variances()}")
    assert hist.variances().tolist() == np.square(hist.errors()).tolist()
    print("\nbut errors are not sqrt of values\n")

    print(f"expected errors to be: {np.sqrt(hist.values())}")
$ python issue.py 
hist has weights: True
hist values: [ 1  2  4  8 16 32 64]

variances are square of uncertainties

hist errors: [ 1.  2.  4.  8. 16. 32. 64.]
hist variances: [1.000e+00 4.000e+00 1.600e+01 6.400e+01 2.560e+02 1.024e+03 4.096e+03]

but errors are not sqrt of values

expected errors to be: [1.         1.41421356 2.         2.82842712 4.         5.65685425
 8.        ]

As help for .errors gives

Help on method errors in module uproot.behaviors.TH1:

errors(flow=False) method of uproot.dynamic.Model_TH1I_v3 instance
    Args:
        flow (bool): If True, include underflow and overflow bins before and
            after the normal (finite-width) bins.
    
    Errors (uncertainties) in the :ref:`uproot.behaviors.TH1.Histogram.values`
    as a 1, 2, or 3 dimensional ``numpy.ndarray`` of ``numpy.float64``.
    
    If ``fSumw2`` (weights) are available, they will be used in the
    calculation of the errors. If not, errors are assumed to be the square
    root of the values.
    
    Setting ``flow=True`` increases the length of each dimension by two.

I think(?) this is due to the behavior of

valuesarray = numpy.empty(len(content) + 2, dtype=content.dtype)
valuesarray[1:-1] = content
valuesarray[0] = 0
valuesarray[-1] = 0
out.extend(valuesarray)
out._fSumw2 = valuesarray ** 2

where it seems that the weights are taken to be the values of the NumPy array — this is not what I would have expected.

What is the proper way to create an uproot3 TH1 with Poisson uncertainties?

@jpivarski
Copy link
Member

The confusing thing about this (I believe this is what you're hitting) is that the TH1 is a NumPy array. That array corresponds to the values (bin contents) of the histogram. It might also have a _fSumw2 attribute attached to it, which is another NumPy array, that corresponds to the errors.

The issue is that ROOT's TH1 is a subclass of TArray, and in Uproot 3 I made TArray a NumPy array subclass. This confusion motivated a less direct modeling of C++ classes in Uproot 4, in which Python objects representing C++ objects contain their superclasses instead of inheriting from them.

Not only did this introduce the strange semantics in which the histogram values are not an attribute; they are the object, but it also made other subclasses of TH1, such as TProfile, have incorrect values and errors (because we didn't explicitly implement it, and it inherited TH1's methods).

At least, that's what I think you're running into above.

@henryiii
Copy link
Member

The problem is this line:

out._fSumw2 = valuesarray ** 2

Which sets fSumw2 assuming that the values in the ndarray were made with a single fill with a large weight, rather than out._fSumw2 = valuesarray, which would assume all fills were made with weight=1 and would match ROOT's behavior when adding weights. (I don't see an obvious path for if the ndarray already has an fSumw2 attribute attached, but didn't check closely).

@matthewfeickert
Copy link
Member Author

matthewfeickert commented Mar 17, 2021

The confusing thing about this (I believe this is what you're hitting) is that the TH1 is a NumPy array. That array corresponds to the values (bin contents) of the histogram. It might also have a _fSumw2 attribute attached to it, which is another NumPy array, that corresponds to the errors.

Hm, it seems like this will always happen though given that it is present here using only uproot3

# uproot3_only.py
import numpy as np
import uproot3

if __name__ == "__main__":
    bins = np.arange(0, 8)
    counts = np.array([2 ** x for x in range(len(bins[:-1]))])
    with uproot3.recreate("test.root", compression=uproot3.ZLIB(4)) as outfile:
        outfile["data"] = (counts, bins)

    root_file = uproot3.open("test.root")
    hist = root_file["data"]

    print(f"hist._fSumw2 ({type(hist._fSumw2)}): {hist._fSumw2}")
    print(f"hist values ({type(hist.values)}): {hist.values}")
    print(f"hist variances ({type(hist.variances)}): {hist.variances}")
$ python uproot3_only.py 
hist._fSumw2 (<class 'uproot3.rootio.TArrayD'>): [0.0, 1.0, 4.0, 16.0, 64.0, 256.0, 1024.0, 4096.0, 0.0]
hist values (<class 'numpy.ndarray'>): [ 1  2  4  8 16 32 64]
hist variances (<class 'numpy.ndarray'>): [1.000e+00 4.000e+00 1.600e+01 6.400e+01 2.560e+02 1.024e+03 4.096e+03]

So won't it always have _fSumw2 from

out._fSumw2 = valuesarray ** 2

and then the variances are set from it? :/ I may be missing something obvious, but from what Henry has pointed out it seems like this will always be the case if you're writing a histogram from numpy arrays.

@kratsg
Copy link
Contributor

kratsg commented Mar 17, 2021

@henryiii to be clear, is this just boiling down to whether one needs to do sum(values)**2 versus sum(x**2 for x in values) ?

@matthewfeickert
Copy link
Member Author

I think I now understand that this is also highlighting a difference in how ROOT and uproot3 write files given that

# example_pyroot.py
import uproot
from ROOT import TH1F, TFile, kFALSE, kTRUE


def write_root_file(filename):
    n_bins = 7
    hist = TH1F("data", "", n_bins, 0, n_bins)
    hist.SetStats(kFALSE)

    for bin_idx in range(0, n_bins):
        hist.SetBinContent(bin_idx + 1, 2 ** bin_idx)

    hist.Sumw2(kTRUE)

    write_file = TFile(filename, "recreate")
    hist.Write()
    write_file.Close()


if __name__ == "__main__":
    write_root_file("pyroot_output.root")

    root_file = uproot.open("pyroot_output.root")
    hist = root_file["data"]

    print(f"hist has weights: {hist.weighted}")
    print(f"hist values: {hist.values()}")
    print(f"hist errors: {hist.errors()}")
    print(f"hist variances: {hist.variances()}")

gives

$ python example_pyroot.py 
hist has weights: True
hist values: [ 1.  2.  4.  8. 16. 32. 64.]
hist errors: [1.         1.41421356 2.         2.82842712 4.         5.65685425
 8.        ]
hist variances: [ 1.  2.  4.  8. 16. 32. 64.]

I need to think on this more, as I obviously don't understand the problem as well as everyone else, but would changing the behavior of TH1's from_numpy be enough and be correct?

@henryiii
Copy link
Member

henryiii commented Mar 18, 2021

The call hist.Sumw2(kTRUE) sets the variances equal to the entries. Then the error is the square root of the entries. Uproot is setting the variance equal to the square of the entries, making the error equal to the entries. I think it's a simple bug due to the fact "sumw2" has a square in it, so it looks like it should be set with a square.

If you have a bin with 5 entries, the usual thing to do is assume that it was filled 5 times, so sumw2 = 1² + 1² + 1² + 1² + 1² = 5. Then the error is √5. Usually you do not assume that this was filled once with weight 5 (which would be with error 5).

@henryiii
Copy link
Member

I think the above answers @kratsg's question - yes, when it says "sumw2", it's each weight that's squared, not the sum of them, so setting this with valuesarray ** 2 is only correct if there was only one weight in the sum.

@matthewfeickert
Copy link
Member Author

@jpivarski If you're happy with the solution Henry poses above I have a branch on uproot3-methods that implements this that I'll make a PR from.

@jpivarski
Copy link
Member

Sorry—just catching up. Yes: this sounds like a good solution, though I'll understand it better when I see it as a diff in a PR. Thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants