Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: warn about TBranch name, alias name conflict. #776

Merged

Conversation

jpivarski
Copy link
Member

I normally dislike warnings, but if we made this case an exception, then there would be reasonable-to-do things that wouldn't be possible to do. So, okay, a warning.

@jpivarski jpivarski linked an issue Nov 2, 2022 that may be closed by this pull request
@jpivarski
Copy link
Member Author

jpivarski commented Nov 2, 2022

I need convincing that this is a good thing to do. (I'm still thinking about it.)

In particular, let me know, @wiso, if this would have caught your problem. (Note: it's only being implemented for 5.x, since it's a new feature; I'm asking if this is the sort of thing that would have fixed it, not whether you can successfully run with this patch.)

@wiso
Copy link
Contributor

wiso commented Nov 3, 2022

I have checkout this branch and I confirm that in the example in #775 now I get the warning

/home/turra/uproot5/src/uproot/language/python.py:422: NameConflictWarning: 'phiModCalo' is both an alias and a branch name
  warnings.warn(
         phiModCalo
0          0.005856
1         -0.004650
2         -0.005619
3         -0.003007
4         -0.004062
...             ...
4650957    0.002802
4650958   -0.007556
4650959    0.001636
4650960   -0.002500
4650961    0.003881

[4650962 rows x 1 columns]

while in the other case I get the same error as before, as expected.

So, yes, this would have caught my problem.

Now I am thinking if this is always the desired behavior. For example you can have an input file where you know that one branch is wrong, but you can recompute it from other branches. In this case you may want to overwrite it with an alias, with the same name of the branch (since maybe you have other codes that assume that particular branch name). By the way this is just a warning, and not an error, so I guess it is ok.

As a comparison RDataFrame is raising an error when calling Define, e.g.

runtime_error: RDataFrame::Define: cannot define column "phiModCalo". A branch with that name is already present in the input TTree/TChain. Use Redefine to force redefinition.
  ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void>::Define(basic_string_view<char,char_traits<char> > name, basic_string_view<char,char_traits<char> > expression) =>
    runtime_error: RDataFrame::Define: cannot define column "phiModCalo". A branch with that name is already present in the input TTree/TChain. Use Redefine to force redefinition.

by the way they have a method Redefine which works.

@jpivarski
Copy link
Member Author

jpivarski commented Nov 3, 2022

I made it a warning, rather than an error, to prevent undue restrictions that might not even be under a user's control: a ROOT file could contain a TBranch fName in a TTree that has an fAliases of the same name—I don't want to make it impossible to read such a file.

Adding the ability to choose to the API, as RDataFrame has with Define and Redefine, would expand the size of the Uproot expressions API. This is all in the name of making something easier; having more knobs to turn—and knowing about them!—is not necessarily easier.

Ultimately, I don't think physicists should be doing their analysis in the Uproot expressions API. It's not comparable to RDataFrame, in which you're supposed to put code into Define nodes as strings, and there's a benefit to doing so. As long as the Uproot expressions are just Python expressions in strings, there's neither an interface nor a performance benefit to doing so. In other fora (Gitter, mostly), I've been recommending against using expressions for computations.

Why are they there at all? Here's how it went, historically:

  1. Uproot didn't support fAliases, and some ROOT files have significant simplifications in fAliases (such as edm::Product/many/levels/deep/Muon/ptMuon_pt).
  2. To support fAliases, Uproot needs to evaluate strings as expressions, ideally as TTreeFormula, since that's the language they were developed for.
  3. If Uproot is going to evaluate expressions anyway, it might as well take user-provided expressions as well as the ones that are baked into the fAliases member of the TTree.1
  4. I didn't have time to write a TTreeFormula parser. The formulate project ought to do it, but it's broken and doesn't have an active maintainer. So Uproot will interpret the expressions as Python "for now."
  5. The state of things hasn't changed.

So I'm not wanting to build up the expressions API more than it already is, which would include providing an API to express the define/redefine distinction, and raising an error on redefinition might make TTrees unreadable in ways that have little to do with the redefined expression, which users can't change.

I think this is the least-bad option. I'd support anyone who wants to revamp formulate (writing parsers is fun!), though it's too late for a PythonLanguage → TTreeFormulaLanguage switch in defaults before Uproot version 5.0.0 (early December). We could do it later with a deprecation cycle.

I guess, then, I'll merge this PR.

Footnotes

  1. Maybe that was the fatal error, adding an interface only because the underlying implementation exists, not because of a clear user-need.

@jpivarski jpivarski merged commit 2492876 into main Nov 3, 2022
@jpivarski jpivarski deleted the jpivarski/warn-about-TBranch-name-alias-name-conflict branch November 3, 2022 17:34
@wiso
Copy link
Contributor

wiso commented Nov 4, 2022

Thank you @jpivarski. By the way, what you are saying about aliases is worrying me. I am moving from uproot3 to 4. Before I used uproot to get a panda dataframe and then I used numexpr to compute new columns. Now, with uproot4, I am using alias to do the same. Should I go back? Probably not since I am limited by I/O and not computation.

@jpivarski
Copy link
Member Author

I think the problem with the expressions interface is that it makes it look like the expressions are calculated very early, perhaps even while the data are being read. This is not the case. It just pulls out arrays like normal, then runs the Python code in the strings. That's something that you could do as easily by running Python code not in strings.

Since that Python code is primarily NumPy ufuncs, NumExpr will be more efficient because it is a single pass over the arrays, rather than one pass per mathematical expression. (It's more complicated than that now with the fact that NumPy can fuse some operations, but with NumExpr, the entire expression is always fused.) This difference circumvents the potential bottleneck between RAM and CPU, which is roughly 10‒20 Gbps. If you're I/O limited, meaning disk I/O, the fastest SSDs are 7 Gbps (and that's extreme), so it's probably not a limiting factor.

To directly answer your question: the NumExpr strings would be faster, but you're limited by I/O anyway so you probably won't see it. But more importantly, prefer doing computations in Pandas and only I/O in Uproot because Pandas is a computation library and Uproot is an I/O library. Pandas, RDataFrame, Awkward Array, NumExpr, Numba, Dask, xarray, ... should all be better at doing computations than Uproot.


However, you just gave me an idea: crossing the boundary from Uproot 4.x to Uproot 5.x (this December), we're allowed to make backward-incompatible changes. As I was saying above, there isn't enough time to develop a TTreeFormula expression-handler, but we could replace the default Python expression-handler with a NumExpr expression-handler. The NumExpr language is more limited but faster, which at least creates a distinction between what you can do in strings and what you can do outside of strings.

I crossed out the above after thinking about it for a few minutes. Rushing an interface breaking change on this motivation sounds like a bad idea. We could take our time and add the NumExpr language (possibly with TTreeFormula → NumExpr translation, which would exclude the slice/loop/reducer parts of the TTreeFormula language) as a non-default, and maybe change the default through a deprecation cycle. There could be a range of version numbers in which non-trivial expressions require you to explicitly set the language=... argument, and after that phase, the default language changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Warn about name conflict between aliases and TBranch names
2 participants