Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid RDataFrames returned from functions in pyROOT when constructing from a TChain due to garbage collection #10965

Closed
KeanuGh opened this issue Jul 14, 2022 · 8 comments
Assignees

Comments

@KeanuGh
Copy link

KeanuGh commented Jul 14, 2022

Describe the bug

RDataFrame constructed within a function with a TChain becomes inaccessible when returned, unless TChain is returned with it (segfault occurs).

Expected behavior

RDataFrame constructed with a TChain should be able to be returned from a function

To Reproduce

import ROOT
import glob

def build_rdataframe(filepath, tree):
    chain = ROOT.TChain(tree)
    for file in glob.glob(filepath):
        chain.Add(file)
    Rdf = ROOT.RDataFrame(chain)

    return chain, Rdf

chain, df = build_rdataframe("path/to/files", "tree")
print(df.Count().GetValue())
# works

def build_rdataframe_2(filepath, tree):                                                                                                                                                     
    chain = ROOT.TChain(tree)
    for file in glob.glob(filepath):
        chain.Add(file)
    Rdf = ROOT.RDataFrame(chain)

    return Rdf

df = build_rdataframe_2("path/to/files", "tree")
print(df.Count().GetValue())
# segfault

Setup

  1. ROOT version 6.24.00 / Python 3.10.4 | ROOT version 6.20.06 / Python 2.7.5
  2. Arch Linux / Centos7
  3. ROOT downloaded from conda-forge / Arch linux repository / lxplus
    (tested on multiple OSs/environments)
@KeanuGh KeanuGh added the bug label Jul 14, 2022
@eguiraud eguiraud self-assigned this Jul 15, 2022
@ikabadzhov
Copy link
Contributor

Remark: If all trees of the TChain has the same name, and the TChain does not have friends, you can instead use this ctor: https://root.cern/doc/master/classROOT_1_1RDataFrame.html#adcebb79c00694365296d054626e87742
Further note that this ctor accepts filename globs.

@eguiraud
Copy link
Member

Hi @KeanuGh ,

thank you very much for your report. This is a lifetime problem that's related to the behavior of the underlying C++, and that makes it extremely hard to solve for this particular constructor.

However, we are introducing what we plan to be a superior alternative to this problematic pattern, RDatasetSpec, used for example as:

def make_df():
   ds = ROOT.RDF.Experimental.RDatasetSpec(treename, paths)
   return ROOT.RDataFrame(ds)

The constructor taking a RDatasetSpec will cover all cases covered by TChain (including friend trees, but probably not indexed friend trees) and it does not have the lifetime issue you describe. RDatasetSpec is available in the master branch since last week.

@KeanuGh
Copy link
Author

KeanuGh commented Jul 18, 2022

Thanks for the replies @eguiraud and @ikabadzhov, RDatasetSpec sounds very useful! The only reason that I had been using TChains thus far is that RDataFrames don't seem to be able to handle wildcards in directories (ie path/to/sample*/*.root as opposed to path/to/sample/*.root). I didn't realise that a list of globs could be passed directly to the constructor so that is a solution I am implementing for now.

@KeanuGh
Copy link
Author

KeanuGh commented Jul 18, 2022

Perhaps it would be possible to throw up an error or warning when this type of issue is encountered? as it took me an embarrassingly long time to find the source of the segfault!

@eguiraud
Copy link
Member

eguiraud commented Aug 1, 2022

RDataFrames don't seem to be able to handle wildcards in directories

can you provide an example of what does not work? RDF forwards the globs directly to TChain so what works for TChain should work for RDF.

Perhaps it would be possible to throw up an error or warning when this type of issue is encountered?

a use-after-delete is undefined behavior in C++ which means (among other things) that in general it cannot be detected. I'll try to discuss with a few people the options we have.

@eguiraud
Copy link
Member

Hello again, we discussed this internally, brainstormed solutions etc. and the outcome is that this cannot be fixed in RDF (or possibly at all): as I mentioned above this is a quirk of managing C++ objects from Python: because of C++ lifetimes and undefined behavior with use-after-deletes, there is no surefire way in which RDF can detect that the TChain object it's referring to went out of scope.

Please do this instead:

def build_rdataframe_2(filepath, tree):                                                                                                                                                     
    return ROOT.RDataFrame(tree, filepath)

or if you need friend trees (from ROOT v6.28):

def build_rdataframe_2(filepath, tree):    
    spec = ROOT.RDF.Experimental.RDatasetSpec()\
       .AddGroup(
            (
                "samples",
                ["subTree1", "subTree2"],
                ["PYspecTestFile5.root", "PYspecTestFile6.root"],
             )
        )\
        .WithGlobalFriends("anotherTree", "PYspecTestFile7.root", "friendTree")
                                                                                                                                                
    return ROOT.RDataFrame(spec)

@KeanuGh
Copy link
Author

KeanuGh commented Feb 13, 2023

I've come back to this recently and have found a little workaround. If the TChain is instead declared in C++ via the gInterpreter, it will not be garbage collected. eg:

def fill_rdataframe(paths, trees):
    """create TChain in c++ in order for it not to be garbage collected by python"""
    ROOT.gInterpreter.Declare(
        f"""
            TChain chain;
            void fill_chain() {{
                std::vector<std::string> paths = {{\"{'","'.join(paths)}\"}};
                std::vector<std::string> trees = {{\"{'","'.join(trees)}\"}};
                for (const auto& path : paths) {{
                    for (const auto& tree : trees) {{
                        chain.Add((path + "?#" + tree).c_str());
                    }}
                }}
            }}
        """
    )
    ROOT.fill_chain()

    return ROOT.RDataFrame(ROOT.chain)

Rdf = fill_rdataframe(["file1", "file2"], ["tree1", "tree2"])

the conversion from python lists into c++ vectors is a little messy but it works

@eguiraud
Copy link
Member

That is safe: the TChain is declared at global scope, so it will be destroyed only at program exit. Declaring it as a Python global variable (i.e. at module scope) should have the same effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants