Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TMVA cannot be initialized properly in distributed Python environments #13798

Closed
1 task
vepadulano opened this issue Oct 3, 2023 · 3 comments
Closed
1 task

Comments

@vepadulano
Copy link
Member

Check duplicate issues.

  • Checked for duplicates

Description

As seen in a few forum posts

https://root-forum.cern.ch/t/issue-with-rdataframe-using-spark-cluster-on-lcg-102-and-higher/56568/3

https://root-forum.cern.ch/t/error-while-using-rdataframe-with-spark-cluster-analytix/56006

https://swan-community.web.cern.ch/t/distributed-rdataframes-with-spark/690

When running on lxplus/SWAN with recent LCG stacks, TMVA fails its part of the Python initialization with the following error

File "/cvmfs/sft.cern.ch/lcg/views/LCG_103swan/x86_64-centos7-gcc11-opt/lib/ROOT/_pythonization/_tmva/__init__.py", line 25, in <module>
    hasRDF = gSystem.GetFromPipe("root-config --has-dataframe") == "yes"
ValueError: TString TSystem::GetFromPipe(const char* command) =>
    ValueError: nullptr result where temporary expected

This has surfaced when users try distributed RDataFrame applications on such platforms, which cannot even start due to the reported error.

Reproducer

See the related forum posts

ROOT version

6.26 and above (based on the LCG stacks reported)

Installation method

LCG build

Operating system

Linux

Additional context

No response

@vepadulano vepadulano added this to the 6.30/00 milestone Oct 3, 2023
@vepadulano vepadulano self-assigned this Oct 3, 2023
@vepadulano vepadulano changed the title TMVA cannot be initialized properly in distributed environments TMVA cannot be initialized properly in distributed Python environments Oct 3, 2023
@vepadulano
Copy link
Member Author

The direct reason of the failure seems to be that, at least when using SWAN, the ROOT installation is somehow ill-formed. This is a simpler reproducer that just uses Spark primitives to try to run the root-config command on the worker

image

In all fairness, I don't understand the reason to call into that command in the first place, which happens here, so I will investigate if that is needed at all.

vepadulano added a commit to vepadulano/root that referenced this issue Oct 3, 2023
Which also relied on TSystem::GetFromPipe which sometimes may not work
properly for any reason. See
root-project#13798 for more details.
@dpiparo
Copy link
Member

dpiparo commented Oct 4, 2023

Hello. I think this issue is quite relevant, not only for distributed execution, but also for PyROOT usability in general. This is why I proposed these changes, in case they turn out to seem useful #13803

@vepadulano
Copy link
Member Author

Thanks @dpiparo for your PR, after merging I was able to continue the investigation on this issue! After removing this code from the TMVA initialization, we are seeing different problems, so I will open a new issue for those and consider this closed. Still, since all the problems we see are related to the same triggering factor, I will also write it here for completeness. It turns out that the Spark workers have a different set of environment variables from the one available in the client session. In particular, the PATH and ROOT_INCLUDE_PATH are completely removed in the Spark workers, which was also causing the absence of the root-config command

@vepadulano vepadulano added this to Issues in Fixed in 6.30/00 via automation Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

2 participants