Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROOT loads unneeded PCMs #13000

Open
1 task done
ktf opened this issue Jun 13, 2023 · 65 comments
Open
1 task done

ROOT loads unneeded PCMs #13000

ktf opened this issue Jun 13, 2023 · 65 comments
Assignees
Labels
bug experiment Affects an experiment / reported by its software & computimng experts

Comments

@ktf
Copy link
Contributor

ktf commented Jun 13, 2023

Check duplicate issues.

  • Checked for duplicates

Describe the bug

When doing a simple TFile::Open(), ROOT initialises the interpreter on the first TEnv::GetValue needed by the aforementioned method. This makes it load all the pcms which it can find, regardless of weather they are needed to read the file or not (in our case we only have branches with basic types, e.g. floats or ints or vector of thereof).
This has the net result of adding 70MB of RSS for no particular reason to any process which uses ROOT. You can find more details in the attached instruments screenshot.
Moreover, single stepping with the debugger and printing out the LoadModule argument, it seems that the pcms for Sofie, Proof and 3D graphics are particularly honours, all of which we do not need and we do not link.

image

What is the expected behaviour?

The best would be for ROOT to be smart enough to lazily load the pcms when they are actually needed, not upfront. A second best option would be to be able to start the interpreter in a clean slate state and load by hand only the pcms which are needed to read a file.

How to reproduce?

On macOS using Instruments one can do:

xcrun xctrace record --output root.trace --time-limit 30s --template Allocations --launch -- /Users/ktf/src/sw/osx_arm64/ROOT/alice-v6-28-04-local2/bin/root.exe -q -b http://alimonitor.cern.ch/train-workdir/testdata/LFN/alice/data/2015/LHC15o/000246392/pass2/PWGZZ/Run3_Conversion/320_20220701-0813/AOD/019/AO2D.root
open root.trace

and get the report in question. Using the alice environment should not matter, but in principle you should be able to load it from CVMFS.

ROOT version

6.28.04, but it's a while we see it.

How did you install ROOT?

ALICE build environment

Which operating system are you using?

macOS, linux

Additional context

No response

@Axel-Naumann
Copy link
Member

Axel-Naumann commented Jun 13, 2023

Info from @ktf : with 100 processes linked against ROOT running on these machines that's an extra 70MB*100=7GB used by (unneeded?) PCMs.

@vgvassilev
Copy link
Member

vgvassilev commented Jun 14, 2023

The best would be for ROOT to be smart enough to lazily load the pcms when they are actually needed, not upfront.

Yes, and unfortunately that is as good as it gets with the current infrastructure. Here I've put a bit more details about what could be improved to further reduce the memory footprint: cms-sw/cmssw#41810 (comment)

Note that the loading of the pcm files is mmap-ed and is a no-op. The problem is that some sections of the PCMs are not lazy and they are eagerly deserialized. Using a PCH won't make the things much better since the PCH still deserializes eagerly some sections. That is, the deficiency is not in ROOT but in Clang itself.

Could you rerun using export ROOTDEBUG=7 and attach the output. Sometimes we lookup an identifier like I or something which is not found and it triggers deserialization of entities. That sometimes could be avoided.

@ktf
Copy link
Contributor Author

ktf commented Jun 14, 2023

I will provide you the output for ROOTDEBUG=7.

Could you elaborate a bit on the mmap? I would expect that using MAP_SHARED would be enough of a workaround for our (ALICE) use case, because what is really killing us is the fact the memory is not shared among the many processes we have. Any reason why that cannot be done? Did I misunderstand something?

@vgvassilev
Copy link
Member

I will provide you the output for ROOTDEBUG=7.

Could you elaborate a bit on the mmap?

The pcm files are "mmapped" when being opened. That practically uses some virtual memory with the promise it won't increase your rss, unless something is needed from that file. Here is how this is achieved:

Buf = FileMgr.getBufferForFile(NewModule->File,

I believe your issue is that due to some identifier lookup we start loading pcm files which have sections that require eager deserialization where the mmap manifests into a real rss increase.

I would expect that using MAP_SHARED would be enough of a workaround for our (ALICE) use case, because what is really killing us is the fact the memory is not shared among the many processes we have. Any reason why that cannot be done? Did I misunderstand something?

The reason is that the serialization in Clang has deficiencies and reads from disk when a pcm file is loaded. I've been hunting down these cases and sometimes we could avoid them. That's why I was looking for some output that could help us do that.

That being said, we could make some effort to split the startup phase of ROOT into loading and initialization. Then we could move the registration of pcm files and setting up the ROOT's runtime as part of the initialization process but realistically, if you use ROOT for anything you'd probably need these both...

@ktf
Copy link
Contributor Author

ktf commented Jun 14, 2023

Ok, so you are saying that it's not the mmaping to cause the problem, but the subsequent allocations due to some deserialisation. Ok, I checked and indeed the mmap is called with MAP_SHARED...

Find in attachment the ROOTDEBUG=7 log. This is only for the component which reads the flat root files (i.e. only basic types).

log.txt

@vgvassilev
Copy link
Member

According to the log ROOT is looking up Experimental (which is a namespace) and pulls in all modules that have namespace partitions with the name Experimental. We have tried to improve that with #10969 but probably missed some bit. Would applying it help?

@ktf
Copy link
Contributor Author

ktf commented Jun 15, 2023

The behavior seems to have indeed changed, yet I still see the same amount of RSS being allocated in the same codepath. See attached log.

log.txt.gz

@ktf
Copy link
Contributor Author

ktf commented Jun 19, 2023

Anything else which I can try here?

@vgvassilev
Copy link
Member

@ktf,

cat /Users/vvassilev/Downloads/log\ 2.txt | grep 'on demand'
[89725:internal-dpl-aod-reader]: Loading 'Core' on demand for 'TArrayF'
[89725:internal-dpl-aod-reader]: Loading 'Cling_Runtime' on demand for 'T'
[89725:internal-dpl-aod-reader]: Loading 'Smatrix' on demand for 'MatRepStd'
[89725:internal-dpl-aod-reader]: Loading 'MathCore' on demand for 'R'
[89725:internal-dpl-aod-reader]: Loading 'std' on demand for 'complex'
[89725:internal-dpl-aod-reader]: Loading 'GenVector' on demand for 'Cartesian2D'
[89725:internal-dpl-aod-reader]: Loading 'ROOT_Foundation_Stage1_NoRTTI' on demand for 'TVirtualIsAProxy'
[89725:internal-dpl-aod-reader]: Loading 'Tree' on demand for 'TIOFeatures'
[89725:internal-dpl-aod-reader]: Loading 'MultiProc' on demand for 'ReadBuffer'
[89725:internal-dpl-aod-reader]: Loading 'Hist' on demand for 'TH1'
[89725:internal-dpl-aod-reader]: Loading 'TreePlayer' on demand for 'TTreeFormula'
[89725:internal-dpl-aod-reader]: Loading 'RIO' on demand for 'TMapFile'
[89725:internal-dpl-aod-reader]: Loading 'ROOT_Rtypes' on demand for 'UChar_t'

Looks quite good. However, for some reason we pre-load a lot more modules than I expect. Do you have an up-to-date ROOTSYS/lib/modules.idx file?

@ktf
Copy link
Contributor Author

ktf commented Jun 20, 2023

I have whatever is generated by building ROOT 6.28.4. I attached it. Is there a way to trim it down for a specific proxy?

modules.idx.zip

@ktf
Copy link
Contributor Author

ktf commented Jun 22, 2023

Any further idea?

@ktf
Copy link
Contributor Author

ktf commented Jun 27, 2023

Anything else we can try here? I would really love to get rid of the overhead from ROOT.

@vgvassilev
Copy link
Member

Anything else we can try here? I would really love to get rid of the overhead from ROOT.

Apologies for being slow. I am attending a workshop these past several days... This is what ROOT (without the suggested patch does for me):

ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep Preloading | wc -l
21
vvassilev@vv-nuc /build/vvassilev/root_build $ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l
35

Essentially it loads 56 pcm files at startup. That's still pretty bad but not 100.

/usr/bin/time -v root.exe -l -b -q shows
User time (seconds): 0.08
System time (seconds): 0.04
Maximum resident set size (kbytes): 179344

Still probably too much memory being used but that's what the state of the art modules technology allows us.

The numbers are not great but not I cannot reproduce the pcm loads that you report. Can you prepare some debug environment for me to check it out. I will need a standard build of ROOT perhaps with -DCMAKE_BUILD_TYPE=Debug -DLLVM_BUILD_TYPE=Debug?

@ktf
Copy link
Contributor Author

ktf commented Jun 28, 2023

TLDR; I repeated your tests with our environment, see below for the actual setup. Without the root file being opened, I get more preloaded dictionaries (119), but roughly the same memory footprint. Adding a ROOT file on the command line it adds an additional 60MB and I see SOFIE and PyMVA being loaded on demand (see also #13055).

Long story:

You can set up the same environment by going to lxplus and doing:

→ ALIBUILD_ARCH_PREFIX="Packages" WORK_DIR=/cvmfs/alice.cern.ch/el7-x86_64 . /cvmfs/alice.cern.ch/el7-x86_64/Packages/ROOT/v6-28-04-9/etc/profile.d/init.sh
/cvmfs/alice.cern.ch/el7-x86_64/Packages/AliEn-Runtime/v2-19-le-136/etc/profile.d/init.sh:source:7: no such file or directory: /cvmfs/alice.cern.ch/el7-x86_64/Packages/ApMon-CPP/v2.2.8-alice5-40/etc/profile.d/init.sh

You can safely ignore the ApMon message.

Without the root file, I still see over one hundred preloaded, while "on demand" and the memory usage is roughly the same.

# eulisse at lxplus707.cern.ch in ~ [9:39:17]
→ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep Preloading | wc -l
119

# eulisse at lxplus707.cern.ch in ~ [9:39:24]
→ ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l

34

# eulisse at lxplus707.cern.ch in ~ [9:39:57]
→ /usr/bin/time -v root.exe -l -b -q

        Command being timed: "root.exe -l -b -q"
        User time (seconds): 0.22
        System time (seconds): 0.26
        Percent of CPU this job got: 62%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.78
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 139016
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 56647
        Voluntary context switches: 5479
        Involuntary context switches: 23
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

With the root file being loaded, the preloading stays almost the same, while "on demand" goes to 41. This also results in a 60MB jump in memory:

→ ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep Preloading | wc -l
120

→ ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq | wc -l
41

→ /usr/bin/time -v root.exe -l -b -q ~/public/AO2D.root

Attaching file /afs/cern.ch/user/e/eulisse/public/AO2D.root as _file0...
(TFile *) 0x3910f20
Command exited with non-zero status 255
        Command being timed: "root.exe -l -b -q /afs/cern.ch/user/e/eulisse/public/AO2D.root"
        User time (seconds): 0.42
        System time (seconds): 0.30
        Percent of CPU this job got: 72%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.01
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 207928
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 76766
        Voluntary context switches: 9921
        Involuntary context switches: 28
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 255

The difference indeed seems to come from loading SOFIE and PyMVA on demand (why?).

→ diff -Naur =(ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep Preloading) =(ROOTDEBUG=7 root.exe -l -b -q  2>&1 | grep Preloading)
--- /tmp/zsh3krahi      2023-06-28 09:47:35.214307973 +0200
+++ /tmp/zshDEGUqN      2023-06-28 09:47:35.843315522 +0200
@@ -117,4 +117,3 @@
 Info in <TCling::__LoadModule>: Preloading module MultiProc.
 Info in <TCling::__LoadModule>: Preloading module Imt.
 Info in <TCling::__LoadModule>: Preloading module MathCore.
-Info in <TCling::__LoadModule>: Preloading module Tree.
→ diff -Naur =(ROOTDEBUG=7 root.exe -l -b -q ~/public/AO2D.root 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq) =(ROOTDEBUG=7 root.exe -l -b -q 2>&1 | grep 'on demand' | cut -d' ' -f 2 | sort | uniq)
--- /tmp/zshD2FXJ0      2023-06-28 09:48:55.414270426 +0200
+++ /tmp/zsh1jCtJt      2023-06-28 09:48:56.293280974 +0200
@@ -1,5 +1,4 @@
 'Cling_Runtime'
-'Cling_Runtime_Extra'
 'Core'
 'Foam'
 'Genetic'
@@ -9,7 +8,6 @@
 'Hist'
 'HistFactory'
 'Imt'
-'libc'
 'MathCore'
 'MathMore'
 'Minuit'
@@ -17,7 +15,6 @@
 'MultiProc'
 'Net'
 'NetxNG'
-'PyMVA'
 'Rint'
 'RIO'
 'RooFitCore'
@@ -26,16 +23,12 @@
 'RooStats'
 'ROOTDataFrame'
 'ROOT_Foundation_Stage1_NoRTTI'
-'ROOT_Rtypes'
-'ROOTTMVASofie'
-'ROOTTMVASofieParser'
 'ROOTTreeViewer'
 'ROOTVecOps'
 'ROOTWebDisplay'
 'Smatrix'
 'std'
 'Thread'
-'TMVA'
 'Tree'
 'TreePlayer'
 'XMLIO'

Full list of modules being preloaded can be found below.

Info in <TCling::__LoadModule>: Preloading module libc.
Info in <TCling::__LoadModule>: Preloading module std.
Info in <TCling::__LoadModule>: Preloading module _Builtin_intrinsics.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_C.
Info in <TCling::__LoadModule>: Preloading module ROOT_Config.
Info in <TCling::__LoadModule>: Preloading module ROOT_Rtypes.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_Stage1_NoRTTI.
Info in <TCling::__LoadModule>: Preloading module Core.
Info in <TCling::__LoadModule>: Preloading module Rint.
Info in <TCling::__LoadModule>: Preloading module RIO.
Info in <TCling::__LoadModule>: Preloading module MathCore.
Info in <TCling::__LoadModule>: Preloading module Hist.
Info in <TCling::__LoadModule>: Preloading module Unfold.
Info in <TCling::__LoadModule>: Preloading module RHTTPSniff.
Info in <TCling::__LoadModule>: Preloading module Imt.
Info in <TCling::__LoadModule>: Preloading module HistPainter.
Info in <TCling::__LoadModule>: Preloading module PyMVA.
Info in <TCling::__LoadModule>: Preloading module RMySQL.
Info in <TCling::__LoadModule>: Preloading module RHTTP.
Info in <TCling::__LoadModule>: Preloading module FitPanel.
Info in <TCling::__LoadModule>: Preloading module ROOTVecOps.
Info in <TCling::__LoadModule>: Preloading module ProofDraw.
Info in <TCling::__LoadModule>: Preloading module Quadp.
Info in <TCling::__LoadModule>: Preloading module GeomPainter.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_Stage1_NoRTTI.
Info in <TCling::__LoadModule>: Preloading module xlocale.h.
Info in <TCling::__LoadModule>: Preloading module Genetic.
Info in <TCling::__LoadModule>: Preloading module Eve.
Info in <TCling::__LoadModule>: Preloading module TreeViewer.
Info in <TCling::__LoadModule>: Preloading module Physics.
Info in <TCling::__LoadModule>: Preloading module ROOTTMVASofieParser.
Info in <TCling::__LoadModule>: Preloading module EG.
Info in <TCling::__LoadModule>: Preloading module Tree.
Info in <TCling::__LoadModule>: Preloading module HistFactory.
Info in <TCling::__LoadModule>: Preloading module Spectrum.
Info in <TCling::__LoadModule>: Preloading module Monalisa.
Info in <TCling::__LoadModule>: Preloading module Matrix.
Info in <TCling::__LoadModule>: Preloading module Html.
Info in <TCling::__LoadModule>: Preloading module Hist.
Info in <TCling::__LoadModule>: Preloading module GuiHtml.
Info in <TCling::__LoadModule>: Preloading module _Builtin_intrinsics.
Info in <TCling::__LoadModule>: Preloading module Gpad.
Info in <TCling::__LoadModule>: Preloading module TMVAGui.
Info in <TCling::__LoadModule>: Preloading module Postscript.
Info in <TCling::__LoadModule>: Preloading module RGL.
Info in <TCling::__LoadModule>: Preloading module Fumili.
Info in <TCling::__LoadModule>: Preloading module Net.
Info in <TCling::__LoadModule>: Preloading module Geom.
Info in <TCling::__LoadModule>: Preloading module RooFitMore.
Info in <TCling::__LoadModule>: Preloading module X3d.
Info in <TCling::__LoadModule>: Preloading module Cling_Runtime.
Info in <TCling::__LoadModule>: Preloading module EGPythia6.
Info in <TCling::__LoadModule>: Preloading module SPlot.
Info in <TCling::__LoadModule>: Preloading module RooFit.
Info in <TCling::__LoadModule>: Preloading module RCsg.
Info in <TCling::__LoadModule>: Preloading module ROOT_Config.
Info in <TCling::__LoadModule>: Preloading module Rint.
Info in <TCling::__LoadModule>: Preloading module XMLParser.
Info in <TCling::__LoadModule>: Preloading module MultiProc.
Info in <TCling::__LoadModule>: Preloading module RooStats.
Info in <TCling::__LoadModule>: Preloading module RooFitRDataFrameHelpers.
Info in <TCling::__LoadModule>: Preloading module GeomBuilder.
Info in <TCling::__LoadModule>: Preloading module Proof.
Info in <TCling::__LoadModule>: Preloading module Ged.
Info in <TCling::__LoadModule>: Preloading module Recorder.
Info in <TCling::__LoadModule>: Preloading module GuiBld.
Info in <TCling::__LoadModule>: Preloading module ROOTWebDisplay.
Info in <TCling::__LoadModule>: Preloading module RooFitCore.
Info in <TCling::__LoadModule>: Preloading module Gui.
Info in <TCling::__LoadModule>: Preloading module SQLIO.
Info in <TCling::__LoadModule>: Preloading module XMLIO.
Info in <TCling::__LoadModule>: Preloading module ROOT_Rtypes.
Info in <TCling::__LoadModule>: Preloading module std.
Info in <TCling::__LoadModule>: Preloading module RIO.
Info in <TCling::__LoadModule>: Preloading module GX11TTF.
Info in <TCling::__LoadModule>: Preloading module ROOTTMVASofie.
Info in <TCling::__LoadModule>: Preloading module ProofPlayer.
Info in <TCling::__LoadModule>: Preloading module ASImage.
Info in <TCling::__LoadModule>: Preloading module libc.
Info in <TCling::__LoadModule>: Preloading module ROOT_Foundation_C.
Info in <TCling::__LoadModule>: Preloading module MathMore.
Info in <TCling::__LoadModule>: Preloading module RooFitHS3.
Info in <TCling::__LoadModule>: Preloading module Foam.
Info in <TCling::__LoadModule>: Preloading module SpectrumPainter.
Info in <TCling::__LoadModule>: Preloading module Minuit2.
Info in <TCling::__LoadModule>: Preloading module Core.
Info in <TCling::__LoadModule>: Preloading module MLP.
Info in <TCling::__LoadModule>: Preloading module ROOTDataFrame.
Info in <TCling::__LoadModule>: Preloading module GenVector.
Info in <TCling::__LoadModule>: Preloading module _Builtin_stddef_max_align_t.
Info in <TCling::__LoadModule>: Preloading module NetxNG.
Info in <TCling::__LoadModule>: Preloading module Minuit.
Info in <TCling::__LoadModule>: Preloading module RootAuth.
Info in <TCling::__LoadModule>: Preloading module TMVA.
Info in <TCling::__LoadModule>: Preloading module Graf3d.
Info in <TCling::__LoadModule>: Preloading module ASImageGui.
Info in <TCling::__LoadModule>: Preloading module Graf.
Info in <TCling::__LoadModule>: Preloading module GX11.
Info in <TCling::__LoadModule>: Preloading module Gdml.
Info in <TCling::__LoadModule>: Preloading module ROOTTreeViewer.
Info in <TCling::__LoadModule>: Preloading module ProofBench.
Info in <TCling::__LoadModule>: Preloading module Cling_Runtime_Extra.
Info in <TCling::__LoadModule>: Preloading module MathCore.
Info in <TCling::__LoadModule>: Preloading module Gviz3d.
Info in <TCling::__LoadModule>: Preloading module WebGui6.
Info in <TCling::__LoadModule>: Preloading module ROOTTPython.
Info in <TCling::__LoadModule>: Preloading module TreePlayer.
Info in <TCling::__LoadModule>: Preloading module RooFitJSONInterface.
Info in <TCling::__LoadModule>: Preloading module Smatrix.
Info in <TCling::__LoadModule>: Preloading module SessionViewer.
Info in <TCling::__LoadModule>: Preloading module Thread.
Info in <TCling::__LoadModule>: Preloading module Core.
Info in <TCling::__LoadModule>: Preloading module Rint.
Info in <TCling::__LoadModule>: Preloading module Thread.
Info in <TCling::__LoadModule>: Preloading module RIO.
Info in <TCling::__LoadModule>: Preloading module Net.
Info in <TCling::__LoadModule>: Preloading module MultiProc.
Info in <TCling::__LoadModule>: Preloading module Imt.
Info in <TCling::__LoadModule>: Preloading module MathCore.

while the full list of on demand modules is:

'Cling_Runtime'
'Cling_Runtime_Extra'
'Core'
'Foam'
'Genetic'
'GenVector'
'Geom'
'Gpad'
'Hist'
'HistFactory'
'Imt'
'libc'
'MathCore'
'MathMore'
'Minuit'
'Minuit2'
'MultiProc'
'Net'
'NetxNG'
'PyMVA'
'Rint'
'RIO'
'RooFitCore'
'RooFitHS3'
'RooFitJSONInterface'
'RooStats'
'ROOTDataFrame'
'ROOT_Foundation_Stage1_NoRTTI'
'ROOT_Rtypes'
'ROOTTMVASofie'
'ROOTTMVASofieParser'
'ROOTTreeViewer'
'ROOTVecOps'
'ROOTWebDisplay'
'Smatrix'
'std'
'Thread'
'TMVA'
'Tree'
'TreePlayer'
'XMLIO'

the file used can be found in /afs/cern.ch/user/e/eulisse/public/AO2D.root.

@ktf
Copy link
Contributor Author

ktf commented Jun 28, 2023

Ok, indeed SOFIE gets loaded because of the Experimental namespace, I guess. I do not see it with your patch. I will try to do a full build on CVMFS with it applied.

@ktf
Copy link
Contributor Author

ktf commented Jun 30, 2023

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

@vgvassilev
Copy link
Member

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

Yes! Very likely.

@vgvassilev
Copy link
Member

@ktf, I was just on a call with @hahnjo and we seem that we can re-introduce a memory optimization that we lost over the last llvm upgrade. That is a version of 5280ade .

ktf added a commit to ktf/root that referenced this issue Jun 30, 2023
As described in root-project#13000,
ROOT will happily load all the PCMs relative to the Experimental
namespace, just because such a namespace is found as a transient member
of TBranch.

This results in a non negligible amount of private memory being used
due to the fact tha a number of Experimental features will see their
PCMs deserialised.
@ktf
Copy link
Contributor Author

ktf commented Jun 30, 2023

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

Yes! Very likely.

That's actually it. If I remove the Experimental namespace and simply use Internal::TBulkBranchRead, it does not load the other Experimental bits either.

@ktf
Copy link
Contributor Author

ktf commented Jul 1, 2023

and I just realised that the "big modules" are not necessarily big, they just happen to trigger the resizing of some pool. 5280ade seems to give 4 / 5 MB: nothing dramatic.

@vgvassilev
Copy link
Member

Could it be that the issue is TBranch now has Experimental::Internal::TBulkBranchRead as a member?

Yes! Very likely.

That's actually it. If I remove the Experimental namespace and simply use Internal::TBulkBranchRead, it does not load the other Experimental bits either.

However, that in practice shows a more generic design problem in the clang modules system. That is, then we make a lookup of a namespace identifier clang (rightfully) tries to collect all namespace partitions from all reachable modules. This has to do with things like overload resolution. Due to the autoloading system, ROOT essentially considers all modules reachable and thus loads them. The only reliable way to fix this is to make the clang module loads a no-op which is a bit of a challenge…

@vgvassilev
Copy link
Member

and I just realised that the "big modules" are not necessarily big, they just happen to trigger the resizing of some pool. 5280ade seems to give 4 / 5 MB: nothing dramatic.

That pool resizing has to do with source locations, right? Can you show the stack trace?

@ktf
Copy link
Contributor Author

ktf commented Jul 1, 2023

That pool resizing has to do with source locations, right? Can you show the stack trace?

Apparently so (assuming SLoc is source locations):

  12.22 MB     start		
  12.22 MB     main	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/main/src/rmain.cxx:84	
  12.22 MB     TRint::TRint(char const*, int*, char**, void*, int, bool, bool)	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/rint/src/TRint.cxx:175	
  12.22 MB     TApplication::TApplication(char const*, int*, char**, void*, int)	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/base/src/TApplication.cxx:153	
  12.22 MB     ROOT::Internal::GetROOT2()	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/base/src/TROOT.cxx:385	
  12.22 MB     TROOT::InitInterpreter()	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/base/src/TROOT.cxx:2072	
  12.22 MB     CreateInterpreter	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/core/metacling/src/TCling.cxx:614	
  12.22 MB     TCling::TCling(char const*, char const*, char const* const*, void*)	/Users/ktf/src/sw/SOURCES/ROOT/alice-v6-28-04/0/interpreter/cling/include/cling/Interpreter/Interpreter.h:0	
  12.22 MB     cling::Interpreter::Interpreter(int, char const* const*, char const*, std::__1::vector<std::__1::shared_ptr<clang::ModuleFileExtension>, std::__1::allocator<std::__1::shared_ptr<clang::ModuleFileExtension>>> const&, void*, bool, cling::Interpreter const*)		
  12.22 MB     cling::Interpreter::Initialize(bool, bool, llvm::SmallVectorImpl<llvm::StringRef>&)		
  12.21 MB     cling::Interpreter::declare(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, cling::Transaction**)		
  12.21 MB     cling::IncrementalParser::Compile(llvm::StringRef, cling::CompilationOptions const&)		
  12.21 MB     cling::IncrementalParser::ParseInternal(llvm::StringRef)		
  12.21 MB     clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, bool)		
  12.21 MB     clang::Preprocessor::Lex(clang::Token&)		
  12.21 MB     clang::Lexer::LexTokenInternal(clang::Token&, bool)		
  12.21 MB     clang::Preprocessor::HandleDirective(clang::Token&)		
  12.21 MB     clang::Preprocessor::HandleIncludeDirective(clang::SourceLocation, clang::Token&, clang::DirectoryLookup const*, clang::FileEntry const*)		
  12.21 MB     clang::Preprocessor::HandleHeaderIncludeOrImport(clang::SourceLocation, clang::Token&, clang::Token&, clang::SourceLocation, clang::DirectoryLookup const*, clang::FileEntry const*)		
  12.21 MB     clang::CompilerInstance::loadModule(clang::SourceLocation, llvm::ArrayRef<std::__1::pair<clang::IdentifierInfo*, clang::SourceLocation>>, clang::Module::NameVisibilityKind, bool)		
  12.21 MB     clang::CompilerInstance::findOrCompileModuleAndReadAST(llvm::StringRef, clang::SourceLocation, clang::SourceLocation, bool)		
  12.21 MB     clang::ASTReader::ReadAST(llvm::StringRef, clang::serialization::ModuleKind, clang::SourceLocation, unsigned int, llvm::SmallVectorImpl<clang::ASTReader::ImportedSubmodule>*)		
  12.15 MB     clang::ASTReader::ReadASTBlock(clang::serialization::ModuleFile&, unsigned int)		
   8.20 MB     clang::SourceManager::AllocateLoadedSLocEntries(unsigned int, unsigned int)		
   8.20 MB     llvm::SmallVectorBase<unsigned int>::grow_pod(void*, unsigned long, unsigned long)		
   8.20 MB     realloc		

However I also see other bumps. For example I see 14MB which seem to be allocated for the QualType, whatever that means.

image

@vgvassilev
Copy link
Member

8.20 MB clang::SourceManager::AllocateLoadedSLocEntries(unsigned int, unsigned int)

Yes, that's what I was talking about. That's the relevant line of code that we need to "fix" in clang. And this operation we try hard to avoid by delaying the module loads until possible with the global module index (modules.idx) file.

However, the QualType thing is new to me. Do you think we go through here? One can see the module file contents by running in ROOT:

make llvm-bcanalyzer
llvm-bcanalyzer $ROOTSYS/lib/some_file.pcm

@ktf
Copy link
Contributor Author

ktf commented Jul 2, 2023

If I understand correctly from:

for x in ls lib/*.pcm; do cat $x | ./llvm-bcanalyzer | grep SPECIAL_TYPES; done | wc -l

every single PCM we have has 137 SPECIAL_TYPES. See here
dump of RooStats.pcm, which seems to be responsible for the bump.

I am now trying to do a debug build to check if we go through the line you suggested.

@hahnjo
Copy link
Member

hahnjo commented Jul 7, 2023

Do you mean this #10969 We tried something along these lines but see the verdict...

Yes, we need the namespaces in the GMI for things like ROOT::RDF. But if the AST already knows about a namespace, I'm not sure if we actually need to do anything when presented with that DeclarationName...

@vgvassilev
Copy link
Member

https://cernbox.cern.ch/s/J7w2JH0D0YXdcyA for the 1.6GB log...

It seems we are hit by a code with a lot of templates code. Very roughly the *SpecializationType are a lot and they are quite big in terms of memory representation:

vvassilev@vv-nuc /tmp $ cat log | grep Type | wc -l
10165751
vvassilev@vv-nuc /tmp $ cat log | grep SpecializationType | wc -l
566599

I am not sure what we could do. So far I cannot find a recurring pattern, except for a bunch of instantiations that might be part of libstdc++ (eg std.pcm), however, I cannot confirm that's an issue since the printing does not include where a subtype comes from. I guess I am stuck and I might need to disturb @zygoloid again...

@ktf
Copy link
Contributor Author

ktf commented Jul 10, 2023

Ok, let me know if you have other tests which you would like me to do.

One more question for my understanding. Are those big vectors actually needed at all when reading a file? If not why aren't they swapped out to disk either implicitly via some madvise or by simply caching to disk the results? The latter might even give some startup advantage, assuming they are offsets to the pcms and they do not change (not the basic root ones, at least).

@vgvassilev
Copy link
Member

Ok, let me know if you have other tests which you would like me to do.

I was a bit surprised to learn that:

Thank you for the nice investigations. What are the next steps? I tried #13139 but that does not help us. I see a different memory profile, but the extra memory due to ROOT is stil around 70MB (even a bit more with that PR, AFAICT).

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss.

ping.

One more question for my understanding. Are those big vectors actually needed at all when reading a file? If not why aren't they swapped out to disk either implicitly via some madvise or by simply caching to disk the results? The latter might even give some startup advantage, assuming they are offsets to the pcms and they do not change (not the basic root ones, at least).

They are not needed upon module loads. We tried to implement some sparse vectors to mitigate this issue. However, I feel like we did not have the persistence to make it work. That is, we can implement a lazy vector structure that behaves just like vector but the reserve operation does not allocate. Instead, we allocate slabs around the access pattern (we override the subscript operator, etc). This would help us avoid eager allocations and hopefully, depending on whether there is a clear access pattern to optimize things. Our implementation seems to be still here.

Another easier thing to try is reviving https://reviews.llvm.org/D89749

@ktf
Copy link
Contributor Author

ktf commented Jul 11, 2023

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss.
ping.

With the file.

@ktf
Copy link
Contributor Author

ktf commented Jul 11, 2023

to be clear, as I said above, the profile changed, however the total sum for cling initialisation is still at 70MB.

@vgvassilev
Copy link
Member

Is that with or without opening the file? My expectation is that root.exe -l -b -q will take a lot less rss.
ping.

With the file.

I meant to ping on the stats without a file.

to be clear, as I said above, the profile changed, however the total sum for cling initialisation is still at 70MB.

Can you share the new profile?

@ktf
Copy link
Contributor Author

ktf commented Sep 12, 2023

So I finally managed to get back to this and try a few things and as a result in my tests I managed to go from 82 MB of overhead to 62 MB. See #13641 for the actual code.

While there is still a few things to cleanup, I think what I have is fairly general and quite self contained change, so I would appreciate feedback and if people like it, instructions on how to proceed to make sure this ends up upstream.

The solution is based on a newly introduced PagedVector<T> which allows registering callbacks which get invoked whenever the associated range is accessed the first time. At such point, the range is actually allocated and each element of the range is passed to the callback.

Initially I developed a mechanism to be general enough to do complex initialisations of the elements, however I later realised this is not actually needed. In order to optimise the issue with the large vectors TypesLoaded and DeclsLoaded one only needs to have something which does the default construction of their elements for ranges which are small enough to avoid to large upfront allocations. The code which is currently there in root / clang is already smart enough to lazily initialise only the elements which are actually needed, so the whole issue is to find a good tradeoff between number of ranges and sparse enough ones.

A few caveats:

  • PagedVector is a bad name, I agree. Suggestions are welcome.
  • At the moment the implementation allows generic callbacks for ranges, however as I said we only need default initialisation. Removing such flexibility should buy another 1 MB from my test.
  • In case DeclID* and QualType can default to all 0, one could probably achieve the same result in a more elegant way using calloc / realloc.
  • There are probably other vectors which can be optimised in the same way, in particular IdentifiersLoaded as already mentioned by @vgvassilev.

Comments?

@vgvassilev
Copy link
Member

vgvassilev commented Sep 12, 2023

Comments?

This looks awesome!

A few caveats:

* `PagedVector` is a bad name, I agree. Suggestions are welcome.

We did not come up with a better name. Ours was SparseVector. Just to confirm, we do not pre-allocate the entire vector but we deterministically allocate the slabs. That is, if I hardcoded an offset 101 in my code and the pre-allocation allocated just a slab [0-100), then when I do PagedVector[101] it would put what I need where I need it, right?

* At the moment the implementation allows generic callbacks for ranges, however as I said we only need default initialisation. Removing such flexibility should buy another 1 MB from my test.

If it is not needed, let's keep it simple. It would be easier to go through the upstream llvm reviews.

* In case `DeclID*` and `QualType` can default to all 0, one could probably achieve the same result in a more elegant way using calloc / realloc.

Can we make the API of the PagedVector such that it becomes a drop-in replacement? I have not looked deeply in the PR but I see some work being done on the users side.

* There are probably other vectors which can be optimised in the same way, in particular IdentifiersLoaded as already mentioned by @vgvassilev.

The source locations offset would be a major source of improvement if this technique flies there.

PP: It seems llvm has some facility along these lines: https://llvm.org/doxygen/classllvm_1_1SparseBitVector.html

@ktf
Copy link
Contributor Author

ktf commented Sep 13, 2023

We did not come up with a better name. Ours was SparseVector. Just to confirm, we do not pre-allocate the entire vector but we deterministically allocate the slabs. That is, if I hardcoded an offset 101 in my code and the pre-allocation allocated just a slab [0-100), then when I do PagedVector[101] it would put what I need where I need it, right?

No, you would need first to register the range [100, 200) via addEmptyRange(). I could try to modify it for that behaviour, if you prefer. Maybe it would allow me to get rid of the logarithmic lookup, actually (basically, one less vector to worry about).

If it is not needed, let's keep it simple. It would be easier to go through the upstream llvm reviews.

Ok, I will clean it up a bit more.

Can we make the API of the PagedVector such that it becomes a drop-in replacement? I have not looked deeply in the PR but I see some work being done on the users side.

Yes, users would need to addEmptyRange where they now do resize(). I guess I could actually hide addEmptyRange inside the resize. Is it allowed to use std::pmr::vector in the llvm codebase?

The source locations offset would be a major source of improvement if this technique flies there.

I couldn't find the source locations vector anymore. Could you point it to me?

PP: It seems llvm has some facility along these lines: https://llvm.org/doxygen/classllvm_1_1SparseBitVector.html

I will have a look.

@vgvassilev
Copy link
Member

Yes, users would need to addEmptyRange where they now do resize(). I guess I could actually hide addEmptyRange inside the resize. Is it allowed to use std::pmr::vector in the llvm codebase?

I don't think so. I could not find uses of it in clang.

The source locations offset would be a major source of improvement if this technique flies there.

I couldn't find the source locations vector anymore. Could you point it to me?

From here you will need to jump to the clang::SourceManager object.

@ktf
Copy link
Contributor Author

ktf commented Sep 13, 2023

@vgvassilev I have updated the PR to include a similar patch for the SourceManager and at least the trivial test seems to work fine, including a nice 9MB reduction in memory allocations. I am now testing with the ALICE analysis framework.

I have also done a few of the cleanups, and it now only exposes an "expand" interface (basically resize without shrinking). AFAICT, it's not worth to implement the full "resize" functionality, given it's complicated and at least the places I fixed never shrink.

I think also the calloc approach might not be a good idea given realloc does not guarantee that the memory is zeroed and besides that page ranges might be a tad too big for vector of pointers and so on. I would put aside that idea, at least for now.

My current plan:

  • Verify with larger tests using ALICE analysis and asynchronous reconstruction.
  • Unify the two PagedVector implementations I currently have (one for ASTReader and one for SourceManager).
  • Use fixed size ranges, so that I can get rid of the ranges in my internal index and just keep where each page was allocated.

@ktf
Copy link
Contributor Author

ktf commented Sep 13, 2023

For the record, the new top offender is actually findOrCreateModule.

image

@vgvassilev
Copy link
Member

For the record, the new top offender is actually findOrCreateModule.

image

That's probably the loading of module std.pcm...

@ktf
Copy link
Contributor Author

ktf commented Sep 14, 2023

For the record, I have done what I outlined above and updated the PR. I do see some drastic improvements for some of our workflows (250MB out of 1GB) where we have many processes initialising the interpreter. For others, where the usage of ROOT is limited to reading files in a single process, the improvement is not so obvious. Simply opening a file does show improvement as well (I am down to 49MB, when I also patch the FileInfo vector in the HeaderSearch).

The PR somehow seems to die with some old memory corruption which I am pretty sure I fixed and I cannot reproduce anymore. Is there any need to clean some cache?

Notice I have also submitted the patch to llvm itself and it passes their CI (llvm/llvm-project#66430 (comment))

@hahnjo
Copy link
Member

hahnjo commented Sep 15, 2023

The PR somehow seems to die with some old memory corruption which I am pretty sure I fixed and I cannot reproduce anymore. Is there any need to clean some cache?

I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.

Notice I have also submitted the patch to llvm itself and it passes their CI (llvm/llvm-project#66430 (comment))

Very likely the pre-merge CI does not test a large scale module setup. You will have to test this yourself and prove that the changes don't regress the case when modules are used (almost) completely. This will be needed to convince the wider community that it is solving an actual problem, or at least not regressing the "normal" compiler use case.

@ktf
Copy link
Contributor Author

ktf commented Sep 15, 2023

I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.

What I think is happening is that the broken PR generated some broken file, and the new PR is not able to survive such broken file. IMHO, this is a problem with the build setup (not being able to notice broken files and delete them) more than with my PR (which of course has no dependency on some intermediate state). Shall I just open a new PR and we see how that goes?

@ktf
Copy link
Contributor Author

ktf commented Sep 15, 2023

Very likely the pre-merge CI does not test a large scale module setup. You will have to test this yourself and prove that the changes don't regress the case when modules are used (almost) completely. This will be needed to convince the wider community that it is solving an actual problem, or at least not regressing the "normal" compiler use case.

I'll do what it takes, thank you for pointing that out.

@hahnjo
Copy link
Member

hahnjo commented Sep 15, 2023

I cannot reproduce it either, but a number of Linux builds agree that the assertion can still be triggered. PRs should be built from a clean state (in Jenkins), but even if not you are not changing the on-disk format so incremental builds must also work correctly. This needs investigation.

What I think is happening is that the broken PR generated some broken file, and the new PR is not able to survive such broken file. IMHO, this is a problem with the build setup (not being able to notice broken files and delete them) more than with my PR (which of course has no dependency on some intermediate state). Shall I just open a new PR and we see how that goes?

I'm telling that this is not the case - there are no left-behind files (in Jenkins) that influence future PR runs. We know this from changing the on-disk format / upgrading LLVM / etc. There is a problem and it needs to be investigated.

@dpiparo
Copy link
Member

dpiparo commented Feb 5, 2024

I think quite some improvements are there in 6.30.04 (released) and master. @ktf is it difficult for you to check if the issue is fixed, and, if yes, close the item? If not I can start from your repro and proceed.

@ktf
Copy link
Contributor Author

ktf commented Feb 5, 2024

Indeed the patch I provided cuts half of the overhead, however there is still 40 MB per process I cannot really justify at the moment. In our case that translates to 4 GB of RSS. While I appreciate that being completely lazy in the PCM loading is probably complicated, maybe some tactical solution could be employed (like it was done for the PagedVector). For example I am not convinced ReadSLocEntry needs to keep around the buffer. There is moreover a few more places where the PagedVector could be used effectively, I will try to propose a separate PR for that.

The reproducer is as easy as opening a ROOT file, see the main issue. A new profile is:

image

@vgvassilev
Copy link
Member

I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself...

@vgvassilev
Copy link
Member

In theory, if the software stack (not just ROOT) uses modules the RSS should also go down. Maybe that's worth trying on your end...

@ktf
Copy link
Contributor Author

ktf commented Feb 5, 2024

I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself...

any way i can confirm this?

@vgvassilev
Copy link
Member

I am wondering if we are not pushing repetitive pragma diagnostic mappings. We have tons for pragrams coming from the Linkdef files, not sure if they are needed in the module file itself...

any way i can confirm this?

Not directly. Either we can run rootcling with -v4 and count the mappings or we can insert some printing statements on ::WritePragmaDiagnosticMapping... I suspect making the pragma diagnostic mappings lazy would be art...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug experiment Affects an experiment / reported by its software & computimng experts
Projects
None yet
Development

No branches or pull requests

6 participants