- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.4k
[df] Change default Snapshot compression settings #20030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[df] Change default Snapshot compression settings #20030
Conversation
cbda4a1    to
    81c2ac0      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
| Test Results    22 files      22 suites   3d 23h 3m 48s ⏱️ For more details on these failures, see this check. Results for commit 4d8eb60. ♻️ This comment has been updated with latest results. | 
81c2ac0    to
    7c5c04f      
    Compare
  
    | @pcanal I am now slightly in doubt about this current approach. For the information message to be actually visible, a user would have to explicitly activate the "info" logging in their application by writing #include <ROOT/RLogger.hxx>
// this increases RDF's verbosity level as long as the `verbosity` variable is in scope
auto verbosity = ROOT::RLogScopedVerbosity(ROOT::Detail::RDF::RDFLogChannel(), ROOT::ELogLevel::kInfo);Whereas I believe our goal here is to show the message once per application anytime the user is writing at least one Snapshot call in their program. By turning the "info" to a "warning", we would get that behaviour. | 
| I think a Warning is fine, considering that you are warning the user about the change in behavior | 
| Thanks for adding it to the RN, which do not reach all users, but at least make the change officially documented in writing. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I agree with the suggestion to turn this into a warning. I added one suggestion to the RN to add some additional context, but it's also okay for me to leave them as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favour of changing the default, and also think that it's good to notify users, but I think the way how the message gets out could be improved:
- An entry in the release notes is warranted
- The Infomessage should show as info, so we might have to move toTError.h'sInfo. (A warning would be too high in my opinion).
- And I think the message should be silenceable using an entry in TEnv.
How about:
The default compression settings of Snapshot have been changed from 101 (ZLIB with compression level 1) to 505 (ZSTD with compression level 5).
See https://root.cern/
Silence this message addingRDF_SNAPSHOT_INFO=0to the process environment or .rootrc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Vincenzo!
7c5c04f    to
    09afe35      
    Compare
  
    | I have found a way to show  The remaining discussion point is about how long this info message should stay around. I believe it should only stay in 6.38 as leaving it in an LTS release has the concrete risk of sticking around in user code for many years. | 
| 
 How about the shorter  
 After today's discussion, 6.38 seems to be enough. | 
| The message appears to be: The message is being obscured by the prefix which in this case in uninformative. Something closer to the following might be more readable.  | 
| 
 I really do not think a rewriting of the implementation of RLogger is in scope for this PR | 
| 
 Maybe that's not required. Adding  | 
| I have added another comment to address the PR reviews @hageboeck @pcanal , this is what a simple example looks like on my terminal Please let me know how you like it. Afterwards, I will take care of resolving the conflicts and most importantly fix the test failures resulting from the extra info print | 
The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.
5e922a1    to
    4d8eb60      
    Compare
  
    | 
 This is more readable. Thanks. | 
The default Snapshot compression setting has always been 101. Historically, this was done for simplicity reasons and following the principle of least surprise. TTree was the only output format available with Snapshot, so the operation was defaulting to the same value used by TTree. Now that Snapshot supports more than one output format, this reason is less strong than before. It has been established that 505 is a better default compression setting overall, so RDataFrame should follow that. The main disadvantage is that this change is hard to communicate. This commit proposes to introduce an information message being shown once per program execution, and only if the program is calling Snapshot. This message is supposed to help the users detect changes in their output file size with the next development cycle (6.38.x) and should be removed afterwards.
As discussed during the ROOT I/O meeting