Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README/ReleaseNotes/v640/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ Given the risk of silently incorrect physics results, and the absence of known w

## RDataFrame

- The message shown in ROOT 6.38 to inform users about change of default compression setting used by Snapshot (was 101 before 6.38, became 505 in 6.38) is now removed.
- The change of default compression settings used by Snapshot for the TTree output data format introduced in 6.38 (was 101 before 6.38, became 505 in 6.38) is reverted. That choice was based on evidence available up to that point that indicated that ZSTD was outperforming ZLIB in all cases for the available datasets. New evidence demonstrated that this is not always the case, and in particular for the notable case of TTree branches made of collections where many (up to all) of them are empty. The investigation is described at https://github.com/vepadulano/ttree-lossless-compression-studies. The new default compression settings for Snapshot are respectively `kUndefined` for the compression algorithm and `0` for the compression level. When Snapshot detects `kUndefined` used in the options, it changes the compression settings to the new defaults of 101 (for TTree) and 505 (for RNTuple).
- Signatures of the HistoND and HistoNSparseD operations have been changed. Previously, the list of input column names was allowed to contain an extra column for events weights. This was done to align the logic with the THnBase::Fill method. But this signature was inconsistent with all other Histo* operations, which have a separate function argument that represents the column to get the weights from. Thus, HistoND and HistoNSparseD both now have a separate function argument for the weights. The previous signature is still supported, but deprecated: a warning will be raised if the user passes the column name of the weights as an extra element of the list of input column names. In a future version of ROOT this functionality will be removed. From now on, creating a (sparse) N-dim histogram with weights should be done by calling `HistoN[Sparse]D(histoModel, inputColumns, weightColumn)`.

## Python Interface
Expand Down
9 changes: 4 additions & 5 deletions tree/dataframe/inc/ROOT/RSnapshotOptions.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -56,13 +56,13 @@ Note that for RNTuple, the defaults correspond to those set in RNTupleWriteOptio
<td><code>fCompressionAlgorithm</code></td>
<td><code>ROOT::RCompressionSetting::EAlgorithm</code></td>
<td>Zstd</td>
<td>Compression algorithm for the output dataset</td>
<td>Compression algorithm for the output dataset, defaults to ROOT::RCompressionSetting::EAlgorithm::EValues::kUndefined. This is converted to ZLIB by default for TTree and ZSTD by default for RNTuple</td>
</tr>
<tr>
<td><code>fCompressionLevel</code></td>
<td><code>int</code></td>
<td>5</td>
<td>Compression level for the output dataset</td>
<td>Compression level for the output dataset, defaults to 0 (uncompressed). If the default value of `fCompressionAlgorithm` is not modified, the compression level is changed to 1 by default for TTree and 5 by default for RNTuple</td>
</tr>
<tr>
<td><code>fOutputFormat</code></td>
Expand Down Expand Up @@ -184,9 +184,8 @@ struct RSnapshotOptions {
}
std::string fMode = "RECREATE"; ///< Mode of creation of output file
ESnapshotOutputFormat fOutputFormat = ESnapshotOutputFormat::kDefault; ///< Which data format to write to
ECAlgo fCompressionAlgorithm =
ROOT::RCompressionSetting::EAlgorithm::kZSTD; ///< Compression algorithm of output file
int fCompressionLevel = 5; ///< Compression level of output file
ECAlgo fCompressionAlgorithm = ECAlgo::kUndefined; ///< Compression algorithm of output file
int fCompressionLevel = 0; ///< Compression level of output file
bool fLazy = false; ///< Do not start the event loop when Snapshot is called
bool fOverwriteIfExists = false; ///< If fMode is "UPDATE", overwrite object in output file if it already exists
bool fVector2RVec = true; ///< If set to true will convert std::vector columns to RVec when saving to disk
Expand Down
34 changes: 27 additions & 7 deletions tree/dataframe/src/RDFSnapshotHelpers.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -364,6 +364,28 @@ void SetBranchesHelper(TTree *inputTree, TTree &outputTree,
throw std::logic_error(
"RDataFrame::Snapshot: something went wrong when creating a TTree branch, please report this as a bug.");
}

auto GetSnapshotCompressionSettings(const ROOT::RDF::RSnapshotOptions &options)
{
using CompAlgo = ROOT::RCompressionSetting::EAlgorithm::EValues;
using OutputFormat = ROOT::RDF::ESnapshotOutputFormat;

if (options.fOutputFormat == OutputFormat::kTTree || options.fOutputFormat == OutputFormat::kDefault) {
// The default compression settings for TTree is 101
if (options.fCompressionAlgorithm == CompAlgo::kUndefined) {
return ROOT::CompressionSettings(CompAlgo::kZLIB, 1);
}
return ROOT::CompressionSettings(options.fCompressionAlgorithm, options.fCompressionLevel);
} else if (options.fOutputFormat == OutputFormat::kRNTuple) {
// The default compression settings for RNTuple is 505
if (options.fCompressionAlgorithm == CompAlgo::kUndefined) {
return ROOT::CompressionSettings(CompAlgo::kZSTD, 5);
}
return ROOT::CompressionSettings(options.fCompressionAlgorithm, options.fCompressionLevel);
} else {
throw std::invalid_argument("RDataFrame::Snapshot: unrecognized output format");
}
}
} // namespace

ROOT::Internal::RDF::RBranchData::RBranchData(std::string inputBranchName, std::string outputBranchName, bool isDefine,
Expand Down Expand Up @@ -535,8 +557,7 @@ void ROOT::Internal::RDF::UntypedSnapshotTTreeHelper::SetEmptyBranches(TTree *in
void ROOT::Internal::RDF::UntypedSnapshotTTreeHelper::Initialize()
{
fOutputFile.reset(
TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/"",
ROOT::CompressionSettings(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel)));
TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/"", GetSnapshotCompressionSettings(fOptions)));
if (!fOutputFile)
throw std::runtime_error("Snapshot: could not create output file " + fFileName);

Expand Down Expand Up @@ -774,9 +795,9 @@ void ROOT::Internal::RDF::UntypedSnapshotTTreeHelperMT::SetEmptyBranches(TTree *

void ROOT::Internal::RDF::UntypedSnapshotTTreeHelperMT::Initialize()
{
const auto cs = ROOT::CompressionSettings(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel);
auto outFile =
std::unique_ptr<TFile>{TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/fFileName.c_str(), cs)};
std::unique_ptr<TFile>{TFile::Open(fFileName.c_str(), fOptions.fMode.c_str(), /*ftitle=*/fFileName.c_str(),
GetSnapshotCompressionSettings(fOptions))};
if (!outFile)
throw std::runtime_error("Snapshot: could not create output file " + fFileName);
fOutputFile = outFile.get();
Expand Down Expand Up @@ -929,7 +950,7 @@ void ROOT::Internal::RDF::UntypedSnapshotRNTupleHelper::Initialize()
model->Freeze();

ROOT::RNTupleWriteOptions writeOptions;
writeOptions.SetCompression(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel);
writeOptions.SetCompression(GetSnapshotCompressionSettings(fOptions));
writeOptions.SetInitialUnzippedPageSize(fOptions.fInitialUnzippedPageSize);
writeOptions.SetMaxUnzippedPageSize(fOptions.fMaxUnzippedPageSize);
writeOptions.SetApproxZippedClusterSize(fOptions.fApproxZippedClusterSize);
Expand Down Expand Up @@ -1151,8 +1172,7 @@ ROOT::Internal::RDF::SnapshotHelperWithVariations::SnapshotHelperWithVariations(

TDirectory::TContext fileCtxt;
fOutputHandle = std::make_shared<SnapshotOutputWriter>(
TFile::Open(filename.data(), fOptions.fMode.c_str(), /*ftitle=*/"",
ROOT::CompressionSettings(fOptions.fCompressionAlgorithm, fOptions.fCompressionLevel)));
TFile::Open(filename.data(), fOptions.fMode.c_str(), /*ftitle=*/"", GetSnapshotCompressionSettings(fOptions)));
if (!fOutputHandle->fFile)
throw std::runtime_error(std::string{"Snapshot: could not create output file "} + std::string{filename});

Expand Down
16 changes: 16 additions & 0 deletions tree/dataframe/test/dataframe_snapshot.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -247,6 +247,22 @@ TEST(RDFSnapshotMore, BasketSizePreservation)
TestBasketSizePreservation();
}

// Test for default compression settings
TEST(RDFSnapshotMore, DefaultCompressionSettings)
{
struct FileGuardRAII {
std::string fFilename{"RDFSnapshotMore_default_compression_settings.root"};
std::string fTreeName{"tree"};
~FileGuardRAII() { std::remove(fFilename.c_str()); }
} fileGuard;
ROOT::RDataFrame df{1};
df.Define("x", [] { return 42; }).Snapshot(fileGuard.fTreeName, fileGuard.fFilename, {"x"});

auto f = std::make_unique<TFile>(fileGuard.fFilename.c_str());
EXPECT_EQ(f->GetCompressionAlgorithm(), ROOT::RCompressionSetting::EAlgorithm::EValues::kZLIB);
EXPECT_EQ(f->GetCompressionLevel(), 1);
}

// fixture that provides fixed and variable sized arrays as RDF columns
class RDFSnapshotArrays : public ::testing::Test {
protected:
Expand Down
20 changes: 20 additions & 0 deletions tree/dataframe/test/dataframe_snapshot_ntuple.cxx
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,26 @@ TEST(RDFSnapshotRNTuple, WriteOpts)
}
}

TEST(RDFSnapshotRNTuple, DefaultCompressionSettings)
{
FileRAII fileGuard{"RDFSnapshotRNTuple_default_compression_settings.root"};
const std::vector<std::string> columns = {"x"};

auto df = ROOT::RDataFrame(25ull).Define("x", [] { return 10; });

RSnapshotOptions opts;
opts.fOutputFormat = ROOT::RDF::ESnapshotOutputFormat::kRNTuple;

auto sdf = df.Snapshot("ntuple", fileGuard.GetPath(), {"x"}, opts);

EXPECT_EQ(columns, sdf->GetColumnNames());

auto reader = RNTupleReader::Open("ntuple", fileGuard.GetPath());
auto compSettings = *reader->GetDescriptor().GetClusterDescriptor(0).GetColumnRange(0).GetCompressionSettings();
// The RNTuple default should be 505
EXPECT_EQ(505, compSettings);
}

TEST(RDFSnapshotRNTuple, Compression)
{
FileRAII fileGuard{"RDFSnapshotRNTuple_compression.root"};
Expand Down
Loading