In [None]:
import ROOT

# The ROOT file

With ROOT, arbitrary objects can be serialized to disk, i.e. they can be written to a file. The ROOT file has many advantageous aspects:

* It is binary
* It can be transparently compressed to reduce disk usage
* Allows to organize objects in a logical file-system-like structure, e.g. objects can be written to different subdirectories in the same file for easier categorization
* Can be transparently read from remote storage locations via XRootD, HTTP and S3 protocols

The class TFile can be used to interact with the ROOT file, both when writing and when reading. Let's see a first simple example:

In [None]:
with ROOT.TFile("my_file.root", "RECREATE") as f:
    h = ROOT.TH1D("my_histo", "Example histogram", 100, -4, 4)
    f.WriteObject(h)

The example above:
* Opens a ROOT file in writing mode
* Creates a ROOT histogram object, giving it the name `"my_histo"`
* Writes the object to the file via the `WriteObject` method
* The file is automatically written to disk at the end of the `with` context

The ROOT file can be opened in various modes, summarized in this table:
<center><img src="../images/ruw_2025_training_io_image_1.png"></center>

We should now have a file called `my_file.root` in the current directory. We can confirm that via the `%%bash` Jupyter notebook cell magic, which allows us to run bash commands from a cell:

In [None]:
%%bash
ls -l my_file.root

ROOT also provides a variety of command line utilities (see [the manual](https://root.cern/manual/root_files/#root-command-line-tools) for a full list). In particular, the `rootls` command lists the contents of a ROOT file. Let's confirm that the `my_histo` object was actually stored in the ROOT file:

In [None]:
%%bash
rootls -l my_file.root

Finally, let's see how we can programmatically retrieve the histogram we just wrote in the file.

We can access the histogram by its name using TFile::Get().

In [None]:
with ROOT.TFile("my_file.root") as f: # READ is the default mode
    h = f.Get("my_histo")
    print(h)

# The HEP dataset

High Energy Physics data is comprised of many (billions) statistically independent collision events.

Laying data into an "event class", then serialize and write out N instances of the class into a file would be very inefficient.

ROOT offers columnar data formats to store physics data in a flexible and optimized way. In a simplified representation, the dataset can be imagined as a table. Rows of the table usually represent different physics events, thus they are also often called "events" directly. Columns of the table contain the values of various physical properties of the particles.

<center><img src="../images/ruw_2025_training_io_image_2.png"></center>

But a typical HEP dataset is rarely representable as a flat table: column values can be of arbitrary data types, including complex structures comprising hierarchies of collections and containers. In fact, a large variety of C++ types can be written to a column of a ROOT dataset:

* Fundamental types: `int`, `float`, ...
* C++ standard containers: `std::vector`, `std::unordered_map`, ...
* Arbitrary C++ classes defined by the user


<center><img src="../images/ruw_2025_training_io_image_3.png"></center>

Let's now take a look at one example ROOT dataset. The following is an open dataset containing events from the CMS detector taken in 2012 (see the full description in the [CERN open data portal](https://opendata.cern.ch/record/12341)). We can open the ROOT file that contains this dataset and print the dataset schema. Note that the file is stored at a remote location, ROOT is able to read it via the XRootD protocol:

In [None]:
path = "root://eospublic.cern.ch//eos/opendata/cms/derived-data/AOD2NanoAODOutreachTool/Run2012BC_DoubleMuParked_Muons.root"
with ROOT.TFile.Open(path) as f:
    events = f["Events"] # Events is the name of the dataset
    events.Print()

A few key things to note:
* The dataset is written in the [TTree columnar format](https://root.cern/manual/trees/). This is the format used in production by LHC experiments at CERN, used to store more than 2 EB of physics data.
* It contains 61 million entries, or events. Note that this is just one file, usually physics analyses process thousands of files, thus many billions of events.
* The `Print` method of `TTree` shows us the dataset schema. Some column names, e.g. `nMuon`, refer to columns of fundamental types, `int` in this case. Other columns, for example `Muon_pt`, contain collections of elements. Across different entries, the same column can have different size, for example because a different number of particles participated in the event.

## RNTuple

RNTuple is the evolution of TTree columnar data storage, targeting the future computing challenges of physics experiments. For more details, see the [ROOT I/O overview at the ROOT Users Workshop 2025](https://indico.cern.ch/event/1505384/contributions/6730571/).

RNTuple is based on the experience gathered with TTree in more than two decades and it is built with modern computer architectures and programming best practices in mind. It is also a columnar format, reducing the complex layout of the data types used to fill the dataset down to their most fundamental components, to allow for more storage optimizations.

Let's see a first example of writing an RNTuple to a ROOT file:

In [None]:
import numpy

model = ROOT.RNTupleModel.Create()
model.MakeField["std::vector<float>"]("values")

with ROOT.RNTupleWriter.Recreate(model, "events", "file_with_rntuple.root") as writer:
    entry = writer.CreateEntry()
    for _ in range(5):
        # We are creating a different array of values for each event, with a different size
        # This is very typical in HEP datasets!
        entry["values"] = numpy.random.normal(size=numpy.random.randint(1, 5))
        writer.Fill(entry)

In the example above:
* We create an `RNTupleModel` first. This is the class representing the dataset schema that we want to store at a later point.
* Through the `RNTupleModel`, we can register the creation of columns with types. In RNTuple, the top-level column of a table is called "field". A field can then be comprised of multiple columns inside of it, to represent different components of the hierarchy of the data type.
* We thus register the creation of a field called `values` that will hold one collection of type `std::vector<float>` per event.
* We then create an `RNTupleWriter` with the name of the dataset `events` and the name of the file `file_with_rntuple.root`.
* `CreateEntry` gives us an object that can be filled with the actual values. It acts as a Python dictionary, where the key is the name of the field that we want to fill with some values.
* We want to fill 5 entries in total, we do that in a for loop. At every iteration, we fill the `values` field with an array of random values with a different size.
* We call `Fill` to actually store the new values into the RNTuple at each iteration.
* Finally, at the end of the `with` context the writer will automatically finalize the writing of the RNTuple to the file.

Now let's read back the data we just wrote to file:

In [None]:
with ROOT.TFile("file_with_rntuple.root") as f:
    ntuple = f["events"]
    reader = ROOT.RNTupleReader.Open(ntuple)
    nentries = reader.GetNEntries()

    print(f"The RNTuple has {nentries} entries")
    print("Printing values from RNTuple:\n")
    entry = reader.CreateEntry()
    for n in range(nentries):
        reader.LoadEntry(n, entry)
        print(f"entry={n}, values={entry['values']}")

    print("\nPrinting summary information about the RNTuple dataset:\n")
    reader.PrintInfo()

In the example above:
* We used the `RNTupleReader` class to open the RNTuple dataset we had previously written
* The same concept of an entry is used, but this time the values are loaded from disk into the entry
* We can print the contents of the entry using standard Python dictionary syntax
* The `PrintInfo` method shows a summary of the RNTuple, including the dataset schema 

## Converting existing TTree data to RNTuple

Existing TTree data can easily be converted to the new RNTuple data format through the `RNTupleImporter` class. The importer will preserve the TTree dataset schema on most occasions, with a few exceptions when certain types are not supported in RNTuple (see [the documentation](https://root.cern.ch/doc/master/classROOT_1_1Experimental_1_1RNTupleImporter.html) for more details). The following example shows how to convert a TTree to an RNTuple: it's practically a single line of code!

In [None]:
# Remove the file if you want to rerun this cell
dataset_name = "data"
input_path = "root://eospublic.cern.ch//eos/root-eos/testfiles/CMS_Open_Dataset.root"
output_path = "imported_rntuple.root"

import os
try:
    os.remove(output_path)
except FileNotFoundError:
    pass


importer = ROOT.Experimental.RNTupleImporter.Create(input_path, dataset_name, output_path)
importer.Import()

# Open the newly created RNTuple file and check its contents
with ROOT.TFile(output_path) as f:
    ntuple = f.Get(dataset_name)
    reader = ROOT.RNTupleReader.Open(ntuple)
    reader.PrintInfo()

## Inspecting RNTuple

The storage-related information about an RNTuple dataset can be further inspected via the `RNTupleInspector` class. The `RNTupleInspector` can be used for studying an RNTuple in terms of its storage efficiency. It provides information on the level of the RNTuple itself, on the (sub)field level and on the column level. For more details, refer to [the documentation](https://root.cern/doc/v634/classROOT_1_1Experimental_1_1RNTupleInspector.html).

Let's now inspect further the RNTuple we have just imported:

In [None]:
inspector = ROOT.Experimental.RNTupleInspector.Create(dataset_name, output_path)

In [None]:
print(f"The compression factor is {inspector.GetCompressionFactor()}")
print(f"The compression settings are '{inspector.GetCompressionSettingsAsString()}'")

In [None]:
inspector.PrintColumnTypeInfo()

In [None]:
c = ROOT.TCanvas()
type_info_h = inspector.GetColumnTypeInfoAsHist(ROOT.Experimental.ENTupleInspectorHist.kCount)
type_info_h.Draw()
c.Draw()

This was just a taste of the wide set of tools to inspect your RNTuple datasets, and more are being added still. Stay up to date with ROOT to get the latest!

<center><img src="../images/ruw_2025_training_io_image_4.png"></center>