# HIP RDataFrame Demonstration 4: Loading multiple files and skimming

In this demonstration we will go through:
1. How to load multiple files to a single RDF with TChain
2. How to skim the files by first filtering them and then saving the filtered RDF as a `.root` file

-- Made by Nico Toikka

## Initialization

Import the only library you'll ever need.

In [1]:
import ROOT

Welcome to JupyROOT 6.30/04


ROOT and RDFs can automatically use multithreading. You can set it up with the following command, and if no number is specified ROOT will default to all threads that are available to it, so be careful!

In [2]:
ROOT.EnableImplicitMT(4)

## Find the files and load the RDF

We will use the same ZeroBias dataset from eos as in demonstration 2. With the Python module glob we can find all the `.root` files and create a list of them.

In [3]:
import glob

directory_path = '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/'

root_files = glob.glob(f'{directory_path}/**/*.root', recursive=True)

# Print the list of .root files
print(root_files)

['/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/616/00000/5a510161-4c3e-4509-b1fc-0ee35fb5c493.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/648/00000/efcb2c35-f56e-4da9-994c-dcdc285d57fc.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/654/00000/11e7a7a6-a89f-46fb-8d8b-b65d545424ae.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/663/00000/d08ee08b-f37d-45bc-b987-4a184720d07e.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/664/00000/25aa2ec7-1ff4-437a-bc5f-53e7f59227d6.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/665/00000/9d62bf7e-dd07-47df-b246-b2caa019f169.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/666/00000/d46fa0c3-c716-4b9d-85fb-c92a4a43885a.root', '/eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/667/00000/7380749c-5c23-42f0-b7f5-bc67e1ba6e70.root', '/eos/cms/store

Now with the files in a list we can create a TChain from the Events trees in the files

In [4]:
tree = "Events"
chain = ROOT.TChain(tree)

for file in root_files:
    chain.Add(file)

print(f"Number of entries: {chain.GetEntries()}")

Number of entries: 7285827




The TChain can be used directly to initialize the RDF.

**Note** The TChain is from the programs point of view the "seed" of the RDF. If it goes out of scope for any reason, the RDF will cause a crash.

In [5]:
df = ROOT.RDataFrame(chain)
df.Describe()

Dataframe from TChain Events in files
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/616/00000/5a510161-4c3e-4509-b1fc-0ee35fb5c493.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/648/00000/efcb2c35-f56e-4da9-994c-dcdc285d57fc.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/654/00000/11e7a7a6-a89f-46fb-8d8b-b65d545424ae.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/663/00000/d08ee08b-f37d-45bc-b987-4a184720d07e.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/664/00000/25aa2ec7-1ff4-437a-bc5f-53e7f59227d6.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/665/00000/9d62bf7e-dd07-47df-b246-b2caa019f169.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/666/00000/d46fa0c3-c716-4b9d-85fb-c92a4a43885a.root
  /eos/cms/store/data/Run2023D/ZeroBias/NANOAOD/PromptReco-v2/000/370/667/00000/7380749c-5c23-42f0-b7f5-bc67

## Filtering, defining new columns and saving the RDF

Now that's a unit of an RDF. Let's do the same dijet cuts to it as before to get rid of some events.

For reference, the cuts are
- require at least two jets
- one in barrel ($|\eta | < 1.3$)
- if there's a third jet it should have pt less than 0.2 of the average of the dijet systems pt.

Let's also define some new columns that we'd be interested to share in the `.root` file.

In [6]:
alpha = 0.2
eta_cut = 1.3

df = (df.Filter("nJet >= 2", "Require at least 2 jets")
        .Filter(f"abs(Jet_eta[0]) < {eta_cut} || abs(Jet_eta[1]) < {eta_cut}", f"Require one of the leading jets in the abs(eta) < {eta_cut} region")
        .Filter(f"nJet > 2 ? Jet_pt[2] / (0.5 * (Jet_pt[0] + Jet_pt[1])) < {alpha} : true", f"Third jets pt should be less than {alpha} of the dijet average pt")
        .Define("Jet_leading", "Jet_pt[0]")
        .Define("Jet_trailing", "Jet_pt[1]")
)

Let's see the effect:

In [7]:
df.Report().Print()

Require at least 2 jets: pass=1657932    all=7285827    -- eff=22.76 % cumulative eff=22.76 %
Require one of the leading jets in the abs(eta) < 1.3 region: pass=1011727    all=1657932    -- eff=61.02 % cumulative eff=13.89 %
Third jets pt should be less than 0.2 of the dijet average pt: pass=647699     all=1011727    -- eff=64.02 % cumulative eff=8.89 %


Just ~9 % of the original events left, which sounds good for our skim.

Now let's save this RDF to a file, but let's also limit the columns to just have the "Jet_" ones.

In [8]:
jet_cols = [str(s) for s in df.GetColumnNames() if str(s).startswith("Jet_")] # The str() is here to go from string datatype to Python's default datatype str

Saving an RDF happens with `.Snapshot()`. You should see the [documentation](https://root.cern/doc/master/classROOT_1_1RDF_1_1RInterface.html#ac5903d3acec8c52f13cbd405371f7fb7) for further details on topics such as compression.

Snapshot can be quite slow, especially with larger RDFs. This can be made faster by choosing less columns and creating harder filters.

In [9]:
df.Snapshot(tree, "dijetFromZB.root", jet_cols)

<cppyy.gbl.ROOT.RDF.RResultPtr<ROOT::RDF::RInterface<ROOT::Detail::RDF::RLoopManager,void> > object at 0xd6f53f0>