# uproot essentials
A quick walkthrough the essentials operations performed with **uproot** to read and manipulate ROOT files in Python. 

https://uproot.readthedocs.io/en/latest/basic.html

In [2]:
import uproot
import awkward as ak
import numpy as np
import hist

## Explore the TFile

In [3]:
cms_opendata_file = "root://eospublic.cern.ch//eos/opendata/cms/mc/RunIISummer20UL16NanoAODv9/DYJetsToLL_M-50_TuneCP5_13TeV-madgraphMLM-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v1/40000/14B6A8AE-C9FE-D744-80A4-DDE5D008C1CD.root"
file = uproot.open(cms_opendata_file)

In [4]:
file.classnames()

{'tag;1': 'TObjString',
 'Events;1': 'TTree',
 'LuminosityBlocks;1': 'TTree',
 'Runs;1': 'TTree',
 'MetaData;1': 'TTree',
 'ParameterSets;1': 'TTree'}

In [5]:
file.keys()

['tag;1',
 'Events;1',
 'LuminosityBlocks;1',
 'Runs;1',
 'MetaData;1',
 'ParameterSets;1']

Let's get the TTree using the file as a dictionary

In [6]:
events = file["Events"]

we can easily inspect all the TTree content

In [None]:
events.show()

or get the type of all the branches

In [None]:
events.typenames()

The number of "rows" or events is accessible as

In [8]:
events.num_entries

1434319

It's also possible to open directly the TTree from the file

In [9]:
events = uproot.open(f"{cms_opendata_file}:Events")
events

<TTree 'Events' (1504 branches) at 0x7af031135a50>

## Arrays from TTree

There are many ways to read data from TTrees in *uproot*. 

- Read a single branch to a numpy arrays or Pandas serier
- Read the full TTree data as a awkward-array
- Read partially the TTree by specifing a list of branch
- Apply a cut over the TTree rows before reading the data

In [None]:
events.arrays?

In [16]:
met = events.arrays(["MET_pt", "MET_phi"], library="np", entry_stop=5000)

In [17]:
met

{'MET_pt': array([31.452927 , 40.65485  , 32.816944 , ..., 23.492601 ,  2.4211023,
        44.237286 ], dtype=float32),
 'MET_phi': array([-0.28570557, -2.8354492 , -0.5496826 , ...,  0.22387695,
         1.7602539 , -1.3271484 ], dtype=float32)}

If we request "flat" arrays, with library `np==numpy`, we get a dictionary of numpy arrays

In [18]:
E = events.arrays(["Electron_pt", "Electron_eta", "Electron_phi"], library="ak", entry_stop=5000)

By default the uproot.arrays() method will read the full data from the TFile. We will see later methods to iterate over chunks of data. 
For this demonstration let's read the first 5k events.

In [26]:
E

<Array [{Electron_pt: [], ... ] type='5000 * {"Electron_pt": var * float32, "Ele...'>

In [20]:
E.fields

['Electron_pt', 'Electron_eta', 'Electron_phi']

In [27]:
E.Electron_pt

<Array [[], [], [], ... 33.3, 32.3, 10.6], []] type='5000 * var * float32'>

This is a awkward array containing the requested branches.

## Iterating over entries

In [30]:
for batch in events.iterate(step_size=1000, entry_stop=10_000):
    print(repr(batch))

<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>
<Array [{run: 1, ... L1simulation_step: True}] type='1000 * {"run": uint32, "lum...'>


## Iterating over files

In [31]:
import yaml
with open("datasets.yaml") as f:
    datasets = yaml.safe_load(f)

In [None]:
uproot.iterate?

In [36]:
for batch in uproot.iterate([ f"{file}:Events" for file in datasets["DYJetsToLL"]["files"][0:3]], 
                           filter_name=["Jet_pt", "Jet_eta", "Jet_phi"],
                           step_size=100_000):
    print(batch)

[{Jet_eta: [], Jet_phi: [], Jet_pt: []}, ... 0.945], Jet_pt: [48, 42.9, 25.3]}]
[{Jet_eta: [-2.25, -3.79, 2.86, 2.66, -3.13], ... Jet_pt: [39.9, 35.1, 22, 20.5]}]
[{Jet_eta: [1.23, 2.67], Jet_phi: [1.14, -2.85, ... -2.51], Jet_pt: [29.5]}]
[{Jet_eta: [0.446, -0.574, 2.14], Jet_phi: [-1.09, ... 57.6, 51.8, 21.4, 19.7, 16]}]
[{Jet_eta: [-2.76], Jet_phi: [1.22], Jet_pt: [16.1, ... Jet_phi: [], Jet_pt: []}]
[{Jet_eta: [-1.41, -1.45], Jet_phi: [-1.8, ... Jet_pt: [34.4, 29.4, 28.4, 15.5]}]
[{Jet_eta: [-0.616], Jet_phi: [1.94], Jet_pt: [, ... Jet_pt: [37.8, 32.2, 15.3]}]
[{Jet_eta: [-1.99, -1.19, -0.0124], Jet_phi: [, ... 55.5, 53.9, 27.9, 22.5, 15.8]}]
[{Jet_eta: [1.61, 1.42, 3.51, 4.42, -0.189, -3.76, ... 115, 93.9, 27.6, 24.1, 20.6]}]
[{Jet_eta: [0.0519, -1.16, -2.49, -3.79, -2.33], ... -0.62], Jet_pt: [17.4]}]
[{Jet_eta: [2.31, 1.78, -1.31, 2.38, 0.196], ... 59.6, 42.3, 36.4, 19.5, 17]}]
[{Jet_eta: [-1.33, -0.825, 2.23, 2.96], ... Jet_pt: [51.5, 21.1, 20.8, 16.9]}]
[{Jet_eta: [-2.81], Jet

## Writing objects to ROOT file

# Awkward arrays

Awkward arrays are very similar to **numpy** arrays, but they can accomodate multidimensional arrays with different size for each entry. 
For example: the pT of the Jets for a set of events can be represented as a 2D arrays, where each row (event) can have different number of columns (different number of jets). 

In [40]:
events = uproot.open(f"{cms_opendata_file}:Events")
df = events.arrays(entry_stop=5000)

In [55]:
jets_pt = df.Jet_pt

the `tolist()` method is useful to printout the full array for debugging and checking the structure

In [56]:
jets_pt[0:5].tolist()

[[],
 [48.1875, 47.84375, 16.28125],
 [45.0625, 27.578125, 16.078125],
 [21.0, 16.640625],
 [26.96875, 20.203125, 19.875, 19.703125]]

In [57]:
N_jets = ak.num(jets_pt)

In [46]:
N_jets

<Array [0, 3, 3, 2, 4, 2, ... 4, 2, 8, 3, 9, 4] type='5000 * int64'>

simple operations can be applied on the full arrays (as with numpy arrays). The **broadcast** rules works in the same way as numpy

In [53]:
jets_pt * 2

<Array [[], [96.4, 95.7, ... 37.6, 37.2, 36]] type='5000 * var * float32'>

In [54]:
jets_pt ** 2

<Array [[], [2.32e+03, ... 353, 345, 324]] type='5000 * var * float32'>

Also numpy functions can be applied

In [58]:
np.sqrt(jets_pt)

<Array [[], [6.94, 6.92, ... 4.34, 4.31, 4.24]] type='5000 * var * float32'>

## Indexing

In [60]:
jets_pt[1]

<Array [48.2, 47.8, 16.3] type='3 * float32'>

We can ask for a range of columns, and this works also if the events have a different lenght

In [63]:
jets_pt[0:10, :5]

<Array [[], [48.2, 47.8, ... 21.3], [43.2]] type='10 * var * float32'>

but if we request the first jet specifically, and it is not available for one of the event, an expection is raise. Have a look at the padding spection. 

In [64]:
jets_pt[0:10, 0]

ValueError: in ListOffsetArray64 attempting to get 0, index out of range

(https://github.com/scikit-hep/awkward-1.0/blob/1.10.3/src/cpu-kernels/awkward_NumpyArray_getitem_next_at.cpp#L21)

In [65]:
ak.firsts(jet_pt)

<Array [None, 48.2, 45.1, ... 39.4, 48.2, 51.4] type='5000 * ?float32'>

## Masking


## Splitting / combining columns

In [48]:
jet = ak.zip({"pt": df.Jet_pt, "eta": df.Jet_eta, "phi": df.Jet_phi})

In [50]:
jet.fields

['pt', 'eta', 'phi']

In [51]:
ak.unzip(jet)

(<Array [[], [48.2, 47.8, ... 18.8, 18.6, 18]] type='5000 * var * float32'>,
 <Array [[], [-1.48, ... 0.106, 3.7, 2.67]] type='5000 * var * float32'>,
 <Array [[], [1.27, ... -3.1, 0.846, -3.05]] type='5000 * var * float32'>)

## Flattening


## Padding