# PURSUE Python for HEP: Uproot I/O

## Introduction

* Scikit-HEP is a community-driven project. It is a software ecosystem designed to provide a coherent collection of Python libraries for HEP. It is aimed at standardizing data analysis in HEP.
* It integrates with other tools such as Numpy, Pandas, Matplotlib and Scikit-Learn.
* Among its major packages we have
  * Uproot: For reading and writing ROOT files without requiring ROOT.
  * Awkward Arrays: For the handling of jagged data structures
  * Hist: For creating and manipulating histograms
  * Vector: For manipulating and oprating on vectors of different kinds in a high-performance manner
  * Boost-Histogram: For fast, multi-dimensional histogramming
  * Particle: For handling particle physics data, including properties and PDG codes
  * Iminuit: For fitting
* In this tutorial, you will be introduced to the main components of the Scikit-HEP project. We will start with Uproot and build from there.
* For more information on the Scikit-HEP project and the tools it offers, please visit the official website [Scikit-HEP Project Website](https://scikit-hep.org/)

## ROOT Files

* ROOT files are a binary format designed for storage and analysis of large amounts of HEP data. It structured hierarchically like a small filesystem which can contain nested directories.
* It can contain a variety of data types such as TTrees and histograms.

<div style="text-align: center;">
  <img src="./assets/roottree.png" alt="roottree" style="width: 400px"/>
</div>


## Uproot
* Uproot allows us to read and write ROOT files. It interacts with Numpy, Pandas as well as the packages offered by Scikit-HEP.

## Opening a File

* To move forward, run the following cells to import `Uproot` and to download the sample data.

In [None]:
# Run this cell to import Uproot
import uproot
import skhep_testdata

# Downloads test file and returns path to it
filename = skhep_testdata.data_path("uproot-Event.root")
file = uproot.open(filename)
file

* This file object actually represents a directory and its contents are accesible through a dict-like interface.

In [None]:
file.keys()

In [None]:
file.values()

In [None]:
file.items()

In [None]:
file.classnames()

* The types seen here are
  * `TProcessID`: 
    * ROOT class that keeps track of process IDs in ROOT files. It is used internally by ROOT to manage object and their references, ensuring that object have unique identifiers across files or sessions. Typically useful for analysis.
  * `TH1F`: 
    * One-dimensional histogram with floating-point bin contents.
  * `TTree`: 
    * ROOT class used for efficient storage and access of large datasets. 
    * Can be conceptualized as a table in a database of a DataFrame in pandas, where each column (branch) can contian different types of data, and each row (entry) corresponds toa single event or datapoint.
    * If it too large to fit in memory a TBranch can be broken down in to TBaskets which are batches of data. They are the smallest unit to read from a TTree.
* We can read the histogram and use use useful methods provided by Uproot to convert the data to something like numpy.
* The reason there is a `;1` in front of 

In [None]:
h = file["hstat;1"]
h

In [None]:
# Using the hist Scikit-HEP library (more on that later...)
h.to_hist().plot(linewidth=0.75, color="red")

In [None]:
# Converting the histogram object to numpy arrays
h.to_numpy()
# First array is the data, second one is for bins

In [None]:
# We can then use plot this data using matplotlib
import matplotlib.pyplot as plt
hist_data, bin_edges = h.to_numpy()
bin_centers = (bin_edges[:-1] + bin_edges[1:]) / 2
plt.hist(bin_edges[:-1], bin_edges, weights=hist_data, histtype="step", color="red", linewidth=0.75)
plt.show()

In [None]:
# `h` also has methods which lets us directly access the values, errors and bin edges
print(h.values())
print(h.errors())
print(h.axis("x").edges())

* We can also read the TTree and show a list of its contents as follows

In [None]:
t = file["T;1"]
t.show()

In [None]:
# Other ways to get the same information
print(t["event/fNtrack"].typename)
print(t["event/fNtrack"].interpretation)

* The most direct way to read the data from each of these branches of this TTree is to use the `array` method.

In [None]:
t["event/fNtrack"].array()

* Notice that this is not a regular Numpy array, but is in fact a new class of object. This is an Awkard array. We will see more of them later and see just how powerful they are, but for now, just know that these arrays solve the limitation we saw Numpy arrays have: Awkward arrays can store jagged data (i.e. it can contain sub-arrays of different sizes).

In [None]:
type(t["event/fNtrack"].array())

* The data we have at hand at the moment does not contain anything that would allow us to see this in action, but we can synthesize an example. We will see what other things Awkward arrays allow us to do later.

In [None]:
import awkward as ak

arr = ak.Array([
    [1, 2],
    [1, 2, 5, 8],
    [],
    [2, 5, 8, 2, 5, 8]
])
type(arr[0])

## Writing to a ROOT file

* In addition to reading, you can also write data with Uproot. To do this, the file must be opened first. We can choose to create a completly new file or update an existing one.
* The functions we can use to open a ROOT file to write to it are:
  * `uproot.recreate()`: Creates a new ROOT file with the given filename. If it already exists, it will be overwritten by an empty ROOT file of the same name. Returns a file handle that can be used to write data.
  * `uproot.update()`: Opens an existing ROOT file in "update" mode. It is used for modigying existing files without deleting them. Returns a file handle that can be used to write data.

In [None]:
output1 = uproot.recreate("newrootfile.root")
output1

In [None]:
# Adding a string
output1["some_str"] = "Wow! I added this to a ROOT file myself!"
print(f"Keys: {output1.keys()}")
output1.values()

In [None]:
# Adding a histogram
output1["some_histogram"] = file["hstat;1"]
output1.values()

In [None]:
# Adding a histogram withing a nested directory
import numpy as np
output1["nested_directory/another_histogram"] = np.histogram(
    np.random.normal(0, 1, 1000000)
)

In [None]:
# One way to add branches
output1["tree1"] = {
    "x": np.random.randint(0, 10, 1000000),
    "y": np.random.normal(0, 1, 1000000),
}
output1["tree1"].extend(
    {"x": np.random.randint(0, 10, 1000000), "y": np.random.normal(0, 1, 1000000)}
)
output1["tree1"].extend(
    {"x": np.random.randint(0, 10, 1000000), "y": np.random.normal(0, 1, 1000000)}
)

In [None]:
# Another way to add baskets
output1.mktree("tree2", {"x": np.int32, "y": np.float64})
for _ in range(20):
    output1["tree2"].extend(
        {"x": np.random.randint(0, 10, 1000000), "y": np.random.normal(0, 1, 1000000)}
    )

In [None]:
# Each call to extend create a new basket in the `tree2` branch
output1["tree2"].num_baskets

The list of data types that can be written to files can be found here: [link](https://uproot.readthedocs.io/en/latest/basic.html#writing-objects-to-a-file)