# MySchema

This is a rendered copy of myschema.ipynb. You can optionally run it interactively on [binder at this link](https://mybinder.org/v2/gh/yihui-lai/coffea/5a506e83975baa75fdf3ac92de720aef79aa1446)

The interpretation of the TTree data is configurable via schema objects. Schema teachs the event processor how to group variables into collections, so operations can be run over entire collection at once:

In this demo, we will create our own schema and implement our own [behaviors](https://awkward-array.readthedocs.io/en/latest/ak.behavior.html). 

First, Let's look at the root file with `NanoAODSchema` and see what's inside of it. The events object can be instantiated as follows:


In [1]:
from coffea.nanoevents import NanoEventsFactory, BaseSchema, NanoAODSchema
fname = "https://raw.githubusercontent.com/CoffeaTeam/coffea/master/tests/samples/nano_dy.root"
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=NanoAODSchema
         ).events()
print(events.Electron.fields)

['deltaEtaSC', 'dr03EcalRecHitSumEt', 'dr03HcalDepth1TowerSumEt', 'dr03TkSumPt', 'dr03TkSumPtHEEP', 'dxy', 'dxyErr', 'dz', 'dzErr', 'eCorr', 'eInvMinusPInv', 'energyErr', 'eta', 'hoe', 'ip3d', 'jetPtRelv2', 'jetRelIso', 'mass', 'miniPFRelIso_all', 'miniPFRelIso_chg', 'mvaFall17V1Iso', 'mvaFall17V1noIso', 'mvaFall17V2Iso', 'mvaFall17V2noIso', 'pfRelIso03_all', 'pfRelIso03_chg', 'phi', 'pt', 'r9', 'sieie', 'sip3d', 'mvaTTH', 'charge', 'cutBased', 'cutBased_Fall17_V1', 'jetIdx', 'pdgId', 'photonIdx', 'tightCharge', 'vidNestedWPBitmap', 'vidNestedWPBitmapHEEP', 'convVeto', 'cutBased_HEEP', 'isPFcand', 'lostHits', 'mvaFall17V1Iso_WP80', 'mvaFall17V1Iso_WP90', 'mvaFall17V1Iso_WPL', 'mvaFall17V1noIso_WP80', 'mvaFall17V1noIso_WP90', 'mvaFall17V1noIso_WPL', 'mvaFall17V2Iso_WP80', 'mvaFall17V2Iso_WP90', 'mvaFall17V2Iso_WPL', 'mvaFall17V2noIso_WP80', 'mvaFall17V2noIso_WP90', 'mvaFall17V2noIso_WPL', 'seedGain', 'genPartIdx', 'genPartFlav', 'cleanmask', 'jetIdxG', 'photonIdxG', 'genPartIdxG']


Now we can copy the skeleton of a schema class:

In [2]:
class MySchema(BaseSchema):
    """
    my schema
    """
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        return output

    @property
    def behavior(self):
        """
        Behaviors necessary to implement this schema
        """
        behavior = {}
        return behavior

As you can see, this schema is so simple and it is not useful currently. If we call the `events` again with our own schema, we'll find it contains nothing.

In [3]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=MySchema
         ).events()
events.fields

[]

## Create collections

In schema, the `branch_forms` is a python dictionary used to define branch grouping. 

By default (`BaseSchema`), it will be completely flat:
```python
branch_form={
  "particle_pt":{},
  "particle_eta":{},
  "particle_phi":{},
  "particle_mass":{},
  ...
}
```

What we want is to put some branches into the same collection:

```python
new_branch_form={
  "particle": schemas.zip_forms({
      "pt" : branch_form["particle_pt"],
      "eta" : branch_form["particle_eta"],
      "phi" : branch_form["particle_phi"],
      "mass" : branch_form["particle_mass"],
  })
}
```
So when we want to call `particle_pt`, we actually do `particle.pt`.

All of this is to be implemented in the `Schema._build_collections` method. 

For example, let's add the `Electron` collection to our schema. To do this we also need to import `zip_forms`.

In [4]:
from coffea.nanoevents.schemas.base import zip_forms, nest_jagged_forms
class MySchema(BaseSchema):
    """
    my schema
    """
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        output["Electron"] = zip_forms(
            {
                "pt" : branch_forms["Electron_pt"],
                "eta" : branch_forms ["Electron_eta"] , 
                "phi": branch_forms["Electron_phi"],
                "mass": branch_forms["Electron_mass"],
                #"xx": branch_forms["Electron_xx"],
            },
            "Electron",
        )
        return output

    @property
    def behavior(self):
        """
        Behaviors necessary to implement this schema
        """
        behavior = {}
        return behavior

Now we successfully created a schema with one collection `Electron`. It will be able to recognize branches with name `Electron_pt, Electron_eta, Electron_phi, Electron_mass`.
Try to call the `events` again.

In [5]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=MySchema
         ).events()
print(events.fields)
print(events.Electron.fields)

['Electron']
['pt', 'eta', 'phi', 'mass']


We can use the mask and do selection on the whole collection at once now:

In [6]:
mask = (events.Electron.pt>3) & (events.Electron.pt<60)
good_elec = events.Electron[mask]
print(good_elec.pt)
print(good_elec.eta)

[[], [29.6], [51.7], [10.7, 8.6], [], [9.91, ... [], [15.6], [], [7.68], [], []]
[[], [1.83], [-0.904], [-2.19, 1.65], [], ... [], [-0.0595], [], [0.381], [], []]


However, if you require some branches that your root file doesn't contain, errors will be returned. 
For example, uncomment the following line in `MySchema`:
```python
"xx": branch_forms["Electron_xx"],
```
Run the above code again, you will see:
```bash
KeyError: 'Electron_xx'
```

## Create behavior

Aside from collections, we can also add `behavior` to collections. This means additional awkward arrays are generated on-the-fly via predefined algorithm.

A bunch of common physics behaviors are already provided in coffea, and you can find them in [methods](https://github.com/CoffeaTeam/coffea/tree/a95401cad91e88ceac47a4c693068bc4cbc7d338/coffea/nanoevents/methods).

To write our own coffea behavior, first we need to define the `behavior`. 
In the following code, we definded `MyBehavior`. It only has one function `plus1()`, which returns the `particle.pt +1` when you call `particle.plus1`.

we also need to add the [`record_name`](https://github.com/CoffeaTeam/coffea/blob/a95401cad91e88ceac47a4c693068bc4cbc7d338/coffea/nanoevents/schemas/base.py#L24) to the collection in the `schema._build_collections` to tell the collection which `behavior` it should use.




In [7]:
import awkward 
mybehavior={}
@awkward.mixin_class(mybehavior)
class MyBehavior:
    """
    A test
    """
    @property
    def plus1(self):
        """
        """
        return self.pt+1 

class MySchema(BaseSchema):
    """
    my schema
    """
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        output["Electron"] = zip_forms(
            {
                "pt" : branch_forms["Electron_pt"],
                "eta" : branch_forms ["Electron_eta"] , 
                "phi": branch_forms["Electron_phi"],
                "mass": branch_forms["Electron_mass"],
                #"xx": branch_forms["Electron_xx"],
            },
            "Electron",
            "MyBehavior",
        )
        return output

    @property
    def behavior(self):
        """
        Behaviors necessary to implement this schema
        """
        behavior = {}
        behavior.update(mybehavior)
        return behavior

Now try our self-defined behavior:

In [8]:
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=MySchema
         ).events()
print(events.Electron.pt)
print(events.Electron.plus1)

[[], [29.6], [60.1, 51.7], [10.7, 8.6], [], ... [], [15.6], [], [7.68], [], []]
[[], [30.6], [61.1, 52.7], [11.7, 9.6], [], ... [], [16.6], [], [8.68], [], []]


## Inherit from NanoEvents

So far we showed how to create custom schemas and behaviors. Actually, if you don't want to write a new schema, you can name the TBranch with the same naming convention as NanoAOD and use the `NanoAODSchema`. Then it will be recognized automatically. 

Looking at the content of [NanoAOD file](https://cms-nanoaod-integration.web.cern.ch/integration/master-cmsswmaster/mc102X_doc.html), most of collections contain 2 kinds of branches. One is `{collection}_{object}`, another is `n{collection}`. 

The `_build_collections` of `NanoAODSchema` groups objects with the same collection and takes the `n{collection}` as the offset of this collection:

```python
    def _build_collections(self, branch_forms):
        # parse into high-level records (collections, list collections, and singletons)
        collections = set(k.split("_")[0] for k in branch_forms)
        collections -= set(
            k for k in collections if k.startswith("n") and k[1:] in collections
        )

        # Create offsets virtual arrays
        for name in collections:
            if "n" + name in branch_forms:
                branch_forms["o" + name] = transforms.counts2offsets_form(
                    branch_forms["n" + name]
                )
```

For example, one can either use arrays or vectors:
```C++
{
    UInt_t run, event, luminosityBlock;
    UInt_t nMuon;
    Float_t Muon_pt[9999];
    Float_t Muon_eta[9999];
    Float_t Muon_phi[9999];
    Float_t Muon_mass[9999];
    
    UInt_t nElectron;
    std::vector<float> Electron_pt;
    std::vector<float> Electron_eta;
    std::vector<float> Electron_phi;
    std::vector<float> Electron_mass;
 
    TFile *Tfile = Tfile = TFile::Open("example_ntuple.root","RECREATE");
    TTree *ttree = new TTree("Events","");

    ttree->Branch("run", &run, "run/I");
    ttree->Branch("luminosityBlock", &luminosityBlock, "luminosityBlock/I");
    ttree->Branch("event", &event, "event/I");
    
    ttree->Branch("nMuon", &nMuon, "nMuon/i");
    ttree->Branch("Muon_pt", &Muon_pt, "Muon_pt[nMuon]/F");
    ttree->Branch("Muon_eta", &Muon_eta, "Muon_eta[nMuon]/F");
    ttree->Branch("Muon_phi", &Muon_phi, "Muon_phi[nMuon]/F");
    ttree->Branch("Muon_mass", &Muon_mass, "Muon_mass[nMuon]/F");
    
    ttree->Branch("nElectron", &nElectron, "nElectron/i");
    ttree->Branch("Electron_pt", &Electron_pt);
    ttree->Branch("Electron_eta", &Electron_eta);
    ttree->Branch("Electron_phi", &Electron_phi);
    ttree->Branch("Electron_mass", &Electron_mass);
    
    for (Int_t ev=0;ev<100;ev++) {
        run=1;
        event=ev;
        luminosityBlock=1000*ev;
        Int_t nmu = Int_t(5*gRandom->Rndm());
        Int_t nele = Int_t(5*gRandom->Rndm());
        nMuon = nmu;
        nElectron = nele;
        
        if(nmu < 0) nmu = 0;
        for (Int_t im=0;im<nmu;im++) {
            Muon_pt[im] = gRandom->Gaus(50,10);
            Muon_eta[im] = gRandom->Rndm();
            Muon_phi[im] = gRandom->Rndm();
            Muon_mass[im] = 200;
        }
        
        if(nele < 0) nele = 0;
        Electron_pt.clear();
        Electron_eta.clear();
        Electron_phi.clear();
        Electron_mass.clear();
        for (Int_t in=0;in<nele;in++) {
            Electron_pt.push_back(gRandom->Gaus(50,10));
            Electron_eta.push_back(gRandom->Rndm());
            Electron_phi.push_back(gRandom->Rndm());
            Electron_mass.push_back(1);
        }
        ttree->Fill();
    }
    
    ttree->Write();
}
```
By doing so, you can also make use of the `behaviors` in `NanoAODSchema` automatically, which is very convenient. 


## nest_jagged_forms

Another more involved thing is the construction of the embedded objects. Think subjets being included as a list of objects within a jet collection. 

Again, let's use a simple macro generating our toy root ntuples:
```C++
{

    Int_t Jets_nJet;
    vector<double> Jets_Pt;         //[Jets_nJet]
    vector<double> Jets_Eta;        //[Jets_nJet]
    vector<double> Jets_Phi;        //[Jets_nJet]
    vector<double> Jets_E;          //[Jets_nJet]
    vector<int> Jets_SubjetsCounts; //[Jets_nJet]
    
    Int_t Jets_nsubjet;
    vector<double> Jets_subjet_Pt;  //[Jets_nsubjet]
    vector<double> Jets_subjet_Eta; //[Jets_nsubjet]
    vector<double> Jets_subjet_Phi; //[Jets_nsubjet]
    vector<double> Jets_subjet_E;   //[Jets_nsubjet]
 
    TFile *Tfile = Tfile = TFile::Open("example_ntuple.root","RECREATE");
    TTree *ttree = new TTree("Events","");

    ttree->Branch("Jets_nJet", &Jets_nJet, "Jets_nJet/i");
    ttree->Branch("Jets_Pt", &Jets_Pt);
    ttree->Branch("Jets_Eta", &Jets_Eta);
    ttree->Branch("Jets_Phi", &Jets_Phi);
    ttree->Branch("Jets_E", &Jets_E);
    ttree->Branch("Jets_SubjetsCounts", &Jets_SubjetsCounts);
    
    ttree->Branch("Jets_nsubjet", &Jets_nsubjet, "Jets_nsubjet/i");
    ttree->Branch("Jets_subjet_Pt", &Jets_subjet_Pt);
    ttree->Branch("Jets_subjet_Eta", &Jets_subjet_Eta);
    ttree->Branch("Jets_subjet_Phi", &Jets_subjet_Phi);
    ttree->Branch("Jets_subjet_E", &Jets_subjet_E);
    
    for (Int_t ev=0;ev<100;ev++) {
        Jets_Pt.clear();
        Jets_Eta.clear();
        Jets_Phi.clear();
        Jets_E.clear();
        Jets_SubjetsCounts.clear();
        Jets_subjet_Pt.clear();
        Jets_subjet_Eta.clear();
        Jets_subjet_Phi.clear();
        Jets_subjet_E.clear();

        Jets_nJet = Int_t(3 * gRandom->Rndm());
        Jets_nsubjet = 0;
        //this event has {Jets_nJet} jets 
        for (int i = 0; i < Jets_nJet; i++)
        {
            Jets_Pt.push_back(10 * gRandom->Rndm());
            Jets_Eta.push_back(gRandom->Rndm());
            Jets_Phi.push_back(gRandom->Rndm());
            Jets_E.push_back(gRandom->Gaus(50, 10));
            
            //each jet has {jets_sub} subjets
            Int_t jets_sub = Int_t(2 * gRandom->Rndm()) + 1;
            Jets_nsubjet += jets_sub;
            //Jets_SubjetsCounts stores the number of subjets in each jet
            Jets_SubjetsCounts.push_back(jets_sub);
            for (int i = 0; i < jets_sub; i++)
            {
                Jets_subjet_Pt.push_back(10 * gRandom->Rndm());
                Jets_subjet_Eta.push_back(gRandom->Rndm());
                Jets_subjet_Phi.push_back(gRandom->Rndm());
                Jets_subjet_E.push_back(gRandom->Gaus(25, 10));
            }
        }
        ttree->Fill();
    }
    
    ttree->Write();
}
```
What we would like to have are objects like:
```python
Jets.pt
Jets.subjet.pt
```
rather than:
```python
Jets.pt
Jets_subjet.pt
```
The following code shows how to put one list of objects to a collection. What we need is a list telling us the offset/index of each list. In our case, it is `Jets_SubjetsCounts`.

In [9]:
class MySchema(BaseSchema):
    """
    my schema
    """
    def __init__(self, base_form):
        super().__init__(base_form)
        self._form["contents"] = self._build_collections(self._form["contents"])

    def _build_collections(self, branch_forms):
        output = {}
        output["Jets"] = zip_forms(
            {
                "pt" : branch_forms["Jets_Pt"],
                "eta" : branch_forms ["Jets_Eta"] , 
                "phi": branch_forms["Jets_Phi"],
                "e": branch_forms["Jets_E"],
                "SubjetsCounts": branch_forms["Jets_SubjetsCounts"],
            },
            "Jets",
        )
        output["Jets_subjet"] = zip_forms(
            {
                "pt" : branch_forms["Jets_subjet_Pt"],
                "eta" : branch_forms ["Jets_subjet_Eta"] , 
                "phi": branch_forms["Jets_subjet_Phi"],
                "e": branch_forms["Jets_subjet_E"],
            },
            "Jets_subjet",
        )
        
        nest_jagged_forms(
            output["Jets"],
            output.pop("Jets_subjet"),
            "SubjetsCounts",
            "subjet",
        )

        return output
  
    @property
    def behavior(self):
        """
        Behaviors necessary to implement this schema
        """
        behavior = {}
        return behavior

In [10]:
fname = "https://raw.githubusercontent.com/yihui-lai/coffea/master/tests/samples/example_ntuple.root"
events = NanoEventsFactory.from_root(
           fname, 
           schemaclass=MySchema
         ).events()
print(events.fields)
print(events.Jets.fields)
print(events.Jets.subjet.fields)
print(events.Jets.pt)
print(events.Jets.SubjetsCounts)
print(events.Jets.subjet.pt)

['Jets']
['pt', 'eta', 'phi', 'e', 'SubjetsCounts', 'subjet']
['pt', 'eta', 'phi', 'e']
[[], [3.47], [], [0.999], [], [1.33, ... [3.37, 1.52], [7.77, 4.24], [1.34, 4.87]]
[[], [1], [], [2], [], [2], [2], [1, 2], ... 2], [], [2], [], [1, 1], [2, 2], [2, 2]]
[[], [[7.24]], [], [[4.23, 0.0375]], ... [4.74, 8.69]], [[8.96, 7.94], [9.39, 3.33]]]
