Skip to content
Sanjiban Sengupta edited this page Aug 24, 2021 · 6 revisions

Root Storage of Deep Learning Models in TMVA

GSoC'21 with CERN-HSF

Following is the Google Summer of Code 2021 Final Report on Project Root Storage of Deep Learning Models in TMVA conducted under CERN-HSF.

Project Details

Student's Name Sanjiban Sengupta
Mentors Lorenzo Moneta, Sitong An, Anirudh Dagar
Organization Root-Project (CERN-HSF)
Organization Code Repository https://github.com/root-project/root
Project Page https://summerofcode.withgoogle.com/projects/#5424575602491392
Code Implementations https://github.com/root-project/root/pulls?q=author:sanjibansg
Documentation Blog https://blog.sanjiban.ml/series/gsoc

Table of Contents

About Project

The Toolkit for Multivariate Data Analysis (TMVA) is a sub-module of ROOT which provides a machine learning environment for conducting the training, testing, and evaluation of various multivariate methods especially used in High-energy Physics. Recently, the TMVA team introduced SOFIE ( System for Fast Inference code Emit) which facilitates its own intermediate representation of deep learning models following the ONNX standards. To facilitate the usage, storage, and exchange of these models, this project aimed at developing the storage functionality of Deep Learning models in the .root format, popular in the High Energy Physics community.

Deliverables

  1. Functionality for serialization of RModel for storing a trained deep learning model in .root format.
  2. Functionality for parsing a Keras .h5 file into a RModel object for generation of inference code.
  3. Functionality for parsing a PyTorch .pt file into a RModel object for generation of inference code.
  4. Tests and Tutorials for various parsers of TMVA SOFIE's RModel object.

Implementations

1. Serialization of RModel PR#8666

  • Link to Blog article: https://blog.sanjiban.ml/root-project-introducing-sofie

  • Description RModel is the primary class defined in SOFIE for storing the configuration and weights of a trained deep learning model, and ROperator is the abstract base class from which various operators are derived. Following ONNX standards, the ROperators are responsible for generating specific inference codes to operate on input tensors and provide the outputs as per the attributes provided. It was required to make the RModel class serializable so that it can be saved in .root format.

  • Progress

    • Modifying the Data Structures
      • Modifying struct InitializedTensor
      • Modifying class RModel & ROperator
    • Modifying the LinkDef file
    • Adding the Custom Streamer to RModel
    • Tests
      • Emit Files for generating header files
      • Tests for Parser
  • Interface

    //Writing ROOT File
    TFile file("model.root","CREATE");
    using namespace TMVA::Experimental;
    SOFIE::RModel model = SOFIE::PyKeras::Parse("trained_model_dense.h5");
    model.Write("model");
    file.Close();
    
    //Reading ROOT File
    TFile file("model.root","READ");
    using namespace TMVA::Experimental;
    SOFIE::RModel *model;
    file.GetObject("model",model);
    file.Close();
    

2. Keras Parser for RModel PR#8430

  • Link to blog article https://blog.sanjiban.ml/root-project-keras-parser-for-sofie

  • Description Converter for Keras .h5 models was required for translating a Keras Sequential API and Functional API model into a RModel object for the subsequent generation of inference code.

  • Progress

    • Restructured SOFIE to avoid dependency conflicts between different Python libraries
    • Parser function for extracting the model information and weights and instantiate a RModel Object
      • Support for Keras Sequential API Models
      • Support for Keras Functional API Models
      • Supports Dense (with relu activation), ReLU and Permute Layers
      • Header file for the function
      • Function implementation
    • Converter Function writing the RModel containing the model information into ROOT file.
      • Header file for the function
      • Function implementation
    • Tests
      • Emit Files for generating header files
      • Tests for Parser
    • Tutorials
  • Interface

    //Parser returns a RModel object
    using TMVA::Experimental::SOFIE;
    RModel model = PyKeras::Parse("trained_model_dense.h5");
    
    //Converter writes a ROOT file directly
    PyKeras::ConvertToRoot(“trained_model_dense.h5”);
    

3. PyTorch Parser for RModel PR#8684

  • Link to Blog Article https://blog.sanjiban.ml/root-project-pytorch-parser-for-sofie

  • Description Converters for PyTorch .pt models saved using TorchScript, were required to be parsed into a RModel object for the subsequent generation of inference code. The developed functionality required the shape of the input tensor and its data type. If not specified, the data type is by default Float, but the shapes vector is a mandatory parameter.

  • Progress

    • Parser function for extracting the model information and weights and instantiate a RModel Object

      • Support for PyTorch nn.Module, nn.Sequential, nn.ModuleList containers.
      • Supports Linear, ReLU and Transpose Layers/operations.
      • Supports tensors of dynamic axes.
      • Header file for the function
      • Function implementation
    • Converter Function writing the RModel containing the model information into ROOT file.

      • Header file for the function
      • Function implementation
    • Tests

      • Emit Files for generating header files
      • Tests for Parser
    • Tutorials

  • Interface

    //Parser returns a RModel object
    using TMVA::Experimental::SOFIE;
    
    //Building the vector for input shapes
    std::vector<size_t> s1{120,1};
    std::vector<std::vector<size_t>> inputShape{s1};
    RModel model = PyTorch::Parse("trained_model_dense.pt",inputShape);
    
    //Converter write3s a ROOT file directly
    std::vector<size_t> s1{120,1};
    std::vector<std::vector<size_t>> shape{s1};
    PyTorch::ConvertToRoot(“trained_model_dense.pt”,inputShape);
    

4. Tests & Tutorials

  • Tests were built on Google's GTest Framework. Python Scripts were developed which were run by the C-Python API to generate models and save them. Then these models were parsed and the correctness of the Parsers was validated by comparing the outputs from the generated inference code and the saved models when called on the same input tensors.
  • Simple Tutorials were built (PR#8874) for showcasing use cases of the Parsers, generation of inference code, and usage of functions defined in RModel class.

5. Extras

After implementing the Expected deliverables, I started working on the development of the Root-Storage of BDT. The implementation required developing a class that will be the primary data structure for holding model configuration & weights and will be serializable into the .root file. Also, a Parse function was required for translating a BDT model which was trained in TMVA and saved in the .xml file. And lastly, a Mapping interface to TMVA Tree Inference for generating inference code. The developed class was initially implemented by Jonas Rembser (https://github.com/guitargeek/tmva-to-xgboost/), and modifications were done by me.

  • Interface
    //Parser loads the BDT model from .xml to RootStorage::BDT object
    TMVA::Experimental::RootStorage::BDT model;
    bool usePurity = true;
    model.Parse("TMVA_CNN_Classification_BDT.weights.xml",usePurity);
    

Contributions

Pull Request PR Number Status
Restructured SOFIE #8594
Serialisation of RModel #8666
Modifying AddOutputTensorNameList() #8640
PyKeras Converter TMVA #8430
PyTorch Converter TMVA #8684
Tutorials for RModel Parsers #8874
Root Storage of BDT #8873

Future Plan

  • Documenting the data structures and functions in SOFIE and the Parsers using DOxygen.

  • Contribute to ROOT & TMVA for implementing, improving, and debugging code.

  • Development of Root Storage of BDT

    • Develop the mapping interface for inference code generation from class RootStorage::BDT
    • Researching on the conversion of scikit-learn based BDT models to class RootStorage::BDT for subsequent inference
    • Adding tests & tutorials
  • Development of ROperators

    • Implementing classes for various ROperators for ONNX & ONNX-ml

Conclusion

Planned goals of the project were successfully implemented. Currently, in the experimental stage, SOFIE requires continuous development and holds effective applications on the inference of deep learning models. I wish to contribute to the project in the future in implementing functionalities, improving features, and debugging issues. I had an in-depth understanding of the Root-Project and its applications in High-Energy Physics. While working on the project, I faced numerous challenges but learned the way to tackle them. In the due course, I did learn about a lot of tools, methods, and concepts for developing robust applications. It was a dream to work with the people from the largest Particle Physics Facility in the world, and I am blessed to receive the opportunity and guidance from them, and sincerely wish to receive the chance to work with them again.


Acknowledgement

First of all, I convey my thanks to Google for organizing this event of massive learning, networking, and experiencing the development of open-source software. I am highly grateful to my mentors Lorenzo Moneta, Sitong An, Anirudh Dagar, and CERN-HSF to provide me the opportunity to work on the project, and for all the guidance and help they have been providing. I am also thankful to TMVA Team member Omar Andres Zapata Mesa for his help and support in implementing and debugging the functionalities. Lastly, I thank all the student developers to make this program successful, my friends and seniors for their continuous help and support, and my Parents for their belief, guidance, and support in all my endeavors.