Skip to content

Manifesto

Eduardo Rodrigues edited this page Feb 16, 2017 · 8 revisions

Scikit-HEP Project Manifesto

Note

Work in progress Manifesto ! Contains material under active discussion and design.

Version as of 2024-06-25, 11h37.

Table of Contents

The Scikit-HEP project (http://scikit-hep.org/) is a community-driven and community-oriented project with the aim of providing Particle Physics at large with a Python package containing core and common tools.

The project started in Autumn 2016 and is presently being defined.

The project homepage is http://scikit-hep.org/. The (future) releases are registered on PyPI. The development is occurring at the project's GitHub page.

The project is licensed under a 3-clause BSD style license.

Project started with:

  • Vanya Belyaev (ITEP, Moscow - LHCb)
  • Noel Dawe (University of Melbourne - ATLAS)
  • David Lange (Princeton University - CMS, DIANA)
  • Sasha Mazurov (University of Birmingham - LHCb)
  • Jim Pivarski (Princeton University - CMS, DIANA)
  • Eduardo Rodrigues (University of Cincinnati - DIANA, LHCb)
  • Simulation: wrappers for Monte Carlo engines and other generators of simulated data.
  • Datasets: data in various sources, such as ROOT, Numpy/Pandas, databases, wrapped in a common interface.
  • Aggregations: e.g. histograms that summarize or project a dataset.
  • Modeling: data models and fitting utilities.
  • Visualization: interface to graphics engines, from ROOT and Matplotlib to maybe even d3 or plot.ly.
  • Community effort coordinated by a few people, to be identified in due time (set will grow with the project).
  • Provide core and common tools rather than try and do everything. JP: What's needed most is a set of gateways between existing tools to normalize their different choices of conventions and automatically convert data among them. These are the barriers that currently exist and need to be broken down.
  • Exploit Astropy's idea of affiliated packages. JP: And the Numpy/Scipy ecosystem, which I'm more familiar with than Astropy. Numpy had the advantage of predating the scientific Python world, so they could establish the conventions used everywhere else. We have a situtation in which many tools exist with different conventions. Was that true of Astropy?
  • Python package supporting Python 2.6, 2.7 and 3.4.
  • Strict requirements for well-documented code, with a test suite.
  • Code should be as generic as possible and various backends should be considered, be it for plotting or I/O. Obvious examples are ROOT and matplotlib. JP: And Keras, TensorFlow, Scikit-Learn, PySpark.

First stab from ER. Feel free to edit ...:

Package and/or topic Name
Continuous integration Jim & Sasha
Data aggregation Jim (Histogrammar) & Noel
Documentation All
Histogramming Jim & Noel & Vanya
Packaging Jim & Sasha
Scripts Eduardo & Vanya
Simulation Eduardo & Noel
Datasets Noel & Vanya
Units & constants Eduardo
Visualization Dave & Noel
Outreach Eduardo & Jim
Website (OK, not code) ?

We need to start building the website with basic info and links. Seems natural to have something already available when going public at the DIANA/HEP topical meeting on 27/02/2017. For now the website index.html contains basic info. A person interested in taking care of designing a website is being identified?

Clearly a very important aspect of the project, to have it up and running "at all times". HEP code drags old code so we need to support at least for now Python 2.6, 2.7 and 3.4.

Continuous integration done with Travis.

This topic should also take care of test coverage, see https://coveralls.io. We need to add this to the project.

Possibilities to discuss, in order of priority (best guess so far):

  1. CONDA installation/channel.
  2. Standard PIP installation.
  3. PIP installation with wrapped ROOT.
  4. CVMFS at CERN.
  5. Spack installation.

There are advantages, disadvantages and issues in all cases. Needs discussion in due time.

The usual delicate point: not much fun, but very important.

Use reStructuredText format for all documentation in .py files.

Question (ER): code is documented by construction. Fine. But where to add usage documentation? Next to the functions, methods,etc.? Or at the top of the files, in what becomes the __doc__?

One also needs to think about a living and self-generated (?) document such as the one at https://github.com/rootpy/rootpy/tree/master/docs/.

JP: Where documentation should go:

  • Python doc strings: top of each module, top of each class, top of each method. See PEP 257 on writing brief, to the point docstrings.
  • I'm going to argue that comments are usually not necessary and often misleading, since they get out of date easily. We'll see if docstrings are wrong when reading the generated Sphinx docs, but we'll only see comments if we're actively reading the code. Assertions or assertion plus comment are much better than comments.
  • Sphinx documentation on scikit-hep.org. This is where most newcomers will start. Not only should the text here be introductory (unlike the docstrings), but it should also be clear that this is the introduction. I've had the problem of users finding the fine-details documentation before finding the introduction and thinking it didn't have an introduction.
  • Complete examples should be presented as Gists. If we put complete examples on scikit-hep.org or in the repository, we'll be expected to keep them up-to-date, and the maintainance cost accretes. Gists have no social expectation of up-to-dateness, with a comments section for users to suggest updates. Users can also make their own Gists (most physicists have GitHub accounts) that are discoverable via the same search. See Histogrammar documentation for how this can work: search buttons for Gists and StackOverflow embedded in our own documentation.
  • A StackOverflow tag, and we should subscribe to it to hear user's problems. Gists and StackOverflow are a track toward developing a self-sustaining community. A mailing list like RootTalk leads to constant maintainance by the core group.

ER: lots of ideas but not sure at all whether that will fly for physists. Remember, physicists are not computing engineers and do not think and work the same way. We have to make it work for physicists. So, for example, StackOverflow does not convince me as a means to document a HEP software project. But I may change my mind ... This is yet another example where we will have to seek feedback from users and the community.

We need to agree on common conventions for the code, not just on the meta-language to use. Is the layout of https://github.com/scikit-hep/scikit-hep/blob/master/skhep/units/__init__.py and https://github.com/scikit-hep/scikit-hep/blob/master/skhep/units/prefixes.py suitable?

The way to reach to the community, train and explain. To be discussed and prepared in due time. All expected to contribute.

We need to make it easy for the community to get in touch, provide feedback, and, of course, contribute.

In the medium term we will need 2 mailing lists, probably:

  • One list for communication among developers and active users of Scikit-HEP.
  • Maybe another list for getting in touch with the core team in case privacy is needed?

ER: the first use case cries for a Google groups list. As for the second maybe our scikit-hep.org site provides already the possibility of a mailing list such as feedback@scikit-hep.org?

JP: I'm not sure how we can set up a mailing address with our DNS (short of running a mail host at all times). If the mailing address is actually a CERN e-group but the link is clearly spelled out on the website, that will be good enough. It's not like people memorize a support e-mail address.

First proposal from ER on the structure of the repository. Not complete nor final! Work very much in progress ...

scikit-hep/
   .gitignore
   README.rst
   setup.py
   ci/                       # continuous integration
   docs/                     # documentation
     logos/                  # for the official Scikit-HEP logos
   licenses/
      LICENSE.rst            # package license
   rc/                       # for RC / configuration files
      rootrc.template        # maybe handy to ship a template
      scikitheprc.template   # the Scikit-HEP config file template
      scikitheprc.default    # the Scikit-HEP config file default
   requirements/             # place to specify dependencies
      matplotlib.txt
      ROOT.txt
   scripts/                  # suite of Scikit-HEP scripts
   skhep/                    # the actual repo for all python subpackages
      __init__.py
      exceptions.py
      logger.py
      aggregation/
         __init__.py
      config/                # place to collect modules for python configuration, see e.g. affiliated packages in astropy
         __init__.py
      constants/
         __init__.py         # make it a configuration in the RC file which CODATA year to use, from data/
         constants.py
      data/                  # For all sorts of data files
         CODATA_2014.py      # File taken from http://physics.nist.gov/cuu/Constants/
      datasets/
         __init__.py
      extern/
         __init__.py
         six.py
      io/
         __init__.py
      math/
         __init__.py
      modeling/     # models and fitting related material
         __init__.py
      simulation/         # more general term than simply 'generators'
         __init__.py
      stats/
          __init__.py
      units/
         __init__.py
         prefixes.py
         units.py
      utils/
         __init__.py
      visualization/
         __init__.py
   tests/
      data/
      __init__.py
   tutorials/        # for sure we will want tutorials,
      examples/      # be it simple examples (via scripts)
      notebooks/     # or more advanced (per topic) tutorials, nicely prepared as Jupyter notebooks

A detailed discussion follows below.

licenses/
Probably a handy directory to hold not only this package's license but also licenses for anything we decide to ship with it. Suggest LICENSE.rst for the package license and LICENSE_<PackageOrModuleName>.rst for license of a package/module shipped with scikit-hep.

JP: The main LICENSE file has to be top-level (without extensions?) for GitHub to recognize it.

ER: no, that is not true, see pargraph "Where does the license live on my repository?" at https://help.github.com/articles/open-source-licensing/. So I definitely prefer to collect all licenses in one place.

ci/

We may well need in the near future a place to add scripts and material for continuous integration. JP: When we need it. As much as possible, we should strive to have a simple CI, simple installation procedure, etc.

ER: Completely agree. Simplicity is always the way to go.

Most software packages we use have (.)XXXrc files, e.g. ROOT, IPython, Emacs, matplotlib. They are widespread and it is highly likely that scikit-hep will need one.

ER: suggestion to prepare a directory rc/ for these run commands files. Examples are:

  • A template file for scikit-hep.
  • A default rc file for scikit-hep to make it trivial for the user to know what are the defaults ;-).
  • A template file for ROOT, taken from the standard ROOT installation. And similarly for other packages.

JP: What do these configuration files hold? Aren't these equivalent to Python global variables? If so, why not make them Python global variables, so that they can be configured programatically? If the skhep module's behavior is modified by something set outside of a script, such as a text-based configuration file, then it will be harder for users to diagnose each other's bugs.

ER: do we also want a separate requirements/ directory to specify installation/package dependencies similarly to what rootpy does? Seems reasonable to me.

JP: Doesn't the setup.py file have a requirements section? Moreover, setup.py's requirements are automatically parsed by PIP to go fetch the dependencies. In this day and age, dependencies should not be manual.

Subpackage docs/ for the user guide, the API and command/scripts references. JP: We have a https://github.com/scikit-hep/scikit-hep.github.io for the tutorials; what would go in this directory?

Place also to add scikit-hep logos, under logos/.

Subpackage tutorials/ for:

  • examples/: simple self-contained scripts. JP: As stated above, I prefer these to be on something like Gist, where there's zero overhead to adding a new one, users can contribute and comment, and (most importantly) there's no expectation that they're up-to-date.
  • notebooks/: for more advanced (per topic) tutorials, nicely prepared as Jupyter notebooks. JP: Jupyter notebooks are a great idea for presenting incrementally executable examples with embedded explanations and plots. These can also be Gists, which would be particularly good for Jupyter because it would both shorten the time to creation and also (very importantly) the time for a user to view it and get started using it. If the notebooks were in a directory on GitHub, the users would have to (1) install Jupyter and (2) download the notebook to load it locally. With it auto-rendered on a website, they can peruse before deciding to look closely at any one example.

Unit tests will go in a directory tests/ at the top level rather than having a series of directories under skhep/<module>/tests/. Then it is a good idea to mirror the structure under skhep/, .i.e. have tests/units/, tests/constants/, etc. It might seem like a burden but let's not forget that some of the packages will have various modules and should hence have various dedicated test files.

Another point: we are almost sure to need a subdirectory data/ to hold data (e.g. ROOT files) for tests, and that's another argument for a top level tests/. That would make it easy to strip off everything not required for normal execution to make a small tarball. (Sometimes you have to do these awful things...)

One can also exploit the doctest module, not to test the code (see unit tests above) is correct but rather to check that the documentation is correct based on the code.

Scripts are extremely handy for well-defined and simple tasks. They avoid the need to write code snippets for common tasks. Agreed that Scikit-HEP scripts should have a common prefix since, unless the user is using virtualenv, they're invading his/her PATH. This way it is clear when a script is Scikit-HEP related. The skhep- prefix is the way to go.

Example of useful scripts could be:

  • Trivial script printing basic info on Scikit-HEP such as version, website and GitHub repository (this one already exists in the repo):

    skhep-info
    
  • Convertion from a backend to another. Possibility is:

    skhep-convert --from file.root --to file.h5 --ignore-errors
    

    (The --ignore-errors option would be a real option whereas --from and --to would be required arguments.)

  • Print the basic units in HEP and defined in the package:

    skhep-print-units
    

For now separated into 2 different subpackages skhep/aggregation/ and skhep/datasets/. Unclear whether this separation is needed ... probably. JP: Yes, they're very different things.

Datasets should be seen as ntuples in the sense of ROOT.

ER: idea for histograms, maybe too naive/unrealistic/...: implementation of a base class with the ability to convert among various backends and read/write from the same backends. The module should have a natural pythonic interface for the representation of histograms and a straightforward conversion to specific histogram classes in wide-spread packages such as ROOT, etc.

Requirements:

  • Core functionality required/expected for/from a histogram, of course.
  • Needs to implement to() and from() methods.
  • Handy methods of checking possible backends, e.g. print_backends().
  • Read and write methods that will be dealt with in the io module, so something like write( filename, backend=None ) (the backend option is only necessary for backends such as databases storing serialised objects.).

These .to(...) methods would call behind the scenes the relevant modules io.root, io.numpy, etc., implementing the read and write methods of each backend.

JP: Histogrammar covers a lot of this: it could be the first "associated package." The Scikit-HEP wrapper would be necessary to bind it to Scikit-HEP's notion of datasets and visualizations. Once we have at least datasets, I can merge Histogrammar in.

Subpackage skhep/config/* to collect python configuration-related code. The astropy project, for example, puts here code to deal with affiliated packages.

JP: I don't understand what would be configured here. Much like rc/, I think these sorts of global configurations should be set by Python code (at the top of each script that uses customizations, so that it's clear how the behavior is non-standard). An example of this is Numpy's seterror function that specifies how to handle NaNs and such.

Subpackages skhep/units/ and skhep/constants/.

A first version of the units module is ready. It contains the basic units. Derived units will follow shortly.

The definition of common physical constants has also been added.

Possible candidates for data files under skhep/data/:

  • CODATA_<year>.py.
  • mass_width_<year>.mcd that is the PDG particle data table (see comment on the PyPDT project under "Affiliated projects").

JP: Another model to follow is the timezone data package (tzdata?), which is versioned by YYYYl where l is a letter (26 possible updates per year). If it's versioned separately from Scikit-HEP itself (because of when new data comes out), it should be another "associated package."

ER: do we want/need a dedicated suite for exception handling? The exceptions should also take care of non-implemented features. A submodule such as skhep/exceptions/ is probably not needed. A exceptions.py module seems enough.

Looking around there are various handy packages and modules that make it as external modules, see for example rootpy. They are distributed along to avoid an extra dependency.

We can simply prepare the usecase with a subpackage skhep/extern/.

JP: Much better to do dependencies in the standard way (PIP) than absorb them like ROOT or Geant4. This will be a dependency-heavy project, anyway.

Likely to be a very important subpackage, skhep/io/, to deal with the I/O from/to the various backends the project will consider.

JP: Probably each part, like datasets, aggregations, visualizations, will have their own I/O. I don't see what a top-level io/ directory would do.

Do we want/need extra code for logging purposes? Package logging code can go in skhep/logger/ unless there is not much, in which case a module logger.py does the job. We should use Python's built-in logger.

Need both skhep/math/ and skhep/stats/ directories since rather different topics in terms of code and usage.

A central part of the functionality scikit-hep will offer. Unclear at this stage whether to collect everything under a single skhep/modeling/ subpackage or rather split into skhep/models/ and skhep/fit/ for example.

JP: Maybe fit under modeling?

Simulation-related tools will go under skhep/simulation/.

Subpackage skhep/utils/ as a placeholder for what does not fit elsewhere.

Subpackage skhep/visualization/ for all matters concerning visualization. This is far from a little subpackage since the code to develop will have to deal with the various backends we want to consider.

More advanced topic to be discussed with lower priority for now.

ER: ideas for affiliated packages:

  • The ROOT team plans to have PyROOT as a genuine Python project. We should discuss with them on how both packages shold talk best ... Does it make sense to have PyROOT as part of Scikit-HEP given that ROOT is central anyway? To be discussed one we identify a contact person from the ROOT team.
  • hep_ml for reweighting of distributions (https://github.com/arogozhnikov/hep_ml).
  • A Python API for Hydra, a C++ header-only library designed for data analysis (https://github.com/MultithreadCorner/Hydra).

ER: note that in some cases it might be useful to promote a package from affiliated to part of the core of scikit-hep.

ER suggests to prepare a first public release v0.1 with just the units and constants module, as soon as ready, so likely in early Janauary. The functionality will clearly be very minimalistic at such a stsge. Still, the release would have several benefits:

  • First module(s) implemented and documented.
  • Expose the package looks and documentation layout.
  • Test the integration in PyPI, namely the preparation of a release and the smooth (hopefully) download and installation on a laptop.

Releases v0.x would then be incremental, following new additions.

For these v0.x releases ER would suggest not to go full blast with a Scikit-HEP universal suite for histograms and tuples, which are central concepts in HEP. One could aim at releasing the API but using as a temporary Scikit-HEP implementation the ROOT backend. When moving to the real Scikit-HEP implementation the user would not have to adapt much code, if any. Even better, the first version of the histograms and ntuples could exploit the enhanced ROOT objects as implemented in Ostap.

There are all sorts of variations to the above. The important point is that the v0.x releases are seen as milestones towards a first release v1.0 to a wider audience. Versions v0.x would serve as examples when presenting the project to a smaller community and getting feedback; and this during the first months of the development phase.

JP: Let's start with a three digit version, without a "v". We'll be happier later if we do. PEP 440 So X.Y.Z where Z is just bug-fixes. We'll probably want to maintain a separate branch for each X.Y combination, so that we can bug-fix on old versions. By that logic, "master" is the bleeding edge. Bug-fixes in the X.Y branches have to be pushed out to master. Histogrammar is already structured this way.

ER: v0.x was just a way of saying version 0.x. Of course I did not mean to add v to the version number since that is not the Python way ;-). So we agree. And I also agree on your comments on master and x.y branches.

We look forward to contributions from the community at large and need to dress a team with complementary expertise. This is not for the immediate future, but soon-ish once we reached a conclusion on most of the above.

In fact the presentation of the project at the DIANA topical meeting of February 20th will be a good opportunity to get a feeling for who might be interested in joining the effort ...

In particular we should welcome contacts from:

  • The ROOT team.
  • All LHC experiments.
  • Neutrino experiments, ongoing and planned.
  • Dark matter experiments.
  • The FCC community.
  • The simulation community be it Geant4 or MC generator experts.
  • The Belle II experiment.
  • The SHiP experiment under design.