Merge ef18d78 into 7852e65

weaverba137 · Aug 9, 2017 · e163e51 · e163e51
2 parents 7852e65 + ef18d78
commit e163e51
Show file tree

Hide file tree

Showing 16 changed files with 739 additions and 144 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -25,9 +25,9 @@ addons:
             - dvipng
 python:
     - 2.7
-    - 3.3
     - 3.4
     - 3.5
+    - 3.6
 env:
     global:
         # The following versions are the 'default' for tests, unless
@@ -46,13 +46,16 @@ matrix:
     # OS X support is still experimental, so don't penalize failuures.
     allow_failures:
         - os: osx
+        - os: linux
+          python: 3.5
+          env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'
 
     include:
         # Check for sphinx doc build warnings - we do this first because it
         # runs for a long time
         - os: linux
           python: 3.5
-          env: SETUP_CMD='build_sphinx'
+          env: SETUP_CMD='build_sphinx --warning-is-error'
           # -w is an astropy extension
 
         # Coverage test, pass the results to coveralls.
@@ -63,14 +66,14 @@ matrix:
         # PEP 8 compliance.
         - os: linux
           python: 3.5
-          env: MAIN_CMD='pep8' SETUP_CMD='--count hpsspy'
+          env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'
 
 # before_install:
 #     - curl ipinfo.io
 
 install:
-    - if [[ $MAIN_CMD == 'pep8' ]]; then pip install pep8; fi
-    - if [[ $SETUP_CMD == 'build_sphinx' ]]; then pip install Sphinx; fi
+    - if [[ $MAIN_CMD == 'pycodestyle' ]]; then pip install pycodestyle; fi
+    - if [[ $SETUP_CMD == build_sphinx* ]]; then pip install Sphinx; fi
     - if [[ $MAIN_CMD == 'coverage' ]]; then pip install coverage coveralls; fi
     # - pip install -r requirements.txt
 

diff --git a/doc/changes.rst b/doc/changes.rst
@@ -2,10 +2,14 @@
 Release Notes
 =============
 
-0.3.1 (unreleased)
+0.4.0 (unreleased)
 ------------------
 
 * Add ``--version`` option.
+* Add Python 3.6, remove 3.3.
+* Add many quality-assurance checks and additional documentation (PR `#2`_).
+
+.. _`#2`: https://github.com/weaverba137/hpsspy/pull/2
 
 0.3.0 (2017-01-18)
 ------------------

diff --git a/doc/conf.py b/doc/conf.py
@@ -40,13 +40,13 @@
 
 # Configuration for intersphinx, copied from astropy.
 intersphinx_mapping = {
-    'python': ('http://docs.python.org/', None),
+    'python': ('http://docs.python.org/3/', None),
     # 'python3': ('http://docs.python.org/3/', path.abspath(path.join(path.dirname(__file__), 'local/python3links.inv'))),
-    'numpy': ('http://docs.scipy.org/doc/numpy/', None),
-    'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
-    'matplotlib': ('http://matplotlib.org/', None),
-    'astropy': ('http://docs.astropy.org/en/stable/', None),
-    'h5py': ('http://docs.h5py.org/en/latest/', None)
+    # 'numpy': ('http://docs.scipy.org/doc/numpy/', None),
+    # 'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
+    # 'matplotlib': ('http://matplotlib.org/', None),
+    # 'astropy': ('http://docs.astropy.org/en/stable/', None),
+    # 'h5py': ('http://docs.h5py.org/en/latest/', None)
     }
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/doc/configuration.rst b/doc/configuration.rst
@@ -0,0 +1,200 @@
+====================
+HPSSPy Configuration
+====================
+
+Introduction
+++++++++++++
+
+The primary HPSSPy command-line program :command:`missing_from_hpss` is
+configured with a JSON_ file.  Both the JSON standard and the
+Python :mod:`json` library are very strict.  There is a very quick way
+to check the validity of JSON files however::
+
+    python -c 'import json; j = open("config.json"); data = json.load(j); j.close()'
+
+where ``"config.json"`` should be replaced with the name of the file to be
+tested.
+
+The top-level JSON container should be an "object", equivalent to a Python
+:class:`dict`.  The simplest possible file that satisfies this requirement
+is::
+
+    {
+    }
+
+Obviously, that's not very much to go on.  You will need further data
+described below.
+
+.. _JSON: http://json.org
+
+Metadata
+++++++++
+
+The configuration file should contain a top-level keyword ``"config"``.
+The value should itself be a :class:`dict`, containing some important
+metadata::
+
+    {
+        "config": {
+            "root": "/global/project/projectdirs/my_project",
+            "hpss_root": "/nersc/projects/my_project",
+            "physical_disks": ["my_project"]
+        }
+    }
+
+root/
+    The directory that contains *all* the data associated with the project.
+
+hpss\_root/
+    The path on the HPSS tape system that will contain the backups.
+
+physical\_disks/
+    If the data are spread across several physical disks and linked into
+    the root path via symlinks, the various physical disks need to be listed
+    here.  If the value is equivalent to ``False``, *e.g.*,
+    [``null``, ``false``, ``[]``] this is means that the
+    ``"root"`` disk contains all the physical data.  If the value of
+    is equivalent to a one-item list containing ``os.path.basename(root)``,
+    then this *also* means that the ``"root"`` disk contains all the physical
+    data.  A list of simple names generates the physical disks by
+    substitution on the basename of the ``"root"`` value.  More complicated
+    configurations are possible, see :func:`hpsspy.scan.physical_disks`.
+
+Sections
+++++++++
+
+Inside the root directory, as described above, there may be several top-level
+directories.  For the purposes of this documentation, these are called
+"sections" or "releases".  The terms are interchangable.  Each section
+has configuration items that describe its structure::
+
+    {
+        "config": {
+            "root": "/projects/my_project",
+            "hpss_root": "/hpss/projects/my_project",
+            "physical_disks": ["my_project"]
+        },
+        "data": {
+            "exclude": [],
+            "d1": {
+                "d1/batch/.*$": "d1/batch.tar",
+                "d1/([^/]+\\.txt)$": "d1/\\1",
+                "d1/templates/[^/]+$": "d1/templates/templates_files.tar"
+            }
+        }
+    }
+
+The :command:`missing_from_hpss` command works on one section at a time.
+The name of the section is passed on the command-line::
+
+    missing_from_hpss config.json data
+
+This would read the data section above.
+
+Each section should have an ``"exclude"`` keyword, whose value is a list
+of files to be ignored.  In the example above, in order to ignore the file
+``/projects/my_project/data/d1/README.html``, the ``"exclude"`` value
+would be ``["d1/README.html"]``.  Note that this is relative to the
+path ``/projects/my_project/data``, since ``"data"`` is the section being
+processed.
+
+Mapping File Names to HPSS Archives
++++++++++++++++++++++++++++++++++++
+
+Within a section, each immediate subdirectory should be described with
+a keyword in the configuration file.  :command:`missing_from_hpss` will
+complain if not, but it won't necessarily cause it to fail.  In the
+example above, ``/projects/my_project/data/d1`` is configured.
+
+There are many possible ways to bundle files for archiving.  Generally you
+want to make archives as large as possible, without spilling onto multiple
+tapes.  However, with highly structured, deeply-nested directory structures,
+this isn't always the best way to do it from a data *retrieval* viewpoint.
+
+Consider this scenario.  ``/projects/my_project/data`` has been archived to
+ten tape archives called ``data00.tar``, ``data01.tar``, ... ``data09.tar``.
+The file ``/projects/my_project/data/d1/templates/d1_template_05.fits``
+needs to be recovered.  Which tape archive contains it?
+
+Now consider the scenario where the files in
+``/projects/my_project/data/d1/templates`` have been archived to
+``/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar``.
+Now is it easier to recover the file?
+
+One should still try to make archives as big as possible, but generally
+speaking, long-term archiving of large, complex data sets should be
+done by **someone who actually knows the structure of the data set** .
+
+In coding terms we describe a portion of a directory tree hierarchy
+using regular expressions to match *files* in that portion.  Then we map
+files that match that regular expression to tape archive files.
+
+Regular Expression Details
+++++++++++++++++++++++++++
+
+The HPSSPy package, and :command:`missing_from_hpss` will validate the
+regular expressions used in the configuration file, in addition to checking
+the overall validity of the JSON file itself.  That is, a bad regular
+expression will be rejected before it has any chance to "touch" any real data.
+
+The regular expressions should follow Python's conventions,
+described in :mod:`re`.  In addition to those conventions, this package
+imposes some additional requirements, conventions and idioms:
+
+* Requirements
+
+  - Backslashes must be escaped in JSON files.  For example the
+    metacharacter (match a single decimal digit) ``\d`` becomes ``\\d``.
+  - Regular expressions should end with the end-of-line marker ``$``.
+
+* Conventions
+
+  - Any archive file name ending in ``.tar`` is assumed to be an HTAR file,
+    and that command will be used to construct it.
+  - Any archive file *not* ending in ``.tar`` will simply be copied to
+    HPSS as is.
+  - When constructing an archive file, :command:`missing_from_hpss` will
+    obtain the directory it needs to archive from the name of the *archive*
+    file, not the regular expression itself.  This is because regular
+    expression *substition* is performed on the archive file name.
+    For example ``batch.tar`` means "archive a batch/ directory".
+    For longer file names, the "suffix" of the file will be used.
+    ``data_d1_batch.tar`` also means "archive a batch/ directory", because
+    ``data_d1_`` is stripped off.
+  - An archive filename that ends with ``_files.tar``, *e.g.* ``foo/bar_files.tar``
+    is a signal to :command:`missing_from_hpss` to construct
+    the archive file in a certain way, not by decending into a directory,
+    but by constructing an explicit list of files and building an archive
+    file out of that.
+
+* Idioms
+
+  - Archive the entire contents of a directory into a single file:
+    ``"foo/.*$" : "foo.tar"``.
+  - Archive several subdirectories of a directory, each into their own file:
+    ``"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"``.  The name of the
+    directory matched in parentheses will be substituted into the file name.
+  - Archive arbitrary subdirectories of a *set* of subdirectories:
+    ``"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$":"d1/foo/\\1/d1_foo_\\1_\\2.tar"``
+  - Match files in a directory, but not any files in any
+    subdirectory: ``"foo/[^/]+$" : "foo_files.tar"``.  See also the
+    ``_files.tar`` convention mentioned above.
+  - Group some but not all subdirectories in a directory into a single
+    archive file for efficiency: ``"foo/([0-9])([0-9][0-9])/.*$" : "foo/foo_\\1XX.tar"``.
+    Note the ending of the archive file, and that the directories have to
+    have a very uniform naming convention (three and only three digits
+    in this example).
+  - Do not create an archive file, just copy the file, as is, to HPSS:
+    ``"d1/README\\.txt$" : "d1/README.txt"``.  Similarly, for a set of TXT files:
+    ``"d1/([^/]+\\.txt)$" : "d1/\\1"``.
+  - An example with lots of substitutions::
+
+        "d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar"
+
+Finally, for truly monumentally-complicated directory trees, there is a
+`JSON file`_ included with this distribution describing the SDSS_ data tree
+that can be used for examples.  To view the equivalent files and directories
+for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12.
+
+.. _SDSS: https://www.sdss.org
+.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json
diff --git a/doc/index.rst b/doc/index.rst
@@ -35,8 +35,10 @@ Contents
 .. toctree::
    :maxdepth: 1
 
-   changes
+   configuration
+   using
    api
+   changes
 
 Indices and tables
 ++++++++++++++++++