Merge 0f60098 into 7852e65

weaverba137 · Aug 9, 2017 · 24e1be1 · 24e1be1
2 parents 7852e65 + 0f60098
commit 24e1be1
Show file tree

Hide file tree

Showing 15 changed files with 715 additions and 136 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -25,9 +25,9 @@ addons:
             - dvipng
 python:
     - 2.7
-    - 3.3
     - 3.4
     - 3.5
+    - 3.6
 env:
     global:
         # The following versions are the 'default' for tests, unless
@@ -46,13 +46,16 @@ matrix:
     # OS X support is still experimental, so don't penalize failuures.
     allow_failures:
         - os: osx
+        - os: linux
+          python: 3.5
+          env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'
 
     include:
         # Check for sphinx doc build warnings - we do this first because it
         # runs for a long time
         - os: linux
           python: 3.5
-          env: SETUP_CMD='build_sphinx'
+          env: SETUP_CMD='build_sphinx --warning-is-error'
           # -w is an astropy extension
 
         # Coverage test, pass the results to coveralls.
@@ -63,14 +66,14 @@ matrix:
         # PEP 8 compliance.
         - os: linux
           python: 3.5
-          env: MAIN_CMD='pep8' SETUP_CMD='--count hpsspy'
+          env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'
 
 # before_install:
 #     - curl ipinfo.io
 
 install:
-    - if [[ $MAIN_CMD == 'pep8' ]]; then pip install pep8; fi
-    - if [[ $SETUP_CMD == 'build_sphinx' ]]; then pip install Sphinx; fi
+    - if [[ $MAIN_CMD == 'pycodestyle' ]]; then pip install pycodestyle; fi
+    - if [[ $SETUP_CMD == build_sphinx* ]]; then pip install Sphinx; fi
     - if [[ $MAIN_CMD == 'coverage' ]]; then pip install coverage coveralls; fi
     # - pip install -r requirements.txt
 

diff --git a/doc/changes.rst b/doc/changes.rst
@@ -6,6 +6,11 @@ Release Notes
 ------------------
 
 * Add ``--version`` option.
+* Add Python 3.6, remove 3.3.
+* Add many quality-assurance checks and additional documentation (PR `#2`_).
+* Todo: document command-line use, unit tests.
+
+.. _`#2`: https://github.com/weaverba137/hpsspy/pull/2
 
 0.3.0 (2017-01-18)
 ------------------

diff --git a/doc/conf.py b/doc/conf.py
@@ -40,13 +40,13 @@
 
 # Configuration for intersphinx, copied from astropy.
 intersphinx_mapping = {
-    'python': ('http://docs.python.org/', None),
+    'python': ('http://docs.python.org/3/', None),
     # 'python3': ('http://docs.python.org/3/', path.abspath(path.join(path.dirname(__file__), 'local/python3links.inv'))),
-    'numpy': ('http://docs.scipy.org/doc/numpy/', None),
-    'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
-    'matplotlib': ('http://matplotlib.org/', None),
-    'astropy': ('http://docs.astropy.org/en/stable/', None),
-    'h5py': ('http://docs.h5py.org/en/latest/', None)
+    # 'numpy': ('http://docs.scipy.org/doc/numpy/', None),
+    # 'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
+    # 'matplotlib': ('http://matplotlib.org/', None),
+    # 'astropy': ('http://docs.astropy.org/en/stable/', None),
+    # 'h5py': ('http://docs.h5py.org/en/latest/', None)
     }
 
 # Add any paths that contain templates here, relative to this directory.

diff --git a/doc/configuration.rst b/doc/configuration.rst
@@ -0,0 +1,197 @@
+====================
+HPSSPy Configuration
+====================
+
+Introduction
+++++++++++++
+
+The primary HPSSPy command-line program :command:`missing_from_hpss` is
+configured with a JSON_ file.  Both the JSON standard and the
+Python :mod:`json` library are very strict.  There is a very quick way
+to check the validity of JSON files however::
+
+    python -c 'import json; j = open("config.json"); data = json.load(j); j.close()'
+
+where ``"config.json"`` should be replaced with the name of the file to be
+tested.
+
+The top-level JSON container should be an "object", equivalent to a Python
+:class:`dict`.  The simplest possible file that satisfies this requirement
+is::
+
+    {
+    }
+
+Obviously, that's not very much to go on.  You will need further data
+described below.
+
+.. _JSON: http://json.org
+
+Metadata
+++++++++
+
+The configuration file should contain a top-level keyword ``"config"``.
+The value should itself be a :class:`dict`, containing some important
+metadata::
+
+    {
+        "config": {
+            "root": "/global/project/projectdirs/my_project",
+            "hpss_root": "/nersc/projects/my_project",
+            "physical_disks": ["my_project"]
+        }
+    }
+
+root/
+    The directory that contains *all* the data associated with the project.
+
+hpss\_root/
+    The path on the HPSS tape system that will contain the backups.
+
+physical\_disks/
+    If the data are spread across several physical disks and linked into
+    the root path via symlinks, the various physical disks need to be listed
+    here.  If the value is equivalent to ``False``, *e.g.*,
+    [``null``, ``false``, ``[]``] this is means that the
+    ``"root"`` disk contains all the physical data.  If the value of
+    is equivalent to a one-item list containing ``os.path.basename(root)``,
+    then this *also* means that the ``"root"`` disk contains all the physical
+    data.  A list of simple names generates the physical disks by
+    substitution on the basename of the ``"root"`` value.  More complicated
+    configurations are possible, see :func:`hpsspy.scan.physical_disks`.
+
+Sections
+++++++++
+
+Inside the root directory, as described above, there may be several top-level
+directories.  For the purposes of this documentation, these are called
+"sections" or "releases".  The terms are interchangable.  Each section
+has configuration items that describe its structure::
+
+    {
+        "config": {
+            "root": "/projects/my_project",
+            "hpss_root": "/hpss/projects/my_project",
+            "physical_disks": ["my_project"]
+        },
+        "data": {
+            "exclude": [],
+            "d1": {
+                "d1/batch/.*$": "d1/batch.tar",
+                "d1/([^/]+\\.txt)": "d1/\\1",
+                "d1/templates/[^/]+$": "d1/templates/templates_files.tar"
+            }
+        }
+    }
+
+The :command:`missing_from_hpss` command works on one section at a time.
+The name of the section is passed on the command-line::
+
+    missing_from_hpss config.json data
+
+This would read the data section above.
+
+Each section should have an ``"exclude"`` keyword, whose value is a list
+of files to be ignored.  In the example above, in order to ignore the file
+``/projects/my_project/data/d1/README.html``, the ``"exclude"`` value
+would be ``["d1/README.html"]``.  Note that this is relative to the
+path ``/projects/my_project/data``, since ``"data"`` is the section being
+processed.
+
+Mapping File Names to HPSS Archives
++++++++++++++++++++++++++++++++++++
+
+Within a section, each immediate subdirectory should be described with
+a keyword in the configuration file.  :command:`missing_from_hpss` will
+complain if not, but it won't necessarily cause it to fail.  In the
+example above, ``/projects/my_project/data/d1`` is configured.
+
+There are many possible ways to bundle files for archiving.  Generally you
+want to make archives as large as possible, without spilling onto multiple
+tapes.  However, with highly structured, deeply-nested directory structures,
+this isn't always the best way to do it from a data *retrieval* viewpoint.
+
+Consider this scenario.  ``/projects/my_project/data`` has been archived to
+ten tape archives called ``data00.tar``, ``data01.tar``, ... ``data09.tar``.
+The file ``/projects/my_project/data/d1/templates/d1_template_05.fits``
+needs to be recovered.  Which tape archive contains it?
+
+Now consider the scenario where the files in
+``/projects/my_project/data/d1/templates`` have been archived to
+``/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar``.
+Now is it easier to recover the file?
+
+One should still try to make archives as big as possible, but generally
+speaking, long-term archiving of large, complex data sets should be
+done by **someone who actually knows the structure of the data set** .
+
+In coding terms we describe a portion of a directory tree hierarchy
+using regular expressions to match *files* in that portion.  Then we map
+files that match that regular expression to tape archive files.
+
+Regular Expression Details
+++++++++++++++++++++++++++
+
+The HPSSPy package, and :command:`missing_from_hpss` will validate the
+regular expressions used in the configuration file, in addition to checking
+the overall validity of the JSON file itself.  That is, a bad regular
+expression will be rejected before it has any chance to "touch" any real data.
+
+The regular expressions should follow Python's conventions,
+described in :mod:`re`.  In addition to those conventions, this package
+imposes some additional requirements, conventions and idioms:
+
+* Requirements
+
+  - Backslashes must be escaped in JSON files.  For example the
+    metacharacter (match a single decimal digit) ``\d`` becomes ``\\d``.
+    "Double-escaping" is not required (if you don't know what this is,
+    don't worry about it).
+
+* Conventions
+
+  - Any archive file name ending in ``.tar`` is assumed to be an HTAR file,
+    and that command will be used to construct it.
+  - Any archive file *not* ending in ``.tar`` will simply be copied to
+    HPSS as is.
+  - When constructing an archive file, :command:`missing_from_hpss` will
+    obtain the directory it needs to archive from the name of the *archive*
+    file, not the regular expression itself.  This is because regular
+    expression *substition* is performed on the archive file name.
+    For example ``batch.tar`` means "archive a batch/ directory".
+    For longer file names, the "suffix" of the file will be used.
+    ``data_d1_batch.tar`` also means "archive a batch/ directory", because
+    ``data_d1_`` is stripped off.
+  - An archive filename that ends with ``_files.tar``, *e.g.* ``foo/bar_files.tar``
+    is a signal to :command:`missing_from_hpss` to construct
+    the archive file in a certain way, not by decending into a directory,
+    but by constructing an explicit list of files and building an archive
+    file out of that.
+  - Regular expressions should end with the end-of-line marker ``$``.
+
+* Idioms
+
+  - Archive the entire contents of a directory into a single file:
+    ``"foo/.*$" : "foo.tar"``.
+  - Archive several subdirectories of a directory, each into their own file:
+    ``"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"``.  The name of the
+    directory matched in parentheses will be substituted into the file name.
+  - Archive arbitrary subdirectories of a *set* of subdirectories:
+    ``"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$":"d1/foo/\\1/d1_foo_\\1_\\2.tar"``
+  - Match files in a directory, but not any files in any
+    subdirectory: ``"foo/[^/]+$" : "foo_files.tar"``.  See also the
+    ``_files.tar`` convention mentioned above.
+  - Do not create an archive file, just copy the file, as is, to HPSS:
+    ``"d1/README\\.txt" : "d1/README.txt"``.  Similarly, for a set of TXT files:
+    ``"d1/([^/]+\\.txt)" : "d1/\\1"``.
+  - An example with lots of substitutions::
+
+        "d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar"
+
+Finally, for truly monumentally-complicated directory trees, there is a
+`JSON file`_ included with this distribution describing the SDSS_ data tree
+that can be used for examples.  To view the equivalent files and directories
+for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12.
+
+.. _SDSS: https://www.sdss.org
+.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json
diff --git a/doc/index.rst b/doc/index.rst
@@ -35,8 +35,10 @@ Contents
 .. toctree::
    :maxdepth: 1
 
-   changes
+   configuration
+   using
    api
+   changes
 
 Indices and tables
 ++++++++++++++++++

diff --git a/doc/using.rst b/doc/using.rst
@@ -0,0 +1,105 @@
+============
+Using HPSSPy
+============
+
+Introduction
+++++++++++++
+
+The primary *command-line* interface to HPSSPy is the script
+:command:`missing_from_hpss`, which is automatically generated by the
+package install process.  If you need to generate this script manually, it
+is equivalent to::
+
+    #!/usr/bin/env python
+    from sys import exit
+    from hpsspy.scan import main
+    exit(main())
+
+Options
++++++++
+
+There are a lot of command-line options.  ``missing_from_hpss --help`` will
+display all of them. Just the short versions of the commands are
+shown here.
+
+-c DIR      Cache files (described below) are written to
+            ``$HOME/scratch`` by default.  This option
+            allows the user to choose any directory.
+-D          Delete and recreate the disk cache file
+            (described below).
+-H          Delete and recreate the HPSS cache file
+            (described below).
+-l N        Limit archive files to this size in GB.
+            The default is 1024 GB (1 TB).
+-p          Issue the HPSS commands necessary to actually
+            back up the files found that need to be backed up.
+-r N        Issue a progress report on how many files
+            have been analyzed after ``N`` files
+            (default 10,000).
+-t          Test mode.  Try not to make any changes.
+            Also pretend that there are no files backed up to HPSS.
+-v          Print *lots* of extra information.
+--version   Print a version string and exit.
+
+Besides the options described above, :command:`missing_from_hpss` requires
+two positional arguments::
+
+    missing_from_hpss config.json section
+
+The two arguements are the path to a configuration file and a section of that
+file to process.  These are extensively described in the
+:doc:`configuration document <configuration>`.
+
+Cache Files
++++++++++++
+
+:command:`missing_from_hpss` uses a few cache files primarily to reduce
+memory footprint.  These files will be stored in ``$HOME/scratch``
+by default.  The files are:
+
+Disk Cache
+    A CSV file of the form ``disk_cache_<section>.csv``, where ``<section>`` is
+    the section (as defined above) specified on the command-line.  The
+    columns are file name and file size in bytes.
+
+HPSS Cache
+    A plain-text file of the form ``hpss_cache_<section>.txt``,
+    where ``<section>`` is the section (as defined above) specified on
+    the command-line.  This is simply a list of files found on HPSS.
+
+Missing File Cache
+    A JSON file of the form ``$HOME/scratch/missing_files_<section>.json``,
+    where ``<section>`` is the section (as defined above) specified on the
+    command-line. It contains a map of HPSS archive files to the files that
+    belong in that archive.  In addition the size of the resulting files
+    (modulo small overheads from the archive file creation process) will
+    be saved to this file.
+
+These files are *not* cleaned up by default because they are very useful
+for debugging purposes.
+
+Testing and Quality Assurance
++++++++++++++++++++++++++++++
+
+To test a configuration file just run :command:`missing_from_hpss` with the
+``--test`` option as described above.  Aside from creating cache files in
+a scratch directory as described above, this mode will not alter any of the
+data, neither on disk nor on HPSS.
+
+In addition to validating JSON files and regular expressions, as
+described in the :doc:`configuration document <configuration>`,
+:command:`missing_from_hpss` will:
+
+1. Make sure all regular expressions are actually used.
+2. Make sure all files actually match *one and only one* regular expression.
+3. Create a manifest file containing the actual files on disk matched and
+   the archive file they map to.  This is one and the same as the
+   "Missing File Cache" described above.
+4. Make sure that all archive file sizes are less than a user-defined limit
+   (default 1 TB), configurable on the command-line.
+
+HPSSPy Library
+++++++++++++++
+
+For programmatic access to HPSS, the :doc:`HPSSPy library <api>` provides
+equvalents of :mod:`os` and :mod:`os.path` that operate on the HPSS filesystem.