Skip to content

Commit

Permalink
Merge 0f60098 into 7852e65
Browse files Browse the repository at this point in the history
  • Loading branch information
weaverba137 committed Aug 9, 2017
2 parents 7852e65 + 0f60098 commit 24e1be1
Show file tree
Hide file tree
Showing 15 changed files with 715 additions and 136 deletions.
13 changes: 8 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ addons:
- dvipng
python:
- 2.7
- 3.3
- 3.4
- 3.5
- 3.6
env:
global:
# The following versions are the 'default' for tests, unless
Expand All @@ -46,13 +46,16 @@ matrix:
# OS X support is still experimental, so don't penalize failuures.
allow_failures:
- os: osx
- os: linux
python: 3.5
env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'

include:
# Check for sphinx doc build warnings - we do this first because it
# runs for a long time
- os: linux
python: 3.5
env: SETUP_CMD='build_sphinx'
env: SETUP_CMD='build_sphinx --warning-is-error'
# -w is an astropy extension

# Coverage test, pass the results to coveralls.
Expand All @@ -63,14 +66,14 @@ matrix:
# PEP 8 compliance.
- os: linux
python: 3.5
env: MAIN_CMD='pep8' SETUP_CMD='--count hpsspy'
env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'

# before_install:
# - curl ipinfo.io

install:
- if [[ $MAIN_CMD == 'pep8' ]]; then pip install pep8; fi
- if [[ $SETUP_CMD == 'build_sphinx' ]]; then pip install Sphinx; fi
- if [[ $MAIN_CMD == 'pycodestyle' ]]; then pip install pycodestyle; fi
- if [[ $SETUP_CMD == build_sphinx* ]]; then pip install Sphinx; fi
- if [[ $MAIN_CMD == 'coverage' ]]; then pip install coverage coveralls; fi
# - pip install -r requirements.txt

Expand Down
5 changes: 5 additions & 0 deletions doc/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,11 @@ Release Notes
------------------

* Add ``--version`` option.
* Add Python 3.6, remove 3.3.
* Add many quality-assurance checks and additional documentation (PR `#2`_).
* Todo: document command-line use, unit tests.

.. _`#2`: https://github.com/weaverba137/hpsspy/pull/2

0.3.0 (2017-01-18)
------------------
Expand Down
12 changes: 6 additions & 6 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,13 @@

# Configuration for intersphinx, copied from astropy.
intersphinx_mapping = {
'python': ('http://docs.python.org/', None),
'python': ('http://docs.python.org/3/', None),
# 'python3': ('http://docs.python.org/3/', path.abspath(path.join(path.dirname(__file__), 'local/python3links.inv'))),
'numpy': ('http://docs.scipy.org/doc/numpy/', None),
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
'matplotlib': ('http://matplotlib.org/', None),
'astropy': ('http://docs.astropy.org/en/stable/', None),
'h5py': ('http://docs.h5py.org/en/latest/', None)
# 'numpy': ('http://docs.scipy.org/doc/numpy/', None),
# 'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
# 'matplotlib': ('http://matplotlib.org/', None),
# 'astropy': ('http://docs.astropy.org/en/stable/', None),
# 'h5py': ('http://docs.h5py.org/en/latest/', None)
}

# Add any paths that contain templates here, relative to this directory.
Expand Down
197 changes: 197 additions & 0 deletions doc/configuration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
====================
HPSSPy Configuration
====================

Introduction
++++++++++++

The primary HPSSPy command-line program :command:`missing_from_hpss` is
configured with a JSON_ file. Both the JSON standard and the
Python :mod:`json` library are very strict. There is a very quick way
to check the validity of JSON files however::

python -c 'import json; j = open("config.json"); data = json.load(j); j.close()'

where ``"config.json"`` should be replaced with the name of the file to be
tested.

The top-level JSON container should be an "object", equivalent to a Python
:class:`dict`. The simplest possible file that satisfies this requirement
is::

{
}

Obviously, that's not very much to go on. You will need further data
described below.

.. _JSON: http://json.org

Metadata
++++++++

The configuration file should contain a top-level keyword ``"config"``.
The value should itself be a :class:`dict`, containing some important
metadata::

{
"config": {
"root": "/global/project/projectdirs/my_project",
"hpss_root": "/nersc/projects/my_project",
"physical_disks": ["my_project"]
}
}

root/
The directory that contains *all* the data associated with the project.

hpss\_root/
The path on the HPSS tape system that will contain the backups.

physical\_disks/
If the data are spread across several physical disks and linked into
the root path via symlinks, the various physical disks need to be listed
here. If the value is equivalent to ``False``, *e.g.*,
[``null``, ``false``, ``[]``] this is means that the
``"root"`` disk contains all the physical data. If the value of
is equivalent to a one-item list containing ``os.path.basename(root)``,
then this *also* means that the ``"root"`` disk contains all the physical
data. A list of simple names generates the physical disks by
substitution on the basename of the ``"root"`` value. More complicated
configurations are possible, see :func:`hpsspy.scan.physical_disks`.

Sections
++++++++

Inside the root directory, as described above, there may be several top-level
directories. For the purposes of this documentation, these are called
"sections" or "releases". The terms are interchangable. Each section
has configuration items that describe its structure::

{
"config": {
"root": "/projects/my_project",
"hpss_root": "/hpss/projects/my_project",
"physical_disks": ["my_project"]
},
"data": {
"exclude": [],
"d1": {
"d1/batch/.*$": "d1/batch.tar",
"d1/([^/]+\\.txt)": "d1/\\1",
"d1/templates/[^/]+$": "d1/templates/templates_files.tar"
}
}
}

The :command:`missing_from_hpss` command works on one section at a time.
The name of the section is passed on the command-line::

missing_from_hpss config.json data

This would read the data section above.

Each section should have an ``"exclude"`` keyword, whose value is a list
of files to be ignored. In the example above, in order to ignore the file
``/projects/my_project/data/d1/README.html``, the ``"exclude"`` value
would be ``["d1/README.html"]``. Note that this is relative to the
path ``/projects/my_project/data``, since ``"data"`` is the section being
processed.

Mapping File Names to HPSS Archives
+++++++++++++++++++++++++++++++++++

Within a section, each immediate subdirectory should be described with
a keyword in the configuration file. :command:`missing_from_hpss` will
complain if not, but it won't necessarily cause it to fail. In the
example above, ``/projects/my_project/data/d1`` is configured.

There are many possible ways to bundle files for archiving. Generally you
want to make archives as large as possible, without spilling onto multiple
tapes. However, with highly structured, deeply-nested directory structures,
this isn't always the best way to do it from a data *retrieval* viewpoint.

Consider this scenario. ``/projects/my_project/data`` has been archived to
ten tape archives called ``data00.tar``, ``data01.tar``, ... ``data09.tar``.
The file ``/projects/my_project/data/d1/templates/d1_template_05.fits``
needs to be recovered. Which tape archive contains it?

Now consider the scenario where the files in
``/projects/my_project/data/d1/templates`` have been archived to
``/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar``.
Now is it easier to recover the file?

One should still try to make archives as big as possible, but generally
speaking, long-term archiving of large, complex data sets should be
done by **someone who actually knows the structure of the data set** .

In coding terms we describe a portion of a directory tree hierarchy
using regular expressions to match *files* in that portion. Then we map
files that match that regular expression to tape archive files.

Regular Expression Details
++++++++++++++++++++++++++

The HPSSPy package, and :command:`missing_from_hpss` will validate the
regular expressions used in the configuration file, in addition to checking
the overall validity of the JSON file itself. That is, a bad regular
expression will be rejected before it has any chance to "touch" any real data.

The regular expressions should follow Python's conventions,
described in :mod:`re`. In addition to those conventions, this package
imposes some additional requirements, conventions and idioms:

* Requirements

- Backslashes must be escaped in JSON files. For example the
metacharacter (match a single decimal digit) ``\d`` becomes ``\\d``.
"Double-escaping" is not required (if you don't know what this is,
don't worry about it).

* Conventions

- Any archive file name ending in ``.tar`` is assumed to be an HTAR file,
and that command will be used to construct it.
- Any archive file *not* ending in ``.tar`` will simply be copied to
HPSS as is.
- When constructing an archive file, :command:`missing_from_hpss` will
obtain the directory it needs to archive from the name of the *archive*
file, not the regular expression itself. This is because regular
expression *substition* is performed on the archive file name.
For example ``batch.tar`` means "archive a batch/ directory".
For longer file names, the "suffix" of the file will be used.
``data_d1_batch.tar`` also means "archive a batch/ directory", because
``data_d1_`` is stripped off.
- An archive filename that ends with ``_files.tar``, *e.g.* ``foo/bar_files.tar``
is a signal to :command:`missing_from_hpss` to construct
the archive file in a certain way, not by decending into a directory,
but by constructing an explicit list of files and building an archive
file out of that.
- Regular expressions should end with the end-of-line marker ``$``.

* Idioms

- Archive the entire contents of a directory into a single file:
``"foo/.*$" : "foo.tar"``.
- Archive several subdirectories of a directory, each into their own file:
``"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"``. The name of the
directory matched in parentheses will be substituted into the file name.
- Archive arbitrary subdirectories of a *set* of subdirectories:
``"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$":"d1/foo/\\1/d1_foo_\\1_\\2.tar"``
- Match files in a directory, but not any files in any
subdirectory: ``"foo/[^/]+$" : "foo_files.tar"``. See also the
``_files.tar`` convention mentioned above.
- Do not create an archive file, just copy the file, as is, to HPSS:
``"d1/README\\.txt" : "d1/README.txt"``. Similarly, for a set of TXT files:
``"d1/([^/]+\\.txt)" : "d1/\\1"``.
- An example with lots of substitutions::

"d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar"

Finally, for truly monumentally-complicated directory trees, there is a
`JSON file`_ included with this distribution describing the SDSS_ data tree
that can be used for examples. To view the equivalent files and directories
for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12.

.. _SDSS: https://www.sdss.org
.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json
4 changes: 3 additions & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@ Contents
.. toctree::
:maxdepth: 1

changes
configuration
using
api
changes

Indices and tables
++++++++++++++++++
Expand Down
105 changes: 105 additions & 0 deletions doc/using.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
============
Using HPSSPy
============

Introduction
++++++++++++

The primary *command-line* interface to HPSSPy is the script
:command:`missing_from_hpss`, which is automatically generated by the
package install process. If you need to generate this script manually, it
is equivalent to::

#!/usr/bin/env python
from sys import exit
from hpsspy.scan import main
exit(main())

Options
+++++++

There are a lot of command-line options. ``missing_from_hpss --help`` will
display all of them. Just the short versions of the commands are
shown here.

-c DIR Cache files (described below) are written to
``$HOME/scratch`` by default. This option
allows the user to choose any directory.
-D Delete and recreate the disk cache file
(described below).
-H Delete and recreate the HPSS cache file
(described below).
-l N Limit archive files to this size in GB.
The default is 1024 GB (1 TB).
-p Issue the HPSS commands necessary to actually
back up the files found that need to be backed up.
-r N Issue a progress report on how many files
have been analyzed after ``N`` files
(default 10,000).
-t Test mode. Try not to make any changes.
Also pretend that there are no files backed up to HPSS.
-v Print *lots* of extra information.
--version Print a version string and exit.

Besides the options described above, :command:`missing_from_hpss` requires
two positional arguments::

missing_from_hpss config.json section

The two arguements are the path to a configuration file and a section of that
file to process. These are extensively described in the
:doc:`configuration document <configuration>`.

Cache Files
+++++++++++

:command:`missing_from_hpss` uses a few cache files primarily to reduce
memory footprint. These files will be stored in ``$HOME/scratch``
by default. The files are:

Disk Cache
A CSV file of the form ``disk_cache_<section>.csv``, where ``<section>`` is
the section (as defined above) specified on the command-line. The
columns are file name and file size in bytes.

HPSS Cache
A plain-text file of the form ``hpss_cache_<section>.txt``,
where ``<section>`` is the section (as defined above) specified on
the command-line. This is simply a list of files found on HPSS.

Missing File Cache
A JSON file of the form ``$HOME/scratch/missing_files_<section>.json``,
where ``<section>`` is the section (as defined above) specified on the
command-line. It contains a map of HPSS archive files to the files that
belong in that archive. In addition the size of the resulting files
(modulo small overheads from the archive file creation process) will
be saved to this file.

These files are *not* cleaned up by default because they are very useful
for debugging purposes.

Testing and Quality Assurance
+++++++++++++++++++++++++++++

To test a configuration file just run :command:`missing_from_hpss` with the
``--test`` option as described above. Aside from creating cache files in
a scratch directory as described above, this mode will not alter any of the
data, neither on disk nor on HPSS.

In addition to validating JSON files and regular expressions, as
described in the :doc:`configuration document <configuration>`,
:command:`missing_from_hpss` will:

1. Make sure all regular expressions are actually used.
2. Make sure all files actually match *one and only one* regular expression.
3. Create a manifest file containing the actual files on disk matched and
the archive file they map to. This is one and the same as the
"Missing File Cache" described above.
4. Make sure that all archive file sizes are less than a user-defined limit
(default 1 TB), configurable on the command-line.

HPSSPy Library
++++++++++++++

For programmatic access to HPSS, the :doc:`HPSSPy library <api>` provides
equvalents of :mod:`os` and :mod:`os.path` that operate on the HPSS filesystem.
Loading

0 comments on commit 24e1be1

Please sign in to comment.