Skip to content

Commit

Permalink
lots of documentation updates
Browse files Browse the repository at this point in the history
  • Loading branch information
weaverba137 committed Aug 9, 2017
1 parent 888820f commit 0f60098
Show file tree
Hide file tree
Showing 7 changed files with 216 additions and 46 deletions.
1 change: 1 addition & 0 deletions doc/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Release Notes
* Add ``--version`` option.
* Add Python 3.6, remove 3.3.
* Add many quality-assurance checks and additional documentation (PR `#2`_).
* Todo: document command-line use, unit tests.

.. _`#2`: https://github.com/weaverba137/hpsspy/pull/2

Expand Down
12 changes: 6 additions & 6 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,13 @@

# Configuration for intersphinx, copied from astropy.
intersphinx_mapping = {
'python': ('http://docs.python.org/', None),
'python': ('http://docs.python.org/3/', None),
# 'python3': ('http://docs.python.org/3/', path.abspath(path.join(path.dirname(__file__), 'local/python3links.inv'))),
'numpy': ('http://docs.scipy.org/doc/numpy/', None),
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
'matplotlib': ('http://matplotlib.org/', None),
'astropy': ('http://docs.astropy.org/en/stable/', None),
'h5py': ('http://docs.h5py.org/en/latest/', None)
# 'numpy': ('http://docs.scipy.org/doc/numpy/', None),
# 'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
# 'matplotlib': ('http://matplotlib.org/', None),
# 'astropy': ('http://docs.astropy.org/en/stable/', None),
# 'h5py': ('http://docs.h5py.org/en/latest/', None)
}

# Add any paths that contain templates here, relative to this directory.
Expand Down
34 changes: 16 additions & 18 deletions doc/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,16 @@ hpss\_root/
The path on the HPSS tape system that will contain the backups.

physical\_disks/
If the data is spread across several physical disks and linked into
If the data are spread across several physical disks and linked into
the root path via symlinks, the various physical disks need to be listed
here.
here. If the value is equivalent to ``False``, *e.g.*,
[``null``, ``false``, ``[]``] this is means that the
``"root"`` disk contains all the physical data. If the value of
is equivalent to a one-item list containing ``os.path.basename(root)``,
then this *also* means that the ``"root"`` disk contains all the physical
data. A list of simple names generates the physical disks by
substitution on the basename of the ``"root"`` value. More complicated
configurations are possible, see :func:`hpsspy.scan.physical_disks`.

Sections
++++++++
Expand Down Expand Up @@ -181,19 +188,10 @@ imposes some additional requirements, conventions and idioms:

"d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar"

Quality Assurance
+++++++++++++++++

In addition to validating JSON files and regular expressions,
:command:`missing_from_hpss` will:

1. Make sure all regular expressions are actually used.
2. Make sure all files actually match *one and only one* regular expression.
3. Create a manifest file containing the actual files on disk matched and
the archive file the map to. In addition the size of the resulting files
(modulo small overheads from the archive file creation process) will
be saved to this file. The manifest file will by default be written
to ``$HOME/scratch/missing_files_<section>.json``, where ``<section>`` is
the section (as defined above) specified on the command-line.
4. Make sure that all archive file sizes are less than a user-defined limit
(default 1 TB), configurable on the command-line.
Finally, for truly monumentally-complicated directory trees, there is a
`JSON file`_ included with this distribution describing the SDSS_ data tree
that can be used for examples. To view the equivalent files and directories
for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12.

.. _SDSS: https://www.sdss.org
.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json
3 changes: 2 additions & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,9 @@ Contents
:maxdepth: 1

configuration
changes
using
api
changes

Indices and tables
++++++++++++++++++
Expand Down
105 changes: 105 additions & 0 deletions doc/using.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
============
Using HPSSPy
============

Introduction
++++++++++++

The primary *command-line* interface to HPSSPy is the script
:command:`missing_from_hpss`, which is automatically generated by the
package install process. If you need to generate this script manually, it
is equivalent to::

#!/usr/bin/env python
from sys import exit
from hpsspy.scan import main
exit(main())

Options
+++++++

There are a lot of command-line options. ``missing_from_hpss --help`` will
display all of them. Just the short versions of the commands are
shown here.

-c DIR Cache files (described below) are written to
``$HOME/scratch`` by default. This option
allows the user to choose any directory.
-D Delete and recreate the disk cache file
(described below).
-H Delete and recreate the HPSS cache file
(described below).
-l N Limit archive files to this size in GB.
The default is 1024 GB (1 TB).
-p Issue the HPSS commands necessary to actually
back up the files found that need to be backed up.
-r N Issue a progress report on how many files
have been analyzed after ``N`` files
(default 10,000).
-t Test mode. Try not to make any changes.
Also pretend that there are no files backed up to HPSS.
-v Print *lots* of extra information.
--version Print a version string and exit.

Besides the options described above, :command:`missing_from_hpss` requires
two positional arguments::

missing_from_hpss config.json section

The two arguements are the path to a configuration file and a section of that
file to process. These are extensively described in the
:doc:`configuration document <configuration>`.

Cache Files
+++++++++++

:command:`missing_from_hpss` uses a few cache files primarily to reduce
memory footprint. These files will be stored in ``$HOME/scratch``
by default. The files are:

Disk Cache
A CSV file of the form ``disk_cache_<section>.csv``, where ``<section>`` is
the section (as defined above) specified on the command-line. The
columns are file name and file size in bytes.

HPSS Cache
A plain-text file of the form ``hpss_cache_<section>.txt``,
where ``<section>`` is the section (as defined above) specified on
the command-line. This is simply a list of files found on HPSS.

Missing File Cache
A JSON file of the form ``$HOME/scratch/missing_files_<section>.json``,
where ``<section>`` is the section (as defined above) specified on the
command-line. It contains a map of HPSS archive files to the files that
belong in that archive. In addition the size of the resulting files
(modulo small overheads from the archive file creation process) will
be saved to this file.

These files are *not* cleaned up by default because they are very useful
for debugging purposes.

Testing and Quality Assurance
+++++++++++++++++++++++++++++

To test a configuration file just run :command:`missing_from_hpss` with the
``--test`` option as described above. Aside from creating cache files in
a scratch directory as described above, this mode will not alter any of the
data, neither on disk nor on HPSS.

In addition to validating JSON files and regular expressions, as
described in the :doc:`configuration document <configuration>`,
:command:`missing_from_hpss` will:

1. Make sure all regular expressions are actually used.
2. Make sure all files actually match *one and only one* regular expression.
3. Create a manifest file containing the actual files on disk matched and
the archive file they map to. This is one and the same as the
"Missing File Cache" described above.
4. Make sure that all archive file sizes are less than a user-defined limit
(default 1 TB), configurable on the command-line.

HPSSPy Library
++++++++++++++

For programmatic access to HPSS, the :doc:`HPSSPy library <api>` provides
equvalents of :mod:`os` and :mod:`os.path` that operate on the HPSS filesystem.
73 changes: 53 additions & 20 deletions hpsspy/scan.py
Original file line number Diff line number Diff line change
Expand Up @@ -175,13 +175,14 @@ def find_missing(hpss_map, hpss_files, disk_files_cache, missing_files,
nmultiple = 0
missing = dict()
pattern_used = dict()
section_warning = set()
with open(disk_files_cache) as t:
reader = csv.DictReader(t)
for row in reader:
f = row['Name']
nfiles += 1
if f in hpss_map["exclude"]:
logger.info("%s skipped.", f)
logger.info("%s is excluded.", f)
continue
section = f.split('/')[0]
try:
Expand All @@ -191,17 +192,19 @@ def find_missing(hpss_map, hpss_files, disk_files_cache, missing_files,
# If the section is not described, that's not
# good, but continue.
#
logger.error("%s is in a directory not " +
"described in the configuration!",
f)
if section not in section_warning:
section_warning.add(section)
logger.warning("Directory %s is not " +
"described in the configuration!",
section)
continue
if not s:
#
# If the section is blank, that's OK.
#
logger.info("%s is in a directory not yet " +
"configured.",
f)
if section not in section_warning:
section_warning.add(section)
logger.warning("Directory %s is not configured!", section)
continue
#
# Now check if it is mapped.
Expand All @@ -214,18 +217,15 @@ def find_missing(hpss_map, hpss_files, disk_files_cache, missing_files,
if m is not None:
pattern_used[r[0].pattern] += 1
reName = r[0].sub(r[1], f)
if reName in hpss_files:
logger.debug("%s in %s.", f, reName)
mapped += 1
else:
if reName not in hpss_files:
if reName in missing:
missing[reName]['files'].append(f)
missing[reName]['size'] += int(row['Size'])
else:
missing[reName] = {'files': [f],
'size': int(row['Size'])}
logger.debug("%s in %s.", f, reName)
mapped += 1
logger.debug("%s in %s.", f, reName)
mapped += 1
if mapped == 0:
logger.error("%s is not mapped to any file on HPSS!", f)
nmissing += 1
Expand All @@ -248,11 +248,15 @@ def find_missing(hpss_map, hpss_files, disk_files_cache, missing_files,
if pattern_used[p] == 0:
logger.critical("Pattern '%s' was never used!", p)
return False
nbackups = 0
for k in missing:
logger.info('%s is %d bytes.', k, missing[k]['size'])
nbackups += len(missing[k]['files'])
if missing[k]['size']/1024/1024/1024 > limit:
logger.critical("HPSS file %s would be too large!", k)
return False
if nbackups > 0:
logger.info('%d files selected for backup.', nbackups)
return (nmissing == 0) and (nmultiple == 0)


Expand Down Expand Up @@ -280,8 +284,6 @@ def process_missing(missing_cache, disk_root, hpss_root, dirmode='2770',
from .os import makedirs
from .util import get_tmpdir, hsi, htar
logger = logging.getLogger(__name__ + '.process_missing')
if test:
logger.setLevel(logging.DEBUG)
logger.debug("Processing missing files from %s.", missing_cache)
with open(missing_cache) as fp:
missing = json.load(fp)
Expand Down Expand Up @@ -448,6 +450,36 @@ def scan_hpss(hpss_root, hpss_files_cache, clobber=False):
return hpss_files


def physical_disks(release_root, config):
"""Convert a root path into a list of physical disks containing data.
Parameters
----------
release_root : :class:`str`
The "official" path to the data.
config : :class:`dict`
A dictionary containing path information.
Returns
-------
:func:`tuple`
A tuple containing the physical disk paths.
"""
from os.path import basename, join
try:
pd = config['physical_disks']
except KeyError:
return (release_root,)
if not pd:
return (release_root,)
broot = basename(config['root'])
if ((len(pd) == 1) and (pd[0] == broot)):
return (release_root,)
if pd[0].startswith('/'):
return tuple([join(d, basename(release_root)) for d in pd])
return tuple([release_root.replace(broot, d) for d in pd])


def main():
"""Entry-point for command-line scripts.
Expand Down Expand Up @@ -496,7 +528,7 @@ def main():
help="Test mode. Try not to make any changes.")
parser.add_argument('-v', '--verbose', action='store_true',
dest='verbose',
help="Increase verbosity.")
help="Increase verbosity. Increase it a lot.")
parser.add_argument('-V', '--version', action='version',
version="%(prog)s " + hpsspyVersion)
parser.add_argument('config', metavar='FILE',
Expand All @@ -507,8 +539,10 @@ def main():
#
# Logging
#
ll = logging.INFO
if options.test or options.verbose:
ll = logging.WARNING
if options.test:
ll = logging.INFO
if options.verbose:
ll = logging.DEBUG
log_format = '%(asctime)s %(name)s %(levelname)s: %(message)s'
logging.basicConfig(level=ll, format=log_format,
Expand Down Expand Up @@ -542,8 +576,7 @@ def main():
disk_files_cache = join(options.cache,
'disk_files_{0}.csv'.format(options.release))
logger.debug("disk_files_cache = '%s'", disk_files_cache)
disk_roots = [release_root.replace(basename(config['root']), d)
for d in config['physical_disks']]
disk_roots = physical_disks(release_root, config)
status = scan_disk(disk_roots, disk_files_cache,
clobber=options.clobber_disk)
if not status:
Expand Down
34 changes: 33 additions & 1 deletion hpsspy/test/test_scan.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
import os
import sys
import re
from ..scan import compile_map
from ..scan import compile_map, physical_disks


class TestScan(unittest.TestCase):
Expand Down Expand Up @@ -73,6 +73,38 @@ def test_compile_map(self):
new_map = compile_map(self.config, 'redux')
self.assertEqual(err.colno, 8)

def test_physical_disks(self):
"""Test physical disk path setup.
"""
release_root = '/foo/bar/baz/data'
config = {'root': '/foo/bar/baz'}
pd = physical_disks(release_root, config)
self.assertEqual(pd, (release_root,))
config['physical_disks'] = None
pd = physical_disks(release_root, config)
self.assertEqual(pd, (release_root,))
config['physical_disks'] = False
pd = physical_disks(release_root, config)
self.assertEqual(pd, (release_root,))
config['physical_disks'] = []
pd = physical_disks(release_root, config)
self.assertEqual(pd, (release_root,))
config['physical_disks'] = ['baz']
pd = physical_disks(release_root, config)
self.assertEqual(pd, (release_root,))
config['physical_disks'] = ['baz0', 'baz1', 'baz2']
pd = physical_disks(release_root, config)
self.assertEqual(pd, ('/foo/bar/baz0/data',
'/foo/bar/baz1/data',
'/foo/bar/baz2/data'))
config['physical_disks'] = ['/foo/bar0/baz',
'/foo/bar1/baz',
'/foo/bar2/baz']
pd = physical_disks(release_root, config)
self.assertEqual(pd, ('/foo/bar0/baz/data',
'/foo/bar1/baz/data',
'/foo/bar2/baz/data'))


def test_suite():
"""Allows testing of only this module with the command::
Expand Down

0 comments on commit 0f60098

Please sign in to comment.