Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests and documentation. #2

Merged
merged 40 commits into from
Aug 10, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
72a4a2f
tweak tests
weaverba137 Aug 2, 2017
a4bfab4
update log formatting
weaverba137 Aug 2, 2017
dbe6593
fix broken test
weaverba137 Aug 2, 2017
f8a49d0
allow test mode to override file creation
weaverba137 Aug 3, 2017
d2737c5
fix formatting
weaverba137 Aug 3, 2017
24d612b
add tests for regular expressions
weaverba137 Aug 8, 2017
e14e085
add configuration validator
weaverba137 Aug 8, 2017
6580f68
fix variable name
weaverba137 Aug 8, 2017
2ac9e90
fix import error [skip ci]
weaverba137 Aug 8, 2017
810f55d
fix object name [skip ci]
weaverba137 Aug 8, 2017
67b58e8
explicitly set debug
weaverba137 Aug 8, 2017
baeab92
add additional tests
weaverba137 Aug 8, 2017
9f9dfe0
fix import error
weaverba137 Aug 8, 2017
6b1082e
expand filename [skip ci]
weaverba137 Aug 8, 2017
fece9bc
fix log message [skip ci]
weaverba137 Aug 8, 2017
d6d1855
print sizes
weaverba137 Aug 8, 2017
7903771
fix log message [skip ci]
weaverba137 Aug 8, 2017
d1482dc
save file sizes in scan disk stage
weaverba137 Aug 8, 2017
9ac4fa1
move validation into main script
weaverba137 Aug 8, 2017
df3e551
fix variable name
weaverba137 Aug 8, 2017
28b0a57
fix variable name
weaverba137 Aug 8, 2017
2a3a747
fix variable name
weaverba137 Aug 8, 2017
c8be3dc
update changes.rst
weaverba137 Aug 8, 2017
289c2cf
fix variable name
weaverba137 Aug 8, 2017
3386e79
fix missing variable
weaverba137 Aug 8, 2017
888820f
add configuration document
weaverba137 Aug 9, 2017
0f60098
lots of documentation updates
weaverba137 Aug 9, 2017
71cf771
adding unit tests
weaverba137 Aug 9, 2017
3e03ed9
add capability to group some types of directories
weaverba137 Aug 9, 2017
f5314c1
fix index error
weaverba137 Aug 9, 2017
fc39275
fix function error
weaverba137 Aug 9, 2017
ef18d78
fix regex error
weaverba137 Aug 9, 2017
c6a4774
nicer formatting
weaverba137 Aug 9, 2017
03d819d
add unit tests
weaverba137 Aug 9, 2017
d978f62
fix exception missing in older Python
weaverba137 Aug 9, 2017
3c0a184
fix up some python version problems
weaverba137 Aug 10, 2017
1c0b636
find correct exception
weaverba137 Aug 10, 2017
509383f
make mock commands easier to use
weaverba137 Aug 10, 2017
9c9c31f
add tests of stat
weaverba137 Aug 10, 2017
911e70e
fix test failure
weaverba137 Aug 10, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 8 additions & 5 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,18 +7,18 @@
# Packages
*.egg
*.egg-info
dist
build
dist/
build/
eggs
parts
bin
bin/
var
sdist
MANIFEST
develop-eggs
.installed.cfg
lib
lib64
lib/
lib64/
__pycache__

# Sphinx
Expand Down Expand Up @@ -50,3 +50,6 @@ nosetests.xml

# Mac OSX
.DS_Store

# don't ignore test/bin
!hpsspy/test/bin/*
13 changes: 8 additions & 5 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ addons:
- dvipng
python:
- 2.7
- 3.3
- 3.4
- 3.5
- 3.6
env:
global:
# The following versions are the 'default' for tests, unless
Expand All @@ -46,13 +46,16 @@ matrix:
# OS X support is still experimental, so don't penalize failuures.
allow_failures:
- os: osx
- os: linux
python: 3.5
env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'

include:
# Check for sphinx doc build warnings - we do this first because it
# runs for a long time
- os: linux
python: 3.5
env: SETUP_CMD='build_sphinx'
env: SETUP_CMD='build_sphinx --warning-is-error'
# -w is an astropy extension

# Coverage test, pass the results to coveralls.
Expand All @@ -63,14 +66,14 @@ matrix:
# PEP 8 compliance.
- os: linux
python: 3.5
env: MAIN_CMD='pep8' SETUP_CMD='--count hpsspy'
env: MAIN_CMD='pycodestyle' SETUP_CMD='--count hpsspy'

# before_install:
# - curl ipinfo.io

install:
- if [[ $MAIN_CMD == 'pep8' ]]; then pip install pep8; fi
- if [[ $SETUP_CMD == 'build_sphinx' ]]; then pip install Sphinx; fi
- if [[ $MAIN_CMD == 'pycodestyle' ]]; then pip install pycodestyle; fi
- if [[ $SETUP_CMD == build_sphinx* ]]; then pip install Sphinx; fi
- if [[ $MAIN_CMD == 'coverage' ]]; then pip install coverage coveralls; fi
# - pip install -r requirements.txt

Expand Down
6 changes: 5 additions & 1 deletion doc/changes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,14 @@
Release Notes
=============

0.3.1 (unreleased)
0.4.0 (unreleased)
------------------

* Add ``--version`` option.
* Add Python 3.6, remove 3.3.
* Add many quality-assurance checks and additional documentation (PR `#2`_).

.. _`#2`: https://github.com/weaverba137/hpsspy/pull/2

0.3.0 (2017-01-18)
------------------
Expand Down
12 changes: 6 additions & 6 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,13 +40,13 @@

# Configuration for intersphinx, copied from astropy.
intersphinx_mapping = {
'python': ('http://docs.python.org/', None),
'python': ('http://docs.python.org/3/', None),
# 'python3': ('http://docs.python.org/3/', path.abspath(path.join(path.dirname(__file__), 'local/python3links.inv'))),
'numpy': ('http://docs.scipy.org/doc/numpy/', None),
'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
'matplotlib': ('http://matplotlib.org/', None),
'astropy': ('http://docs.astropy.org/en/stable/', None),
'h5py': ('http://docs.h5py.org/en/latest/', None)
# 'numpy': ('http://docs.scipy.org/doc/numpy/', None),
# 'scipy': ('http://docs.scipy.org/doc/scipy/reference/', None),
# 'matplotlib': ('http://matplotlib.org/', None),
# 'astropy': ('http://docs.astropy.org/en/stable/', None),
# 'h5py': ('http://docs.h5py.org/en/latest/', None)
}

# Add any paths that contain templates here, relative to this directory.
Expand Down
200 changes: 200 additions & 0 deletions doc/configuration.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
====================
HPSSPy Configuration
====================

Introduction
++++++++++++

The primary HPSSPy command-line program :command:`missing_from_hpss` is
configured with a JSON_ file. Both the JSON standard and the
Python :mod:`json` library are very strict. There is a very quick way
to check the validity of JSON files however::

python -c 'import json; j = open("config.json"); data = json.load(j); j.close()'

where ``"config.json"`` should be replaced with the name of the file to be
tested.

The top-level JSON container should be an "object", equivalent to a Python
:class:`dict`. The simplest possible file that satisfies this requirement
is::

{
}

Obviously, that's not very much to go on. You will need further data
described below.

.. _JSON: http://json.org

Metadata
++++++++

The configuration file should contain a top-level keyword ``"config"``.
The value should itself be a :class:`dict`, containing some important
metadata::

{
"config": {
"root": "/global/project/projectdirs/my_project",
"hpss_root": "/nersc/projects/my_project",
"physical_disks": ["my_project"]
}
}

root/
The directory that contains *all* the data associated with the project.

hpss\_root/
The path on the HPSS tape system that will contain the backups.

physical\_disks/
If the data are spread across several physical disks and linked into
the root path via symlinks, the various physical disks need to be listed
here. If the value is equivalent to ``False``, *e.g.*,
[``null``, ``false``, ``[]``] this is means that the
``"root"`` disk contains all the physical data. If the value of
is equivalent to a one-item list containing ``os.path.basename(root)``,
then this *also* means that the ``"root"`` disk contains all the physical
data. A list of simple names generates the physical disks by
substitution on the basename of the ``"root"`` value. More complicated
configurations are possible, see :func:`hpsspy.scan.physical_disks`.

Sections
++++++++

Inside the root directory, as described above, there may be several top-level
directories. For the purposes of this documentation, these are called
"sections" or "releases". The terms are interchangable. Each section
has configuration items that describe its structure::

{
"config": {
"root": "/projects/my_project",
"hpss_root": "/hpss/projects/my_project",
"physical_disks": ["my_project"]
},
"data": {
"exclude": [],
"d1": {
"d1/batch/.*$": "d1/batch.tar",
"d1/([^/]+\\.txt)$": "d1/\\1",
"d1/templates/[^/]+$": "d1/templates/templates_files.tar"
}
}
}

The :command:`missing_from_hpss` command works on one section at a time.
The name of the section is passed on the command-line::

missing_from_hpss config.json data

This would read the data section above.

Each section should have an ``"exclude"`` keyword, whose value is a list
of files to be ignored. In the example above, in order to ignore the file
``/projects/my_project/data/d1/README.html``, the ``"exclude"`` value
would be ``["d1/README.html"]``. Note that this is relative to the
path ``/projects/my_project/data``, since ``"data"`` is the section being
processed.

Mapping File Names to HPSS Archives
+++++++++++++++++++++++++++++++++++

Within a section, each immediate subdirectory should be described with
a keyword in the configuration file. :command:`missing_from_hpss` will
complain if not, but it won't necessarily cause it to fail. In the
example above, ``/projects/my_project/data/d1`` is configured.

There are many possible ways to bundle files for archiving. Generally you
want to make archives as large as possible, without spilling onto multiple
tapes. However, with highly structured, deeply-nested directory structures,
this isn't always the best way to do it from a data *retrieval* viewpoint.

Consider this scenario. ``/projects/my_project/data`` has been archived to
ten tape archives called ``data00.tar``, ``data01.tar``, ... ``data09.tar``.
The file ``/projects/my_project/data/d1/templates/d1_template_05.fits``
needs to be recovered. Which tape archive contains it?

Now consider the scenario where the files in
``/projects/my_project/data/d1/templates`` have been archived to
``/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar``.
Now is it easier to recover the file?

One should still try to make archives as big as possible, but generally
speaking, long-term archiving of large, complex data sets should be
done by **someone who actually knows the structure of the data set** .

In coding terms we describe a portion of a directory tree hierarchy
using regular expressions to match *files* in that portion. Then we map
files that match that regular expression to tape archive files.

Regular Expression Details
++++++++++++++++++++++++++

The HPSSPy package, and :command:`missing_from_hpss` will validate the
regular expressions used in the configuration file, in addition to checking
the overall validity of the JSON file itself. That is, a bad regular
expression will be rejected before it has any chance to "touch" any real data.

The regular expressions should follow Python's conventions,
described in :mod:`re`. In addition to those conventions, this package
imposes some additional requirements, conventions and idioms:

* Requirements

- Backslashes must be escaped in JSON files. For example the
metacharacter (match a single decimal digit) ``\d`` becomes ``\\d``.
- Regular expressions should end with the end-of-line marker ``$``.

* Conventions

- Any archive file name ending in ``.tar`` is assumed to be an HTAR file,
and that command will be used to construct it.
- Any archive file *not* ending in ``.tar`` will simply be copied to
HPSS as is.
- When constructing an archive file, :command:`missing_from_hpss` will
obtain the directory it needs to archive from the name of the *archive*
file, not the regular expression itself. This is because regular
expression *substition* is performed on the archive file name.
For example ``batch.tar`` means "archive a batch/ directory".
For longer file names, the "suffix" of the file will be used.
``data_d1_batch.tar`` also means "archive a batch/ directory", because
``data_d1_`` is stripped off.
- An archive filename that ends with ``_files.tar``, *e.g.* ``foo/bar_files.tar``
is a signal to :command:`missing_from_hpss` to construct
the archive file in a certain way, not by decending into a directory,
but by constructing an explicit list of files and building an archive
file out of that.

* Idioms

- Archive the entire contents of a directory into a single file:
``"foo/.*$" : "foo.tar"``.
- Archive several subdirectories of a directory, each into their own file:
``"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"``. The name of the
directory matched in parentheses will be substituted into the file name.
- Archive arbitrary subdirectories of a *set* of subdirectories:
``"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$":"d1/foo/\\1/d1_foo_\\1_\\2.tar"``
- Match files in a directory, but not any files in any
subdirectory: ``"foo/[^/]+$" : "foo_files.tar"``. See also the
``_files.tar`` convention mentioned above.
- Group some but not all subdirectories in a directory into a single
archive file for efficiency: ``"foo/([0-9])([0-9][0-9])/.*$" : "foo/foo_\\1XX.tar"``.
Note the ending of the archive file, and that the directories have to
have a very uniform naming convention (three and only three digits
in this example).
- Do not create an archive file, just copy the file, as is, to HPSS:
``"d1/README\\.txt$" : "d1/README.txt"``. Similarly, for a set of TXT files:
``"d1/([^/]+\\.txt)$" : "d1/\\1"``.
- An example with lots of substitutions::

"d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar"

Finally, for truly monumentally-complicated directory trees, there is a
`JSON file`_ included with this distribution describing the SDSS_ data tree
that can be used for examples. To view the equivalent files and directories
for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12.

.. _SDSS: https://www.sdss.org
.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json
4 changes: 3 additions & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,8 +35,10 @@ Contents
.. toctree::
:maxdepth: 1

changes
configuration
using
api
changes

Indices and tables
++++++++++++++++++
Expand Down
Loading