-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
15 changed files
with
715 additions
and
136 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,197 @@ | ||
==================== | ||
HPSSPy Configuration | ||
==================== | ||
|
||
Introduction | ||
++++++++++++ | ||
|
||
The primary HPSSPy command-line program :command:`missing_from_hpss` is | ||
configured with a JSON_ file. Both the JSON standard and the | ||
Python :mod:`json` library are very strict. There is a very quick way | ||
to check the validity of JSON files however:: | ||
|
||
python -c 'import json; j = open("config.json"); data = json.load(j); j.close()' | ||
|
||
where ``"config.json"`` should be replaced with the name of the file to be | ||
tested. | ||
|
||
The top-level JSON container should be an "object", equivalent to a Python | ||
:class:`dict`. The simplest possible file that satisfies this requirement | ||
is:: | ||
|
||
{ | ||
} | ||
|
||
Obviously, that's not very much to go on. You will need further data | ||
described below. | ||
|
||
.. _JSON: http://json.org | ||
|
||
Metadata | ||
++++++++ | ||
|
||
The configuration file should contain a top-level keyword ``"config"``. | ||
The value should itself be a :class:`dict`, containing some important | ||
metadata:: | ||
|
||
{ | ||
"config": { | ||
"root": "/global/project/projectdirs/my_project", | ||
"hpss_root": "/nersc/projects/my_project", | ||
"physical_disks": ["my_project"] | ||
} | ||
} | ||
|
||
root/ | ||
The directory that contains *all* the data associated with the project. | ||
|
||
hpss\_root/ | ||
The path on the HPSS tape system that will contain the backups. | ||
|
||
physical\_disks/ | ||
If the data are spread across several physical disks and linked into | ||
the root path via symlinks, the various physical disks need to be listed | ||
here. If the value is equivalent to ``False``, *e.g.*, | ||
[``null``, ``false``, ``[]``] this is means that the | ||
``"root"`` disk contains all the physical data. If the value of | ||
is equivalent to a one-item list containing ``os.path.basename(root)``, | ||
then this *also* means that the ``"root"`` disk contains all the physical | ||
data. A list of simple names generates the physical disks by | ||
substitution on the basename of the ``"root"`` value. More complicated | ||
configurations are possible, see :func:`hpsspy.scan.physical_disks`. | ||
|
||
Sections | ||
++++++++ | ||
|
||
Inside the root directory, as described above, there may be several top-level | ||
directories. For the purposes of this documentation, these are called | ||
"sections" or "releases". The terms are interchangable. Each section | ||
has configuration items that describe its structure:: | ||
|
||
{ | ||
"config": { | ||
"root": "/projects/my_project", | ||
"hpss_root": "/hpss/projects/my_project", | ||
"physical_disks": ["my_project"] | ||
}, | ||
"data": { | ||
"exclude": [], | ||
"d1": { | ||
"d1/batch/.*$": "d1/batch.tar", | ||
"d1/([^/]+\\.txt)": "d1/\\1", | ||
"d1/templates/[^/]+$": "d1/templates/templates_files.tar" | ||
} | ||
} | ||
} | ||
|
||
The :command:`missing_from_hpss` command works on one section at a time. | ||
The name of the section is passed on the command-line:: | ||
|
||
missing_from_hpss config.json data | ||
|
||
This would read the data section above. | ||
|
||
Each section should have an ``"exclude"`` keyword, whose value is a list | ||
of files to be ignored. In the example above, in order to ignore the file | ||
``/projects/my_project/data/d1/README.html``, the ``"exclude"`` value | ||
would be ``["d1/README.html"]``. Note that this is relative to the | ||
path ``/projects/my_project/data``, since ``"data"`` is the section being | ||
processed. | ||
|
||
Mapping File Names to HPSS Archives | ||
+++++++++++++++++++++++++++++++++++ | ||
|
||
Within a section, each immediate subdirectory should be described with | ||
a keyword in the configuration file. :command:`missing_from_hpss` will | ||
complain if not, but it won't necessarily cause it to fail. In the | ||
example above, ``/projects/my_project/data/d1`` is configured. | ||
|
||
There are many possible ways to bundle files for archiving. Generally you | ||
want to make archives as large as possible, without spilling onto multiple | ||
tapes. However, with highly structured, deeply-nested directory structures, | ||
this isn't always the best way to do it from a data *retrieval* viewpoint. | ||
|
||
Consider this scenario. ``/projects/my_project/data`` has been archived to | ||
ten tape archives called ``data00.tar``, ``data01.tar``, ... ``data09.tar``. | ||
The file ``/projects/my_project/data/d1/templates/d1_template_05.fits`` | ||
needs to be recovered. Which tape archive contains it? | ||
|
||
Now consider the scenario where the files in | ||
``/projects/my_project/data/d1/templates`` have been archived to | ||
``/hpss/projects/my_project/data/d1/templates/d1_templates_files.tar``. | ||
Now is it easier to recover the file? | ||
|
||
One should still try to make archives as big as possible, but generally | ||
speaking, long-term archiving of large, complex data sets should be | ||
done by **someone who actually knows the structure of the data set** . | ||
|
||
In coding terms we describe a portion of a directory tree hierarchy | ||
using regular expressions to match *files* in that portion. Then we map | ||
files that match that regular expression to tape archive files. | ||
|
||
Regular Expression Details | ||
++++++++++++++++++++++++++ | ||
|
||
The HPSSPy package, and :command:`missing_from_hpss` will validate the | ||
regular expressions used in the configuration file, in addition to checking | ||
the overall validity of the JSON file itself. That is, a bad regular | ||
expression will be rejected before it has any chance to "touch" any real data. | ||
|
||
The regular expressions should follow Python's conventions, | ||
described in :mod:`re`. In addition to those conventions, this package | ||
imposes some additional requirements, conventions and idioms: | ||
|
||
* Requirements | ||
|
||
- Backslashes must be escaped in JSON files. For example the | ||
metacharacter (match a single decimal digit) ``\d`` becomes ``\\d``. | ||
"Double-escaping" is not required (if you don't know what this is, | ||
don't worry about it). | ||
|
||
* Conventions | ||
|
||
- Any archive file name ending in ``.tar`` is assumed to be an HTAR file, | ||
and that command will be used to construct it. | ||
- Any archive file *not* ending in ``.tar`` will simply be copied to | ||
HPSS as is. | ||
- When constructing an archive file, :command:`missing_from_hpss` will | ||
obtain the directory it needs to archive from the name of the *archive* | ||
file, not the regular expression itself. This is because regular | ||
expression *substition* is performed on the archive file name. | ||
For example ``batch.tar`` means "archive a batch/ directory". | ||
For longer file names, the "suffix" of the file will be used. | ||
``data_d1_batch.tar`` also means "archive a batch/ directory", because | ||
``data_d1_`` is stripped off. | ||
- An archive filename that ends with ``_files.tar``, *e.g.* ``foo/bar_files.tar`` | ||
is a signal to :command:`missing_from_hpss` to construct | ||
the archive file in a certain way, not by decending into a directory, | ||
but by constructing an explicit list of files and building an archive | ||
file out of that. | ||
- Regular expressions should end with the end-of-line marker ``$``. | ||
|
||
* Idioms | ||
|
||
- Archive the entire contents of a directory into a single file: | ||
``"foo/.*$" : "foo.tar"``. | ||
- Archive several subdirectories of a directory, each into their own file: | ||
``"foo/(bar|baz|flub)/.*$" : "foo/foo_\\1.tar"``. The name of the | ||
directory matched in parentheses will be substituted into the file name. | ||
- Archive arbitrary subdirectories of a *set* of subdirectories: | ||
``"d1/foo/(ab|bc|cd|de|ef)/([^/]+)/.*$":"d1/foo/\\1/d1_foo_\\1_\\2.tar"`` | ||
- Match files in a directory, but not any files in any | ||
subdirectory: ``"foo/[^/]+$" : "foo_files.tar"``. See also the | ||
``_files.tar`` convention mentioned above. | ||
- Do not create an archive file, just copy the file, as is, to HPSS: | ||
``"d1/README\\.txt" : "d1/README.txt"``. Similarly, for a set of TXT files: | ||
``"d1/([^/]+\\.txt)" : "d1/\\1"``. | ||
- An example with lots of substitutions:: | ||
|
||
"d1/foo/([0-9a-zA-Z_-]+)/sub-([0-9]+)/([0-9]+)/.*$" : "d1/foo/\\1/spectra-\\2/\\1_spectra-\\2_\\3.tar" | ||
|
||
Finally, for truly monumentally-complicated directory trees, there is a | ||
`JSON file`_ included with this distribution describing the SDSS_ data tree | ||
that can be used for examples. To view the equivalent files and directories | ||
for section ``"dr12"``, for example, visit https://data.sdss.org/sas/dr12. | ||
|
||
.. _SDSS: https://www.sdss.org | ||
.. _`JSON file`: https://github.com/weaverba137/hpsspy/blob/master/hpsspy/data/sdss.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,105 @@ | ||
============ | ||
Using HPSSPy | ||
============ | ||
|
||
Introduction | ||
++++++++++++ | ||
|
||
The primary *command-line* interface to HPSSPy is the script | ||
:command:`missing_from_hpss`, which is automatically generated by the | ||
package install process. If you need to generate this script manually, it | ||
is equivalent to:: | ||
|
||
#!/usr/bin/env python | ||
from sys import exit | ||
from hpsspy.scan import main | ||
exit(main()) | ||
|
||
Options | ||
+++++++ | ||
|
||
There are a lot of command-line options. ``missing_from_hpss --help`` will | ||
display all of them. Just the short versions of the commands are | ||
shown here. | ||
|
||
-c DIR Cache files (described below) are written to | ||
``$HOME/scratch`` by default. This option | ||
allows the user to choose any directory. | ||
-D Delete and recreate the disk cache file | ||
(described below). | ||
-H Delete and recreate the HPSS cache file | ||
(described below). | ||
-l N Limit archive files to this size in GB. | ||
The default is 1024 GB (1 TB). | ||
-p Issue the HPSS commands necessary to actually | ||
back up the files found that need to be backed up. | ||
-r N Issue a progress report on how many files | ||
have been analyzed after ``N`` files | ||
(default 10,000). | ||
-t Test mode. Try not to make any changes. | ||
Also pretend that there are no files backed up to HPSS. | ||
-v Print *lots* of extra information. | ||
--version Print a version string and exit. | ||
|
||
Besides the options described above, :command:`missing_from_hpss` requires | ||
two positional arguments:: | ||
|
||
missing_from_hpss config.json section | ||
|
||
The two arguements are the path to a configuration file and a section of that | ||
file to process. These are extensively described in the | ||
:doc:`configuration document <configuration>`. | ||
|
||
Cache Files | ||
+++++++++++ | ||
|
||
:command:`missing_from_hpss` uses a few cache files primarily to reduce | ||
memory footprint. These files will be stored in ``$HOME/scratch`` | ||
by default. The files are: | ||
|
||
Disk Cache | ||
A CSV file of the form ``disk_cache_<section>.csv``, where ``<section>`` is | ||
the section (as defined above) specified on the command-line. The | ||
columns are file name and file size in bytes. | ||
|
||
HPSS Cache | ||
A plain-text file of the form ``hpss_cache_<section>.txt``, | ||
where ``<section>`` is the section (as defined above) specified on | ||
the command-line. This is simply a list of files found on HPSS. | ||
|
||
Missing File Cache | ||
A JSON file of the form ``$HOME/scratch/missing_files_<section>.json``, | ||
where ``<section>`` is the section (as defined above) specified on the | ||
command-line. It contains a map of HPSS archive files to the files that | ||
belong in that archive. In addition the size of the resulting files | ||
(modulo small overheads from the archive file creation process) will | ||
be saved to this file. | ||
|
||
These files are *not* cleaned up by default because they are very useful | ||
for debugging purposes. | ||
|
||
Testing and Quality Assurance | ||
+++++++++++++++++++++++++++++ | ||
|
||
To test a configuration file just run :command:`missing_from_hpss` with the | ||
``--test`` option as described above. Aside from creating cache files in | ||
a scratch directory as described above, this mode will not alter any of the | ||
data, neither on disk nor on HPSS. | ||
|
||
In addition to validating JSON files and regular expressions, as | ||
described in the :doc:`configuration document <configuration>`, | ||
:command:`missing_from_hpss` will: | ||
|
||
1. Make sure all regular expressions are actually used. | ||
2. Make sure all files actually match *one and only one* regular expression. | ||
3. Create a manifest file containing the actual files on disk matched and | ||
the archive file they map to. This is one and the same as the | ||
"Missing File Cache" described above. | ||
4. Make sure that all archive file sizes are less than a user-defined limit | ||
(default 1 TB), configurable on the command-line. | ||
|
||
HPSSPy Library | ||
++++++++++++++ | ||
|
||
For programmatic access to HPSS, the :doc:`HPSSPy library <api>` provides | ||
equvalents of :mod:`os` and :mod:`os.path` that operate on the HPSS filesystem. |
Oops, something went wrong.