Skip to content

Commit

Permalink
Merge branch 'develop' for 0.15.0
Browse files Browse the repository at this point in the history
  • Loading branch information
ctberthiaume committed Jun 19, 2019
2 parents 83eb73a + 72b08cd commit e167e8d
Show file tree
Hide file tree
Showing 14 changed files with 543 additions and 212 deletions.
4 changes: 4 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
/pyinstaller/
/.tox/
/.pytest_cache/
/.git/
/__pycache__/
84 changes: 48 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ This package is compatible with Python 3.7.

Single file executables of the `seaflowpy` command-line tool
for MacOS and Linux can be downloaded from the project's github
[release page](https://github.com/armbrustlab/seaflowpy/releases).
[releases page](https://github.com/armbrustlab/seaflowpy/releases).
This is the recommended method if only the command-line tool is required.

### Docker

Expand Down Expand Up @@ -45,6 +46,20 @@ seaflowpy version
deactivate
```

## Configuration

To use `seaflowpy sfl manifest` AWS credentials need to be configured.
The easiest way to do this is to install the `awscli` Python package
and go through configuration.

```sh
pip3 install awscli
aws configure
```

This will store AWS configuration in `~/.aws` which `seaflowpy` will use to
access Seaflow data in S3 storage.

## Integration with R

To call `seaflowpy` from R, update the PATH environment variable in
Expand All @@ -70,51 +85,48 @@ Run `seaflowpy --help` to begin exploring the CLI usage documentation.
SFL validation sub-commands are available under the `seaflowpy sfl` command.
The usage details for each command can be accessed as `seaflowpy sfl <cmd> -h`.

#### `seaflowpy sfl convert-gga`

Converts GGA coordinate values to decimal degree. Otherwise the file is
unchanged.

#### `seaflowpy sfl dedup`
The basic worfkflow should be

Remove lines in an SFL file with duplicate "FILE" values.
Because it's impossible to know which of the duplicated SFL entries
corresponds to which EVT file, all duplicate rows are removed.
A unique list of removed files is printed to STDERR.
1) If starting with an SDS file, first convert to SFL with `seaflowpy sds2sfl`

#### `seaflowpy sfl manifest`
2) If the SFL file is output from `sds2sfl` or is a raw SeaFlow SFL file,
convert it to a normalized format with `seaflowpy sfl print`.
This command can be used to concatenate multiple SFL files,
e.g. merge all SFL files in day-of-year directories.

Compare EVT files listed in an SFL file with EVT files on-disk
or in cloud object storage.
This can serve as a quick sanity check for the internal consistency of a
SeaFlow cruise data folder.
NB, it's normal for one file to be missing from the SFL file
or EVT day of year folder around midnight.
3) Check for potential errors or warnings with `seaflowpy sfl validate`.

#### `seaflowpy sfl print`
4) Fix errors and warnings. Duplicate file errors can be fixed with `seaflowpy sfl dedup`.
Bad lat/lon errors may be fixed with`seaflowpy sfl convert-gga`,
assuming the bad coordinates are GGA to begin with.
This can be checked with with `seaflowpy sfl detect-gga`.
Other errors or missing values may need to be fixed manually.

Print a standard version of an SFL file with only the necessary columns.
The correct day of year folder will be added to "FILE" column values if not
present. "DATE" column will be created if not present from "FILE" column values
(only applies to new-style datestamped file names).
Any other required columns which are missing will be created with "NA" values.
5) (Optional) Update event rates based on true event counts and file duration
with `seaflowpy sfl fix-event-rate`.
True event counts for raw EVT files can be determined with `seaflowpy evt count`.
If filtering has already been performed then event counts can be pulled from
the `all_count` column of the opp table in the SQLITE3 database.
e.g. `sqlite3 -separator $'\t' SCOPE_14.db 'SELECT file, all_count ORDER BY file'`

#### `seaflowpy sfl validate`
6) (Optional) As a check for dataset completeness,
the list of files in an SFL file can be compared to the actual EVT files present
with `seaflowpy sfl manifest`. It's normal for a few files to differ,
especially near midnight. If a large number of files are missing it may be a
sign that the data transfer was incomplete or the SFL file is missing some days.

Validate key values in an SFL file. The following checks are performed:
7) Once all errors or warnings have been fixed, do a final `seaflowpy validate`
before adding the SFL file to the appropriate repository.

* all required columns are present
* "FILE" column values have day of year folders, are in the proper format,
in chronological order, and are unique
* "DATE" column values are in the proper format, represent valid date and times,
and are UTC
* "LAT" and "LON" coordinate column values are valid decimal degree values
## Development

Because some of these errors can affect every row of the file
(e.g. out of order files), only the first error of each type is printed.
To get a full printout of all errors run the command with `--verbose`.
### Source code structure

## Development
This project follows the [Git feature branch workflow](https://www.atlassian.com/git/tutorials/comparing-workflows/feature-branch-workflow).
Active development happens on the `develop` branch and on feature branches which are eventually merged into `develop`.
Commits on the `master` branch represent stable release snapshots with version tags and build products,
merged from `develop` with `--no-ff` to create a single commit in `master`
while keeping the complete commit history in develop.

### Build

Expand Down
4 changes: 2 additions & 2 deletions build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ deactivate
# Optional, upload wheel and source tarball to PyPI
# --------------------------------------------------------------------------- #
# Test against test PyPI repo
# twine upload -r https://test.pypi.org/legacy/ dist/seaflowpy-x.x.x*
# twine upload --repository-url https://test.pypi.org/legacy/ dist/seaflowpy-*

# Create a virtualenv and test install from test.pypi.org
# python -m venv pypi-test
Expand All @@ -115,4 +115,4 @@ deactivate
# pypi-test/bin/seaflowpy version

# Then upload to the real PyPI
# twine upload dist/seaflowpy-x.x.x*
# twine upload dist/seaflowpy-*
4 changes: 3 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@
url='https://github.com/armbrustlab/seaflowpy',
author='Chris T. Berthiaume',
author_email='chrisbee@uw.edu',
license='GPL3',
packages=find_packages(where='src'),
package_dir={'': 'src'},
include_package_data=True,
Expand All @@ -26,5 +25,8 @@
'seaflowpy=seaflowpy.cli.cli:cli'
]
},
classifiers=[
'License :: OSI Approved :: GNU General Public License v3 (GPLv3)'
],
zip_safe=True
)
5 changes: 3 additions & 2 deletions src/seaflowpy/cli/commands/dayofyear_cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,12 @@
help='Print 3 columns: input path, file name, day of year dir.')
@click.argument('files', nargs=-1, type=click.Path())
def dayofyear_cmd(verbose, files):
"""Get calculated day of year dir from filename timestamp.
"""
Gets calculated day of year dir from filename timestamp.
File paths must be new-style datestamped paths. Any part of the file
path except for the filename will be ignored. The filename may include a
'.gz' extension.
'.gz' extension. Outputs to STDOUT.
"""
output = []
for file in files:
Expand Down
51 changes: 26 additions & 25 deletions src/seaflowpy/cli/commands/db_cmd.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,6 @@ def db_cmd():


@db_cmd.command('create')
@click.option('-i', '--infile', required=True, type=click.File(),
help='Input SFL file. - for stdin.')
@click.option('-d', '--db', 'dbpath', required=True,
help='SQLite3 database file.')
@click.option('-c', '--cruise',
help='Supply a cruise name here to override any found in the filename.')
@click.option('-f', '--force', is_flag=True,
Expand All @@ -27,16 +23,22 @@ def db_cmd():
help='Supply a instrument serial number here to override any found in the filename.')
@click.option('-v', '--verbose', is_flag=True,
help='Report all errors.')
def db_create_cmd(infile, dbpath, cruise, force, json, serial, verbose):
"""Create database from SFL file.
@click.argument('sfl-file', nargs=1, type=click.File())
@click.argument('db-file', nargs=1, type=click.Path(writable=True))
def db_create_cmd(cruise, force, json, serial, verbose, sfl_file, db_file):
"""
Creates database from SFL file.
Write processed SFL file data to SQLite3 database files. Data will be
Writes processed SFL-FILE data to SQLite3 database file. Data will be
checked before inserting. If any errors are found the first of each type
will be reported and no data will be written.
will be reported and no data will be written. To read from STDIN use '-'
for SFL-FILE. SFL-FILE should have the <cruise name> and <instrument serial>
embedded in the filename as '<cruise name>_<instrument serial>.sfl'. If not,
specify as options. Errors or warnings are output to STDOUT.
"""
if infile is not sys.stdin:
if sfl_file is not sys.stdin:
# Try to read cruise and serial from filename
results = sfl.parse_sfl_filename(infile.name)
results = sfl.parse_sfl_filename(sfl_file.name)
if results:
if cruise is None:
cruise = results[0]
Expand All @@ -46,20 +48,20 @@ def db_create_cmd(infile, dbpath, cruise, force, json, serial, verbose):
# Try to read cruise and serial from database if not already defined
if cruise is None:
try:
cruise = db.get_cruise(dbpath)
cruise = db.get_cruise(db_file)
except SeaFlowpyError as e:
pass
if serial is None:
try:
serial = db.get_serial(dbpath)
serial = db.get_serial(db_file)
except SeaFlowpyError as e:
pass

# Make sure cruise and serial are defined somewhere
if cruise is None or serial is None:
raise click.ClickException('instrument serial and cruise must both be specified either in filename as <cruise>_<instrument-serial>.sfl, as command-line options, or in database metadata table.')

df = sfl.read_file(infile)
df = sfl.read_file(sfl_file)

df = sfl.fix(df)
errors = sfl.check(df)
Expand All @@ -71,18 +73,17 @@ def db_create_cmd(infile, dbpath, cruise, force, json, serial, verbose):
sfl.print_tsv_errors(errors, sys.stdout, print_all=verbose)
if not force and len([e for e in errors if e["level"] == "error"]) > 0:
sys.exit(1)
sfl.save_to_db(df, dbpath, cruise, serial)
sfl.save_to_db(df, db_file, cruise, serial)


@db_cmd.command('import-filter-params')
@click.option('-d', '--db', 'dbpath', required=True,
help='SQLite3 database file.')
@click.option('-i', '--infile', required=True, type=click.File(),
help='Input filter parameters CSV. - for stdin.')
@click.option('-c', '--cruise',
help='Supply a cruise name for parameter selection. If not provided cruise in database will be used.')
def db_import_filter_params_cmd(dbpath, infile, cruise):
"""Import filter parameters to database.
@click.argument('filter-file', nargs=1, type=click.File())
@click.argument('db-file', nargs=1, type=click.Path(exists=True, writable=True))
def db_import_filter_params_cmd(cruise, filter_file, db_file):
"""
Imports filter parameters to database.
File paths must be new-style datestamped paths. Any part of the file
path except for the filename will be ignored. The filename may include a
Expand All @@ -92,7 +93,7 @@ def db_import_filter_params_cmd(dbpath, infile, cruise):
# If cruise not supplied, try to get from db
if cruise is None:
try:
cruise = db.get_cruise(dbpath)
cruise = db.get_cruise(db_file)
except SeaFlowpyError:
pass

Expand All @@ -104,19 +105,19 @@ def db_import_filter_params_cmd(dbpath, infile, cruise):
"na_filter": True,
"encoding": "utf-8"
}
df = pd.read_csv(infile, **defaults)
df = pd.read_csv(filter_file, **defaults)
df.columns = [c.replace('.', '_') for c in df.columns]
params = df[df.cruise == cruise]
if len(params.index) == 0:
raise click.ClickException('no filter parameters found for cruise %s' % cruise)
db.save_filter_params(dbpath, params.to_dict('index').values())
db.save_filter_params(db_file, params.to_dict('index').values())


@db_cmd.command('merge')
@click.argument('db1', type=click.Path(exists=True))
@click.argument('db2', type=click.Path(exists=True))
@click.argument('db2', type=click.Path(exists=True, writable=True))
def db_merge_cmd(db1, db2):
"""Merge SQLite3 db1 into db2.
"""Merges SQLite3 DB1 into DB2.
Only merges gating, poly, filter tables.
"""
Expand Down
Loading

0 comments on commit e167e8d

Please sign in to comment.