Make it possible to do eager saving #138

pnuu · 2022-02-24T09:28:49Z

When saving multiple datasets to a single CF/NetCDF4 file using the syntax introduced in #51 , I got random crashes resulting in RuntimeError: NetCDF: Not a valid ID within the XArray library. Some internet searching suggested that this is due to trying to use dimensions that have not yet been defined.

My solution to this is adding a config option that forces an eager saving instead of delaying the saving and calling compute_writer_results() afterwards. With this PR, I haven't seen this happen a single time over 50 consecutive runs.

Tests added
Tests passed
Passes flake8 trollflow2
Fully documented

codecov · 2022-02-24T09:31:46Z

Codecov Report

Merging #138 (d4eb0ca) into main (73c9e35) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

❗ Current head d4eb0ca differs from pull request most recent head dbd4e5d. Consider uploading reports for the commit dbd4e5d to get more accurate results

@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
- Coverage   95.56%   95.48%   -0.08%     
==========================================
  Files          11       11              
  Lines        2365     2372       +7     
==========================================
+ Hits         2260     2265       +5     
- Misses        105      107       +2

Flag	Coverage Δ
unittests	`95.48% <100.00%> (-0.08%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
trollflow2/plugins/__init__.py	`92.84% <100.00%> (-0.38%)`	⬇️
trollflow2/tests/test_trollflow2.py	`99.47% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97882c0...dbd4e5d. Read the comment docs.

pnuu · 2022-02-24T10:34:04Z

Here's a short scipt using Satpy directly that demonstrates the random crashes:

#!/usr/bin/env python

import glob

import numpy as np
from satpy import Scene
from satpy.writers import compute_writer_results

COMPUTE = False


def main():
    fnames = glob.glob(
        "/home/lahtinep/data/satellite/polar/ears_pps/*48965.nc")
    dsets = ['cma', 'ct', 'ctth_alti', 'ctth_pres', 'ctth_tempe']
    glbl = Scene(reader='nwcsaf-pps_nc', filenames=fnames)
    glbl.load(dsets)
    dtype_int16 = {'dtype': np.int16}
    encoding = {'cma': dtype_int16, 'ct': dtype_int16}
    res = glbl.save_datasets(writer='cf', filename="/tmp/pps_test.nc",
                             encoding=encoding, include_lonlats=True, compute=COMPUTE)
    if not COMPUTE:
        compute_writer_results([res])


if __name__ == "__main__":
    main()

I'm trying to find an example that wouldn't require actual data.

pnuu · 2022-02-24T10:57:28Z

And the corresponding version as trollflow2.yaml:

product_list:

  output_dir:
    /tmp/
  fname_pattern:
    "{start_time:%Y%m%d_%H%M}_{platform_name}_{areaname}_EARS_PPS.nc"
  reader: nwcsaf-pps_nc
  subscribe_topics:
    - /test/ears/avhrr/pps/gatherer
  eager_writing: True

  areas:
    null:
      priority: 1
      areaname: swath
      products:
        ("cma", "ct", "ctth_alti", "ctth_pres", "ctth_tempe"):
          formats:
          - format: nc
            writer: cf
            encoding:
              cma:
                dtype: !!python/name:numpy.int16
              ct:
                dtype: !!python/name:numpy.int16
            include_lonlats: True


workers:
  - fun: !!python/name:trollflow2.plugins.create_scene
  - fun: !!python/name:trollflow2.plugins.load_composites
  - fun: !!python/name:trollflow2.plugins.resample
  - fun: !!python/name:trollflow2.plugins.save_datasets

pnuu · 2022-02-24T11:27:05Z

Another pure-Satpy example using SEVIRI HRIT data:

#!/usr/bin/env python

import glob

from satpy import Scene
from satpy.writers import compute_writer_results

COMPUTE = False


def main():
    fnames = glob.glob(
        "/home/lahtinep/data/satellite/geo/0deg/*202202031045*")
    dsets = ['VIS006', 'VIS008']
    glbl = Scene(reader='seviri_l1b_hrit', filenames=fnames)
    glbl.load(dsets)
    res = glbl.save_datasets(writer='cf', filename="/tmp/seviri_test.nc",
                             include_lonlats=True, compute=COMPUTE)
    if not COMPUTE:
        compute_writer_results([res])


if __name__ == "__main__":
    main()

pnuu · 2022-02-24T12:50:59Z

UPDATED VERSION

And a version that fails without any Pytroll code:

#!/usr/bin/env python

import datetime as dt

import numpy as np
import dask.array as da
import xarray as xr

COMPUTE = False
FNAME = "/tmp/xr_test.nc"


def main():
    y = np.arange(1000, dtype=np.uint16)
    x = np.arange(2000, dtype=np.uint16)
    now = dt.datetime.utcnow()
    times = xr.DataArray(np.array([now + dt.timedelta(seconds=i) for i in range(y.size)], dtype=np.datetime64),
                         coords={'y': y})

    # Write root
    root = xr.Dataset({}, attrs={'global': 'attribute'})
    written = [root.to_netcdf(FNAME, mode='w')]

    # Write first dataset
    data1 = xr.DataArray(da.random.random((y.size, x.size)), dims=['y', 'x'],
                         coords={'y': y, 'x': x, 'time': times})
    dset1 = xr.Dataset({'data1': data1})
    written.append(dset1.to_netcdf(FNAME, mode='a', compute=COMPUTE))

    # Write second dataset using the same time coordinates
    data2 = xr.DataArray(da.random.random((y.size, x.size)), dims=['y', 'x'],
                         coords={'y': y, 'x': x, 'time': times})
    dset2 = xr.Dataset({'data2': data2})
    written.append(dset2.to_netcdf(FNAME, mode='a', compute=COMPUTE))

    if not COMPUTE:
        da.compute(written)


if __name__ == "__main__":
    main()

pnuu · 2022-02-24T14:36:35Z

Created an issue to XArray: pydata/xarray#6300

mraspaud

Thanks for adding this feature. I think it just needs adding in the general documentation that this option will most likely slow down processing significantly.

pnuu · 2022-03-09T14:18:57Z

I added a note to the example configuration file. This will most likely be a temporary solution until pydata/xarray#6300 is fixed.

mraspaud

LGTM

pnuu added 3 commits February 24, 2022 10:50

Refactor job creation for save_datasets tests

9e6b2c1

Test that eager saving is called

aa4d42a

Make it possible to write data eagerly

c415561

pnuu added the enhancement New feature or request label Feb 24, 2022

pnuu self-assigned this Feb 24, 2022

pnuu requested a review from mraspaud as a code owner February 24, 2022 09:28

mraspaud reviewed Feb 28, 2022

View reviewed changes

Add a warning that eager saving might increase processing time

41bb87e

Mention eager_writing option in the documentation

d4eb0ca

mraspaud approved these changes Mar 9, 2022

View reviewed changes

mraspaud enabled auto-merge March 9, 2022 14:42

Merge branch 'main' into feature-eager-saving

dbd4e5d

pnuu disabled auto-merge March 10, 2022 06:05

pnuu merged commit c9b3ad1 into pytroll:main Mar 10, 2022

pnuu deleted the feature-eager-saving branch March 10, 2022 06:05

gerritholl mentioned this pull request Apr 26, 2022

When using use_tmp_file or staging_dir, move/rename files as soon as they are finished #129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to do eager saving #138

Make it possible to do eager saving #138

pnuu commented Feb 24, 2022

codecov bot commented Feb 24, 2022 •

edited

pnuu commented Feb 24, 2022 •

edited

pnuu commented Feb 24, 2022

pnuu commented Feb 24, 2022

pnuu commented Feb 24, 2022 •

edited

pnuu commented Feb 24, 2022

mraspaud left a comment

pnuu commented Mar 9, 2022

mraspaud left a comment

Make it possible to do eager saving #138

Make it possible to do eager saving #138

Conversation

pnuu commented Feb 24, 2022

codecov bot commented Feb 24, 2022 • edited

Codecov Report

pnuu commented Feb 24, 2022 • edited

pnuu commented Feb 24, 2022

pnuu commented Feb 24, 2022

pnuu commented Feb 24, 2022 • edited

pnuu commented Feb 24, 2022

mraspaud left a comment

Choose a reason for hiding this comment

pnuu commented Mar 9, 2022

mraspaud left a comment

Choose a reason for hiding this comment

codecov bot commented Feb 24, 2022 •

edited

pnuu commented Feb 24, 2022 •

edited

pnuu commented Feb 24, 2022 •

edited