Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to do eager saving #138

Merged
merged 6 commits into from Mar 10, 2022
Merged

Conversation

pnuu
Copy link
Member

@pnuu pnuu commented Feb 24, 2022

When saving multiple datasets to a single CF/NetCDF4 file using the syntax introduced in #51 , I got random crashes resulting in RuntimeError: NetCDF: Not a valid ID within the XArray library. Some internet searching suggested that this is due to trying to use dimensions that have not yet been defined.

My solution to this is adding a config option that forces an eager saving instead of delaying the saving and calling compute_writer_results() afterwards. With this PR, I haven't seen this happen a single time over 50 consecutive runs.

  • Tests added
  • Tests passed
  • Passes flake8 trollflow2
  • Fully documented

@pnuu pnuu added the enhancement New feature or request label Feb 24, 2022
@pnuu pnuu self-assigned this Feb 24, 2022
@pnuu pnuu requested a review from mraspaud as a code owner February 24, 2022 09:28
@codecov
Copy link

codecov bot commented Feb 24, 2022

Codecov Report

Merging #138 (d4eb0ca) into main (73c9e35) will decrease coverage by 0.07%.
The diff coverage is 100.00%.

❗ Current head d4eb0ca differs from pull request most recent head dbd4e5d. Consider uploading reports for the commit dbd4e5d to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
- Coverage   95.56%   95.48%   -0.08%     
==========================================
  Files          11       11              
  Lines        2365     2372       +7     
==========================================
+ Hits         2260     2265       +5     
- Misses        105      107       +2     
Flag Coverage Δ
unittests 95.48% <100.00%> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
trollflow2/plugins/__init__.py 92.84% <100.00%> (-0.38%) ⬇️
trollflow2/tests/test_trollflow2.py 99.47% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 97882c0...dbd4e5d. Read the comment docs.

@pnuu
Copy link
Member Author

pnuu commented Feb 24, 2022

Here's a short scipt using Satpy directly that demonstrates the random crashes:

#!/usr/bin/env python

import glob

import numpy as np
from satpy import Scene
from satpy.writers import compute_writer_results

COMPUTE = False


def main():
    fnames = glob.glob(
        "/home/lahtinep/data/satellite/polar/ears_pps/*48965.nc")
    dsets = ['cma', 'ct', 'ctth_alti', 'ctth_pres', 'ctth_tempe']
    glbl = Scene(reader='nwcsaf-pps_nc', filenames=fnames)
    glbl.load(dsets)
    dtype_int16 = {'dtype': np.int16}
    encoding = {'cma': dtype_int16, 'ct': dtype_int16}
    res = glbl.save_datasets(writer='cf', filename="/tmp/pps_test.nc",
                             encoding=encoding, include_lonlats=True, compute=COMPUTE)
    if not COMPUTE:
        compute_writer_results([res])


if __name__ == "__main__":
    main()

I'm trying to find an example that wouldn't require actual data.

@pnuu
Copy link
Member Author

pnuu commented Feb 24, 2022

And the corresponding version as trollflow2.yaml:

product_list:

  output_dir:
    /tmp/
  fname_pattern:
    "{start_time:%Y%m%d_%H%M}_{platform_name}_{areaname}_EARS_PPS.nc"
  reader: nwcsaf-pps_nc
  subscribe_topics:
    - /test/ears/avhrr/pps/gatherer
  eager_writing: True

  areas:
    null:
      priority: 1
      areaname: swath
      products:
        ("cma", "ct", "ctth_alti", "ctth_pres", "ctth_tempe"):
          formats:
          - format: nc
            writer: cf
            encoding:
              cma:
                dtype: !!python/name:numpy.int16
              ct:
                dtype: !!python/name:numpy.int16
            include_lonlats: True


workers:
  - fun: !!python/name:trollflow2.plugins.create_scene
  - fun: !!python/name:trollflow2.plugins.load_composites
  - fun: !!python/name:trollflow2.plugins.resample
  - fun: !!python/name:trollflow2.plugins.save_datasets

@pnuu
Copy link
Member Author

pnuu commented Feb 24, 2022

Another pure-Satpy example using SEVIRI HRIT data:

#!/usr/bin/env python

import glob

from satpy import Scene
from satpy.writers import compute_writer_results

COMPUTE = False


def main():
    fnames = glob.glob(
        "/home/lahtinep/data/satellite/geo/0deg/*202202031045*")
    dsets = ['VIS006', 'VIS008']
    glbl = Scene(reader='seviri_l1b_hrit', filenames=fnames)
    glbl.load(dsets)
    res = glbl.save_datasets(writer='cf', filename="/tmp/seviri_test.nc",
                             include_lonlats=True, compute=COMPUTE)
    if not COMPUTE:
        compute_writer_results([res])


if __name__ == "__main__":
    main()

@pnuu
Copy link
Member Author

pnuu commented Feb 24, 2022

UPDATED VERSION

And a version that fails without any Pytroll code:

#!/usr/bin/env python

import datetime as dt

import numpy as np
import dask.array as da
import xarray as xr

COMPUTE = False
FNAME = "/tmp/xr_test.nc"


def main():
    y = np.arange(1000, dtype=np.uint16)
    x = np.arange(2000, dtype=np.uint16)
    now = dt.datetime.utcnow()
    times = xr.DataArray(np.array([now + dt.timedelta(seconds=i) for i in range(y.size)], dtype=np.datetime64),
                         coords={'y': y})

    # Write root
    root = xr.Dataset({}, attrs={'global': 'attribute'})
    written = [root.to_netcdf(FNAME, mode='w')]

    # Write first dataset
    data1 = xr.DataArray(da.random.random((y.size, x.size)), dims=['y', 'x'],
                         coords={'y': y, 'x': x, 'time': times})
    dset1 = xr.Dataset({'data1': data1})
    written.append(dset1.to_netcdf(FNAME, mode='a', compute=COMPUTE))

    # Write second dataset using the same time coordinates
    data2 = xr.DataArray(da.random.random((y.size, x.size)), dims=['y', 'x'],
                         coords={'y': y, 'x': x, 'time': times})
    dset2 = xr.Dataset({'data2': data2})
    written.append(dset2.to_netcdf(FNAME, mode='a', compute=COMPUTE))

    if not COMPUTE:
        da.compute(written)


if __name__ == "__main__":
    main()

@pnuu
Copy link
Member Author

pnuu commented Feb 24, 2022

Created an issue to XArray: pydata/xarray#6300

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this feature. I think it just needs adding in the general documentation that this option will most likely slow down processing significantly.

@pnuu
Copy link
Member Author

pnuu commented Mar 9, 2022

I added a note to the example configuration file. This will most likely be a temporary solution until pydata/xarray#6300 is fixed.

Copy link
Member

@mraspaud mraspaud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mraspaud mraspaud enabled auto-merge March 9, 2022 14:42
@pnuu pnuu disabled auto-merge March 10, 2022 06:05
@pnuu pnuu merged commit c9b3ad1 into pytroll:main Mar 10, 2022
@pnuu pnuu deleted the feature-eager-saving branch March 10, 2022 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants