Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching #2237

ameraner · 2022-10-18T12:41:43Z

A first attempt at optimising the NetCDF4FileHandler for faster FCI L1c loading.

Idea: skip the size-based caching at Scene initialisation (cache_var_size=0), and instead get and cache variable as numpy-xarrays when needed.

Also only required variables/attributes (listed in the reader YAML) are collected. This reduces the number of collected variables/attributes from 93880 to 2070.

This PR also included a lot of tests for solving the single-threadedness reported in #2186, so we'll be closing it with this PR:

create file handlers in parallel: slower than single-threaded
read datasets (scn.load() call) in parallel: slower than single-treaded
Closes FCI L1c reader is single threaded #2186
Tests added
Fully documented

ameraner · 2022-10-18T12:46:23Z

Testing code snippet, using one RC of data from the latest test data (download link from the EUM sftp)

import time
import warnings
from glob import glob

import numpy as np
from satpy import Scene

warnings.filterwarnings("ignore", category=UserWarning, module=r'.*crs')
warnings.filterwarnings("ignore", category=RuntimeWarning, module=r'.*dask')

# debug_on()

path_to_testdata = 'path_to_test_data/RCRC0072/'

scn_time = []
load_time = []
save_time = []
total_time = []

channels = [
    'vis_06',
    'vis_08',
    'ir_105',
    'ir_87',
]

n = 5

for i in range(n):
    print(i)
    scene_start_time = time.time()
    scn = Scene(filenames=glob(path_to_testdata + "*BODY*.nc"), reader='fci_l1c_nc')
    scn_time.extend([time.time() - scene_start_time])
    print(f"Done. It took {time.time() - scene_start_time:.2f}s  to start Scene.")

    load_start_time = time.time()
    scn.load(channels, upper_right_corner='NE')
    load_time.extend([time.time() - load_start_time])
    print(f"Done. It took {time.time() - load_start_time:.2f}s to load datasets.")

    save_start_time = time.time()
    scn.save_datasets()
    save_time.extend([time.time() - save_start_time])
    print(f"Done. It took {time.time() - save_start_time:.2f}s  to save datasets.")

    print(f"Done. It took {time.time() - scene_start_time:.2f}s  for all ops.")
    total_time.extend([time.time() - scene_start_time])

    del scn

print("*" * 50)
print(f"Done. It took {np.mean(scn_time):.2f}s  to start Scene.")
print(f"Done. It took {np.mean(load_time):.2f}s  to load datasets.")
print(f"Done. It took {np.mean(save_time):.2f}s  to save datasets.")
print(f"Done. It took {np.mean(total_time):.2f}s  for all ops.")

ameraner · 2022-10-18T13:01:53Z

From the code above, the average stats for loading and saving two VIS and two IR channels are:
Before this PR (go back to cache_var_size=10000):

Done. It took 10.15s  to start Scene.
Done. It took 4.72s  to load datasets.
Done. It took 13.03s  to save datasets.
Done. It took 27.90s  for all ops.

after this PR:

Done. It took 6.21s  to start Scene.
Done. It took 6.19s  to load datasets.
Done. It took 13.08s  to save datasets.
Done. It took 25.49s  for all ops.

So, we shave some time off the Scene initialisation because of the skipped caching, it takes some more time to load the dataset (need to get the variables now), and in total we save a couple of seconds...

ameraner · 2022-10-18T13:09:35Z

Note that almost all of the remaining Scene initialisation time now comes from the file opening here

satpy/satpy/readers/netcdf_utils.py

Line 100 in e7d24a3

file_handle = netCDF4.Dataset(self.filename, 'r')

codecov · 2022-11-15T07:32:02Z

Codecov Report

Merging #2237 (563e303) into main (8c1ccdd) will increase coverage by 0.21%.
The diff coverage is 97.79%.

@@            Coverage Diff             @@
##             main    #2237      +/-   ##
==========================================
+ Coverage   94.13%   94.35%   +0.21%     
==========================================
  Files         293      310      +17     
  Lines       45079    46554    +1475     
==========================================
+ Hits        42437    43926    +1489     
+ Misses       2642     2628      -14

Flag	Coverage Δ
behaviourtests	`4.59% <0.00%> (-0.09%)`	⬇️
unittests	`94.99% <97.79%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
satpy/readers/netcdf_utils.py	`98.32% <96.25%> (-1.68%)`	⬇️
satpy/readers/fci_l1c_nc.py	`98.19% <100.00%> (+0.02%)`	⬆️
satpy/tests/reader_tests/test_fci_l1c_nc.py	`100.00% <100.00%> (ø)`
satpy/tests/reader_tests/test_netcdf_utils.py	`96.12% <100.00%> (+1.22%)`	⬆️
satpy/readers/seadas_l2.py	`96.84% <0.00%> (-1.28%)`	⬇️
satpy/tests/test_resample.py	`88.90% <0.00%> (-0.37%)`	⬇️
satpy/composites/ahi.py	`100.00% <0.00%> (ø)`
satpy/readers/hrit_jma.py	`97.94% <0.00%> (ø)`
satpy/tests/test_utils.py	`100.00% <0.00%> (ø)`
satpy/readers/yaml_reader.py	`97.50% <0.00%> (ø)`
... and 48 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

coveralls · 2022-11-15T07:54:20Z

Coverage increased (+0.2%) to 94.946% when pulling 563e303 on ameraner:feature_optimise_fcil1c into e7d24a3 on pytroll:main.

pnuu · 2022-11-17T15:11:23Z

I'm not happy how I implemented the channel name replacement/expansion, because that's completely based on the FCI reader and file structure. I think it should be done in a more generic way, maybe using a dictionary of "replace the key with each item in the given list" kind of approach.

djhoese · 2022-11-17T21:41:02Z

Can you use python format strings?

pnuu · 2022-11-18T07:54:00Z

Can you use python format strings?

Oh yeah, .format() is cleaner than string replacement. Done in f0cda99

mraspaud

Looks good, thanks a lot for the speedup! The caching function doesn't seem to be well tested according to codecov? otherwise just a couple of questions/comments

satpy/readers/fci_l1c_nc.py

satpy/readers/netcdf_utils.py

…npxr()

…il1c

satpy/readers/netcdf_utils.py

pnuu · 2022-11-24T12:38:07Z

I upgraded my local netdcf4 to the same version as in CI (1.6.2), and the test_netcdf_utils pass.

…ter during get_dataset uses the caching trick as discussed in pytroll#2237 (comment)

pnuu · 2022-11-28T08:55:01Z

I got the tests failing also locally when I run all the reader tests. If I go back to 17db0de the tests pass.

pnuu · 2022-11-28T10:44:41Z

So:

decorating the method works in the tests, but causes a memory-leak
using the trick in the video or here works when running the code but causes tests to fail if all the reader tests are run. Running only the tests in test_netcdf_utils.py works fine

pnuu · 2022-11-28T13:22:36Z

Going back to the original caching with dictionaries, don't want to use too much time wondering why the functools way causes netCDF4 to break.

pnuu · 2022-11-28T15:02:11Z

I added couple of tests for get_and_cache_npxr() as @mraspaud requested.

If everything seems fine for merging, please lets do that. We can then continue to work with the next PRs and inspect the caching and possible test side effects on a more stable (not being changed all the time by me) base.

ameraner · 2022-11-28T18:36:22Z

I've ran a few more tests, and added a few more usages of get_and_cache_npxr in the code. Here the results (for the test code above, now on the EWC machine):

	main	first commit (`40d1d67`)	custom var collect (`c7a4c3d`)	expand get_and_cache usage (`528ee31`)
Scene (s)	6,90	4,08	3,76	3,81
load (s)	3,10	3,91	3,80	2,70
save_datasets (s)	12,38	12,29	12,38	12,34
Total (s)	22,38	20,28	19,94	18,85
Memory (MiB)	3070	3058	2974	2998

this is for the case of local data, with the maximum memory usage monitored with profile from memory_profiler.
In summary:

the first commit reduces the Scene init by deactivating the caching, and increases loading time as the variables need to be read from scratch. Memory is reduced due to less variable caching.
the custom variable collection further reduces the Scene init time by reducing the variables and metadata collection - this is particularly useful for the h5netcdf/remote reading case, as seen in other tests. Memory is reduced due to less variables and metadata collection.
the last commit adds more variables to the caching mechanism, most importantly the x/y geolocation 1-d arrays, reducing loading time. Memory is slightly increased again.

mraspaud

LGTM, thanks a lot for optimising the reader!

first attempt at a get_and_store_npxr

40d1d67

pnuu mentioned this pull request Oct 20, 2022

FCI L1c reader is single threaded #2186

Closed

Fix tests

afdfc31

pnuu added 2 commits November 17, 2022 11:54

Collect only require variables/attributs for FCI L1c

d534837

Fix setting global attributes in tests

bb7c313

pnuu mentioned this pull request Nov 17, 2022

Update FCI L1c reader to work with remote file systems #2182

Closed

2 tasks

pnuu added 2 commits November 17, 2022 16:32

Make channel names optional for dataset name handling

8221841

Test reading only defined variables/attrs

e9a4648

Make listed variable name composing more generic

9156b01

Add test for parameterized listed variables

953006e

pnuu marked this pull request as ready for review November 18, 2022 07:37

pnuu requested review from gerritholl, djhoese and mraspaud as code owners November 18, 2022 07:37

Use Python string formating in variable composing

f0cda99

mraspaud reviewed Nov 23, 2022

View reviewed changes

satpy/readers/fci_l1c_nc.py Show resolved Hide resolved

satpy/readers/netcdf_utils.py Outdated Show resolved Hide resolved

ameraner changed the title ~~Optimise the FCI L1c/netcdf_utils by introducing on-demand caching~~ Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching Nov 23, 2022

pnuu added 2 commits November 23, 2022 15:21

Use functools.cache instead of cached_file_content for get_and_cache_…

63ac32c

…npxr()

Merge branch 'feature_optimise_fcil1c_cache' into feature_optimise_fc…

17db0de

…il1c

pnuu mentioned this pull request Nov 23, 2022

Do not collect references/values in NetCDF4FileHandler.__init__() #2296

Open

pnuu added 2 commits November 23, 2022 16:35

Do npxr caching in a function

0fff0fc

Move cache fallback to satpy._compat module

236019c

djhoese assigned pnuu Nov 23, 2022

djhoese added enhancement code enhancements, features, improvements component:readers optimization labels Nov 23, 2022

djhoese reviewed Nov 23, 2022

View reviewed changes

satpy/readers/netcdf_utils.py Outdated Show resolved Hide resolved

djhoese reviewed Nov 23, 2022

View reviewed changes

satpy/readers/netcdf_utils.py Outdated Show resolved Hide resolved

pnuu added 2 commits November 23, 2022 19:31

Use a cache wrapper for storing varibles as DataArray[numpy]

e96a1ba

Add cache wrapper to FakeNetCDF4FileHandler

a22db0a

Increase cache number

925f5d9

ameraner added a commit to ameraner/satpy that referenced this pull request Nov 25, 2022

move collection of segment info out of the init, to be called only la…

132f9a3

…ter during get_dataset uses the caching trick as discussed in pytroll#2237 (comment)

Go back to using internal caching

c7a4c3d

Add tests for get_and_cache_npxr()

4bd0d0e

add more uses of get_and_cache_npxr

528ee31

Refactor init

563e303

mraspaud approved these changes Dec 1, 2022

View reviewed changes

mraspaud merged commit 4962f2a into pytroll:main Dec 1, 2022

ameraner deleted the feature_optimise_fcil1c branch December 2, 2022 13:14

pnuu mentioned this pull request Dec 2, 2022

Remote file reading for FCI L1c #2305

Merged

2 tasks

gerritholl mentioned this pull request Dec 21, 2023

UnboundLocalError when required_netcdf_variables contains only root variables without / #2704

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching #2237

Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching #2237

ameraner commented Oct 18, 2022 •

edited by pnuu

ameraner commented Oct 18, 2022 •

edited

ameraner commented Oct 18, 2022 •

edited

ameraner commented Oct 18, 2022 •

edited

codecov bot commented Nov 15, 2022 •

edited

coveralls commented Nov 15, 2022 •

edited

pnuu commented Nov 17, 2022

djhoese commented Nov 17, 2022

pnuu commented Nov 18, 2022

mraspaud left a comment

pnuu commented Nov 24, 2022

pnuu commented Nov 28, 2022

pnuu commented Nov 28, 2022

pnuu commented Nov 28, 2022

pnuu commented Nov 28, 2022

ameraner commented Nov 28, 2022 •

edited

mraspaud left a comment

Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching #2237

Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching #2237

Conversation

ameraner commented Oct 18, 2022 • edited by pnuu

ameraner commented Oct 18, 2022 • edited

ameraner commented Oct 18, 2022 • edited

ameraner commented Oct 18, 2022 • edited

codecov bot commented Nov 15, 2022 • edited

Codecov Report

coveralls commented Nov 15, 2022 • edited

pnuu commented Nov 17, 2022

djhoese commented Nov 17, 2022

pnuu commented Nov 18, 2022

mraspaud left a comment

Choose a reason for hiding this comment

pnuu commented Nov 24, 2022

pnuu commented Nov 28, 2022

pnuu commented Nov 28, 2022

pnuu commented Nov 28, 2022

pnuu commented Nov 28, 2022

ameraner commented Nov 28, 2022 • edited

mraspaud left a comment

Choose a reason for hiding this comment

ameraner commented Oct 18, 2022 •

edited by pnuu

ameraner commented Oct 18, 2022 •

edited

ameraner commented Oct 18, 2022 •

edited

ameraner commented Oct 18, 2022 •

edited

codecov bot commented Nov 15, 2022 •

edited

coveralls commented Nov 15, 2022 •

edited

ameraner commented Nov 28, 2022 •

edited