New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise the FCI L1c/netcdf_utils by introducing on-demand variables collection and caching #2237
Conversation
Testing code snippet, using one RC of data from the latest test data (download link from the EUM sftp) import time
import warnings
from glob import glob
import numpy as np
from satpy import Scene
warnings.filterwarnings("ignore", category=UserWarning, module=r'.*crs')
warnings.filterwarnings("ignore", category=RuntimeWarning, module=r'.*dask')
# debug_on()
path_to_testdata = 'path_to_test_data/RCRC0072/'
scn_time = []
load_time = []
save_time = []
total_time = []
channels = [
'vis_06',
'vis_08',
'ir_105',
'ir_87',
]
n = 5
for i in range(n):
print(i)
scene_start_time = time.time()
scn = Scene(filenames=glob(path_to_testdata + "*BODY*.nc"), reader='fci_l1c_nc')
scn_time.extend([time.time() - scene_start_time])
print(f"Done. It took {time.time() - scene_start_time:.2f}s to start Scene.")
load_start_time = time.time()
scn.load(channels, upper_right_corner='NE')
load_time.extend([time.time() - load_start_time])
print(f"Done. It took {time.time() - load_start_time:.2f}s to load datasets.")
save_start_time = time.time()
scn.save_datasets()
save_time.extend([time.time() - save_start_time])
print(f"Done. It took {time.time() - save_start_time:.2f}s to save datasets.")
print(f"Done. It took {time.time() - scene_start_time:.2f}s for all ops.")
total_time.extend([time.time() - scene_start_time])
del scn
print("*" * 50)
print(f"Done. It took {np.mean(scn_time):.2f}s to start Scene.")
print(f"Done. It took {np.mean(load_time):.2f}s to load datasets.")
print(f"Done. It took {np.mean(save_time):.2f}s to save datasets.")
print(f"Done. It took {np.mean(total_time):.2f}s for all ops.") |
From the code above, the average stats for loading and saving two VIS and two IR channels are:
after this PR:
So, we shave some time off the Scene initialisation because of the skipped caching, it takes some more time to load the dataset (need to get the variables now), and in total we save a couple of seconds... |
Note that almost all of the remaining Scene initialisation time now comes from the file opening here satpy/satpy/readers/netcdf_utils.py Line 100 in e7d24a3
|
Codecov Report
@@ Coverage Diff @@
## main #2237 +/- ##
==========================================
+ Coverage 94.13% 94.35% +0.21%
==========================================
Files 293 310 +17
Lines 45079 46554 +1475
==========================================
+ Hits 42437 43926 +1489
+ Misses 2642 2628 -14
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
I'm not happy how I implemented the channel name replacement/expansion, because that's completely based on the FCI reader and file structure. I think it should be done in a more generic way, maybe using a dictionary of "replace the key with each item in the given list" kind of approach. |
Can you use python format strings? |
Oh yeah, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks a lot for the speedup! The caching function doesn't seem to be well tested according to codecov? otherwise just a couple of questions/comments
I upgraded my local |
…ter during get_dataset uses the caching trick as discussed in pytroll#2237 (comment)
I got the tests failing also locally when I run all the reader tests. If I go back to 17db0de the tests pass. |
So:
|
Going back to the original caching with dictionaries, don't want to use too much time wondering why the |
I added couple of tests for If everything seems fine for merging, please lets do that. We can then continue to work with the next PRs and inspect the caching and possible test side effects on a more stable (not being changed all the time by me) base. |
I've ran a few more tests, and added a few more usages of
this is for the case of local data, with the maximum memory usage monitored with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks a lot for optimising the reader!
A first attempt at optimising the NetCDF4FileHandler for faster FCI L1c loading.
Idea: skip the size-based caching at Scene initialisation (
cache_var_size=0
), and instead get and cache variable as numpy-xarrays when needed.Also only required variables/attributes (listed in the reader YAML) are collected. This reduces the number of collected variables/attributes from 93880 to 2070.
This PR also included a lot of tests for solving the single-threadedness reported in #2186, so we'll be closing it with this PR:
create file handlers in parallel: slower than single-threaded
read datasets (
scn.load()
call) in parallel: slower than single-treadedCloses FCI L1c reader is single threaded #2186
Tests added
Fully documented