-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preload (FCI) filehandlers for eager processing #2686
base: main
Are you sure you want to change the base?
Conversation
First beginning of work to preload filehandlers before files are present. Not much implementation yet, just a skeleton on what it might look like in the YAMLReader.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2686 +/- ##
==========================================
+ Coverage 95.88% 95.94% +0.05%
==========================================
Files 371 373 +2
Lines 52835 53430 +595
==========================================
+ Hits 50663 51265 +602
+ Misses 2172 2165 -7
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Pull Request Test Coverage Report for Build 7933498151Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Prepare GeoSegmentYAMLReader for preloading filehandlers for files corresponding to segments that don't exist yet. Early draft implementation that appears to work with limitations. Implementation in GEOSegmentYAMLReader still needs tweaking, tests need to be improved, and the corresponding file handler (for now just FCI) needs to be able to handle it.
Continue the pre-loading implementation in the GEOSegmentedYAMLReader. Add unit tests.
Don't raise an error that we can't predict the remaining files if this functionality was not requested.
Add a Preloadable class to netcdf_utils. This so far implements pre-loading filehandlers for to-be-expected files if a single one already exists, taking a defined set of data variables and their attributes from the first segment. Still to be implemented is to take other information from other repeat cycles, by on-disk caching.
Cache data variables between repeat cycles.
Continue working on repeat-cycle caching.
The good news is that my test script now passes without errors. The bad news is that the resulting image has no data / is all black. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side request as mentioned on slack: Move the "does NetCDF file exist" functionality to its own function that is Delayed. Hold on to the result of this in your file handler(s). Pass the result of that (the filename I think) to the _get_delayed_value
function which is also delayed, and then go on as you are. This way the globbing and sleeping only has to happen once.
There is something wrong with the values and the calibration. For one test case, for one pixel, loading |
When preloading, unset automaskandscale.
The bug with the wrong values was because automaskandscale was being applied automatically by the NetCDF library. After unsetting this, the values seem to match with the reference. This image was created while scene creation, loading, and creating the resampling graph happening before accessing any data: Processing took 28.4 seconds, took 1.89 GB RAM, and used 121% CPU. Classical processing took 19.0 seconds, took 2.03 GB RAM, and used 73% CPU. Now that it's functional, I will work on adding tests and improving the implementation. |
Very nice! How long did it take after the last segment had "arrived"? |
Split the delayed loading from file into one delayed function that waits for the file and another that loads a variable from the file.
Add tests for the functionality to wait for a file to appear in a delayed object.
Add a test method to test getting a delayed value from a (delayed) file
PRogress un nit tests fol the puproses of delayed olading
I haven't done a full test including a realistic simulation of delayed segment arrival yet. But when I skip computing, it takes 19 seconds with 421 MB RAM. |
Add more tests for the preloadable mixin in netcdf_utils. Cleanup some unused code and add checks/guards in this mixin.
Improve tests for the yaml reader in case of preloading file handlers. Tests fail.
Improve the tests for the YAML reader and the NetCDF utils. Tolerate absence of time_tags and variable_tags for preloadables. Verify presence of required_netcdf_filehandlers on creation rather than on caching time.
Using datetime.min for an artificial date leads to different strftime results between platforms (see https://bugs.python.org/msg307401). Use datetime.max instead.
Simplify _new_filehandler_instances. Hopefully this will satisfy codescene. Replace "/dev/null" by os.devnull for cross-platform support.
In test waiting for file to appear, wait longer. Maybe this will fix the "file not found" on 3.11 problem.
Using this "in the wild" I get HDF5 errors and segmentation faults after files appear. Maybe too many open files (the old problem).
|
I haven't looked at the code, but are you maybe only waiting for existence of the file? My guess would be it exists, but isn't finished being written. |
I am, and it sounds plausible, but... I get those error messages even if the files are already there and thus found immediately. But I don't get a segfault, and an image is produced despite the barrage of error messages reported by HDF5-DIAG. But in any case it would be good to look for the same type of events that trollstalker does. When it does work (in a test where I manually copy over files), the dask graph looks like this, which might be expected, although it doesn't seem to do a lot before it has the last files that it needs: |
😨 Is this an HDF5 multiple reader limitation maybe? |
The HDF5-DIAG problem might be not directly related to this issue. https://stackoverflow.com/a/75899745/974555 |
The HDF5-diag barrage vanishes if I downgrade netcdf4, but the segfaults still happen. They're not due to incomplete files, because they also occur if I link rather than copy files into the target directory. I seem to recall working on the segfault problem in May 2019 in Madison and then it was due to too many open files, but pydata/xarray#2954 was fixed by pydata/xarray#3082 so that can't be the explanation. |
Maybe... it seems four workers all find the same file simultaneously and then I get the crash:
|
No segfault with |
Use xarray rather than netCDF4 for retrieving the variable name. Maybe this will help to resolve the concurrency problem.
Still getting segfaults when using
That's when trying to get the lock that it segfaults. https://github.com/pydata/xarray/blob/main/xarray/backends/locks.py#L161-L163 |
I have not yet tested the non-nominal case, such as when a segment fails to appear. I think the entire repeat cycle would fail, which is not optimal. |
When trying this out with data as they come in, it turns out the distribution of dask tasks to workers is not optimal. The distribution of the 40 "wait for files to appear" tasks over the workers is not consecutive, but appears not deterministic. This means that workers may be occupied waiting for chunk 33-40, while nobody is processing chunks 1–32 until after 33–40 have been processed.
etc. I should check how we can prioritise the scheduling. One workaround may be to use 40 workers. In this case, the first image is ready 20 seconds after the last chunk has arrived. The last image is ready after 2 minutes 31 seconds. |
Write about limitations in the documentation. Adapt how things are logged.
Unless I install netcdf4=1.5.8, I get many HDF5_DIAG errors and segmentation faults. With netCDF4=1.5.8 (which implies Python 3.10) operational processing appears to work. |
Have you found a workaround for this? I believe dask should be considering these in the order that they are defined, but I could be wrong about that if dask sees them as all happening at the same time. I also think there are ways to define priorities to tasks, but this may only be in the "futures" or distributed workflows. |
The only workaround I've found is to run with 40 workers. |
I think we should ask the dask community about this. We can't be the first ones to stumble on this. |
Related reading to find out the "waiting for external files" in the right order, it might be possible to us "secede" and "rejoin" (only with the distributed scheduler?): https://stackoverflow.com/questions/56222869/waiting-for-external-dependencies-in-dask https://docs.dask.org/en/latest/futures.html https://distributed.dask.org/en/stable/task-launch.html#getting-the-client-on-a-worker |
Fixing a merge conflict in test_readers.py
Refactor the unit test test_preloaded_instances into multiple unit tests testing different cases. This should hopefully improve the code health as measured by CodeScene.
Refactor _wait_for_file to reduce the cyclomatic complexity
Refactor to simplify conditional logic.
The remaining |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I have a few questions and comments, but looks good overall.
@@ -1317,6 +1338,129 @@ def _get_new_areadef_heights(self, previous_area, previous_seg_size, **kwargs): | |||
|
|||
return new_height_proj_coord, new_height_px | |||
|
|||
def _new_filehandler_instances(self, filetype_info, filename_items, fh_kwargs=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could all these new methods be put into a new subclass? Might help with SOC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started that way at some point, but was not sure how to proceed. Suppose we have GEOPreloadableSegmentYAMLReader
(GPSYR) as a subclass of GEOSegmentYAMLReader
(GSYR), then some questions arise:
- Would we change the five readers that currently use GSYR directly to use GPSYR instead, or only those where we know pre-loading works? Probably only those where it works.
- How do we combie GPSYR with
GEOVariableSegmentYAMLReader
(GVSYR)? Does GVSYR subclass GPSYR, suggesting that every GVSYR support pre-loading? Currently, this is true (fci_l1c_nc
is the only GVSYR), but perhaps not in the future. Or are GVSYR and GPSYR bot subclasses of GSYR, and then a dedicatedGEOPreloadableVariableSegmentYAMLReader
(GPVSYR) would subclass both GVSYR and GPSYR, leading to diamond rule inheritance? I prefer to avoid the diamond problem. Or is the preloadable yaml reader a mixin that doesn't subclass GSYR at all?
I couldn't decide and concluded that I can run away and avoid those questions by now creating a class at all 😄
class Preloadable: | ||
"""Mixin class for pre-loading. | ||
|
||
This is a mixin class designed to use with file handlers that support | ||
preloading. Subclasses deriving from this mixin are expected to be | ||
geo-segmentable file handlers, and can be created based on a glob pattern | ||
for a non-existing file (as well as a path or glob pattern to an existing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this class be called eg PreloadableSegments
to make clear of the main use case?
list(g) | ||
|
||
|
||
def test_get_cache_filename(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this test be split in multiple smaller ones?
def __init__(self, *args, **kwargs): | ||
"""Initialise object.""" | ||
self.preload = kwargs.pop("preload", False) | ||
super().__init__(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should use satpy's config system for this instead of adding yet another keyword argument. In my mind, a config would make sense as I don't think we would want to mix preloadable and non preloadable data.
Add functionality to preload filehandlers for files that have not yet arrived on disk. This works by passing a single file (the first chunk) and then it will generate glob patterns for all remaining chunks.
It seems to work for loading only FDHSI data.
Deferred: support for mixing FDHSI and HRFI. This PR is ambitious enough as it is, and I would like to postpone that complication to a later PR.
Remaining problem: segmentation faults due to different dask workers simultaneously opening the same file.