Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage generating composites from ABI/AHI #1902

Closed
Plantain opened this issue Nov 29, 2021 · 16 comments
Closed

High memory usage generating composites from ABI/AHI #1902

Plantain opened this issue Nov 29, 2021 · 16 comments

Comments

@Plantain
Copy link

Using AHI/ABI readers with Himawari/GOES data and producing composites uses in excess of 24GB of RAM, even for a single worker/thread. I suspect this is more than is necessary and probably more than when it was originally written.

To Reproduce
Run https://gist.github.com/Plantain/18afecfc8f6c049aa8fbc7f92e7d8284 , with decompressed Himawari8 full-disk imagery

Expected behavior
I don't know what an appropriate amount of memory usage is, but I suspect it is considerably less than 24GB for a single worker. I understand Dask is meant to enable chunking of tasks into smaller size components, and something is not working correctly here.

Actual results
Observing memory usage with top shows it consistently using ~24GB of RAM

Environment Info:

  • OS: Ubuntu 20.04
  • Satpy Version: satpy==0.31.1.dev146+g764171f0, also 0.31.0
  • PyResample Version: pyresample==1.22.1
  • dask==2021.11.2
    I also tried with dask 2021.03
@djhoese djhoese added this to To do in PCW Autumn 2021 via automation Nov 29, 2021
@djhoese djhoese self-assigned this Nov 29, 2021
@Plantain
Copy link
Author

See the memory usage plot from the gist (updated to only profile resource usage)

bokeh_plot

@yukaribbba
Copy link
Contributor

yukaribbba commented Nov 29, 2021

For my experience of GK-2A data, setting PYTROLL_CHUNK_SIZE to 1024 will limit your maximum memory usage up to 16GB around. Also you can get faster speed. I hope this also works for other geostationary satellites.

Besides, I remember a long time ago satpy didn't have this issue. You can get a pretty quick result with acceptable memory usage.

@djhoese
Copy link
Member

djhoese commented Nov 29, 2021

@Plantain Can you try passing tiled=True to save_datasets and see if/how that improves your memory?

@Plantain
Copy link
Author

For my experience of GK-2A data, setting PYTROLL_CHUNK_SIZE to 1024 will limit your maximum memory usage up to 16GB around. Also you can get faster speed. I hope this also works for other geostationary satellites.

Besides, I remember a long time ago satpy didn't have this issue. You can get a pretty quick result with acceptable memory usage.

If you look at the attached script, that is already set.

@Plantain
Copy link
Author

@Plantain Can you try passing tiled=True to save_datasets and see if/how that improves your memory?

ex
Right graph is tiled=False. You can see it uses ~5GB less memory in the final writing phase, but no difference to the peak.

@djhoese
Copy link
Member

djhoese commented Nov 30, 2021

@Plantain That is a very good data point to have. This shows very clearly that for some reason dask (or something in this processing) is holding on to the memory before writing. I think I'm seeing the same thing in my own testing, but am still trying to narrow down what is allocating what and trying to understand everything that the dask diagnostics are telling me.

What I'm currently working on and what I have figured out are:

  1. When using dask's CacheProfiler I see that one task is continuously holding on to (caching) tasks as processing continues. This could be a misunderstanding of mine or a bug in the way I setup my test script, like dask needing to hold on to all the individual tasks until it can generate a final result. However (see below) I think I've gotten around that.
  2. Passing mask_space=False to the reader cuts off about 1.5GB of memory when loading a single channel. I think there are some optimizations I could do that would help reduce this, but I don't think it is the main issue.
  3. For some reason trying to generate overview takes almost 4 minutes when generating true_color_nocorr takes a little less than 45 seconds. overview should be incredibly simple compared to the true_color_nocorr composite, especially since I'm skipping the enhancing and writing step in my script.
  4. Setting chunk size to 2200 will likely produce the best overall performance as each segment is 2200 rows high. This reduces the number of "mini" chunks since 2200 with a chunk size of 1024 splits into (1024, 1024, 52). It didn't seem to effect overall memory usage for me though.
My Script
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["PYTROLL_CHUNK_SIZE"] = "2200"
from datetime import datetime
from glob import glob
from dask.diagnostics import CacheProfiler, ResourceProfiler, Profiler, visualize
import numpy as np
import dask
import dask.array as da
from satpy import Scene


class PassThroughStore:
    def __init__(self, shape, dtype):
        self._store = np.empty(shape, dtype=dtype)

    def __setitem__(self, key, value):
        self._store[key] = value


def run():
    # name = "true_color"
    # name = "true_color_nocorr"
    # name = "overview"
    name = 0.65
    scn = Scene(reader="ahi_hsd", filenames=glob("/data/satellite/ahi/hsd/2330/*FLDK*.DAT"), reader_kwargs={"mask_space": False})
    # scn.load([name], calibration="counts", pad_data=False)
    scn.load([name], calibration="counts", pad_data=True)
    new_scn = scn
    # new_scn = scn.resample(resampler="native")
    print(new_scn[name].shape, new_scn[name].data.numblocks)
    final_store = PassThroughStore(new_scn[name].shape, new_scn[name].dtype)
    # new_scn[name].compute()
    # new_scn[name].data[:4000, :4000].compute()
    # block_view = new_scn[name].data.blocks
    # block_view[0, 0].compute()
    # new_scn[name].data.visualize("profile_ahi_visualize_noresample.svg")
    da.store(new_scn[name].data, final_store, lock=False)
    # da.store(new_scn[name].data[:18000, :18000], PassThroughStore(), lock=False)
    print("Done storing")


def run_profiled():
    with dask.config.set(num_workers=4), CacheProfiler() as cprof, ResourceProfiler(0.4) as rprof, Profiler() as prof:
        run()
    filename = f"profile_ahi_{datetime.now():%Y%m%d_%H%M%S}.html"
    visualize([prof, rprof, cprof], filename=filename, show=False)
    cwd = os.getcwd()
    print("file://" + os.path.join(cwd, filename))


run_profiled()
# run()

@simonrp84
Copy link
Member

Just an small addition, the IR segments are 550 lines, only band 3 has 2200 line chunks. So maybe setting a chunk size of 550 would be more optimal as that can be shared by all bands?

@djhoese
Copy link
Member

djhoese commented Nov 30, 2021

I've spent a lot of today updating my test script so I can be a little more flexible with what I load and to save various profile HTML files or performance reports. I've decided to switch to the full true_color plot to see if I can identify issues. I still haven't done the update to how the ahi_hsd reader uses memmap but I'm getting there. I got distracted looking at the task graphs for AHI and ABI true_color generation. In one of @Plantain's plots on slack it looked like the memory was increasing until it started saving to disk. In running my own tests which don't include writing I noticed that this reduction of memory usage seems to line up with when dask has finished computing all the chunks for the various angles:

image

The first 80% of that task stream (top plot) are the angle generations. After that it starts generating/loading the band data from the input files. It seems once it does that it can finally start removing some of the angle information as it was actually applied to the data and is no longer needed. What's really weird though is I then tried running this with the same thing, but forced the input data chunks so they should have been 1 chunk per file (each file is a segment). With that the profiles now look like this:

image

So it took much less time and used much less memory and the memory doesn't continuously increase, it goes up and down. My best guesses for why this is is:

  1. We are giving the dask scheduler a better problem to solve. Instead of a ton of angle-based tasks to run and be saturated by it is able to more easily decipher the dependencies and schedule them in a more optimal order.
  2. Somehow these single chunk cases make the memory mapping behave much more friendly and something 🪄 magical happens.

Still not completely sure, but I do want to point out that the situation in the first screenshot is the same thing that is happening with the ABI reader, just not as much memory is used.

@djhoese
Copy link
Member

djhoese commented Nov 30, 2021

So I think my local changes have improved this and we have some other small things we could do. As mentioned above I started noticing that dask was scheduling all the angle-related operations first and wasn't even getting to the other calculations (loading the data from the input files) until all (or almost all) of them were finished. So I played around a bit and tried hacking the modifiers so rayleigh correction would just make an adjustment of - 0 to the data and sunz corrected was just / 1. Since this cut out the use of any of the angles the memory usage went down to ~3GB and the code finished in ~30 seconds. Then I re-enabled the sunz correction and memory usage went up to ~10GB peak and ~60 seconds to complete.

I then refactored the angle generation after I noticed that the cos(sza) generated for sunz correction wasn't using the same angle generation that the rayleigh correction is. This didn't really change much (and it shouldn't) with my current case of only have sunz correction and no rayleigh. So then I updated the rayleigh correction so not call pyspectral but still use the angles refl_cor_band = sunz * satz * ssadiff * 0. This shot the memory usage up to ~30GB at peak with the familiar pyramid of memory usage. Then I updated rayleigh again to call pyspectral's actual code and this brought the peak memory usage up to ~35GB in ~250s. This is pretty good considering previously the last time I ran this I got ~50GB peak memory in ~300s.

The increased memory usage overall makes sense to me. You're including 3 additional angle arrays (sat azimuth, sat zenith, solar azimuth) and the rayleigh correction uses all the angles + the red band + the data being corrected in the same function. That's a lot.

TODO:

  1. Why did only having the sunz correction still make all the angles calculate before getting to the data loading? Is this really just a worst case scenario for dask's scheduler?
  2. Look into map_blocks'ing parts of the rayleigh correction so it can maybe use less memory or depend on less input arrays at once.
  3. Look into a possible duplicate masking done in the RatioSharpeningCompositor since the GenericCompositor (it's parent) already does something similar.

Edit: I'll try to make a pull request later tonight.

@djhoese
Copy link
Member

djhoese commented Dec 1, 2021

As github shows I made #1909 and #1910 which reduce the memory usage a bit. I've now made the biggest improvement by changing how lon/lats are generated in pyresample. It completes in about the same amount of time, but I went from ~30-35GB of memory peak to ~5.5GB.

image

The main idea with the fix (I'll be making a pull request this afternoon) is that pyresample generates lons/lats for an AreaDefinition by first generating the x/y coordinate vectors (so 1D) then uses np.meshgrid to generate two 2D arrays from those representing the X and Y coordinates for each pixel in the 2D AreaDefinition. It then passes these to pyproj to be transformed from x/y projection space to lon/lat space. The issue is/was that this makes dask think that all the chunks depend on the same np.arange that was used to generate the projection vectors. An since these arange results are then passed to meshgrid you get a big series of connections on the dask task graph. In principal this doesn't seem to cause any connections that aren't accurate, but it does make it harder for dask to know if it can continue down a specific thread of processing because it thinks it needs to hold on to all these arange and x/y arrays and that they are some how associated with all of the other ones.

I'm going to eat some lunch, do some meetings, and then make a pull request on pyresample for this.

WARNING: I have not actually tried generating geotiffs with the changes I've made and have no idea if they are even still correct...but none of the tests fail.

@pnuu
Copy link
Member

pnuu commented Dec 1, 2021

Ooh, that's remarkable, lets hope it works all the way!

@mraspaud
Copy link
Member

mraspaud commented Dec 1, 2021

That last improvement is really nice, and will provide a boost to all area-based data!

@djhoese
Copy link
Member

djhoese commented Dec 3, 2021

Here's what I get with my newest update to the angles PR in satpy and my get_proj_coords PR in pyresample when generating an AHI true_color all the way to a tiled geotiff (260-290 seconds) with 4 workers on my laptop:

image

Here's ABI with the same situation (<140s):

image

@djhoese
Copy link
Member

djhoese commented Dec 3, 2021

@Plantain Do you have a plot of what it looked like with ABI before my changes?

@djhoese
Copy link
Member

djhoese commented Dec 3, 2021

Nevermind, I switched my pyresample branch and here's what ABI used to look like:

image

17GB now only uses ~6GB.

@djhoese
Copy link
Member

djhoese commented Dec 6, 2021

Pyresample 1.22.2 was released on Friday and includes the major improvements shown here. We'll plan to release Satpy this week which will include the other smaller performance improvements. Closing this as those satpy improvements are now in satpy main branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

6 participants