Skip to content

Memory issues loading pp files with default LOAD_POLICY #6404

@david-bentley

Description

@david-bentley

🐛 Bug Report

With Iris >= 3.11, and the introduction of LOAD_POLICY/COMBINE_POLICY, loading some pp files from the UM fails with memory issues. Note that the files load as expected with the 'legacy' LOAD_POLICY, but fail as described with the 'default' LOAD_POLICY.

I've tried running on our compute cluster with 64GB memory and still get the same issue. I can point someone to the files in the Met Office if required.

Script to reproduce:

import glob
import iris

policy = "legacy"  # or "default"
iris.LOAD_POLICY.set(policy)

files = glob.glob("*.pp")

cubes = iris.load(files)
print(cubes)

default LOAD_POLICY

$ time python read.py
...
long traceback
...
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 15.6 GiB for an array with shape (71, 6, 1920, 2560, 2) and data type float32

real  38m48.660s
user  35m16.419s
sys   3m7.086s

legacy LOAD_POLICY

$ time python read.py
0: m01s03i318 / (unknown)              (pseudo_level: 5; time: 13; latitude: 1920; longitude: 2560)
1: high_type_cloud_area_fraction / (1) (time: 37; latitude: 1920; longitude: 2560)
2: land_binary_mask / (1)              (latitude: 1920; longitude: 2560)
3: land_binary_mask / (1)              (latitude: 1920; longitude: 2560)
4: low_type_cloud_area_fraction / (1)  (time: 37; latitude: 1920; longitude: 2560)
5: mass_fraction_of_cloud_ice_in_air / (kg kg-1) (time: 37; model_level_number: 71; latitude: 1920; longitude: 2560)
6: mass_fraction_of_cloud_liquid_water_in_air / (kg kg-1) (time: 37; model_level_number: 71; latitude: 1920; longitude: 2560)
7: medium_type_cloud_area_fraction / (1) (time: 37; latitude: 1920; longitude: 2560)
8: surface_altitude / (m)              (time: 37; latitude: 1920; longitude: 2560)
9: surface_altitude / (m)              (time: 37; latitude: 1920; longitude: 2560)
10: surface_snow_amount / (kg m-2)      (time: 37; latitude: 1920; longitude: 2560)
11: x_wind / (m s-1)                    (time: 37; latitude: 1921; longitude: 2560)
12: y_wind / (m s-1)                    (time: 37; latitude: 1921; longitude: 2560)

real  12m57.740s
user  9m32.780s
sys   2m51.927s

How To Reproduce

Steps to reproduce the behaviour:

  1. Some UM pp files downloaded from the MASS archive
$ du -ch *pp | grep total
8.0G  total
  1. Read the files
import glob
import iris

files = glob.glob("*.pp")

cubes = iris.load(files)
print(cubes)
  1. Script is killed with memory issues
$ time python read.py
...
long traceback
...
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 15.6 GiB for an array with shape (71, 6, 1920, 2560, 2) and data type float32

real  38m48.660s
user  35m16.419s
sys   3m7.086s

Expected behaviour

These are pretty standard pp files coming out of the UM global model forecast, so I would expect that the files could be loaded successfully with the default LOAD_POLICY/COMBINE_POLICY.

Environment

  • OS & Version: RHEL7
  • Iris Version: >=3.11

Activity

pp-mo

pp-mo commented on Apr 14, 2025

@pp-mo
Member

Status update 2025-04-14

I have recently investigated this, and found that the problem relates to multiple orography fields,
It is "cured" by adopting iris.LOAD_POLICY.set("legacy") (as you might expect -- i.e. this gives results equivalent to Iris 3.10).

Unfortunately I haven't (yet) managed to produce a standalone testcase, but am loading a single 500Mb file
(for reference, prodm_op_gl-mn_20250327_06_012.pp)
to reproduce the problem.
This (slightly) reduced testcase has 3 timesteps, and contains an orography field for each timestep, i.e. each with a different timepoint, though in fact the payload content is (unsurprisingly) identical.

In these cases, the "old" Iris simply ignoring the varying orography, producing 3 cubes like this :

>>> print(cubes)
0: mass_fraction_of_cloud_ice_in_air / (kg kg-1) (time: 3; model_level_number: 71; latitude: 1920; longitude: 2560)
1: mass_fraction_of_cloud_liquid_water_in_air / (kg kg-1) (time: 3; model_level_number: 71; latitude: 1920; longitude: 2560)
2: surface_altitude / (m)              (time: 3; latitude: 1920; longitude: 2560)

In these legacy loads, the main data cubes mya have a time dimension, but they never have a time-dependent orography.
But the separate orography (i.e. "surface_altitude") cubes are time-dependent, if that appears in the input.

(
Aside :
I'm surprised that it is not issuing a "multiple references" warning for this in all cases.
... /iris/fileformats/rules.py:44: IrisUserWarning: Multiple reference cubes for orography
I'm seeing that it does do that if I load the whole original set of files, but not for my reduced single-file case.
There's a distinction here that I don't understand.
)

Summary

  • We do expect the "newer" Iris (version >=3.11) to load this treating the orography itself as time-dependent.
    • What is unexpected is not that it loads differently, but that it fails.
  • This looks like a hitherto unknown concatenate bug. Which we need to investigate
  • UsingLOAD_POLICY.set("legacy") cures the problem in these cases,
    • this may in fact be correct + appropriate for these input cases where, in effect, there is duplication in the orography input fields which should not be respected
    • there may be a case here for altering the default loading behaviour, where we knowingly adopted a backwards-incompatible approach as "more correct".
moved this to 🆕 Candidate in 🦋 Iris 3.13.0on May 13, 2025
trexfeathers

trexfeathers commented on May 13, 2025

@trexfeathers
Contributor

@pp-mo has explained to me an opportunity for improving chunk awareness in hybrid height coordinates. Regardless of the problems with prodm_op_gl-mn_20250327_06_012.pp, this may become necessary when we start seeing more time-dependent hybrid heights due to over-large chunk sizes.

My loose understanding

  • The way 'aux factory' coordinates are constructed means they are NOT chunked.
  • This is usually workable since their maximum size is 72 * 1920 * 2560 (height * lat * lon).
  • Adding a time dimension now risks the chunk being larger than available memory.
  • When we create the final array for the hybrid height coordinate, we need chunking to be calculated by _optimum_chunksize:
    def _optimum_chunksize(
  • Final array creation happens somewhere around here:
    # Build the points array
    nd_points_by_key = self._remap(dependency_dims, derived_dims)
    points = self._derive(
    nd_points_by_key["pressure_at_top"],
    nd_points_by_key["sigma"],
    nd_points_by_key["surface_air_pressure"],
    )
  • Unfortunately array creation is repeated for every type of aux factory, and must also be implemented again if the user defines their own aux factory, so fixing the chunking will need some refactoring so that we can use common code.
moved this from 🆕 Candidate to 📚 Backlog in 🦋 Iris 3.13.0on May 14, 2025
pp-mo

pp-mo commented on Jun 4, 2025

@pp-mo
Member

WIP Status update 2025-06-04

Fix prototype is currently up on branch "concatmem_prob" : see changes here

  • test/exercise in /home/users/patrick.peglar/worklog/concat_bug/sample_concat_fail.py
  • key change is in _aux_factory.py
    • rework _derive to call a new _inner_calc
    • LATEST made switchable
    • TODO make this a generic solution shared across all factory coord types
  • currently mixed with various debug-related changes
    • _concatenate.py : array hashing debug (including more human-friendly array hash naming)
    • _data_manager.py : no-touch realisation control in
    • _cube.py : make derived coords tuple from list instead of iterator
  • also a separate fix proposed in rules.py/_deferefence_args() --> use core_data so reference targets can be lazy

NOTE : although this problem was seen as potentially due to the change in loading policy introducing a "concatenate" step, in practice I don't now think this is the key difference : It is just that the chunks got bigger since the derived coordinate now varies on the time dimension as well - i.e. it is the time-varying nature that matters.
Meantime, I think that the one-chunk nature of the derived coord has always been there, but I think it only just got bad enough to notice.

ASIDE : there is a lesson here about dask chunking in general.
E.G. we have y * x where they broadcast together as (1, N) and (M, 1) and are chunked as (1, N) and (M, 1) -- both quite small individually (e.g. dim coords, single chunks ).
But the result can then end up with chunks of (M, N), i.e. a size of M*N, which may be rather huge.
So we need to rechunk to get, say, (Q, N) for manageable size : and this will requires rechunking x from (M, 1) to (Q, 1).

moved this from 📚 Backlog to 🔖 Assigned in 🦋 Iris 3.13.0on Jun 9, 2025
moved this from 🔖 Assigned to 🚀 In Progress in 🦋 Iris 3.13.0on Jun 16, 2025
linked a pull request that will close this issueAllow derived coord references to be lazy. #6517on Jun 18, 2025
moved this from 🚀 In Progress to 👀 In Review in 🦋 Iris 3.13.0on Jun 18, 2025
moved this from 👀 In Review to 🏁 Done in 🦋 Iris 3.13.0on Jul 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Labels

Type

No type

Projects

Status

Done

Status

🏁 Done

Milestone

No milestone

Relationships

None yet

    Participants

    @pp-mo@trexfeathers@stephenworsley@david-bentley

    Issue actions

      Memory issues loading pp files with default LOAD_POLICY · Issue #6404 · SciTools/iris