jupyter repr caching deleted netcdf file #4240

michaelaye · 2020-07-21T02:50:04Z

What happened:

Testing xarray data storage in a jupyter notebook with varying data sizes and storing to a netcdf, i noticed that open_dataset/array (both show this behaviour) continue to return data from the first testing run, ignoring the fact that each run deletes the previously created netcdf file.
This only happens once the repr was used to display the xarray object.
But once in error mode, even the previously fine printed objects are then showing the wrong data.

This was hard to track down as it depends on the precise sequence in jupyter.

What you expected to happen:

when i use open_dataset/array, the resulting object should reflect reality on disk.

Minimal Complete Verifiable Example:

import xarray as xr
from pathlib import Path
import numpy as np

def test_repr(nx):
    ds = xr.DataArray(np.random.rand(nx))
    path = Path("saved_on_disk.nc")
    if path.exists():
        path.unlink()
    ds.to_netcdf(path)
    return path

When executed in a cell with print for display, all is fine:

test_repr(4)
print(xr.open_dataset("saved_on_disk.nc"))
test_repr(5)
print(xr.open_dataset("saved_on_disk.nc"))

but as soon as one cell used the jupyter repr:

xr.open_dataset("saved_on_disk.nc")

all future file reads, even after executing the test function again and even using print and not repr, show the data from the last repr use.

Anything else we need to know?:

Here's a notebook showing the issue:
https://gist.github.com/05c2542ed33662cdcb6024815cc0c72c

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.5
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: 0.18.0
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 49.2.0.post20200712
pip: 20.1.1
conda: installed
pytest: 6.0.0rc1
IPython: 7.16.1
sphinx: 3.1.2

The text was updated successfully, but these errors were encountered:

shoyer · 2020-07-25T01:33:36Z

Thanks for the clear example!

This happens dues to xarray's caching logic for files:

xarray/xarray/backends/file_manager.py

Lines 50 to 76 in b1c7e31

    
           class CachingFileManager(FileManager): 
        
               """Wrapper for automatically opening and closing file objects. 
        
               Unlike files, CachingFileManager objects can be safely pickled and passed 
        
               between processes. They should be explicitly closed to release resources, 
        
               but a per-process least-recently-used cache for open files ensures that you 
        
               can safely create arbitrarily large numbers of FileManager objects. 
        
               Don't directly close files acquired from a FileManager. Instead, call 
        
               FileManager.close(), which ensures that closed files are removed from the 
        
               cache as well. 
        
               Example usage: 
        
                   manager = FileManager(open, 'example.txt', mode='w') 
        
                   f = manager.acquire() 
        
                   f.write(...) 
        
                   manager.close()  # ensures file is closed 
        
               Note that as long as previous files are still cached, acquiring a file 
        
               multiple times from the same FileManager is essentially free: 
        
                   f1 = manager.acquire() 
        
                   f2 = manager.acquire() 
        
                   assert f1 is f2 
        
               """

This means that when you open the same filename, xarray doesn't actually reopen the file from disk -- instead it points to the same file object already cached in memory.

I can see why this could be confusing. We do need this caching logic for files opened from the same backends.*DataStore class, but this could include some sort of unique identifier (i.e., from uuid) to ensure each separate call to xr.open_dataset results in a separately cached/opened file object:

xarray/xarray/backends/netCDF4_.py

Lines 355 to 357 in b1c7e31

    
           manager = CachingFileManager( 
        
               netCDF4.Dataset, filename, mode=mode, kwargs=kwargs 
        
           )

michaelaye · 2020-07-25T01:37:20Z

is there a workaround for forcing the opening without restarting the notebook?

michaelaye · 2020-07-25T01:41:20Z

now i'm wondering why the caching logic is only activated by the repr? As you can see, when printed, it always updated to the status on disk?

shoyer · 2020-07-25T02:05:18Z

Probably the easiest work around is to call .close() on the original dataset. Failing that, the file is cached in xarray.backends.file_manager.FILE_CACHE, which you could muck around with.

I believe it only gets activated by repr() because array values from netCDF file are loaded lazily. Not 100% without more testing, though.

markusritschel · 2020-08-19T13:07:05Z

Would it be an option to consider the time stamp of the file's last change as a caching criterion?

kmuehlbauer · 2021-01-21T15:34:42Z

I've stumbled over this weird behaviour many times and was wondering why this happens. So AFAICT @shoyer hit the nail on the head but the root cause is that the Dataset is added to the notebook namespace somehow, if one just evaluates it in the cell.

This doesn't happen if you invoke the __repr__ via

display(xr.open_dataset("saved_on_disk.nc"))

I've forced myself to use either print or display for xarray data. As this also happens if the Dataset is attached to a variable you would need to specifically delete (or .close()) the variable in question before opening again.

try: 
    del ds
except NameError:
    pass
ds = xr.open_dataset("saved_on_disk.nc")

This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: pydata#4240, pydata#4862 Conveniently, this also obviates the need for some messy reference counting logic.

shoyer · 2021-02-07T21:48:38Z

I have a tentative fix for this in #4879. It would be great if someone could give this a try to verify that it resolve the issue.

brianmapes · 2022-09-27T02:14:28Z

+1
Complicated, still vexing this user a year+ later, but it easier for me to just restart the kernel again and again than read this and #4879, which is closed but didn't seem to have succeeded if I read correctly?

mullenkamp · 2022-10-04T20:58:37Z

Running xarray.backends.file_manager.FILE_CACHE.clear() fixed the issue for me. I couldn't find any other way to stop xarray from pulling up some old data from a newly saved file. I'm using the h5netcdf engine with xarray version 2022.6.0 by the way.

* Cache files for different CachingFileManager objects separately This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: #4240, #4862 Conveniently, this also obviates the need for some messy reference counting logic. * Fix whats-new message location * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add id to CachingFileManager * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restrict new test to only netCDF files * fix whats-new message * skip test on windows * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit e637165. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Fix whats-new message location" This reverts commit 6bc80e7. * fixups * fix syntax * tweaks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix types for mypy * add uuid * restore ref_counts * doc tweaks * close files inside test_open_mfdataset_list_attr * remove unused itertools * don't use refcounts * re-enable ref counting * cleanup * Apply typing suggestions from code review Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> * fix import of Hashable * ignore __init__ type * fix whats-new Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: dcherian <deepak@cherian.net>

…ta#4879) * Cache files for different CachingFileManager objects separately This means that explicitly opening a file multiple times with ``open_dataset`` (e.g., after modifying it on disk) now reopens the file from scratch, rather than reusing a cached version. If users want to reuse the cached file, they can reuse the same xarray object. We don't need this for handling many files in Dask (the original motivation for caching), because in those cases only a single CachingFileManager is created. I think this should some long-standing usability issues: pydata#4240, pydata#4862 Conveniently, this also obviates the need for some messy reference counting logic. * Fix whats-new message location * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add id to CachingFileManager * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restrict new test to only netCDF files * fix whats-new message * skip test on windows * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit e637165. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "Fix whats-new message location" This reverts commit 6bc80e7. * fixups * fix syntax * tweaks * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix types for mypy * add uuid * restore ref_counts * doc tweaks * close files inside test_open_mfdataset_list_attr * remove unused itertools * don't use refcounts * re-enable ref counting * cleanup * Apply typing suggestions from code review Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> * fix import of Hashable * ignore __init__ type * fix whats-new Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com> Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com> Co-authored-by: dcherian <deepak@cherian.net>

michaelaye changed the title ~~jupyter repr caching deleted file netcdf file~~ jupyter repr caching deleted netcdf file Jul 21, 2020

shoyer added topic-backends topic-internals maintenance and removed maintenance labels Jul 25, 2020

dcherian mentioned this issue Jan 20, 2021

Strange behaviour when overwriting files with to_netcdf and html repr #4833

Closed

shoyer mentioned this issue Feb 7, 2021

Obtaining fresh data from the disk when reopening a NetCDF file a second time #4862

Closed

shoyer mentioned this issue Feb 7, 2021

Cache files for different CachingFileManager objects separately #4879

Merged

4 tasks

dcherian closed this as completed in #4879 Oct 18, 2022

bendudson mentioned this issue Feb 6, 2023

xarray cacheing NetCDF files boutproject/xBOUT#277

Open

brendan-m-murphy mentioned this issue Oct 4, 2023

Merge devel into new_object_store - fixes needed openghg/openghg#802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jupyter repr caching deleted netcdf file #4240

jupyter repr caching deleted netcdf file #4240

michaelaye commented Jul 21, 2020

INSTALLED VERSIONS

shoyer commented Jul 25, 2020

michaelaye commented Jul 25, 2020

michaelaye commented Jul 25, 2020

shoyer commented Jul 25, 2020

markusritschel commented Aug 19, 2020

kmuehlbauer commented Jan 21, 2021 •

edited

shoyer commented Feb 7, 2021

brianmapes commented Sep 27, 2022

mullenkamp commented Oct 4, 2022 •

edited

jupyter repr caching deleted netcdf file #4240

jupyter repr caching deleted netcdf file #4240

Comments

michaelaye commented Jul 21, 2020

INSTALLED VERSIONS

shoyer commented Jul 25, 2020

michaelaye commented Jul 25, 2020

michaelaye commented Jul 25, 2020

shoyer commented Jul 25, 2020

markusritschel commented Aug 19, 2020

kmuehlbauer commented Jan 21, 2021 • edited

shoyer commented Feb 7, 2021

brianmapes commented Sep 27, 2022

mullenkamp commented Oct 4, 2022 • edited

kmuehlbauer commented Jan 21, 2021 •

edited

mullenkamp commented Oct 4, 2022 •

edited