Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jupyter repr caching deleted netcdf file #4240

Closed
michaelaye opened this issue Jul 21, 2020 · 9 comments · Fixed by #4879
Closed

jupyter repr caching deleted netcdf file #4240

michaelaye opened this issue Jul 21, 2020 · 9 comments · Fixed by #4879

Comments

@michaelaye
Copy link

What happened:

Testing xarray data storage in a jupyter notebook with varying data sizes and storing to a netcdf, i noticed that open_dataset/array (both show this behaviour) continue to return data from the first testing run, ignoring the fact that each run deletes the previously created netcdf file.
This only happens once the repr was used to display the xarray object.
But once in error mode, even the previously fine printed objects are then showing the wrong data.

This was hard to track down as it depends on the precise sequence in jupyter.

What you expected to happen:

when i use open_dataset/array, the resulting object should reflect reality on disk.

Minimal Complete Verifiable Example:

import xarray as xr
from pathlib import Path
import numpy as np

def test_repr(nx):
    ds = xr.DataArray(np.random.rand(nx))
    path = Path("saved_on_disk.nc")
    if path.exists():
        path.unlink()
    ds.to_netcdf(path)
    return path

When executed in a cell with print for display, all is fine:

test_repr(4)
print(xr.open_dataset("saved_on_disk.nc"))
test_repr(5)
print(xr.open_dataset("saved_on_disk.nc"))

but as soon as one cell used the jupyter repr:

xr.open_dataset("saved_on_disk.nc")

all future file reads, even after executing the test function again and even using print and not repr, show the data from the last repr use.

Anything else we need to know?:

Here's a notebook showing the issue:
https://gist.github.com/05c2542ed33662cdcb6024815cc0c72c

Environment:

Output of xr.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50)
[GCC 7.5.0]
python-bits: 64
OS: Linux
OS-release: 5.4.0-40-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
libhdf5: 1.10.6
libnetcdf: 4.7.4

xarray: 0.16.0
pandas: 1.0.5
numpy: 1.19.0
scipy: 1.5.1
netCDF4: 1.5.3
pydap: None
h5netcdf: None
h5py: 2.10.0
Nio: None
zarr: None
cftime: 1.2.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: 1.1.5
cfgrib: None
iris: None
bottleneck: None
dask: 2.21.0
distributed: 2.21.0
matplotlib: 3.3.0
cartopy: 0.18.0
seaborn: 0.10.1
numbagg: None
pint: None
setuptools: 49.2.0.post20200712
pip: 20.1.1
conda: installed
pytest: 6.0.0rc1
IPython: 7.16.1
sphinx: 3.1.2

@michaelaye michaelaye changed the title jupyter repr caching deleted file netcdf file jupyter repr caching deleted netcdf file Jul 21, 2020
@shoyer
Copy link
Member

shoyer commented Jul 25, 2020

Thanks for the clear example!

This happens dues to xarray's caching logic for files:

class CachingFileManager(FileManager):
"""Wrapper for automatically opening and closing file objects.
Unlike files, CachingFileManager objects can be safely pickled and passed
between processes. They should be explicitly closed to release resources,
but a per-process least-recently-used cache for open files ensures that you
can safely create arbitrarily large numbers of FileManager objects.
Don't directly close files acquired from a FileManager. Instead, call
FileManager.close(), which ensures that closed files are removed from the
cache as well.
Example usage:
manager = FileManager(open, 'example.txt', mode='w')
f = manager.acquire()
f.write(...)
manager.close() # ensures file is closed
Note that as long as previous files are still cached, acquiring a file
multiple times from the same FileManager is essentially free:
f1 = manager.acquire()
f2 = manager.acquire()
assert f1 is f2
"""

This means that when you open the same filename, xarray doesn't actually reopen the file from disk -- instead it points to the same file object already cached in memory.

I can see why this could be confusing. We do need this caching logic for files opened from the same backends.*DataStore class, but this could include some sort of unique identifier (i.e., from uuid) to ensure each separate call to xr.open_dataset results in a separately cached/opened file object:

manager = CachingFileManager(
netCDF4.Dataset, filename, mode=mode, kwargs=kwargs
)

@michaelaye
Copy link
Author

is there a workaround for forcing the opening without restarting the notebook?

@michaelaye
Copy link
Author

now i'm wondering why the caching logic is only activated by the repr? As you can see, when printed, it always updated to the status on disk?

@shoyer
Copy link
Member

shoyer commented Jul 25, 2020

Probably the easiest work around is to call .close() on the original dataset. Failing that, the file is cached in xarray.backends.file_manager.FILE_CACHE, which you could muck around with.

I believe it only gets activated by repr() because array values from netCDF file are loaded lazily. Not 100% without more testing, though.

@markusritschel
Copy link

Would it be an option to consider the time stamp of the file's last change as a caching criterion?

@kmuehlbauer
Copy link
Contributor

kmuehlbauer commented Jan 21, 2021

I've stumbled over this weird behaviour many times and was wondering why this happens. So AFAICT @shoyer hit the nail on the head but the root cause is that the Dataset is added to the notebook namespace somehow, if one just evaluates it in the cell.

This doesn't happen if you invoke the __repr__ via

display(xr.open_dataset("saved_on_disk.nc"))

I've forced myself to use either print or display for xarray data. As this also happens if the Dataset is attached to a variable you would need to specifically delete (or .close()) the variable in question before opening again.

try: 
    del ds
except NameError:
    pass
ds = xr.open_dataset("saved_on_disk.nc")

shoyer added a commit to shoyer/xarray that referenced this issue Feb 7, 2021
This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.
shoyer added a commit to shoyer/xarray that referenced this issue Feb 7, 2021
This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.
@shoyer
Copy link
Member

shoyer commented Feb 7, 2021

I have a tentative fix for this in #4879. It would be great if someone could give this a try to verify that it resolve the issue.

@brianmapes
Copy link

+1
Complicated, still vexing this user a year+ later, but it easier for me to just restart the kernel again and again than read this and #4879, which is closed but didn't seem to have succeeded if I read correctly?

@mullenkamp
Copy link

mullenkamp commented Oct 4, 2022

Running xarray.backends.file_manager.FILE_CACHE.clear() fixed the issue for me. I couldn't find any other way to stop xarray from pulling up some old data from a newly saved file. I'm using the h5netcdf engine with xarray version 2022.6.0 by the way.

dcherian added a commit that referenced this issue Oct 18, 2022
* Cache files for different CachingFileManager objects separately

This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: #4240, #4862

Conveniently, this also obviates the need for some messy reference
counting logic.

* Fix whats-new message location

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add id to CachingFileManager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restrict new test to only netCDF files

* fix whats-new message

* skip test on windows

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit e637165.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Fix whats-new message location"

This reverts commit 6bc80e7.

* fixups

* fix syntax

* tweaks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix types for mypy

* add uuid

* restore ref_counts

* doc tweaks

* close files inside test_open_mfdataset_list_attr

* remove unused itertools

* don't use refcounts

* re-enable ref counting

* cleanup

* Apply typing suggestions from code review

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* fix import of Hashable

* ignore __init__ type

* fix whats-new

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: dcherian <deepak@cherian.net>
keewis pushed a commit to keewis/xarray that referenced this issue Oct 19, 2022
…ta#4879)

* Cache files for different CachingFileManager objects separately

This means that explicitly opening a file multiple times with
``open_dataset`` (e.g., after modifying it on disk) now reopens the file
from scratch, rather than reusing a cached version.

If users want to reuse the cached file, they can reuse the same xarray
object. We don't need this for handling many files in Dask (the original
motivation for caching), because in those cases only a single
CachingFileManager is created.

I think this should some long-standing usability issues: pydata#4240, pydata#4862

Conveniently, this also obviates the need for some messy reference
counting logic.

* Fix whats-new message location

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add id to CachingFileManager

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* restrict new test to only netCDF files

* fix whats-new message

* skip test on windows

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit e637165.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "Fix whats-new message location"

This reverts commit 6bc80e7.

* fixups

* fix syntax

* tweaks

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix types for mypy

* add uuid

* restore ref_counts

* doc tweaks

* close files inside test_open_mfdataset_list_attr

* remove unused itertools

* don't use refcounts

* re-enable ref counting

* cleanup

* Apply typing suggestions from code review

Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>

* fix import of Hashable

* ignore __init__ type

* fix whats-new

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Deepak Cherian <dcherian@users.noreply.github.com>
Co-authored-by: Illviljan <14371165+Illviljan@users.noreply.github.com>
Co-authored-by: dcherian <deepak@cherian.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants