Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert remote files to FSFile objects automatically #2096

Merged
merged 37 commits into from May 9, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
0ed9b23
Convert remote files to FSFile objects
pnuu Apr 27, 2022
36e4b1d
Fix usage for dictionary of readers and their files
pnuu Apr 27, 2022
5b189bd
Make TestScene::test_init_with_fsfile pass
pnuu Apr 27, 2022
35592f8
Handle files already being FSFiles
pnuu Apr 27, 2022
6331ff8
Add missing square brackets for list comprehension
pnuu Apr 27, 2022
b379e5f
Let import error stop processing when remote files can't be handled
pnuu Apr 27, 2022
810d927
Add s3fs to test and CI requirements
pnuu Apr 27, 2022
aa876d7
Assume all paths having backslash in them be windows local files
pnuu Apr 28, 2022
180aba9
Add some documentation for remote reading
pnuu Apr 28, 2022
ae7b4ca
Add example reading public GOES-16/ABI data
pnuu Apr 28, 2022
a7f6b69
Move remote reading documentation to its own page
pnuu May 2, 2022
59a7caa
Clarify documentation for fsspec configuration
pnuu May 2, 2022
f35d010
Add table of readers supporting reading remote data
pnuu May 2, 2022
7c5e41c
Link FSFile to remote reading documentation
pnuu May 3, 2022
f66a22b
Move check_file_protocols() to utils.py
pnuu May 3, 2022
bbd50c1
Fix imports
pnuu May 3, 2022
67b28ac
Add kwarg for fsspec storage options
pnuu May 3, 2022
2a8f4ca
Use reader_kwargs for storage options
pnuu May 4, 2022
b31488d
Make CodeFactor happier
pnuu May 4, 2022
521c3f6
Refactor storage option handling
pnuu May 4, 2022
45ff2aa
Fix storage option handling for dictionaries
pnuu May 4, 2022
69eb6a1
Move storage option parsing to utilities
pnuu May 4, 2022
58ae0ce
Make CodeFactor happier
pnuu May 4, 2022
28de307
Clarify variable naming
pnuu May 4, 2022
e62fd87
Use more descriptive name for remote file handling
pnuu May 4, 2022
f442165
Rename test
pnuu May 5, 2022
4cd2f4a
Add a section on caching
pnuu May 5, 2022
7ef8b6f
Update doc/source/remote_reading.rst
pnuu May 5, 2022
1c27ea6
Remove unnecessary doctest setup
pnuu May 5, 2022
d508067
Clarify cache documentation
pnuu May 5, 2022
0f2c7d9
Update doc/source/remote_reading.rst
pnuu May 5, 2022
c6fc935
Clarify note on reader_kwargs overriding only some config options
pnuu May 6, 2022
361e4d0
Show example us simplecache, add a note about thread safety
pnuu May 6, 2022
fd3156b
Make get_storage_options_from_reader_kwargs() public
pnuu May 6, 2022
00bc379
Move imports from function to module
pnuu May 6, 2022
1fe6dd6
Add fsspec as optional dependency for remote file reading
pnuu May 6, 2022
c45fd89
Move import of FSFile back to function due to circular import
pnuu May 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions continuous_integration/environment.yaml
Expand Up @@ -44,6 +44,7 @@ dependencies:
- pytest-cov
- pytest-lazy-fixture
- fsspec
- s3fs
- pylibtiff
- python-geotiepoints
- pooch
Expand Down
1 change: 1 addition & 0 deletions doc/source/index.rst
Expand Up @@ -56,6 +56,7 @@ the base Satpy installation.
examples/index
quickstart
readers
remote_reading
composites
resample
enhancements
Expand Down
173 changes: 173 additions & 0 deletions doc/source/remote_reading.rst
@@ -0,0 +1,173 @@
====================
Reading remote files
pnuu marked this conversation as resolved.
Show resolved Hide resolved
====================

Using a single reader
=====================

Some of the readers in Satpy can read data directly over various transfer protocols. This is done
using `fsspec <https://filesystem-spec.readthedocs.io/en/latest/index.html>`_ and various packages
it is using underneath.

As an example, reading ABI data from public AWS S3 storage can be done in the following way::

from satpy import Scene

storage_options = {'anon': True}
filenames = ['s3://noaa-goes16/ABI-L1b-RadC/2019/001/17/*_G16_s20190011702186*']
scn = Scene(reader='abi_l1b', filenames=filenames, reader_kwargs={'storage_options': storage_options})
scn.load(['true_color_raw'])

Reading from S3 as above requires the `s3fs` library to be installed in addition to `fsspec`.

As an alternative, the storage options can be given using
`fsspec configuration <https://filesystem-spec.readthedocs.io/en/latest/features.html#configuration>`_.
For the above example, the configuration could be saved to `s3.json` in the `fsspec` configuration directory
(by default placed in `~/.config/fsspec/` directory in Linux)::

{
"s3": {
"anon": "true"
}
}

.. note::

Options given in `reader_kwargs` override only the matching options given in configuration file and everythin else is left
as-is. In case of problems in data access, remove the configuration file to see if that solves the issue.


For reference, reading SEVIRI HRIT data from a local S3 storage works the same way::

filenames = [
's3://satellite-data-eumetcast-seviri-rss/H-000-MSG3*202204260855*',
]
storage_options = {
"client_kwargs": {"endpoint_url": "https://PLACE-YOUR-SERVER-URL-HERE"},
"secret": "VERYBIGSECRET",
"key": "ACCESSKEY"
}
scn = Scene(reader='seviri_l1b_hrit', filenames=filenames, reader_kwargs={'storage_options': storage_options})
scn.load(['WV_073'])
Comment on lines +42 to +51
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the doctests skip this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the doctests run by default on all files? I don't see any skips on https://satpy.readthedocs.io/en/stable/quickstart.html which definitely wouldn't work without having the data on the defined path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, ok, I thought we were doctesting everything. I got confused by the doctest setup at the top of the file. Should we have a clear comment on the top of the files that are not tested? Should this file actually be tested?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know what that meant so just copied it from quickstart.rst where the documentation was originally.

I'm not sure these can be tested apart from the GOES case, but I'm also not sure whether having the tests to access AWS/S3 and interact with the data is reasonable...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I want regular unit tests running S3 downloads, but doctests may be "OK". I don't think we run doctests as part of CI as most of our examples use fake non-existent paths.


Using the `fsspec` configuration in `s3.json` the configuration would look like this::

{
pnuu marked this conversation as resolved.
Show resolved Hide resolved
"s3": {
"client_kwargs": {"endpoint_url": "https://PLACE-YOUR-SERVER-URL-HERE"},
"secret": "VERYBIGSECRET",
"key": "ACCESSKEY"
}
}


Using multiple readers
======================

If multiple readers are used and the required credentials differ, the storage options are passed per reader like this::

reader1_filenames = [...]
reader2_filenames = [...]
filenames = {
'reader1': reader1_filenames,
'reader2': reader2_filenames,
}
reader1_storage_options = {...}
reader2_storage_options = {...}
reader_kwargs = {
'reader1': {
'option1': 'foo',
'storage_options': reader1_storage_options,
},
'reader2': {
'option1': 'foo',
'storage_options': reader1_storage_options,
}
}
scn = Scene(filenames=filenames, reader_kwargs=reader_kwargs)


Caching the remote files
========================

Caching the remote file locally can speedup the overall processing time significantly, especially if the data are re-used
for example when testing. The caching can be done by taking advantage of the `fsspec caching mechanism
<https://filesystem-spec.readthedocs.io/en/latest/features.html#caching-files-locally>`_::
pnuu marked this conversation as resolved.
Show resolved Hide resolved

reader_kwargs = {
'storage_options': {
's3': {'anon': True},
'simple': {
'cache_storage': '/tmp/s3_cache',
}
}
}

filenames = ['simplecache::s3://noaa-goes16/ABI-L1b-RadC/2019/001/17/*_G16_s20190011702186*']
scn = Scene(reader='abi_l1b', filenames=filenames, reader_kwargs=reader_kwargs)
scn.load(['true_color_raw'])
scn2 = scn.resample(scn.coarsest_area(), resampler='native')
scn2.save_datasets(base_dir='/tmp/', tiled=True, blockxsize=512, blockysize=512, driver='COG', overviews=[])


The following table shows the timings for running the above code with different cache statuses::

.. _cache_timing_table:

.. list-table:: Processing times without and with caching
:header-rows: 1
:widths: 40 30 30

* - Caching
- Elapsed time
- Notes
* - No caching
- 650 s
- remove `reader_kwargs` and `simplecache::` from the code
* - File cache
- 66 s
- Initial run
* - File cache
- 13 s
- Second run

.. note::

The cache is not cleaned by Satpy nor fsspec so the user should handle cleaning excess files from `cache_storage`.


.. note::

Only `simplecache` is considered thread-safe, so using the other caching mechanisms may or may not work depending
on the reader, Dask scheduler or the phase of the moon.


Resources
=========

See :class:`~satpy.readers.FSFile` for direct usage of `fsspec` with Satpy, and
`fsspec documentation <https://filesystem-spec.readthedocs.io/en/latest/index.html>`_ for more details on connection options
and detailes.


Supported readers
=================

.. _reader_table:

.. list-table:: Satpy Readers capable of reading remote files using `fsspec`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR I'll really need to include this information in a big table of all the readers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot: @BENR0 started on that front in #1547 but the PR has been at draft stage since Feb 2021.

:header-rows: 1
:widths: 70 30

* - Description
- Reader name
* - MSG (Meteosat 8 to 11) SEVIRI data in HRIT format
- `seviri_l1b_hrit`
* - GOES-R imager data in netcdf format
- `abi_l1b`
* - NOAA GOES-R ABI L2+ products in netcdf format
- `abi_l2_nc`
* - Sentinel-3 A and B OLCI Level 1B data in netCDF4 format
- `olci_l1b`
* - Sentinel-3 A and B OLCI Level 2 data in netCDF4 format
- `olci_l2`
5 changes: 4 additions & 1 deletion satpy/readers/__init__.py
Expand Up @@ -647,7 +647,10 @@ def _get_reader_kwargs(reader, reader_kwargs):
class FSFile(os.PathLike):
"""Implementation of a PathLike file object, that can be opened.

This is made to be used in conjuction with fsspec or s3fs. For example::
Giving the filenames to :class:`Scene` with valid transfer protocols will automatically
use this class so manual usage of this class is needed mainly for fine-grained control.

This class is made to be used in conjuction with fsspec or s3fs. For example::

from satpy import Scene

Expand Down
21 changes: 16 additions & 5 deletions satpy/scene.py
Expand Up @@ -34,6 +34,7 @@
from satpy.node import CompositorNode, MissingDependencies, ReaderNode
from satpy.readers import load_readers
from satpy.resample import get_area_def, prepare_resampler, resample_dataset
from satpy.utils import convert_remote_files_to_fsspec, get_storage_options_from_reader_kwargs
from satpy.writers import load_writer

LOG = logging.getLogger(__name__)
Expand Down Expand Up @@ -106,21 +107,31 @@ def __init__(self, filenames=None, reader=None, filter_parameters=None,
sub-dictionaries to pass different arguments to different
reader instances.

Keyword arguments for remote file access are also given in this dictionary.
See `documentation <https://satpy.readthedocs.io/en/stable/remote_reading.html>`_
for usage examples.

"""
self.attrs = dict()

storage_options, cleaned_reader_kwargs = get_storage_options_from_reader_kwargs(reader_kwargs)

if filter_parameters:
if reader_kwargs is None:
reader_kwargs = {}
if cleaned_reader_kwargs is None:
cleaned_reader_kwargs = {}
else:
reader_kwargs = reader_kwargs.copy()
reader_kwargs.setdefault('filter_parameters', {}).update(filter_parameters)
cleaned_reader_kwargs = cleaned_reader_kwargs.copy()
cleaned_reader_kwargs.setdefault('filter_parameters', {}).update(filter_parameters)

if filenames and isinstance(filenames, str):
raise ValueError("'filenames' must be a list of files: Scene(filenames=[filename])")

if filenames:
filenames = convert_remote_files_to_fsspec(filenames, storage_options)

self._readers = self._create_reader_instances(filenames=filenames,
reader=reader,
reader_kwargs=reader_kwargs)
reader_kwargs=cleaned_reader_kwargs)
self._datasets = DatasetDict()
self._wishlist = set()
self._dependency_tree = DependencyTree(self._readers)
Expand Down
74 changes: 74 additions & 0 deletions satpy/tests/test_scene.py
Expand Up @@ -632,6 +632,80 @@ def test_available_dataset_names_no_readers(self):
name_list = scene.available_dataset_names(composites=True)
assert name_list == []

def test_storage_options_from_reader_kwargs_no_options(self):
"""Test getting storage options from reader kwargs.

Case where there are no options given.
"""
filenames = ["s3://data-bucket/file1", "s3://data-bucket/file2", "s3://data-bucket/file3"]
with mock.patch('satpy.scene.load_readers'):
with mock.patch('fsspec.open_files') as open_files:
Scene(filenames=filenames)
open_files.assert_called_once_with(filenames)

def test_storage_options_from_reader_kwargs_single_dict_no_options(self):
"""Test getting storage options from reader kwargs for remote files.

Case where a single dict is given for all readers without storage options.
"""
filenames = ["s3://data-bucket/file1", "s3://data-bucket/file2", "s3://data-bucket/file3"]
reader_kwargs = {'reader_opt': 'foo'}
with mock.patch('satpy.scene.load_readers'):
with mock.patch('fsspec.open_files') as open_files:
Scene(filenames=filenames, reader_kwargs=reader_kwargs)
open_files.assert_called_once_with(filenames)

def test_storage_options_from_reader_kwargs_single_dict(self):
"""Test getting storage options from reader kwargs.

Case where a single dict is given for all readers with some common storage options.
"""
filenames = ["s3://data-bucket/file1", "s3://data-bucket/file2", "s3://data-bucket/file3"]
reader_kwargs = {'reader_opt': 'foo'}
expected_reader_kwargs = reader_kwargs.copy()
storage_options = {'option1': '1'}
reader_kwargs['storage_options'] = storage_options
with mock.patch('satpy.scene.load_readers') as load_readers:
with mock.patch('fsspec.open_files') as open_files:
Scene(filenames=filenames, reader_kwargs=reader_kwargs)
call_ = load_readers.mock_calls[0]
assert call_.kwargs['reader_kwargs'] == expected_reader_kwargs
open_files.assert_called_once_with(filenames, **storage_options)

def test_storage_options_from_reader_kwargs_per_reader(self):
"""Test getting storage options from reader kwargs.

Case where each reader have their own storage options.
"""
from copy import deepcopy

filenames = {
"reader1": ["s3://data-bucket/file1"],
"reader2": ["s3://data-bucket/file2"],
"reader3": ["s3://data-bucket/file3"],
}
storage_options_1 = {'option1': '1'}
storage_options_2 = {'option2': '2'}
storage_options_3 = {'option3': '3'}
reader_kwargs = {
"reader1": {'reader_opt_1': 'foo'},
"reader2": {'reader_opt_2': 'bar'},
"reader3": {'reader_opt_3': 'baz'},
}
expected_reader_kwargs = deepcopy(reader_kwargs)
reader_kwargs['reader1']['storage_options'] = storage_options_1
reader_kwargs['reader2']['storage_options'] = storage_options_2
reader_kwargs['reader3']['storage_options'] = storage_options_3

with mock.patch('satpy.scene.load_readers') as load_readers:
with mock.patch('fsspec.open_files') as open_files:
Scene(filenames=filenames, reader_kwargs=reader_kwargs)
call_ = load_readers.mock_calls[0]
assert call_.kwargs['reader_kwargs'] == expected_reader_kwargs
assert mock.call(filenames["reader1"], **storage_options_1) in open_files.mock_calls
assert mock.call(filenames["reader2"], **storage_options_2) in open_files.mock_calls
assert mock.call(filenames["reader3"], **storage_options_3) in open_files.mock_calls


class TestFinestCoarsestArea:
"""Test the Scene logic for finding the finest and coarsest area."""
Expand Down