Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python file opener VSI plugin #2898

Merged
merged 12 commits into from
Aug 31, 2023
Merged

Python file opener VSI plugin #2898

merged 12 commits into from
Aug 31, 2023

Conversation

sgillies
Copy link
Member

@sgillies sgillies commented Aug 11, 2023

Resolves #2888 maybe definitely.

This PR adds a new opener keyword argument to rasterio.open(), making an analogy to the opener kwarg of Python's builtin open().

The opener keyword argument must be a callable which takes a string as its one positional argument and returns a read-only Python file-like object with read, seek, tell, and close methods. The opener is called by GDAL's virtual filesystem machinery and thus its file-like object serves GDAL's format drivers.

Example usage:

with rasterio.open("zip://*.tif::tests/data/files.zip", opener=fsspec.open) as src:
    print(src.profile)

When the opener needs extra arguments, one can use functools.partial.

from functools import partial
with rasterio.open("dask::s3://bucket/key", opener=partial(fsspec.open, s3={"anon": True})) as src:
    print(src.profile)

Applications of this include:

  • Opening datasets stored in proprietary systems
  • Integration with auth systems or protocols not supported by GDAL

Currently we're testing with io.open, zipfile.ZipFile.open, and fsspec.open. I expect this to extend to gzip and fs_s3fs openers in the same way.

@sgillies sgillies self-assigned this Aug 11, 2023
rasterio/__init__.py Outdated Show resolved Hide resolved
sgillies and others added 3 commits August 14, 2023 17:16
The missing piece was keeping a reference to opened files.
@sgillies sgillies changed the title Rough draft of a new VSI plugin Python file opener VSI plugin Aug 17, 2023
@sgillies sgillies marked this pull request as ready for review August 18, 2023 00:49
@sgillies
Copy link
Member Author

@djhoese does this look useful to you? I think it's close to solving the Python-GDAL issues of #2850 forever, but does require a fairly major change in usage.

@djhoese
Copy link
Contributor

djhoese commented Aug 18, 2023

Very interesting. From what I'm understanding it definitely seems interesting. Right now, is this an additional interface or is this PR completely replacing the old interface(s)?

I honestly haven't been using this functionality even though I was the one who originally added it. It was more a stepping stone for fsspec users, but I didn't need it directly for my usual work.

rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
Sean Gillies added 4 commits August 18, 2023 16:34
@sgillies
Copy link
Member Author

@djhoese this would be an additional interface. I think FilePath will need to be deprecated and removed eventually. Your work won't be lost, though, it will live on in the new opener argument.

@sgillies
Copy link
Member Author

@groutr @vincentsarago @snowman2 anything else you'd like to see before this gets merged?

rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
Eliminates the fsspec OpenFile special case
rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
@sgillies sgillies mentioned this pull request Aug 30, 2023
rasterio/_vsiopener.pyx Outdated Show resolved Hide resolved
Copy link
Member

@snowman2 snowman2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @sgillies 👍

Co-authored-by: Alan D. Snow <alansnow21@gmail.com>
sgillies pushed a commit that referenced this pull request Aug 31, 2023
This depends on PR #2898.

Resolves #2906.
@sgillies sgillies merged commit b9716ce into main Aug 31, 2023
16 checks passed
sgillies pushed a commit that referenced this pull request Sep 1, 2023
This depends on PR #2898.

Resolves #2906.
sgillies added a commit that referenced this pull request Sep 1, 2023
* Rough draft of a new VSI plugin

Resolves #2888 maybe

* Getting closer

* Working implementation!

The missing piece was keeping a reference to opened files.

* Add more tests

* Add more documentation about opener

* Improvements based on feedback from @rouault

* Update docstrings

* Update the VSI topic in docs

* Don't accept fsspec OpenFile instances as opener

Instead, we'll make a way for rasterio.open to accept them as the
"fp" positional argument

* Store contexts of vsipyopener file objects

Eliminates the fsspec OpenFile special case

* Strip file mode for the zipfile case

* Update rasterio/_vsiopener.pyx

Co-authored-by: Alan D. Snow <alansnow21@gmail.com>

---------

Co-authored-by: Sean Gillies <seangillies@Seans-MacBook-Air.local>
Co-authored-by: Alan D. Snow <alansnow21@gmail.com>
sgillies pushed a commit that referenced this pull request Sep 1, 2023
This depends on PR #2898.

Resolves #2906.
@sgillies sgillies deleted the issue2888 branch October 17, 2023 20:36
@sgillies sgillies added this to the 1.4.0 milestone Oct 27, 2023
@vincentsarago
Copy link
Member

👋 I've been playing with this recently https://github.com/vincentsarago/vsifile

Maybe I'm doing something absolutely wrong (and surely duplicating what fsspec is trying to solve) but I wanted to see if we could maybe create a super simple file handler with simple persistent cache (at least for file header).

Sadly, right now it's pretty inefficient because GDAL won't merge request nor use internal cache (this is really weird to me) as I'm trying to explain in vincentsarago/vsifile#1

No real actions needed it's mostly a FYI

@vincentsarago
Copy link
Member

FYI: I think the opener needs to implement https://github.com/OSGeo/gdal/blob/4d94400adc0a47fc5c9091adb8a73a5aadeb5d4a/port/cpl_vsil_plugin.cpp#L80-L97 to enable SMART multi range request :-)

@djhoese
Copy link
Contributor

djhoese commented Nov 7, 2023

@vincentsarago Doesn't that assume that the python file-like object is multi-reader friendly? And thread safe? I don't think that is true for most file-like objects.

@vincentsarago
Copy link
Member

Doesn't that assume that the python file-like object is multi-reader friendly?

Maybe, I'm not saying that all file-like object should support this

And thread safe? I don't think that is true for most file-like objects.

Yeah that's definitely something that should be tested (I usually work in single thread)

@vincentsarago
Copy link
Member

vincentsarago commented Nov 7, 2023

Cython is not my forte so if someone is interested to help it will be really appreciated

I'm just trying to add the

# in https://github.com/rasterio/rasterio/blob/75f770a309c24c97c859b7b20985e10c00a4c514/rasterio/_vsiopener.pyx#L57-L58
callbacks_struct.read_multi_range = <VSIFilesystemPluginReadMultiRangeCallback>pyopener_read_multi_range
# in https://github.com/rasterio/rasterio/blob/75f770a309c24c97c859b7b20985e10c00a4c514/rasterio/_vsiopener.pyx#L150
cdef int pyopener_read_multi_range(void *pFile,
                                   int nRanges,
                                   char *ppData,
                                   vsi_l_offset[:] panOffsets,
                                   size_t[:] panSizes) except -1 with gil:
    """Read several ranges of bytes from file.

    Reads nRanges objects of panSizes[i] bytes from the indicated file at the offset panOffsets[i] into the buffer ppData[i].
    Ranges must be sorted in ascending start offset, and must not overlap each other.

    """
    cdef object file_obj = <object>pFile
    file_obj.read_multi_range(nRanges, ppData, panOffsets, panSizes)
    return 0

but the definition of pyopener_read_multi_range is not right

I guess we'll also need to make it optional in some way!

we can start a proper PR if people are interested

Note: the AdviseRead and GetRangeStatus are much more advanced and I don't personally need them for now (ref OSGeo/gdal#6456)

@rouault
Copy link
Contributor

rouault commented Nov 8, 2023

Doesn't that assume that the python file-like object is multi-reader friendly? And thread safe?

ReadMultiRange() is only going to be called from a single thread at a time on a single VSILFILE* object
Only PRead() needs to be multi-thread safe, but it is not (yet?) available in the file plugin mechanism

@vincentsarago
Copy link
Member

cdef int pyopener_read_multi_range(void *pFile,
                                   int nRanges,
                                   void **ppData,
                                   vsi_l_offset *panOffsets,
                                   size_t *panSizes) except -1 with gil:
    """Read several ranges of bytes from file.

        Reads nRanges objects of panSizes[i] bytes from the indicated file at the offset panOffsets[i] into the buffer ppData[i].
        Ranges must be sorted in ascending start offset, and must not overlap each other.

        ref: https://gdal.org/api/cpl.html#_CPPv419VSIFReadMultiRangeLiPPvPK12vsi_l_offsetPK6size_tP8VSILFILE

        Parameters
        ----------
        pFile: File handle
            file handle opened with VSIFOpenL().
        nRanges : int
            number of ranges to read.
        ppData : array of buffer
            Array of nRanges buffer into which the data should be read
            (ppData[i] must be at list panSizes[i] bytes).
        panOffsets : array of int
            Array of nRanges offsets at which the data should be read.
        panSizes : array of in
            Array of nRanges sizes of objects to read (in bytes).

        int VSIPluginHandle::ReadMultiRange(int nRanges, **ppData, vsi_l_offset *panOffsets, size_t *panSizes)
        {
            return poFS->ReadMultiRange(cbData, nRanges, ppData, panOffsets, panSizes);
        }
    """
    cdef object file_obj = <object>pFile

    # Convert panOffsets and panSizes to Python lists
    cdef list offsets = [int(panOffsets[i]) for i in range(nRanges)]
    cdef list sizes = [int(panSizes[i]) for i in range(nRanges)]

    # Call the Python method with the converted arguments
    cdef list python_data = file_obj.read_multi_range(nRanges, offsets, sizes)
    for i in range(nRanges):
        memcpy(ppData[i], <void*><char*>python_data[i], len(python_data[i]))

    return 0

took me only 2 days 😅 (there might be a lot of issue with this code, but at least it works 😄 )

@vincentsarago
Copy link
Member

Note:

There are 2 important variables nBufferSize and nCacheSize which have influence over GDAL internal cache https://github.com/OSGeo/gdal/blob/f77e05dd2fbfd96933621de9a42f763fd158933c/port/cpl_vsil_plugin.cpp#L143-L153

I'm not sure if Rasterio's should have default for those or try to use the default from GDAL but having those set in the callbacks_struct makes vsiopener much more performant!

@vincentsarago
Copy link
Member

@sgillies happy to open an issue instead of commenting here!

Quick question: is the registery thread safe (e.g what happens if multiple threads try to register/deregister the same file path ?

I'm getting this kind of error sometimes 👇

  File "rasterio/_vsiopener.pyx", line 205, in _opener_registration
    _ = _OPENER_REGISTRY.pop(filename)
KeyError: 'https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/15/T/VK/2023/10/S2B_15TVK_20231008_0_L2A/TCI.tif'

@djhoese
Copy link
Contributor

djhoese commented Nov 13, 2023

At least for the initial design, it was not thread safe. We didn't have a lot of test/user cases for the initial proof of concept and the initial implementation has changed a couple times since then. Please create new issues for what you've been discussing here so the individual topics don't get lost and please mention the relevant people.

sgillies added a commit that referenced this pull request Dec 22, 2023
* Support writing to Python files

This depends on PR #2898.

Resolves #2906.

* Update change log

* Fix PR reference

* Use CPLError to convey opener failure out of the callback

* Call CPLError with 4 arguments

* Register opener for filename and mode

Also make sure that deregistration happens on cleanup

* Remove with from test

* Update opener documentation

---------

Co-authored-by: Sean Gillies <seangillies@Seans-MacBook-Air.local>
sgillies added a commit that referenced this pull request Dec 28, 2023
* Support writing to Python files

This depends on PR #2898.

Resolves #2906.

* Update change log

* Fix PR reference

* Allow fsspec OpenFile instances as dataset name

Resolves #2905

* Remove line left after online merge

---------

Co-authored-by: Sean Gillies <seangillies@Seans-MacBook-Air.local>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rasterio 1.3.8 fails to read a raster from AWS S3 bucket
6 participants