Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accessing datasets located in buffers using MemoryFile and ZipMemoryFile #977

Open
sgillies opened this issue Feb 3, 2017 · 3 comments
Open
Labels

Comments

@sgillies
Copy link
Member

sgillies commented Feb 3, 2017

Rasterio has different ways to access datasets located on disk or at network addresses and datasets located in memory buffers. This document explains the former once again and then introduces the latter for the first time.

Accessing datasets on your filesystem

To access datasets on disk, give a filesystem path to rasterio.open().

import rasterio

# Open a dataset located in a local file.
with rasterio.open('data/RGB.byte.tif') as dataset:
    print(dataset.profile)

Equivalently, use a file:// URL.

with rasterio.open('file://data/RGB.byte.tif') as dataset:
    print(dataset.profile)

Accessing datasets in a zip archive

To access a dataset located in a local zip file, pass a zip:// URL (Apache VFS style) to rasterio.open().

with rasterio.open('zip://data/files.zip!RGB.byte.tif') as dataset:
    print(dataset.profile)

Accessing network datasets

Datasets at http://, https://, or s3:// (AWS CLI style) network locations can be accessed by passing these locators to rasterio.open(). See #942 for details.

The difference from GDAL

If you're a GDAL user, you may be used to passing strings like /vsizip/foo.zip to call for zip file handling and strings like /viscurl/https://example.com/foo.tif to call for HTTP protocol handling. Rasterio registers handlers by URL schemes instead. Rasterio uses GDAL's special strings internally, but they are not part of the Rasterio API.

Accessing datasets in memory buffers

Rasterio can access datasets located in the buffers of Python objects without writing the buffers to disk. To see, open and read any GeoTIFF file.

data = open('data/RGB.byte.tif', 'rb').read()

The buffer of data's value contains that GeoTIFF. To make it available to Rasterio (and GDAL), give data to a MemoryFile and then open the dataset using MemoryFile.open().

from rasterio.io import MemoryFile

with MemoryFile(data) as memfile:
    with memfile.open() as dataset:
        print(dataset.profile)

As there is only one dataset per MemoryFile, MemoryFile.open() needs no filename or path argument. In many cases the usage can be condensed to the following.

with MemoryFile(data).open() as dataset:
    print(dataset.profile)

MemoryFile is like Python's BytesIO class but has an additional special feature: the bytes buffer is mapped to a virtual file for use by GDAL. The virtual file is deleted when the MemoryFile closes.

You can also pass a file-like object opened in binary mode to MemoryFile(). This is for convenience only, the bytes of the file are read immediately into a bytes object.

fp = open('data/RGB.byte.tif', 'rb')

with MemoryFile(fp).open() as dataset:
    print(dataset.profile)
    rgb_profile = dataset.profile
    rgb_data = dataset.read()

Note that the profile and band data of that dataset have been captured for use in other examples below.

Performance notes

Recognize the above as a more memory-intensive way of getting the same results as the very first example in this document. Generally speaking, raster data formats are optimized for random access and GDAL format drivers need datasets to be written entirely onto disk or into memory and mapped to a virtual file. Using MemoryFile to hold a large GeoTIFF doesn't require
a hard disk (which is good for serverless applications) but loads the entire GeoTIFF into RAM.

Writing to MemoryFile

A MemoryFile can also be written. You can create a GeoTIFF (for example) in memory and then stream its bytes elsewhere without writing to disk. In this case you must bind the MemoryFile to a name so it can be referenced later.

with MemoryFile() as memfile:
    with memfile.open(**rgb_profile) as dataset:
        dataset.write(rgb_data)

    memfile.seek(0)
    print(memfile.read(1000))

Writing band data to the opened dataset modifies the virtual file and consequently the MemoryFile buffer.

Be kind: rewind

Note well: after dataset closes, the memfile position is left at its end.

Zip files in a buffer

The ZipMemoryFile class is mostly the same, but is for use with a buffer that contains a zip archive.

from rasterio.io import ZipMemoryFile

fp = open('data/files.zip', 'rb')

with ZipMemoryFile(fp) as zipmem:
    with zipmem.open('RGB.byte.tif') as dataset:
        print(dataset.profile)

This is much the same interface as that of zipfile.ZipFile.

Writing in-memory zip files

Writing to a ZipMemoryFile is not currently supported, but it is possible to do so using Python's zipfile library and Rasterio's MemoryFile together.

from io import BytesIO
import zipfile

with BytesIO() as bytes_buffer:
    with zipfile.ZipFile(bytes_buffer, 'w') as zf:

        with MemoryFile() as memfile:
            with memfile.open(**rgb_profile) as dataset:
                dataset.write(rgb_data)
                
            memfile.seek(0)
            zf.writestr('foo.tif', memfile.read())

    bytes_buffer.seek(0)
    with ZipMemoryFile(bytes_buffer).open('foo.tif') as dataset:
        print(dataset.profile)

Final notes on convenience features

By popular request, rasterio.open() can also take a file object opened in binary modes 'rb' or 'wb' as its first argument.

with open('data/RGB.byte.tif') as f:
    with rasterio.open(f) as dataset:
        print(dataset.profile)

A MemoryFile is created internally to hold the bytes read from the input file object. This is therefore not the best way to read or write datasets already on disk and addressable by name.

As is the case for every printed profile, the output is the following.

{'tiled': False, 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'width': 791, 'dtype': 'uint8', 'interleave': 'pixel', 'driver': 'GTiff', 'crs': CRS({'init': 'epsg:32618'}), 'count': 3, 'height': 718, 'nodata': 0.0}

Rasterio has different ways to access datasets located on disk or at network addresses and datasets located in memory buffers. The features are acquired from GDAL, but the abstractions are different, more Pythonic.

@sgillies sgillies added the devlog label Feb 3, 2017
@sgillies sgillies added this to the 1.0 milestone Feb 13, 2017
@sgillies sgillies added this to Done in Rasterio 1.0.0 Final Jun 8, 2017
@sgillies sgillies removed this from Done in Rasterio 1.0.0 Final Apr 4, 2018
@sgillies sgillies removed this from the 1.0 milestone Jun 21, 2018
@jhamman
Copy link

jhamman commented Jun 11, 2020

@sgillies - quick question on one part of this document (which is very useful by the way).

You can also pass a file-like object opened in binary mode to MemoryFile(). This is for convenience only, the bytes of the file are read immediately into a bytes object.

Can you help me understand the technical reasons as to why this is the case? Would it be possible in rasterio to pass file-like objects that are not immediately read into memory? Obviously some streaming access is possible for other buffer types (local file, s3, etc.).

@martindurant
Copy link

Yes, you would think that you can pass any type compatible with that MemoryFile interface, where methods to return sets of bytes from the target were implemented to read on demand, not hold everything literally in memory.

@sgillies
Copy link
Member Author

@jhamman GDAL requires a GByte * (unsigned char *) buffer: https://gdal.org/api/cpl.html#_CPPv420VSIFileFromMemBufferPKcP5GByte12vsi_l_offseti, hence we read file objects entirely into memory and then provide that buffer to GDAL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants