Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to retry a read for an s3-COG (GeoTIFF on AWS s3) #2119

Open
dazza-codes opened this issue Feb 18, 2021 · 9 comments
Open

Failure to retry a read for an s3-COG (GeoTIFF on AWS s3) #2119

dazza-codes opened this issue Feb 18, 2021 · 9 comments
Assignees
Milestone

Comments

@dazza-codes
Copy link
Contributor

When a rasterio reads from S3 and it hits an s3 throttle limit (503), some code to wrap the reader in a retry block fails to retry (sample function code below). It appears as though rasterio/GDAL has registered the dataset as some kind of missing dataset (s3 object) and will refuse to retry reading it.

2021-02-18 08:03:18,849 | INFO | raster_file_metadata:1671 | Reading GeoTIFF metadata: s3://tst-research-dem/n40w085.tif
[WARNING] 2021-02-18T08:03:18.935Z 6b4780eb-93f1-4226-a7c0-8564329c21c5 CPLE_AppDefined in HTTP response code on https://tst-research-dem.s3.amazonaws.com/n40w085.tif: 503
2021-02-18 08:03:18,998 | ERROR | raster_file_metadata:1681 | Please reduce your request rate.

# first retry:
2021-02-18 08:03:19,720 | ERROR | raster_file_metadata:1681 | '/vsis3/tst-research-dem/n40w085.tif' does not exist in the file system, and is not recognized as a supported dataset name.

# second retry:
2021-02-18 08:03:21,032 | ERROR | raster_file_metadata:1681 | '/vsis3/tst-research-dem/n40w085.tif' not recognized as a supported file format.

Sample function to read/retry a COG metadata:

# assume the geotiff arg is "s3://tst-research-dem/n40w085.tif"

def raster_file_metadata(geotiff: str) -> Optional[Dict]:
    LOGGER.info("Reading GeoTIFF metadata: %s", geotiff)
    with rasterio.Env():
        retry_jitter = random.uniform(0.1, 0.3)
        tries = 0
        while tries < 3:
            tries += 1
            try:
                with rasterio.open(geotiff) as src:
                    # raster_metadata function only uses src.* metadata methods
                    return raster_metadata(src)
            except rasterio.errors.RasterioIOError as err:
                LOGGER.error(err)
                time.sleep(RETRY_DELAY + retry_jitter)

Are there better ways to use rasterio to retry any failed reads for s3-COG metadata and data? Are there any values for the rasterio.Env() or other configuration details that will avoid a failure to retry reading an s3-COG when the first read hits an s3-throttle error (503)? Is this a known feature of rasterio/GDAL or is this a bug?

Versions are binary wheel installations (pip only with rasterio bundled libs for GDAL) and it runs on AWS lambda runtime containers for python 3.7

$ poetry show rasterio
name         : rasterio
version      : 1.2.0
description  : Fast and direct raster I/O for use with Numpy and SciPy

dependencies
 - affine *
 - attrs *
 - certifi *
 - click >=4.0,<8
 - click-plugins *
 - cligj >=0.5
 - numpy *
 - snuggs >=1.4.1

$ python
Python 3.7.9 | packaged by conda-forge | (default, Dec  9 2020, 21:08:20) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import rasterio
>>> rasterio.gdal_version()
'3.2.1'
@sgillies
Copy link
Member

This sounds related to #1877 and OSGeo/gdal#2294, but must be subtly different.

@dazza-codes
Copy link
Contributor Author

Comment / question on those links:

  • semantics for HTTP (/vsis3) on a 503 (retry could work) is different from 403/404 (don't bother retry, not authorized or does not exist)
  • is there a recommended rasterio option to force-clear the cache on exit of the rasterio.Env?
    • is it released already in a pip package with a gdal binary wheel?
    • would it effectively call gdal.VSICurlClearCache() or set some other GDAL/CURL flag that might "escape" the rasterio.Env to be set at a session level?
    • is a recommended pattern to use something like CPL_VSIL_CURL_NON_CACHED=/vsis3/bucket/alpha.tif in a rasterio.Env() when it wraps a call to one specific s3-COG?
    • it is not easy to understand the scope of some GDAL env settings and how they might or might not be managed by rasterio.Env, and how they could be isolated to specific s3-COG reads and retries.

@dazza-codes
Copy link
Contributor Author

dazza-codes commented Feb 24, 2021

This might not be an optimal pattern, but it's a first pass at a function to read s3-COG metadata with actual retries:

def raster_file_metadata(geotiff: str) -> Optional[Dict]:
    """
    Read raster metadata.

    :param geotiff: a GeoTIFF file
    :return: a dictionary of metadata, if the geotiff exists
    """
    LOGGER.info("Reading GeoTIFF metadata: %s", geotiff)
    retry_delay = random.uniform(0.1, 0.5)
    retry_jitter = retry_delay / 10
    max_tries = 3
    num_tries = 0
    while num_tries < max_tries:
        num_tries += 1
        try:
            gdal_env = {}  # TODO: add GDAL env-vars
            if geotiff.startswith("s3:"):
                vsis3_path = geotiff.replace("s3:/", "/vsis3")
                gdal_env["CPL_VSIL_CURL_NON_CACHED"] = vsis3_path
            with rasterio.Env(**gdal_env):
                with rasterio.open(geotiff) as src:
                    return raster_metadata(src)
        except rasterio.errors.RasterioIOError as err:
            LOGGER.error(err)
            time.sleep(retry_delay + num_tries * retry_jitter)

Update: something like this function can run into a different kind of error:

11:33:49 am
[WARNING] 2021-03-01T19:33:49.298Z b1de2cf9-9ba4-4d9d-8ea8-8369b8a238e8 CPLE_AppDefined in HTTP response code on https://{bucket}.s3.amazonaws.com/{key-path}/{file-name}.tif: 500

This might be related to additional GDAL env-vars, e.g.

@sgillies
Copy link
Member

sgillies commented Mar 1, 2021

Below https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access I found this sentence.

Starting with GDAL 2.3, the GDAL_HTTP_MAX_RETRY (number of attempts) and GDAL_HTTP_RETRY_DELAY (in seconds) configuration option can be set, so that request retries are done in case of HTTP errors 429, 502, 503 or 504.

It's my understanding that this can also be used for vsis3 requests. Could you give that a try before we see about patching GDAL or enabling workarounds in rasterio?

@dazza-codes
Copy link
Contributor Author

If it does, how does rasterio expose any HTTP status codes that could allow consuming code to try to make smarter decisions about how to handle IO errors? If it does not yet, would it be possible for rasterio to expose HTTP status codes on any /vsis3 read requests? There only seems to be a message available on a rio IOError, e.g.

(Pdb) err.value
RasterioIOError("'/vsis3/key-path/missing.tif' does not exist in the file system,
 and is not recognized as a supported dataset name.")
(Pdb) err.value.args
("'/vsis3/key-path/missing.tif' does not exist in the file system, and is not recognized as a supported dataset name.",)
(Pdb) err.value.args[0]
"'/vsis3/key-path/missing.tif' does not exist in the file system, and is not recognized as a supported dataset name."
(Pdb) dir(err.value)
['__cause__', '__class__', '__context__', '__delattr__', '__dict__', '__dir__', 
'__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', 
'__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', 
'__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__',
'__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__suppress_context__', 
'__traceback__', '__weakref__', 'args', 'characters_written', 'errno', 'filename', 
'filename2', 'strerror', 'with_traceback']
(Pdb) err.value.errno is None
True

The HTTP response might be buried too deep in the library stack (?).

Traceback (most recent call last):
  File "rasterio/_base.pyx", line 260, in rasterio._base.DatasetBase.__init__
  File "rasterio/_shim.pyx", line 78, in rasterio._shim.open_dataset
  File "rasterio/_err.pyx", line 215, in rasterio._err.exc_wrap_pointer
rasterio._err.CPLE_OpenFailedError: '/vsis3/key-path/missing.tif' does not exist in the file system, and is not recognized as a supported dataset name.

@sgillies
Copy link
Member

sgillies commented Mar 2, 2021

Right, @dazza-codes, the responses aren't accessible from a GDAL API. You can see them in stderr if you set CPL_CURL_VERBOSE=1, but that's it. In a better world, GDAL would let us bring our own HTTP client, but it does not.

@dazza-codes
Copy link
Contributor Author

dazza-codes commented Mar 2, 2021

I'll report back when some requests hit 503s on s3-COG reads after I've run some load tests with some updated settings to explore the use of

        # Starting with GDAL 2.3, the GDAL_HTTP_MAX_RETRY (number of attempts) and
        # GDAL_HTTP_RETRY_DELAY (in seconds) configuration option can be set, so that request
        # retries are done in case of HTTP errors 429, 502, 503 or 504.
        gdal_env["GDAL_HTTP_MAX_RETRY"] = 3
        gdal_env["GDAL_HTTP_RETRY_DELAY"] = 0.5  # default is 30 sec

BTW, CPL_CURL_VERBOSE=1 is very useful for debugging and profiling. When reading some unknown number of points with src.sample from an s3-COG, it can dramatically reduce the number of GET requests for partial-content on s3 by setting the curl chunk size. By reducing the number of GET requests on s3, it can help to avoid s3-rate throttling errors. For example,

        # Partial downloads (requires the HTTP server to support random reading) are done with a
        # 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured
        # with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the
        # driver detects sequential reading it will progressively increase the chunk size up to
        # 2 MB to improve download performance.
        if n_points > 100:
            gdal_env["CPL_VSIL_CURL_CHUNK_SIZE"] = 5 * 16384
        elif n_points > 1:
            gdal_env["CPL_VSIL_CURL_CHUNK_SIZE"] = 3 * 16384

When reading with src.sample for random points (or cells) and say 3 bands (or indexes), I wonder if the s3-COG should be written out with interleaved bands? (That is, maybe src.sample could read contiguous bytes for a cell across N-bands?). While profiling, our sample s3-COG has:

Image Structure Metadata:
  INTERLEAVE=BAND
Band 1 Block=514x7 Type=UInt16, ColorInterp=Gray
Band 2 Block=514x7 Type=UInt16, ColorInterp=Undefined
Band 3 Block=514x7 Type=UInt16, ColorInterp=Undefined
Band 4 Block=514x7 Type=UInt16, ColorInterp=Undefined
Band 5 Block=514x7 Type=UInt16, ColorInterp=Undefined

It might be tricky to manage this setting, depending on how many files are in an s3 prefix, and it could be a little counter-intuitive, but it seems like allowing GDAL to read the s3 prefix to list the objects results in fewer s3 GET requests overall (and more efficient 404s with better error messages).

        # By default (GDAL_DISABLE_READDIR_ON_OPEN=FALSE), GDAL establishes a list
        # of all the files in the directory of the file passed to GDALOpen(). This
        # can result in speed-ups in some use cases, but also to major slow downs
        # when the directory contains thousands of other files. When set to TRUE,
        # GDAL will not try to establish the list of files.
        gdal_env["GDAL_DISABLE_READDIR_ON_OPEN"] = False

@dazza-codes
Copy link
Contributor Author

dazza-codes commented Mar 3, 2021

Update - it seems to work with GDAL_HTTP_MAX_RETRY. That might be all that is required to close this ticket; maybe it could be transformed into a documentation issue; possible candidates to update include:

Some relevant GDAL env-var settings used to load-test s3-COG reads:

    retry_delay = random.uniform(0.2, 0.6)
    retry_jitter = retry_delay / 10
    retry_delay += retry_jitter
    max_tries = 3
    gdal_env = {
        "GDAL_CACHEMAX": 1024_000_000,  # 1 Gb in bytes
        "GDAL_DISABLE_READDIR_ON_OPEN": False,
        # # debug options for libcurl verbose outputs
        # "CPL_DEBUG": True,
        # "CPL_CURL_VERBOSE": True,
    }
    if geotiff.startswith("s3:"):
        # https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access
        # Note that CPL_VSIL_CURL_NON_CACHED is NOT set
        gdal_env["VSI_CACHE"] = True
        gdal_env["VSI_CACHE_SIZE"] = gdal_env["GDAL_CACHEMAX"]
        gdal_env["GDAL_HTTP_MAX_RETRY"] = max_tries
        gdal_env["GDAL_HTTP_RETRY_DELAY"] = retry_delay

    try:
        with rasterio.Env(**gdal_env) as rio_env:
            with rasterio.open(geotiff) as src:
                # do stuff
                return stuff
    except rasterio.errors.RasterioIOError as err:
        LOGGER.error("%s rasterio read failed", geotiff)
        LOGGER.error(err)
        raise

An example log under load-testing where s3 rate throttling is expected and observed as 503 errors:

[WARNING] 2021-03-03T19:37:40.312Z CPLE_AppDefined in HTTP error code: 503 - https://a-bucket.s3.amazonaws.com/geotiff.aux. Retrying again in 0.5 secs
[WARNING] 2021-03-03T19:37:40.795Z CPLE_AppDefined in HTTP error code: 503 - https://a-bucket.s3.amazonaws.com/geotiff.aux. Retrying again in 1.0 secs
[WARNING] 2021-03-03T19:37:41.796Z CPLE_AppDefined in HTTP error code: 503 - https://a-bucket.s3.amazonaws.com/geotiff.aux. Retrying again in 2.2 secs
[WARNING] 2021-03-03T19:37:44.290Z CPLE_AppDefined in HTTP response code on https://a-bucket.s3.amazonaws.com/geotiff.aux: 503
2021-03-03 19:37:44,155 | INFO | extract_raster_points:1036 | s3://a-bucket/geotiff.tif extract cells (1)...
2021-03-03 19:37:44,347 | INFO | extract_raster_points:1050 | s3://a-bucket/geotiff.tif extract cells (1) done

One possible performance problem with this approach is that it could apply to any GDAL reads for any of the files that it searches for, such as geotiff.aux files. In our s3 datasets, there are only *.tif files with no additional aux or other files. We need to ensure that GDAL will only search for geotiff.tif file and nothing else. Otherwise it might take 3 retries and an useless 2+ seconds to search for files that we already know are not there and GDAL should not look for.

Additional GDAL options

        # narrow the allowed files to read
        gdal_env["GDAL_DISABLE_READDIR_ON_OPEN"] = "EMPTY_DIR"
        gdal_env["CPL_VSIL_CURL_ALLOWED_EXTENSIONS"] = ".tif"

@rasterio rasterio deleted a comment from sid-ks May 3, 2021
@rasterio rasterio deleted a comment from sid-ks May 3, 2021
@sgillies
Copy link
Member

sgillies commented May 3, 2021

I deleted some irrelevant comments. Please don't do this, people. Use the discussion group linked in the README under "support".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants