Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read errors when reading COGs from S3 with many threads #1828

Closed
Kirill888 opened this issue Nov 12, 2019 · 9 comments
Closed

Read errors when reading COGs from S3 with many threads #1828

Kirill888 opened this issue Nov 12, 2019 · 9 comments
Assignees

Comments

@Kirill888
Copy link
Contributor

Expected behavior and actual behavior.

I'm reading a bunch of Cloud Optimized GeoTIFF images from a public S3 bucket with aws_unsigned=True option. I'm using many threads to speed things up. As I increase number of threads "high enough", I start seeing errors of this kind:

RasterioIOError: Read or write failed. /vsis3/landsat-pds/L8/100/072/LC81000722013115LGN01/LC81000722013115LGN01_B1.TIF, band 1: IReadBlock failed at X offset 2, Y offset 11: TIFFReadEncodedTile() failed.

I suspect the problem is in GDAL code to be honest. Looks like error is triggered more often when reading the entire raster at once. I have not been able to trigger error when reading data in stripes in the test example linked below, but I have seen it happen elsewhere even when reading a part of the file.

I believe this is the same issue as reported in #1686.

Steps to reproduce the problem.

see here:

https://gist.github.com/Kirill888/55148f21e0dcc2cf3d88e9e6abd349f7

With this example I have only observed errors at read time, but in "production" I seen failures during open as well. I'm using public bucket for convenience of reporting, but I have observed same problems with signed s3 requests.

Operating system

Ubuntu 18.04, running in AWS us-west-2, m5.4xlarge.

Rasterio version and provenance

1.1.0 installed in binary mode from pypi

---------------------------------------------------------------------------
CPLE_AppDefinedError                      Traceback (most recent call last)
rasterio/_io.pyx in rasterio._io.DatasetReaderBase._read()

rasterio/shim_rasterioex.pxi in rasterio._shim.io_multi_band()

rasterio/_err.pyx in rasterio._err.exc_wrap_int()

CPLE_AppDefinedError: /vsis3/landsat-pds/L8/100/072/LC81000722013115LGN01/LC81000722013115LGN01_B1.TIF, band 1: IReadBlock failed at X offset 2, Y offset 11: TIFFReadEncodedTile() failed.

During handling of the above exception, another exception occurred:

RasterioIOError                           Traceback (most recent call last)
<ipython-input-7-6ea40a64db9e> in <module>
----> 1 raise bad[0][1]

<ipython-input-1-8052a1cc6ae9> in test_workload(url, rio_opts, stripe)
     41     try:
     42         with rasterio.open(url, sharing=False) as src:
---> 43             return (url, pixel_sha1(src, stripe), None)
     44     except Exception as e:
     45         return (url, None, e)

<ipython-input-1-8052a1cc6ae9> in pixel_sha1(src, stripe)
     13     _hash = hashlib.sha1()
     14     if stripe is None:
---> 15         pix = src.read(1)
     16         _hash.update(pix.data)
     17     else:

rasterio/_io.pyx in rasterio._io.DatasetReaderBase.read()

rasterio/_io.pyx in rasterio._io.DatasetReaderBase._read()

RasterioIOError: Read or write failed. /vsis3/landsat-pds/L8/100/072/LC81000722013115LGN01/LC81000722013115LGN01_B1.TIF, band 1: IReadBlock failed at X offset 2, Y offset 11: TIFFReadEncodedTile() failed.
@sgillies
Copy link
Member

@Kirill888 there was a thread recently on gdal-dev where @rouault suspects that there is a problem in the GDAL block cache or GTiff driver.

https://lists.osgeo.org/pipermail/gdal-dev/2019-October/051016.html
https://lists.osgeo.org/pipermail/gdal-dev/2019-November/051026.html

And there are reports that seem related like this one: OSGeo/gdal#1244.

@Kirill888
Copy link
Contributor Author

@sgillies thanks, I'll read through those, from a quick glance it does look like this is the same issue.

@Kirill888
Copy link
Contributor Author

The fact that larger reads are more likely to cause failures is compatible with cache hypothesis. I have tried disabling cache and still seen errors, but as discussed in that issue one can not disable cache fully, it is still being used.

@rouault
Copy link
Contributor

rouault commented Nov 12, 2019

there was a thread recently on gdal-dev where @rouault suspects that there is a problem in the GDAL block cache or GTiff driver.

https://lists.osgeo.org/pipermail/gdal-dev/2019-October/051016.html
https://lists.osgeo.org/pipermail/gdal-dev/2019-November/051026.html

For the record, this recent thread has nothing to do with /vsicurl /vsis3 issues. It is a concurrency issue with writes to datasets and can be reproduced with only local files

@rouault
Copy link
Contributor

rouault commented Nov 12, 2019

Proposed fix for the /vsis3/ issue in OSGeo/gdal#2012

@sgillies
Copy link
Member

Thanks @rouault ! @Kirill888 I'm going to try your example with a patched GDAL 3.0 now.

@sgillies
Copy link
Member

@Kirill888 using your notebook, I didn't see any failures with my patched GDAL.

@sgillies
Copy link
Member

I'm going to patch GDAL 2.4.3 in the wheels we upload to PyPI. See rasterio/rasterio-wheels#30.

@Kirill888
Copy link
Contributor Author

Awesome, thanks @sgillies and @rouault

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants