New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrently reading from COGs through vsicurl fails with a segfault in 1.1.3 #1876
Comments
@mihi314 sorry about the trouble and thank you for the traceback. I've found a few other issues and discussions that look related, like openssl/openssl#2809, and am digging into them. Fiona 1.8.13.post1 and rasterio 1.1.3 introduced GDAL 2.4.4 in their wheels. There is a small chance this is a GDAL bug and I will look into that as well. |
Thanks for looking into this! I've tried some more things and noticed that the reading part is not strictly necessary. The error also occurs when just opening the file, without reading anything. Also with non-white images. Although much less reliably. So it might be somewhat related to OSGeo/gdal#1244 after all? |
@mihi314 thanks for providing the program. I've been able to trigger the And I have another data point. After getting the crash, I edited the program to comment out the import of fiona, ran it again and see a thread-related GDAL error.
|
I've updated the test program so that we're using the latest and greatest Python modules. from concurrent.futures import ThreadPoolExecutor
import fiona
import rasterio
def read(file):
with rasterio.open(file, sharing=False) as src:
print("starting to read", flush=True)
src.read(window=rasterio.windows.Window(0, 0, 5000, 5000))
print("done reading", flush=True)
failing = "/vsicurl/https://storage.googleapis.com/temporary_eu_west_4/mostly_white.tif"
with ThreadPoolExecutor(4) as executor:
executor.map(read, [failing]*4) This will produce the crash with rasterio 1.1.3 and fiona 1.8.13.post1. If I downgrade fiona to 1.8.13, which has GDAL 2.4.3, I cannot produce the crash. This seems to be a clue that there is a bug related to GDAL 2.4.4. |
I downgraded to rasterio 1.1.2, which also has GDAL 2.4.3, and can produce the crash. It's not specific to GDAL 2.4.4. Looks to me like a function from fiona's copy of libgdal or libcurl is being called unexpectedly and results in some operation on uninitialized data. I guess when different versions of GDAL are in play this does not happen (waves hands). I cannot produce the crash with a single worker thread (1 instead of 4 in the executor). |
Here's the backtrace I'm looking at. A function in rasterio's libgdal calls a function in fiona's libcurl (we're starting to go off the rails right there), which then eventually calls a function in rasterio's libgdal (a curl callback maybe?). And... splat.
|
I'm unable to reproduce the crash on my macbook using rasterio 1.1.3 and fiona 1.8.13.post1 wheels (GDAL 2.4.4). Looks like the differences between curl and openssl in the wheels are important. |
Yes, seems to have something to do which the shared libraries that get loaded. I tried comparing the combination rasterio==1.1.3 and fiona==1.8.13.post1 with rasterio==1.1.3 and fiona==1.8.13. The first case is the failing one, the other one works. When looking at the relevant opened libraries with strace, I see the following difference: rasterio==1.1.3, fiona==1.8.13.post1:
rasterio==1.1.3 and fiona==1.8.13:
I.e. in the working case, fiona and rasterio both load their own version of libcurl and libgeos. In the failing case however only the fiona versions get loaded. Possibly because these were already loaded by fiona, as the rasterio==1.1.3 wheels also ship with Exactly why this should lead to the crashes however I have no clue. |
@mihi314 thank you for that update. I am not sure why we have libcurl-ed5c192c.so.4.4.0 in the fiona 1.8.13 wheel and libcurl-ea538880.so.4.4.0 in the rasterio 1.1.3 wheel, I thought I was compiling curl the same way in each wheel builder, I'll look into that. I'm slightly optimistic that making sure the libs have unique hashes might help a lot. |
I'm happy to say that I can't reproduce the segfault using rasterio 1.1.4.dev0 wheels built by rasterio/rasterio-wheels#47 and fiona 1.8.13.post1. I still do see a printed warning about a mutex that I mentioned in OSGeo/gdal#2278, but it doesn't halt the program. |
In conclusion, this is not a rasterio or GDAL bug, but a problem with shared library SONAMEs that I'm solving in rasterio/rasterio-wheels#44. |
Thank you very much :) |
Expected behavior and actual behavior.
Concurrently reading from COGs through vsicurl fails with a segfault or other error:
*** Error in `python3': double free or corruption (!prev): 0x00007fddb4088da0 ***
The accompanying backtrace:
Click to expand
There are a few conditions needed for the issue to pop up:
Steps to reproduce the problem.
I can reproduce the problem about 80% of the time with the following script:
Using the following docker image:
Built image available here:
mengelhard/rasterio_issue:latest
Rasterio version and provenance
rasterio-1.1.3-cp35-cp35m-manylinux1_x86_64.whl
Fiona-1.8.13.post1-cp35-cp35m-manylinux1_x86_64.whl
We had the same issue with raterio 1.1.2, but the above script does not seem to be able to reproduce it in this case.
The text was updated successfully, but these errors were encountered: