New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better defined and tested concurrent Envs #997
Conversation
@geowurster I'm continuing to use More rigorous tests of the environments are forthcoming. |
CPLGetConfigOption() will return the value of the config option, be it either defined through environement variable, CPLSetConfigOption() or CPLSetThreadLocalConfigOption() (from the same thread). CPLGetThreadLocalConfigOption() will return the value of the config option, but only if it has been set with CPLSetThreadLocalConfigOption() |
@sgillies How do you feel about just issuing a warning when using Unless the user wants a different configuration per thread this should work on every GDAL version, although I think it requires a change in from concurrent.futures import ThreadPoolExecutor
import rasterio as rio
def _process(path):
with rio.open(path) as src:
pass
with rio.Env(), ThreadPoolExecutor(4) as pool:
for res in pool.map(_process, ['tests/data/RGB.byte.tif']):
pass |
The main thread uses CPLSetConfigOption and thus child threads inherit from main's configuration. Child threads use the thread local version and are thus isolated from each other. Tests of this behavior have been added and documentation added to _env (and soon to the manual).
@geowurster I've made the change in Got a few minutes for review @geowurster? |
@sgillies sorry, didn't realize you were done. I should have some time tomorrow to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think being forced to wrap threads in a rasterio.Env()
might be surprising for some. What do you think about something like this:
class GDALEnv):
def __init__(self):
self._have_registered_drivers = False
def start(self):
# The outer if statement prevents each thread from acquiring a lock when the environment starts, and the inner avoids a potential race condition.
if not self._have_registered_drivers:
with threading.Lock():
if not self._have_registered_drivers:
GDALAllRegister()
This allows the first thread to quietly register drivers while still allowing the thread to inherit from a parent rasterio.Env()
if needed, at the cost of some edge cases and potential surprises for those with passing familiarity of GDAL's internals.
I'm starting to think that the rasterio.Env()
should be completely invisible by default but easy to manage for those with more advanced use cases, and I'm not entirely sure that wanting to just work with different files in different threads completely fits that advanced use case, especially since it doesn't necessarily require any modifications to the GDAL environment.
from rasterio.env import get_gdal_config | ||
|
||
|
||
class TestThreading(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its worth testing with a multiprocessing.Process()
as well as doing a small I/O call to ensure that drivers are actually registered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@geowurster Agreed. I got it in the next two commits: thread and process pool executors, with and without an Env()
in the main thread.
Discovered options needed to be a thread local as well. I'm removing the "WIP" from the title. |
@sgillies Will take a look later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! 🎉
I suspect the code coverage drop is due to some tests now running in threads and processes.
I made a few inline comments but none are blockers. Two more small requests:
- Some tests introduced in this PR depend on
concurrent.futures
, which is not part of the standard library in Python 2, but has been backported as thefutures
package. We only get it becauseboto -> s3transfer -> futures
. It's probably a good idea to add this to thetests
install extras for Python 2. - Add
tests/data/white-gemini-iv.zip
to.gitignore
rasterio/_env.pyx
Outdated
"""GDAL and OGR driver management.""" | ||
"""GDAL and OGR driver and configuration management | ||
|
||
Note: Only the main thread may load drivers. This means that new threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer needed.
rasterio/_env.pyx
Outdated
The main thread always utilizes CPLSetConfigOption. Child threads | ||
utilize CPLSetThreadLocalConfigOption instead. All threads use | ||
CPLGetConfigOption and not CPLGetThreadLocalConfigOption, thus child | ||
threads will inherit config options from the main thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not hurt to add "unless the option is set to a new value inside the thread"
rasterio/env.py
Outdated
|
||
local = ThreadEnv() | ||
|
||
# When the outermost 'rasterio.Env()' executes '__enter__' it probes the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move this long comment closer to _discovered_options
, maybe just as a docstring on ThreadEnv()
? It may provide those dealing with both osgeo
and Rasterio with some important background info.
This also closes #986 |
Includes the work in PR #993.
Resolves #996.