Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better defined and tested concurrent Envs #997

Merged
merged 10 commits into from Mar 28, 2017
Merged

Better defined and tested concurrent Envs #997

merged 10 commits into from Mar 28, 2017

Conversation

sgillies
Copy link
Member

@sgillies sgillies commented Mar 16, 2017

Includes the work in PR #993.

Resolves #996.

@sgillies
Copy link
Member Author

@geowurster I'm continuing to use CPLGetConfigOption in this branch because 1) the thread local version doesn't exist in GDAL 1.11 and 2) I don't understand the difference between CPLGetConfigOption and CPLGetThreadLocalConfigOption in GDAL 2.1. Each get a pointer to thread-local config options, right? The former uses a mutex (in a way I don't understand) and falls back on environment variables, the latter does not. @rouault can you explain what I'm missing and whether I'm going to experience trouble by not using CPLGetThreadLocalConfigOption?

More rigorous tests of the environments are forthcoming.

@rouault
Copy link
Contributor

rouault commented Mar 16, 2017

CPLGetConfigOption() will return the value of the config option, be it either defined through environement variable, CPLSetConfigOption() or CPLSetThreadLocalConfigOption() (from the same thread).

CPLGetThreadLocalConfigOption() will return the value of the config option, but only if it has been set with CPLSetThreadLocalConfigOption()

@geowurster
Copy link
Contributor

geowurster commented Mar 16, 2017

@sgillies How do you feel about just issuing a warning when using Env() inside a thread on GDAL<2.0? Seems like we can use a shim and use CPL*ConfigOption() when CPL*ThreadLocalConfigOption() is not available.

Unless the user wants a different configuration per thread this should work on every GDAL version, although I think it requires a change in @ensure_env. Right now @ensure_env always creates a new environment, but if one exists it shouldn't create one.

from concurrent.futures import ThreadPoolExecutor

import rasterio as rio

def _process(path):
    with rio.open(path) as src:
        pass

with rio.Env(), ThreadPoolExecutor(4) as pool:
    for res in pool.map(_process, ['tests/data/RGB.byte.tif']):
        pass

The main thread uses CPLSetConfigOption and thus child threads
inherit from main's configuration. Child threads use the thread
local version and are thus isolated from each other.

Tests of this behavior have been added and documentation added
to _env (and soon to the manual).
@sgillies
Copy link
Member Author

sgillies commented Mar 17, 2017

@geowurster I've made the change in ensure_env but I think that we can avoid the use of a shim. After more thought, I believe it's less surprising if child threads inherit the main thread's configuration and use CPLGetConfigOption.

Got a few minutes for review @geowurster?

@geowurster
Copy link
Contributor

@sgillies sorry, didn't realize you were done. I should have some time tomorrow to review.

Copy link
Contributor

@geowurster geowurster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think being forced to wrap threads in a rasterio.Env() might be surprising for some. What do you think about something like this:

class GDALEnv):
    
    def __init__(self):
        self._have_registered_drivers = False

    def start(self):
        # The outer if statement prevents each thread from acquiring a lock when the environment starts, and the inner avoids a potential race condition.
         if not self._have_registered_drivers:
            with threading.Lock():
                if not self._have_registered_drivers:
                    GDALAllRegister()

This allows the first thread to quietly register drivers while still allowing the thread to inherit from a parent rasterio.Env() if needed, at the cost of some edge cases and potential surprises for those with passing familiarity of GDAL's internals.

I'm starting to think that the rasterio.Env() should be completely invisible by default but easy to manage for those with more advanced use cases, and I'm not entirely sure that wanting to just work with different files in different threads completely fits that advanced use case, especially since it doesn't necessarily require any modifications to the GDAL environment.

from rasterio.env import get_gdal_config


class TestThreading(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth testing with a multiprocessing.Process() as well as doing a small I/O call to ensure that drivers are actually registered.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@geowurster Agreed. I got it in the next two commits: thread and process pool executors, with and without an Env() in the main thread.

@sgillies
Copy link
Member Author

Discovered options needed to be a thread local as well. I'm removing the "WIP" from the title.

@sgillies sgillies changed the title WIP: Switch to CPLSetThreadLocalConfigOption Better defined and tested concurrent Envs Mar 24, 2017
@geowurster
Copy link
Contributor

@sgillies Will take a look later today.

Copy link
Contributor

@geowurster geowurster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! 🎉

I suspect the code coverage drop is due to some tests now running in threads and processes.

I made a few inline comments but none are blockers. Two more small requests:

  1. Some tests introduced in this PR depend on concurrent.futures, which is not part of the standard library in Python 2, but has been backported as the futures package. We only get it because boto -> s3transfer -> futures. It's probably a good idea to add this to the tests install extras for Python 2.
  2. Add tests/data/white-gemini-iv.zip to .gitignore

"""GDAL and OGR driver management."""
"""GDAL and OGR driver and configuration management

Note: Only the main thread may load drivers. This means that new threads
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer needed.

The main thread always utilizes CPLSetConfigOption. Child threads
utilize CPLSetThreadLocalConfigOption instead. All threads use
CPLGetConfigOption and not CPLGetThreadLocalConfigOption, thus child
threads will inherit config options from the main thread.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not hurt to add "unless the option is set to a new value inside the thread"

rasterio/env.py Outdated

local = ThreadEnv()

# When the outermost 'rasterio.Env()' executes '__enter__' it probes the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move this long comment closer to _discovered_options, maybe just as a docstring on ThreadEnv()? It may provide those dealing with both osgeo and Rasterio with some important background info.

@geowurster
Copy link
Contributor

This also closes #986

@geowurster geowurster mentioned this pull request Mar 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants