Use Separate Pools for Raster IO + Cache #5219

notthatbreezy · 2019-10-09T17:33:02Z

Overview

In #5200 we removed parTraverse because we had evidence that tiles
with many IO operations were using up the default execution context
threads and blocking on requests from S3. Before, we experimented with
different execution contexts for the whole application, but not in
specific circumstances. This commit introduces parTraverse back into
our rendering and specifically uses a cached thread pool in this
context.

Additionally, the memcached client had an internal pool that was
scaled based on the number of cores present. This PR adjusts that to
also use a cached threadpool with the understanding that since this is
IO we should not be limiting ourselves to the cores of the machine.

Checklist

Description of PR is in an appropriate section of the changelog and grouped with similar changes if possible

Notes

Load test from develop

Load test from this PR

It doesn't move the needle that much, or if at all, but the story makes more sense I think conceptually without too much overhead in terms of complexity so I think it's still worth it.

I did test out using fibers, but there wasn't a noticeable bump and on the global execution context we were still running into issues around paralellizing the fetches to S3.

Testing Instructions

Rebuild Jars
Start server and browse around

Closes #5199
Closes #5191

In #5200 we removed `parTraverse` because we had evidence that tiles with many IO operations were using up the default execution context threads and blocking on requests from S3. Before, we experimented with different execution contexts for the whole application, but not in specific circumstances. This commit introduces `parTraverse` back into our rendering and specifically uses a cached thread pool in this context. Additionally, the memcached client had an internal pool that was scaled based on the number of cores present. This PR adjusts that to also use a cached threadpool with the understanding that since this is IO we should not be limiting ourselves to the cores of the machine.

Lknechtli · 2019-10-10T15:42:39Z

Loading the NYC project at a low zoom level seems to cripple the tile server with raster-io threads.
~~The heap usage signals to me that something's happening that's probably memory bandwidth limited, and is rapidly creating / destroying objects that need to be cleaned up every so often~~ not so sure about this after the later testing

There are 2-300 raster-io threads active

It doesn't seem right to me that the raster-io threads would be active for this entire time, the resource usage on my computer shows me that it's not spending a whole lot of time spinning the cpu or fetching things on the network, so this tells me that we're running into a resource that's being shared across all these threads and probably running into some serious context switching issues or something.

Lknechtli · 2019-10-10T15:54:31Z

going by this, it looks like the actual read is the problem. So I might be running into gdal working under extreme memory pressure.

I think we need to make it so that paintedRender can be interrupted and release any resources it's using (gdal etc), because when we have severe pressure like this, later requests will just keep piling up IO threads and putting more pressure on the blocking resource. It took about 10-15 minutes for the tile server to finish the actual reads on a single page of tiles, which is pretty extreme.

Lknechtli · 2019-10-10T18:06:28Z

another test case.
50 raster-io threads, single tile first reads:

First read on tile containing a single image:

First read on a tile containing >50 images (same project zoomed out)

Interestingly, the reads that happen after the initial burst (delayed due to the 50 thread limit) go much faster, ranging from a high of 60 seconds to as low as 15 seconds.

The number of threads currently fetching has a drastic impact on how fast the reads happen. With 17 threads going, it maxes out at a minute, 50 threads is > 100 seconds.

This is not IO or compute bound, my network activity is pretty consistently under 100 kb/s with a few spikes which I assume are image fetches. Cpu doesn't max out any cores.

When I limit things to 10 threads, each individual read takes about 25-30 seconds, but the overall time to complete is about 160 seconds as opposed to nearly 10 minutes with an unlimited number of threads

Limited to 4 threads, reads take under 6-10 seconds each (1.5-2.5 sec/img bandwidth) for a total of 112 seconds
2 threads takes about 3 seconds per image (1.5 sec/img bandwidth), totalling 93 seconds
1 thread takes 1.3-1.6 seconds per image (1.5 sec/img bandwidth), totalling 94 seconds

Lknechtli

Good once we switch to a smaller threadpool to avoid crushing gdal until we get geotiff raster sources working.

notthatbreezy added 3 commits October 9, 2019 13:18

Remove unused dependency

6c26ef5

cleanup

341b80c

notthatbreezy requested a review from Lknechtli October 9, 2019 19:09

notthatbreezy assigned Lknechtli Oct 9, 2019

wip

39104c1

Lknechtli approved these changes Oct 11, 2019

View reviewed changes

notthatbreezy added 4 commits October 14, 2019 08:25

Use fixed threadpool

5092325

Revert default max pool size

3ca82f8

revert build changes

135ac87

fixup tests

ea0c913

notthatbreezy merged commit a134247 into develop Oct 14, 2019

notthatbreezy deleted the test/threadpools branch October 14, 2019 13:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Separate Pools for Raster IO + Cache #5219

Use Separate Pools for Raster IO + Cache #5219

notthatbreezy commented Oct 9, 2019

Lknechtli commented Oct 10, 2019 •

edited

Lknechtli commented Oct 10, 2019

Lknechtli commented Oct 10, 2019 •

edited

Lknechtli left a comment

Use Separate Pools for Raster IO + Cache #5219

Use Separate Pools for Raster IO + Cache #5219

Conversation

notthatbreezy commented Oct 9, 2019

Overview

Checklist

Notes

Testing Instructions

Lknechtli commented Oct 10, 2019 • edited

Lknechtli commented Oct 10, 2019

Lknechtli commented Oct 10, 2019 • edited

Lknechtli left a comment

Choose a reason for hiding this comment

Lknechtli commented Oct 10, 2019 •

edited

Lknechtli commented Oct 10, 2019 •

edited