Slxu kill query at aborted downloading #602

slxu · 2014-08-25T08:29:07Z

Review after RP #598

The input buffer status checking fails the backpressure in the following conditions: 1. Firstly, by design, given two adjacent events with the same type, the second one will not be triggered 2. Then, given buffer full, start to execute buffer full listener 3. at the same time a thread polls a data from the input buffer 4. now the buffer is not full, channel read pausing is not conducted 5. the buffer full event will not get triggered anymore because of 1. 6. The backpressure is broken. remove the checking codes fix the problem.

Temporarily add a special download_test api to implement big data download test. This method can be removed if query result stream-back to client is implemented (if needed).

The pipeline executors have their own memory management mechnism by specifying the estimated size of message caches. However, they can not work together with FlowControlInputBuffer or any other backpressure mechnisms. The reason is that Netty provides a boolen switch for each connection to turn on/off of data transmission. The executors and the FlowControlInputBuffers are both using this switch to implement backpressure. If we have both the executors and the FlowControlInputBuffers present, we have three layers of threads: Netty worker threads --> pipeline executor --> myria query executor --> here means push message into When the myria query executor finds the input buffer is too full, and turn the connection switch to stop read data, the pipeline executor does not know that, it continues push data into input buffer and restart the network data transmission. Currently what the pipeline executors do are justing serializing/deserializing messages, removing them should not hurt the performance.

The actual resume read and pause read of connections are conducted by netty worker threads, when query executor threads call these methods, the awaitUninterruptably cause the query executor threads to sleep when they are holding the spin locks. And the Netty worker threads need these locks to get information when conducting the actual resume and pause. Move the resume/pause codes out of the spin lock protected block.

Only override Timeout Rule can disable the global timeout setting. The test specific timeout setting does not work

…backpressure

…twork-backpressure

The netty IO threads only do network transmition and data serialization/deserialization. The old implementation may cause distributed deadlock because 1. network connections are created in producer.init 2. init is blocking 3. Assume that Machine A. producer 1.init is executed by worker1, and it wants to connect to machine B.consumer 1, and in the same time, Machine B.producer 1 executed by worker 2 wants to connect to Machine A.consumer 1. But unfortunately, Netty assigns connection acceptance in Machine A from Machine B.producer 1 to worker1, and also it assigns the connection acceptance in Machine B from Machine A.producer 1 to worker 2. Then deadlock happens.

Useless, always report false connection errors. Let the error handle code process connection problems.

This is the major fix of this branch. The Netty framework actually does not guarantee the same order of actually setting the readbility of a channel as the order of the setReadability method call when called sequentially by different Netty worker threads. This problem may cause unexpected delayed read pause, which is the main cause of query hang. Implement the sequential actual readability setting.

Put the listeners into an executor may break the readability setting order

To remove possible NullPointerException. Meet one in an experiment.

and throw exceptions on error states (mapping) instead of silently try to correct the state

other threads may change the readability between setReadable and isReadable

remove possible deadlock between ChannelContext.statemachinelock and channelpool.updatelock by always acquiring the updatelock before the statemachinelock

Add an input buffer size variable and protect it by a lock. The code is much simplified and also it's much easier to infer the correctness of an IB implementation because the buffer size is the single state variable that controls all the flow control event triggers and the data pulling thread

1. Test under 30 workers. 2. Catalog.addWorker is super slow (2~3 seconds each call on my machine), implement addWorkers to batch the addition

coveralls · 2014-08-25T18:09:37Z

Coverage increased (+0.09%) when pulling 30dda95 on slxu-kill-query-at-aborted-downloading into c9c36d3 on master.

slxu · 2014-08-25T20:38:53Z

Yes, its's very strange. The tests are very easy to fail using gradle check in Travis and sometimes on my laptop, although very rare. But I'm also running gradle check infinitely on my desktop. Currently, it has successfully executed for several days, thousands of rounds. No error occurs.

In Travis, I already noticed that the SequenceTest is very easy to fail long time ago.

Currently I've no definitive explanation of the failures. When the tests are run under gradle, currently I have not found a good way to get enough debug information when failures occur.

Conflicts: systemtest/edu/washington/escience/myria/systemtest/SystemTestBase.java

…aborted-downloading

coveralls · 2014-08-25T23:48:00Z

Coverage increased (+0.3%) when pulling 5219f84 on slxu-kill-query-at-aborted-downloading into 4b5fba6 on master.

…github.com/uwescience/myria into slxu-kill-query-at-aborted-downloading

…ted-downloading

…aborted-downloading Conflicts: src/edu/washington/escience/myria/parallel/Server.java

coveralls · 2014-09-09T04:38:36Z

Coverage increased (+0.7%) when pulling 1b16f80 on slxu-kill-query-at-aborted-downloading into 4985696 on master.

coveralls · 2014-09-11T20:09:21Z

Coverage increased (+0.13%) when pulling 7edcfa9 on slxu-kill-query-at-aborted-downloading into d07616a on master.

coveralls · 2014-09-16T19:45:27Z

Coverage increased (+0.27%) when pulling e612a10 on slxu-kill-query-at-aborted-downloading into c085de9 on master.

…aborted-downloading

coveralls · 2014-09-17T23:50:04Z

Coverage increased (+0.47%) when pulling d806d3a on slxu-kill-query-at-aborted-downloading into d9fe0d8 on master.

dhalperi · 2014-09-18T00:16:10Z

@slxu the BigDataTest simply completely destroys my machine, which completely runs out of memory. This is probably exactly a symptom of the underlying issue.

coveralls · 2014-09-18T00:34:41Z

Coverage increased (+0.19%) when pulling ba8fa3f on slxu-kill-query-at-aborted-downloading into d9fe0d8 on master.

…wnloading Slxu kill query at aborted downloading

slxu added 30 commits July 29, 2014 17:56

Add big data tests

effe40b

Temporarily add a special download_test api to implement big data download test. This method can be removed if query result stream-back to client is implemented (if needed).

Remove dependency on test package

26ed433

add test-specific timeout to pass Travis

949eba9

Infinite loop should check the interrupted bit

2bd1940

Only override Timeout Rule can disable the global

b429093

Only override Timeout Rule can disable the global timeout setting. The test specific timeout setting does not work

Merge branch 'master' into slxu-fix-network-backpressure

df9fbed

Merge branch 'master' into slxu-fix-network-backpressure

2394746

Merge branch 'slxu-fix-and-improve-deployment' into slxu-fix-network-…

bdb1007

…backpressure

Merge branch 'slxu-improve-sql-exception-processing' into slxu-fix-ne…

37b67b5

…twork-backpressure

Remove ID checking timeout

52e979f

Useless, always report false connection errors. Let the error handle code process connection problems.

Execute flow control listeners directly

6c845b3

Put the listeners into an executor may break the readability setting order

Make StreamIOChannel thread safe

df154b8

Use NullChannel to replace null in ChannelFuture

7d529f3

To remove possible NullPointerException. Meet one in an experiment.

Let logical input channels control their own state

918fb7a

and throw exceptions on error states (mapping) instead of silently try to correct the state

Add worker/master id to channel toString

dd1a3c5

Fix shallow remote connect test

b8cf2d0

other threads may change the readability between setReadable and isReadable

Channel state change deadlock

dcc9014

remove possible deadlock between ChannelContext.statemachinelock and channelpool.updatelock by always acquiring the updatelock before the statemachinelock

Add more logs for debug

f5d3178

resume read when stream input channel eos

a323ef3

remove useless codes

707fec2

Add worker id to query failure msg

93ac00c

Add input buffer test

e33b25e

Add BigClusterTest

a1ea889

1. Test under 30 workers. 2. Catalog.addWorker is super slow (2~3 seconds each call on my machine), implement addWorkers to batch the addition

Start 2 workers in travis

fc116eb

slxu added 2 commits August 25, 2014 12:43

localPath may have space

a7dffc5

Fix typo

a4f88d4

dhalperi added 2 commits August 25, 2014 15:52

Merge branch 'master' into slxu-fix-network-backpressure

9aec93b

Conflicts: systemtest/edu/washington/escience/myria/systemtest/SystemTestBase.java

Merge branch 'slxu-fix-network-backpressure' into slxu-kill-query-at-…

5219f84

…aborted-downloading

dhalperi and others added 10 commits August 25, 2014 17:43

Merge branch 'master' into slxu-fix-network-backpressure

abe7d3b

Merge branch 'slxu-kill-query-at-aborted-downloading' of https://www.…

5946adf

…github.com/uwescience/myria into slxu-kill-query-at-aborted-downloading

Merge branch 'master' into slxu-kill-query-at-aborted-downloading

c5a8e4e

Merge branch 'dhalperi-improve-profiler' into slxu-kill-query-at-abor…

36c0c39

…ted-downloading

Merge branch 'master' into slxu-kill-query-at-aborted-downloading

0466cb4

Merge branch 'master' into slxu-fix-network-backpressure

eb2ffcc

add worker part query succeeds log

1997b57

remove incorrect comment

39a6191

Merge branch 'master' into slxu-fix-network-backpressure

0493436

Merge branch 'slxu-fix-network-backpressure' into slxu-kill-query-at-…

1b16f80

…aborted-downloading Conflicts: src/edu/washington/escience/myria/parallel/Server.java

Merge branch 'master' into slxu-kill-query-at-aborted-downloading

7edcfa9

Merge branch 'master' into slxu-kill-query-at-aborted-downloading

e612a10

Merge remote-tracking branch 'origin/master' into slxu-kill-query-at-…

d806d3a

…aborted-downloading

Big*Test: fix warnings

ba8fa3f

dhalperi added a commit that referenced this pull request Sep 24, 2014

Merge pull request #602 from uwescience/slxu-kill-query-at-aborted-do…

519c37e

…wnloading Slxu kill query at aborted downloading

dhalperi merged commit 519c37e into master Sep 24, 2014

dhalperi deleted the slxu-kill-query-at-aborted-downloading branch September 24, 2014 11:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slxu kill query at aborted downloading #602

Slxu kill query at aborted downloading #602

slxu commented Aug 25, 2014

coveralls commented Aug 25, 2014

slxu commented Aug 25, 2014

coveralls commented Aug 25, 2014

coveralls commented Sep 9, 2014

coveralls commented Sep 11, 2014

coveralls commented Sep 16, 2014

coveralls commented Sep 17, 2014

dhalperi commented Sep 18, 2014

coveralls commented Sep 18, 2014

Slxu kill query at aborted downloading #602

Slxu kill query at aborted downloading #602

Conversation

slxu commented Aug 25, 2014

coveralls commented Aug 25, 2014

slxu commented Aug 25, 2014

coveralls commented Aug 25, 2014

coveralls commented Sep 9, 2014

coveralls commented Sep 11, 2014

coveralls commented Sep 16, 2014

coveralls commented Sep 17, 2014

dhalperi commented Sep 18, 2014

coveralls commented Sep 18, 2014