-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slxu kill query at aborted downloading #602
Conversation
The input buffer status checking fails the backpressure in the following conditions: 1. Firstly, by design, given two adjacent events with the same type, the second one will not be triggered 2. Then, given buffer full, start to execute buffer full listener 3. at the same time a thread polls a data from the input buffer 4. now the buffer is not full, channel read pausing is not conducted 5. the buffer full event will not get triggered anymore because of 1. 6. The backpressure is broken. remove the checking codes fix the problem.
Temporarily add a special download_test api to implement big data download test. This method can be removed if query result stream-back to client is implemented (if needed).
The pipeline executors have their own memory management mechnism by specifying the estimated size of message caches. However, they can not work together with FlowControlInputBuffer or any other backpressure mechnisms. The reason is that Netty provides a boolen switch for each connection to turn on/off of data transmission. The executors and the FlowControlInputBuffers are both using this switch to implement backpressure. If we have both the executors and the FlowControlInputBuffers present, we have three layers of threads: Netty worker threads --> pipeline executor --> myria query executor --> here means push message into When the myria query executor finds the input buffer is too full, and turn the connection switch to stop read data, the pipeline executor does not know that, it continues push data into input buffer and restart the network data transmission. Currently what the pipeline executors do are justing serializing/deserializing messages, removing them should not hurt the performance.
The actual resume read and pause read of connections are conducted by netty worker threads, when query executor threads call these methods, the awaitUninterruptably cause the query executor threads to sleep when they are holding the spin locks. And the Netty worker threads need these locks to get information when conducting the actual resume and pause. Move the resume/pause codes out of the spin lock protected block.
Only override Timeout Rule can disable the global timeout setting. The test specific timeout setting does not work
…twork-backpressure
The netty IO threads only do network transmition and data serialization/deserialization. The old implementation may cause distributed deadlock because 1. network connections are created in producer.init 2. init is blocking 3. Assume that Machine A. producer 1.init is executed by worker1, and it wants to connect to machine B.consumer 1, and in the same time, Machine B.producer 1 executed by worker 2 wants to connect to Machine A.consumer 1. But unfortunately, Netty assigns connection acceptance in Machine A from Machine B.producer 1 to worker1, and also it assigns the connection acceptance in Machine B from Machine A.producer 1 to worker 2. Then deadlock happens.
Useless, always report false connection errors. Let the error handle code process connection problems.
This is the major fix of this branch. The Netty framework actually does not guarantee the same order of actually setting the readbility of a channel as the order of the setReadability method call when called sequentially by different Netty worker threads. This problem may cause unexpected delayed read pause, which is the main cause of query hang. Implement the sequential actual readability setting.
Put the listeners into an executor may break the readability setting order
To remove possible NullPointerException. Meet one in an experiment.
and throw exceptions on error states (mapping) instead of silently try to correct the state
other threads may change the readability between setReadable and isReadable
remove possible deadlock between ChannelContext.statemachinelock and channelpool.updatelock by always acquiring the updatelock before the statemachinelock
Add an input buffer size variable and protect it by a lock. The code is much simplified and also it's much easier to infer the correctness of an IB implementation because the buffer size is the single state variable that controls all the flow control event triggers and the data pulling thread
1. Test under 30 workers. 2. Catalog.addWorker is super slow (2~3 seconds each call on my machine), implement addWorkers to batch the addition
Yes, its's very strange. The tests are very easy to fail using gradle check in Travis and sometimes on my laptop, although very rare. But I'm also running gradle check infinitely on my desktop. Currently, it has successfully executed for several days, thousands of rounds. No error occurs. In Travis, I already noticed that the SequenceTest is very easy to fail long time ago. Currently I've no definitive explanation of the failures. When the tests are run under gradle, currently I have not found a good way to get enough debug information when failures occur. |
Conflicts: systemtest/edu/washington/escience/myria/systemtest/SystemTestBase.java
…aborted-downloading
…github.com/uwescience/myria into slxu-kill-query-at-aborted-downloading
…aborted-downloading Conflicts: src/edu/washington/escience/myria/parallel/Server.java
…aborted-downloading
@slxu the BigDataTest simply completely destroys my machine, which completely runs out of memory. This is probably exactly a symptom of the underlying issue. |
…wnloading Slxu kill query at aborted downloading
Review after RP #598
Fix #509