Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slxu kill query at aborted downloading #602

Merged
merged 57 commits into from
Sep 24, 2014

Conversation

slxu
Copy link
Contributor

@slxu slxu commented Aug 25, 2014

Review after RP #598

Fix #509

slxu added 30 commits July 29, 2014 17:56
The input buffer status checking fails the backpressure in the
following conditions:
1. Firstly, by design, given two adjacent events with the same type,
the second one will not be triggered
2. Then, given buffer full, start to execute buffer full listener
3. at the same time a thread polls a data from the input buffer
4. now the buffer is not full, channel read pausing is not conducted
5. the buffer full event will not get triggered anymore because of 1.
6. The backpressure is broken.

remove the checking codes fix the problem.
Temporarily add a special download_test api to implement big data
download test. This method can be removed if query result
stream-back to client is implemented (if needed).
The pipeline executors have their own memory management mechnism by
specifying the estimated size of message caches. However, they can not
work together with FlowControlInputBuffer or any other backpressure
mechnisms.

The reason is that Netty provides a boolen switch for each connection
to turn on/off of data transmission. The executors and the
FlowControlInputBuffers are both using this switch to implement
backpressure. If we have both the executors and the
FlowControlInputBuffers present, we have three layers of threads:

Netty worker threads --> pipeline executor --> myria query executor

--> here means push message into

When the myria query executor finds the input buffer is too full, and
turn the connection switch to stop read data, the pipeline executor
does not know that, it continues push data into input buffer and
restart the network data transmission.

Currently what the pipeline executors do are justing
serializing/deserializing messages, removing them should not hurt the
performance.
The actual resume read and pause read of connections are conducted by
netty worker threads, when query executor threads call these methods,
the awaitUninterruptably cause the query executor threads to sleep
when they are holding the spin locks. And the Netty worker threads
need these locks to get information when conducting the actual resume
and pause.

Move the resume/pause codes out of the spin lock protected block.
Only override Timeout Rule can disable the global timeout setting. The
test specific timeout setting does not work
The netty IO threads only do network transmition and data
serialization/deserialization.

The old implementation may cause distributed deadlock because
1. network connections are created in producer.init
2. init is blocking
3. Assume that Machine A. producer 1.init is executed by worker1, and it wants to
connect to machine B.consumer 1, and in the same time, Machine
B.producer 1 executed by worker 2 wants to connect to Machine A.consumer 1.  But
unfortunately, Netty assigns connection acceptance in Machine A from
Machine B.producer 1 to worker1, and also it assigns the connection acceptance in
Machine B from Machine A.producer 1 to worker 2. Then deadlock
happens.
Useless, always report false connection errors. Let the error handle
code process connection problems.
This is the major fix of this branch. The Netty framework actually
does not guarantee the same order of actually setting the readbility
of a channel as the order of the setReadability method call when
called sequentially by different Netty worker threads.  This problem
may cause unexpected delayed read pause, which is the main cause of
query hang.

Implement the sequential actual readability setting.
Put the listeners into an executor may break the readability setting
order
To remove possible NullPointerException. Meet one in an experiment.
and throw exceptions on error states (mapping) instead of silently try
to correct the state
other threads may change the readability between setReadable and isReadable
remove  possible deadlock between ChannelContext.statemachinelock and
channelpool.updatelock by always acquiring the updatelock before the
statemachinelock
Add an input buffer size variable and protect it by a lock. The code
is much simplified and also it's much easier to infer the correctness
of an IB implementation because the buffer size is the single state
variable that controls all the flow control event triggers and the data
pulling thread
1. Test under 30 workers.
2. Catalog.addWorker is super slow (2~3 seconds each call on my
machine), implement addWorkers to batch the addition
@coveralls
Copy link

Coverage Status

Coverage increased (+0.09%) when pulling 30dda95 on slxu-kill-query-at-aborted-downloading into c9c36d3 on master.

@slxu
Copy link
Contributor Author

slxu commented Aug 25, 2014

Yes, its's very strange. The tests are very easy to fail using gradle check in Travis and sometimes on my laptop, although very rare. But I'm also running gradle check infinitely on my desktop. Currently, it has successfully executed for several days, thousands of rounds. No error occurs.

In Travis, I already noticed that the SequenceTest is very easy to fail long time ago.

Currently I've no definitive explanation of the failures. When the tests are run under gradle, currently I have not found a good way to get enough debug information when failures occur.

Conflicts:
	systemtest/edu/washington/escience/myria/systemtest/SystemTestBase.java
@coveralls
Copy link

Coverage Status

Coverage increased (+0.3%) when pulling 5219f84 on slxu-kill-query-at-aborted-downloading into 4b5fba6 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.7%) when pulling 1b16f80 on slxu-kill-query-at-aborted-downloading into 4985696 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.13%) when pulling 7edcfa9 on slxu-kill-query-at-aborted-downloading into d07616a on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.27%) when pulling e612a10 on slxu-kill-query-at-aborted-downloading into c085de9 on master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.47%) when pulling d806d3a on slxu-kill-query-at-aborted-downloading into d9fe0d8 on master.

@dhalperi
Copy link
Member

@slxu the BigDataTest simply completely destroys my machine, which completely runs out of memory. This is probably exactly a symptom of the underlying issue.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.19%) when pulling ba8fa3f on slxu-kill-query-at-aborted-downloading into d9fe0d8 on master.

dhalperi added a commit that referenced this pull request Sep 24, 2014
…wnloading

Slxu kill query at aborted downloading
@dhalperi dhalperi merged commit 519c37e into master Sep 24, 2014
@dhalperi dhalperi deleted the slxu-kill-query-at-aborted-downloading branch September 24, 2014 11:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stopped downloads lead to zombie queries
3 participants