-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cqlsh times out in test_copy_to_with_child_process_crashing dtest #37
Comments
Seen a couple of times, never could reproduce it locally |
I wonder if those |
FWIW, I saw this locally after adding some debug printouts:
|
To help understand spurious timeout error we encounter. Refs scylladb/scylla-cqlsh#37 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
To help understand spurious timeout error we encounter. Refs scylladb/scylla-cqlsh#37 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Hi again, now with printing stdout to the log.
|
this is unique, compare to the others:
question now from which part of the code it's coming |
too bad we didn't had the full backtrace of that error, seem like we would if we used |
I wish it had the full bracktrace |
LOL |
https://github.com/scylladb/scylla-dtest/pull/3538 might help a bit further |
with the following command lines args |
after closing digging into it seems like one option would be to replace the whole pipelines with |
looks like all those pipes was introduced in 54a7cf9 I'll try to revert it, seeing if it would behave better. |
And the answer is no, it's exactly the same problem. seems like I didn't went all the way into what the pipes where introduced, and that's in https://issues.apache.org/jira/browse/CASSANDRA-11053 |
CopyTask is is using ReceivingChannels and SendingChannels which are a list of pipes created by CopyTask and those pipes are being send to a list of subprocesses so the main task can communicate with them. in one dtest that delibratily make once of those child process to break and exit, from time to time we see it getting stuck forever. the reason is the CopyTask process is hanging on `recv` call on one of those pipes, since pipes are copy into the child processes there's one fd open on CopyTask and one fd open on child process when the child process closes the fd, `recv()` doesn't raise EOF since there an open fd that might still send in data. so we need to close the local pipes on CopyTask after all child processes are started Fixes: scylladb#37
After banging my head on it again and again, I've found the root cause... |
and that's the fix: #49 |
CopyTask is is using ReceivingChannels and SendingChannels which are a list of pipes created by CopyTask and those pipes are being send to a list of subprocesses so the main task can communicate with them. in one dtest that delibratily make once of those child process to break and exit, from time to time we see it getting stuck forever. the reason is the CopyTask process is hanging on `recv` call on one of those pipes, since pipes are copy into the child processes there's one fd open on CopyTask and one fd open on child process when the child process closes the fd, `recv()` doesn't raise EOF since there an open fd that might still send in data. so we need to close the local pipes on CopyTask after all child processes are started Fixes: #37
* tools/cqlsh 66ae7eac...426fa0ea (8): > Updated Scylla Driver[Issue scylladb/scylla-cqlsh#55] > copyutil: closing the local end of pipes after processes starts > setup.py: specify Cython language_level explicitly > setup.py: pass extensions as a list > setup.py: reindent block in else branch > setup.py: early return in get_extension() > reloc: install build==0.10.0 > reloc: add --verbose option to build_reloc.sh Fixes: scylladb/scylla-cqlsh#37
* tools/cqlsh 66ae7eac...426fa0ea (8): > Updated Scylla Driver[Issue scylladb/scylla-cqlsh#55] > copyutil: closing the local end of pipes after processes starts > setup.py: specify Cython language_level explicitly > setup.py: pass extensions as a list > setup.py: reindent block in else branch > setup.py: early return in get_extension() > reloc: install build==0.10.0 > reloc: add --verbose option to build_reloc.sh Fixes: scylladb/scylla-cqlsh#37 Closes #15685
To help understand spurious timeout error we encounter. Refs scylladb/scylla-cqlsh#37 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
See for example https://jenkins.scylladb.com/view/master/job/scylla-master/job/dtest-daily-release/249/testReport/cqlsh_tests.cqlsh_copy_tests/TestCqlshCopy/Run_Dtest_Parallel_Cloud_Machines___FullDtest___full_split000___test_copy_to_with_child_process_crashing/
This might be related to the error injection that is triggered in this test.
The text was updated successfully, but these errors were encountered: