Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] Flaky data_node #4047

Closed
mkindahl opened this issue Feb 2, 2022 · 3 comments
Closed

[Flaky test] Flaky data_node #4047

mkindahl opened this issue Feb 2, 2022 · 3 comments
Assignees
Labels
flaky-test Issue about a flaky test

Comments

@mkindahl
Copy link
Contributor

mkindahl commented Feb 2, 2022

Which test is flaky?

data_node

Since when has the test been flaky?

No idea

Link to the failed test run

https://github.com/timescale/timescaledb/runs/5034821435?check_suite_focus=true

Log output

diff -u /home/runner/work/timescaledb/timescaledb/tsl/test/expected/data_node.out /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/data_node.out
--- /home/runner/work/timescaledb/timescaledb/tsl/test/expected/data_node.out	2022-02-02 10:12:13.614481898 +0000
+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/data_node.out	2022-02-02 10:15:49.317060231 +0000
@@ -1551,155 +1551,4 @@
 \set ON_ERROR_STOP 1
 -- do actualy copy
 CALL timescaledb_experimental.copy_chunk(chunk=>'_timescaledb_internal._dist_hyper_9_12_chunk', source_node=> 'data_node_1', destination_node => 'data_node_2');
-SELECT * FROM test.remote_exec(NULL, $$ SELECT * from show_chunks('dist_test'); $$);
-NOTICE:  [data_node_1]:  SELECT * from show_chunks('dist_test')
-NOTICE:  [data_node_1]:
-show_chunks                                 
---------------------------------------------
-_timescaledb_internal._dist_hyper_9_12_chunk
-_timescaledb_internal._dist_hyper_9_15_chunk
-(2 rows)
-
-
-NOTICE:  [data_node_2]:  SELECT * from show_chunks('dist_test')
-NOTICE:  [data_node_2]:
-show_chunks                                 
---------------------------------------------
-_timescaledb_internal._dist_hyper_9_13_chunk
-_timescaledb_internal._dist_hyper_9_12_chunk
-(2 rows)
-
-
-NOTICE:  [data_node_3]:  SELECT * from show_chunks('dist_test')
-NOTICE:  [data_node_3]:
-show_chunks                                 
---------------------------------------------
-_timescaledb_internal._dist_hyper_9_14_chunk
-(1 row)
-
-
- remote_exec 
--------------
- 
-(1 row)
-
-SELECT * FROM test.remote_exec(ARRAY['data_node_2'], $$ SELECT sum(device) FROM _timescaledb_internal._dist_hyper_9_12_chunk; $$);
-NOTICE:  [data_node_2]:  SELECT sum(device) FROM _timescaledb_internal._dist_hyper_9_12_chunk
-NOTICE:  [data_node_2]:
-sum
----
-406
-(1 row)
-
-
- remote_exec 
--------------
- 
-(1 row)
-
--- ensure that chunk exists on a destination data node
-\set ON_ERROR_STOP 0
-CALL timescaledb_experimental.copy_chunk(chunk=>'_timescaledb_internal._dist_hyper_9_12_chunk', source_node=> 'data_node_1', destination_node => 'data_node_2');
-ERROR:  chunk "_dist_hyper_9_12_chunk" already exists on destination data node "data_node_2"
-\set ON_ERROR_STOP 1
--- now try to move the same chunk from data node 2 to 3
-CALL timescaledb_experimental.move_chunk(chunk=>'_timescaledb_internal._dist_hyper_9_12_chunk', source_node=> 'data_node_2', destination_node => 'data_node_3');
-SELECT * FROM test.remote_exec(NULL, $$ SELECT * from show_chunks('dist_test'); $$);
-NOTICE:  [data_node_1]:  SELECT * from show_chunks('dist_test')
-NOTICE:  [data_node_1]:
-show_chunks                                 
---------------------------------------------
-_timescaledb_internal._dist_hyper_9_12_chunk
-_timescaledb_internal._dist_hyper_9_15_chunk
-(2 rows)
-
-
-NOTICE:  [data_node_2]:  SELECT * from show_chunks('dist_test')
-NOTICE:  [data_node_2]:
-show_chunks                                 
---------------------------------------------
-_timescaledb_internal._dist_hyper_9_13_chunk
-(1 row)
-
-
-NOTICE:  [data_node_3]:  SELECT * from show_chunks('dist_test')
-NOTICE:  [data_node_3]:
-show_chunks                                 
---------------------------------------------
-_timescaledb_internal._dist_hyper_9_14_chunk
-_timescaledb_internal._dist_hyper_9_12_chunk
-(2 rows)
-
-
- remote_exec 
--------------
- 
-(1 row)
-
-SELECT * FROM test.remote_exec(ARRAY['data_node_3'], $$ SELECT sum(device) FROM _timescaledb_internal._dist_hyper_9_12_chunk; $$);
-NOTICE:  [data_node_3]:  SELECT sum(device) FROM _timescaledb_internal._dist_hyper_9_12_chunk
-NOTICE:  [data_node_3]:
-sum
----
-406
-(1 row)
-
-
- remote_exec 
--------------
- 
-(1 row)
-
-SELECT sum(device) FROM dist_test; 
- sum 
------
- 846
-(1 row)
-
--- Check that they can be called from inside a procedure without
--- generating warnings or error messages (#3495).
-CREATE OR REPLACE PROCEDURE copy_wrapper(regclass, text, text)
-AS $$
-BEGIN
-    CALL timescaledb_experimental.copy_chunk($1, $2, $3);
-END
-$$
-LANGUAGE PLPGSQL;
-CREATE OR REPLACE PROCEDURE move_wrapper(regclass, text, text)
-AS $$
-BEGIN
-    CALL timescaledb_experimental.move_chunk($1, $2, $3);
-END
-$$
-LANGUAGE PLPGSQL;
-SELECT chunk_name, replica_nodes, non_replica_nodes
-FROM timescaledb_experimental.chunk_replication_status;
-       chunk_name       |       replica_nodes       |     non_replica_nodes     
-------------------------+---------------------------+---------------------------
- _dist_hyper_9_12_chunk | {data_node_1,data_node_3} | {data_node_2}
- _dist_hyper_9_13_chunk | {data_node_2}             | {data_node_1,data_node_3}
- _dist_hyper_9_14_chunk | {data_node_3}             | {data_node_1,data_node_2}
- _dist_hyper_9_15_chunk | {data_node_1}             | {data_node_2,data_node_3}
-(4 rows)
-
-CALL copy_wrapper('_timescaledb_internal._dist_hyper_9_14_chunk', 'data_node_3', 'data_node_2');
-CALL move_wrapper('_timescaledb_internal._dist_hyper_9_13_chunk', 'data_node_2', 'data_node_1');
-SELECT chunk_name, replica_nodes, non_replica_nodes
-FROM timescaledb_experimental.chunk_replication_status;
-       chunk_name       |       replica_nodes       |     non_replica_nodes     
-------------------------+---------------------------+---------------------------
- _dist_hyper_9_12_chunk | {data_node_1,data_node_3} | {data_node_2}
- _dist_hyper_9_13_chunk | {data_node_1}             | {data_node_2,data_node_3}
- _dist_hyper_9_14_chunk | {data_node_3,data_node_2} | {data_node_1}
- _dist_hyper_9_15_chunk | {data_node_1}             | {data_node_2,data_node_3}
-(4 rows)
-
-DROP PROCEDURE copy_wrapper;
-DROP PROCEDURE move_wrapper;
-RESET ROLE;
-DROP DATABASE :DN_DBNAME_1;
-DROP DATABASE :DN_DBNAME_2;
-DROP DATABASE :DN_DBNAME_3;
-DROP DATABASE :DN_DBNAME_4;
-DROP DATABASE :DN_DBNAME_5;
-DROP DATABASE :DN_DBNAME_6;
+ERROR:  [data_node_1]: replication slot "ts_copy_1_12" is active for PID 12123

Reason for flakiness

copy_chunk function does not wait for processes to shut down properly, causing a race condition in subsequent calls.

@mkindahl mkindahl added the flaky-test Issue about a flaky test label Feb 2, 2022
@afiskon
Copy link
Contributor

afiskon commented Feb 15, 2022

For me sometimes it fails like this on GitHub:

+++ /home/runner/work/timescaledb/timescaledb/build/tsl/test/results/data_node.out
@@ -1551,155 +1551,4 @@
 \set ON_ERROR_STOP 1
 -- do actualy copy
 CALL timescaledb_experimental.copy_chunk(chunk=>'_timescaledb_internal._dist_hyper_9_12_chunk', source_node=> 'data_node_1', destination
_node => 'data_node_2');

[...]

+ERROR:  [data_node_2]: subscription sync wait timedout

The test works fine locally but takes ~25 seconds. The dataset is not that large:

INSERT INTO dist_test SELECT t, (abs(timestamp_hash(t::timestamp)) % 10) + 1, 0.10 FROM generate_series('2018-03-02 1:00'::TIMESTAMPTZ, '2018-03-08 1:00', '1 hour') t;

I wonder if we could reduce this time.

@afiskon
Copy link
Contributor

afiskon commented Feb 16, 2022

"subscription sync wait timedout" reproduces on master ( Regression PG14.2 Debug ubuntu-20.04 ): https://github.com/timescale/timescaledb/runs/5206405227?check_suite_focus=true

The data_node test hangs for hours and the log is full of:

 [12461]  WARNING:  out of background worker slots
 [12461]  HINT:  You might need to increase max_worker_processes.

One other thing I noticed is that it seems to reproduce only on Debug builds. Release builds never failed. Any chance we may have some sort of buggy and/or expensive Asserts()'s?

UPD: Reproduces on both 14.1 and 14.2. At least we know this has nothing to do with PostgreSQL

@nikkhils nikkhils assigned nikkhils and pmwkaa and unassigned nikkhils Feb 16, 2022
pmwkaa added a commit to pmwkaa/timescaledb that referenced this issue Feb 18, 2022
Fix flaky data_node test by moving tests associated with copy/move chunk
functionality into a separate test file/groupand with increased
background workers (using max_bgw_8.conf).

Issue: timescale#4047
pmwkaa added a commit to pmwkaa/timescaledb that referenced this issue Feb 21, 2022
Fix flaky data_node test by moving tests associated with copy/move chunk
functionality into a separate test file/groupand with increased
background workers (using max_bgw_8.conf).

Issue: timescale#4047
pmwkaa added a commit that referenced this issue Feb 21, 2022
Fix flaky data_node test by moving tests associated with copy/move chunk
functionality into a separate test file/groupand with increased
background workers (using max_bgw_8.conf).

Issue: #4047
@nikkhils
Copy link
Contributor

PR merged. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issue about a flaky test
Projects
None yet
Development

No branches or pull requests

4 participants