Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: add_job or an alter_job can crash an unrelated running job #5537

Closed
lkshminarayanan opened this issue Apr 10, 2023 · 2 comments · Fixed by #5543
Closed

[Bug]: add_job or an alter_job can crash an unrelated running job #5537

lkshminarayanan opened this issue Apr 10, 2023 · 2 comments · Fixed by #5543
Assignees
Labels
bgw The background worker subsystem, including the scheduler bug

Comments

@lkshminarayanan
Copy link
Contributor

lkshminarayanan commented Apr 10, 2023

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker, User-Defined Action (UDA)

What happened?

The start_scheduled_jobs function mistakenly sorts the scheduled_jobs list in-place. As a result, when the ts_update_scheduled_jobs_list function compares the updated list of scheduled jobs with the existing scheduled jobs list, it is comparing a list that is sorted by job_id to one that is sorted by next_start time.

This inconsistency can lead to various issues, such as erroneously calling terminate_and_cleanup_job on a running job that doesn't require termination. This will cause the job to terminate with the following log :

2023-04-10 16:20:06.765 IST [23737] FATAL:  terminating background worker "User-Defined Action [1000]" due to administrator command
2023-04-10 16:20:06.765 IST [23737] CONTEXT:  SQL statement "UPDATE test SET a = a + 1"
	PL/pgSQL function update_test(integer,jsonb) line 4 at SQL statement
2023-04-10 16:20:06.787 IST [23715] LOG:  background worker "User-Defined Action [1000]" (PID 23737) exited with exit code 1

The job_errors table also reports the error :

select job_id, err_message from timescaledb_information.job_errors;
 job_id |             err_message             
--------+-------------------------------------
   1000 | job crash detected, see server logs
(1 row)

TimescaleDB version affected

main (517dee9)

PostgreSQL version used

15.2

What operating system did you use?

Ubuntu 22.04

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

Not applicable

How can we reproduce the bug?

Session 1 - Run the following SQL script to create the procedure and related jobs

-- Create a test table
CREATE TABLE test (a int);
INSERT INTO test VALUES (1);

-- Create a procedure that updates the value of a column in the test table
CREATE OR REPLACE PROCEDURE update_test(job_id INT, config JSONB)
LANGUAGE PLPGSQL AS
$$
BEGIN
  RAISE LOG 'Executing job %', job_id;
  UPDATE test SET a = a + 1;
  RAISE LOG 'Job % succeeded', job_id;
END
$$;

-- Create two jobs that use the update_test.
-- Note that these two jobs use the same procedure but the
-- same issue can be reproduced with different procedures as well.
SELECT add_job('update_test', '5s', initial_start => now()::timestamptz + interval '2s');
SELECT add_job('update_test', '5s');

-- Begin a transaction to keep the jobs in 'running' state
BEGIN;
UPDATE test SET a = a + 1;

Session 2 - create a new job that is unrelated to the jobs above. When this is executed in a new session, it will crash the first job that started at initial_start => now()::timestamptz + interval '2s'.

CREATE OR REPLACE PROCEDURE dummy_pg_stat_proc(job_id INT, config JSONB)
LANGUAGE PLPGSQL AS
$$
BEGIN
  PERFORM * FROM pg_stat_activity;
END
$$;

SELECT add_job('dummy_pg_stat_proc', '5s');
@lkshminarayanan lkshminarayanan self-assigned this Apr 10, 2023
@lkshminarayanan lkshminarayanan added the bgw The background worker subsystem, including the scheduler label Apr 10, 2023
@lkshminarayanan
Copy link
Contributor Author

In the main scheduler loop, the ts_update_scheduled_jobs_list function is called whenever the the bgw_job_cache is invalidated and not just during create/alter job.

lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Apr 11, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes timescale#5537
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Apr 11, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes timescale#5537
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Apr 12, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes timescale#5537
@lkshminarayanan
Copy link
Contributor Author

A simpler reproducible testcase that doesn't involve any tables or transactions :

Create a proc that does nothing other than pg_sleeping and couple of jobs that use it :

-- Proc that sleeps for 1m - to keep the test jobs in running state
CREATE OR REPLACE PROCEDURE proc_that_sleeps(job_id INT, config JSONB)
LANGUAGE PLPGSQL AS
$$
BEGIN
    PERFORM pg_sleep(60);
END
$$;

-- create new jobs and ensure that the second one gets scheduled
-- before the first one by adjusting the initial_start values
SELECT add_job('proc_that_sleeps', '1h', initial_start => now()::timestamptz + interval '2s');
SELECT add_job('proc_that_sleeps', '1h', initial_start => now()::timestamptz - interval '2s');

Wait for the jobs to start running :

postgres=# SELECT job_id, job_status, next_start,
       total_runs, total_successes, total_failures
  FROM timescaledb_information.job_stats ORDER BY job_id;
 job_id | job_status |            next_start            | total_runs | total_successes | total_failures 
--------+------------+----------------------------------+------------+-----------------+----------------
      1 | Scheduled  | 2023-04-12 22:02:17.324489+05:30 |          1 |               1 |              0
      2 | Scheduled  | 2023-05-01 05:30:00+05:30        |          1 |               1 |              0
   1000 | Running    | -infinity                        |          1 |               0 |              0
   1001 | Running    | -infinity                        |          1 |               0 |              0
(4 rows)

Now add create another job :

SELECT add_job('proc_that_sleeps', '1h');

This will cause the job with id 1000 to fail :

postgres=# SELECT job_id, job_status, next_start,
       total_runs, total_successes, total_failures
  FROM timescaledb_information.job_stats ORDER BY job_id;
 job_id | job_status |            next_start            | total_runs | total_successes | total_failures 
--------+------------+----------------------------------+------------+-----------------+----------------
      1 | Scheduled  | 2023-04-12 22:02:17.324489+05:30 |          1 |               1 |              0
      2 | Scheduled  | 2023-05-01 05:30:00+05:30        |          1 |               1 |              0
   1000 | Scheduled  | -infinity                        |          1 |               0 |              0
   1001 | Scheduled  | 2023-04-12 22:02:19.025757+05:30 |          1 |               1 |              0
   1002 | Running    | -infinity                        |          1 |               0 |              0
(5 rows)

Time: 3.439 ms
postgres=# SELECT job_id, err_message
  FROM timescaledb_information.job_errors;
 job_id |             err_message             
--------+-------------------------------------
   1000 | job crash detected, see server logs
(1 row)

Time: 2.102 ms

lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Apr 12, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes timescale#5537
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Apr 13, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes timescale#5537
lkshminarayanan added a commit to lkshminarayanan/timescaledb that referenced this issue Apr 14, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes timescale#5537
lkshminarayanan added a commit that referenced this issue Apr 14, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes #5537
github-actions bot pushed a commit that referenced this issue Apr 14, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes #5537

(cherry picked from commit a383c8d)
timescale-automation pushed a commit that referenced this issue Apr 14, 2023
The start_scheduled_jobs function mistakenly sorts the scheduled_jobs
list in-place. As a result, when the ts_update_scheduled_jobs_list
function compares the updated list of scheduled jobs with the existing
scheduled jobs list, it is comparing a list that is sorted by job_id to
one that is sorted by next_start time. Fix that by properly copying the
scheduled_jobs list into a new list and use that for sorting.

Fixes #5537

(cherry picked from commit a383c8d)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant