Slow queries from pg-boss #126

mshwery · 2019-08-30T17:37:53Z

We have an archive retention policy of 30 days before we evict from the archive table, but it seems that this produces some very poor query times as the number of archived jobs increases...

Should pgboss create an index on archivedOn since it heavily depends on that for it's eviction query?

The text was updated successfully, but these errors were encountered:

timgit · 2019-08-30T23:46:59Z

Would you share the number of records in archive and how long it’s taking? I think it’s fine to add an index, and it would be great if you could add it and show the diff in perf

mshwery · 2019-09-02T15:35:30Z

Here are a few examples of slow query logs we are seeing. All are related to the archiving or eviction queries:

27 seconds

{
  "data": {
    "parameters": [
      "1 hour"
    ],
    "query": "\n    WITH archived_rows AS (\n      DELETE FROM pgboss.job\n      WHERE\n        (completedOn + CAST($1 as interval) < now())\n        OR (\n          state = 'created'\n          AND name LIKE '__state__completed__%'\n          AND createdOn + CAST($1 as interval) < now()\n        )\n      RETURNING *\n    )\n    INSERT INTO pgboss.archive (\n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    )\n    SELECT \n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    FROM archived_rows\n  ",
    "time": 27139
  },
  "message": "Slow query execution time: 27139"
}

219 seconds

{
  "data": {
    "parameters": [
      "30 days"
    ],
    "query": "\n    DELETE FROM pgboss.archive\n    WHERE (archivedOn + CAST($1 as interval) < now())\n  ",
    "time": 219570
  },
  "message": "Slow query execution time: 219570"
}

145 seconds

{
  "data": {
    "parameters": [
      "30 days"
    ],
    "query": "\n    DELETE FROM pgboss.archive\n    WHERE (archivedOn + CAST($1 as interval) < now())\n  ",
    "time": 145607
  },
  "message": "Slow query execution time: 145607"
}

50 seconds

{
  "data": {
    "parameters": [
      "1 hour"
    ],
    "query": "\n    WITH archived_rows AS (\n      DELETE FROM pgboss.job\n      WHERE\n        (completedOn + CAST($1 as interval) < now())\n        OR (\n          state = 'created'\n          AND name LIKE '__state__completed__%'\n          AND createdOn + CAST($1 as interval) < now()\n        )\n      RETURNING *\n    )\n    INSERT INTO pgboss.archive (\n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    )\n    SELECT \n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    FROM archived_rows\n  ",
    "time": 50476
  },
  "message": "Slow query execution time: 50476"
}

There are many of these in the logs each day. I can get you more details later on how many records, etc., and the result of adding an index on the archive table.

mshwery · 2019-09-03T02:06:50Z

~24M rows at the moment (which is about 10x more than expected), though I see this is in part because pgboss also keeps a row representing the job state transition like __state__completed__FOO_JOB.

mshwery · 2019-11-18T23:59:27Z

@timgit another thought – would it be a reasonable idea to allow individual queues to define their own archive retention? I realize that might be a big shift in the inner workings of the lib, but in our case we have some jobs that we want to keep around for awhile, and others that can be evicted within a day.

mshwery · 2019-11-26T18:41:18Z

Looks like the index can't help because of how the comparison is written. We need to isolate the index column on the left hand side of the comparison:

Before:

WHERE (archivedOn + CAST($1 as interval) < now())

After:

WHERE (archivedOn < now() - CAST($1 as interval))

mshwery · 2019-11-26T19:33:15Z

With a small enough table this doesn't seem to help either – it needs to do a Seq Scan because we are basically comparing between (now - interval) and infinity.

I was able to get the index used by using BETWEEN to create a range instead, but it's not foolproof because we don't know what the oldest time is in the table. And if the range is sufficiently large compared to the size of the table it may still not use the index

BEGIN;
EXPLAIN ANALYZE DELETE FROM pgboss.archive WHERE archivedOn BETWEEN (now() - CAST('2 days' as interval)) AND (now() - CAST('1 day' as interval));
ROLLBACK;

Delete on archive  (cost=14.17..1346.46 rows=573 width=6) (actual time=5.882..5.882 rows=0 loops=1)
  ->  Bitmap Heap Scan on archive  (cost=14.17..1346.46 rows=573 width=6) (actual time=0.123..0.530 rows=570 loops=1)
"        Recheck Cond: ((archivedon >= (now() - '2 days'::interval)) AND (archivedon <= (now() - '1 day'::interval)))"
        Heap Blocks: exact=136
"        ->  Bitmap Index Scan on ""IDX_archivedon""  (cost=0.00..14.03 rows=573 width=0) (actual time=0.098..0.098 rows=570 loops=1)"
"              Index Cond: ((archivedon >= (now() - '2 days'::interval)) AND (archivedon <= (now() - '1 day'::interval)))"
Planning time: 0.088 ms
Execution time: 5.917 ms

timgit · 2019-11-26T23:28:58Z

Thanks for looking into this and playing around the options. There are a lot of options I can use to address this, some even advanced such as partitioning the archive by date so archiving becomes a partition deletion instead of costly query.

Until that happens, I'm thinking the best course of action here would be to create your own retention system to pull the jobs you need out of the archive so you won't need to rely on it for anything of importance. If you were to have your own process copying the meaningful jobs out of the archive on an interval lower than the archive interval, then you could reduce the archive retention period low enough that the execution time is resolved.

timgit · 2020-01-09T19:52:50Z

i added an index to archivedOn in 3.2.1, which should help out in most cases, although I agree it's not a perfect solution for very large volumes.

timgit mentioned this issue Dec 21, 2019

3.2.0 #135

Merged

timgit closed this as completed Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow queries from pg-boss #126

Slow queries from pg-boss #126

mshwery commented Aug 30, 2019

timgit commented Aug 30, 2019

mshwery commented Sep 2, 2019

mshwery commented Sep 3, 2019 •

edited

Loading

mshwery commented Nov 18, 2019

mshwery commented Nov 26, 2019

mshwery commented Nov 26, 2019

timgit commented Nov 26, 2019

timgit commented Jan 9, 2020

Slow queries from pg-boss #126

Slow queries from pg-boss #126

Comments

mshwery commented Aug 30, 2019

timgit commented Aug 30, 2019

mshwery commented Sep 2, 2019

mshwery commented Sep 3, 2019 • edited Loading

mshwery commented Nov 18, 2019

mshwery commented Nov 26, 2019

mshwery commented Nov 26, 2019

timgit commented Nov 26, 2019

timgit commented Jan 9, 2020

mshwery commented Sep 3, 2019 •

edited

Loading