Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow queries from pg-boss #126

Closed
mshwery opened this issue Aug 30, 2019 · 8 comments
Closed

Slow queries from pg-boss #126

mshwery opened this issue Aug 30, 2019 · 8 comments

Comments

@mshwery
Copy link

mshwery commented Aug 30, 2019

We have an archive retention policy of 30 days before we evict from the archive table, but it seems that this produces some very poor query times as the number of archived jobs increases...

Should pgboss create an index on archivedOn since it heavily depends on that for it's eviction query?

@timgit
Copy link
Owner

timgit commented Aug 30, 2019

Would you share the number of records in archive and how long it’s taking? I think it’s fine to add an index, and it would be great if you could add it and show the diff in perf

@mshwery
Copy link
Author

mshwery commented Sep 2, 2019

Here are a few examples of slow query logs we are seeing. All are related to the archiving or eviction queries:

27 seconds

{
  "data": {
    "parameters": [
      "1 hour"
    ],
    "query": "\n    WITH archived_rows AS (\n      DELETE FROM pgboss.job\n      WHERE\n        (completedOn + CAST($1 as interval) < now())\n        OR (\n          state = 'created'\n          AND name LIKE '__state__completed__%'\n          AND createdOn + CAST($1 as interval) < now()\n        )\n      RETURNING *\n    )\n    INSERT INTO pgboss.archive (\n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    )\n    SELECT \n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    FROM archived_rows\n  ",
    "time": 27139
  },
  "message": "Slow query execution time: 27139"
}

219 seconds

{
  "data": {
    "parameters": [
      "30 days"
    ],
    "query": "\n    DELETE FROM pgboss.archive\n    WHERE (archivedOn + CAST($1 as interval) < now())\n  ",
    "time": 219570
  },
  "message": "Slow query execution time: 219570"
}

145 seconds

{
  "data": {
    "parameters": [
      "30 days"
    ],
    "query": "\n    DELETE FROM pgboss.archive\n    WHERE (archivedOn + CAST($1 as interval) < now())\n  ",
    "time": 145607
  },
  "message": "Slow query execution time: 145607"
}

50 seconds

{
  "data": {
    "parameters": [
      "1 hour"
    ],
    "query": "\n    WITH archived_rows AS (\n      DELETE FROM pgboss.job\n      WHERE\n        (completedOn + CAST($1 as interval) < now())\n        OR (\n          state = 'created'\n          AND name LIKE '__state__completed__%'\n          AND createdOn + CAST($1 as interval) < now()\n        )\n      RETURNING *\n    )\n    INSERT INTO pgboss.archive (\n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    )\n    SELECT \n      id, name, priority, data, state, retryLimit, retryCount, retryDelay, retryBackoff, startAfter, startedOn, singletonKey, singletonOn, expireIn, createdOn, completedOn\n    FROM archived_rows\n  ",
    "time": 50476
  },
  "message": "Slow query execution time: 50476"
}

There are many of these in the logs each day. I can get you more details later on how many records, etc., and the result of adding an index on the archive table.

@mshwery
Copy link
Author

mshwery commented Sep 3, 2019

~24M rows at the moment (which is about 10x more than expected), though I see this is in part because pgboss also keeps a row representing the job state transition like __state__completed__FOO_JOB.

@mshwery
Copy link
Author

mshwery commented Nov 18, 2019

@timgit another thought – would it be a reasonable idea to allow individual queues to define their own archive retention? I realize that might be a big shift in the inner workings of the lib, but in our case we have some jobs that we want to keep around for awhile, and others that can be evicted within a day.

@mshwery
Copy link
Author

mshwery commented Nov 26, 2019

Looks like the index can't help because of how the comparison is written. We need to isolate the index column on the left hand side of the comparison:

Before:

WHERE (archivedOn + CAST($1 as interval) < now())

After:

WHERE (archivedOn < now() - CAST($1 as interval))

@mshwery
Copy link
Author

mshwery commented Nov 26, 2019

With a small enough table this doesn't seem to help either – it needs to do a Seq Scan because we are basically comparing between (now - interval) and infinity.

I was able to get the index used by using BETWEEN to create a range instead, but it's not foolproof because we don't know what the oldest time is in the table. And if the range is sufficiently large compared to the size of the table it may still not use the index

BEGIN;
EXPLAIN ANALYZE DELETE FROM pgboss.archive WHERE archivedOn BETWEEN (now() - CAST('2 days' as interval)) AND (now() - CAST('1 day' as interval));
ROLLBACK;
Delete on archive  (cost=14.17..1346.46 rows=573 width=6) (actual time=5.882..5.882 rows=0 loops=1)
  ->  Bitmap Heap Scan on archive  (cost=14.17..1346.46 rows=573 width=6) (actual time=0.123..0.530 rows=570 loops=1)
"        Recheck Cond: ((archivedon >= (now() - '2 days'::interval)) AND (archivedon <= (now() - '1 day'::interval)))"
        Heap Blocks: exact=136
"        ->  Bitmap Index Scan on ""IDX_archivedon""  (cost=0.00..14.03 rows=573 width=0) (actual time=0.098..0.098 rows=570 loops=1)"
"              Index Cond: ((archivedon >= (now() - '2 days'::interval)) AND (archivedon <= (now() - '1 day'::interval)))"
Planning time: 0.088 ms
Execution time: 5.917 ms

@timgit
Copy link
Owner

timgit commented Nov 26, 2019

Thanks for looking into this and playing around the options. There are a lot of options I can use to address this, some even advanced such as partitioning the archive by date so archiving becomes a partition deletion instead of costly query.

Until that happens, I'm thinking the best course of action here would be to create your own retention system to pull the jobs you need out of the archive so you won't need to rely on it for anything of importance. If you were to have your own process copying the meaningful jobs out of the archive on an interval lower than the archive interval, then you could reduce the archive retention period low enough that the execution time is resolved.

@timgit timgit mentioned this issue Dec 21, 2019
@timgit
Copy link
Owner

timgit commented Jan 9, 2020

i added an index to archivedOn in 3.2.1, which should help out in most cases, although I agree it's not a perfect solution for very large volumes.

@timgit timgit closed this as completed Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants