Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify docs on keepUntil, deleteInternal, and retention #167

Closed
bhsiaoNFLX opened this issue May 16, 2020 · 14 comments · Fixed by #172
Closed

clarify docs on keepUntil, deleteInternal, and retention #167

bhsiaoNFLX opened this issue May 16, 2020 · 14 comments · Fixed by #172

Comments

@bhsiaoNFLX
Copy link
Contributor

bhsiaoNFLX commented May 16, 2020

When looking at the archive table it took me a while to realize that keepuntil doesn't actually represent the time that it will be kept until. I was trying to use one of the deleteInterval... options to control keepuntil only to realize later that they are not related. The docs currently say When jobs in the archive table become eligible for deletion. but looking at the code and it seems like it is really retention... options that control the value of keepuntil. Also, afaik deleteInterval... doesn't associate with a corresponding date database column like the other ones are - is there a reason that is not the case? i.e. something along the lines of a willPurgeOn

P.S. I love pg-boss, keep up the fantastic work.

@timgit
Copy link
Owner

timgit commented May 16, 2020

Thanks for the feedback. I agree that this can be confusing, primarily because there are 2 job storage tables and 2 different options which control when job records are removed from both of them. I also agree the documentation could be far, far better, as I obviously don't spend as much time writing docs as I do code.

First of all, your quesion about keepuntil confused me because keepUntil is an internal name and a database column name but not in the api or configuration. You're right that it's related to the retention options, and this controls how long a job should stay in the "hot" queue table (pgboss.job).

Once a job is moved into the "cold" queue table (pgboss.archive), it remains there based upon a date calculation using the "delete interval" configuration options. The default is 7 days which means "7 days since the job was archived". The archive table has an addditional timestamp column named archivedOn for this purpose.

Hope this helps!

@bhsiaoNFLX
Copy link
Contributor Author

bhsiaoNFLX commented May 16, 2020

Thanks for the clarity - everything mostly makes sense to me (I think). Writing out my understanding here in case other folks have the same type of questions.

I see what you mean - the name keepuntil starts to make a lot more sense if thinking in terms of job retention in the "hot" queue (versus I was incorrectly thinking of it in terms of the "cold" queue which is the wrong mental picture).

Totally understand, writing docs takes time, naming things is hard. And to be fair, I think the docs are overall pretty great! I think my original confusion stems from the fact that it is relatively easy to map these api config values to their internal columns:

  • archive... options --> archive.archiveon
  • expireIn... options --> job.expirein

but the following knowledge requires some guess work:

  • retention... options --> job.keepuntil
  • delete... options --> currently no dedicated column

Does this make sense?

I should also mention the reason I especially care is because I have a use case where I need to keep the archived jobs for 90 days. Currently I'm achieving this by setting deleteIntervalDays = 90, which I'm convinced works based on the deleteTest.js unit test. For sanity reasons, it'd be nice for me to hop into the db tables and check explicitly when my jobs are going to be permanently purged from the archive table. Right now (I believe) this is not possible because there is no dedicated column for that, I sort of have to just trust that it will purge in a timely manner. Furthermore, in my use case it is possible that this deleteIntervalDays be dynamic (changes won't be very frequent, but possible regardless), another big reason why I care about these sanity checks.

Not a blocking issue for me but just letting you know where I'm coming from. I also understand if mine isn't the typical use case for a queue, where I'm somewhat treating the data in the archive table as the source of truth (leveraging pg-boss's automatic purge instead of rolling my own).

Alternatively it occurs to me that perhaps it I can just set archiveIntervalDays = 90, (and keep the default deleteIntervalDays) to achieve a similar effect. With this approach, I still can't quite check exactly when a specific job will be permanently purged since archivedon column only exists in the archive table, not job table, but at least I'm able to check which jobs are pending to be purged in the archive table. I need to mull over this approach a bit more but please let me know if you think of some obvious caveats which this alternative approach. The only one I can think of is the fact that I will have hundreds of thousands of completed jobs sitting in the hot queue, so not sure if this is a bad practice and/or has some performance implications given how the maintenance operations work.

@timgit
Copy link
Owner

timgit commented May 16, 2020

Haha. Yes, indeed naming things is one of the hardest things. For exampe, I use the term "interval" in 2 different ways, even causing myself to be confused. Adding to this difficulty is that I've kept most of the original api intact and added to it since v1. For example, I didn't originally build pg-boss with a cold archive table, so "archive" used to mean "delete". I think the only way out of this is do a semver major with new configuration options to clear up the confustion.

The configuration docs has this:

Archive completed jobs
When jobs become eligible for archive after completion.

and

Delete archived jobs
When jobs in the archive table become eligible for deletion.

While this may clear up some confusion, "Why would I care about setting the archive interval?" is now a valid question, since arguably once a job is completed it could also be immediately moved into "yet another queue table" similar to the archive to consolidate how the completed jobs work. All of the throttling and debouncing ignores completed jobs, so they don't really need to stay in the hot table for normal operations.

Seting the deleteIntervalDays config to 90 will keep jobs in the cold archive for 90 days after they arrive. This is tracked by the column pgboss.archive.archivedOn.

@bhsiaoNFLX
Copy link
Contributor Author

bhsiaoNFLX commented May 20, 2020

Hey sorry for the late reply. Work carried me away.

I think the only way out of this is do a semver major with new configuration options to clear up the confustion.

Cool! I did not know about semver until now - will have to read it up.

Seting the deleteIntervalDays config to 90 will keep jobs in the cold archive for 90 days after they arrive.

Yup, I am doing this already and it fits my use case well!

This is tracked by the column pgboss.archive.archivedOn.

The sentence confuses me. I would love to be proven wrong, but it seems like that is not the case for me. I'm on 4.2.0. Maybe it'll help if I draw a few rows in my pgboss.archive table:

createdon archivedon
2020-05-20 03:53:16.329802 2020-05-20 04:53:55.183965
2020-05-20 03:53:03.356071 2020-05-20 04:53:55.183965
...
2020-05-20 01:12:29.815483 2020-05-20 02:13:37.504774
2020-05-20 01:12:03.955974 2020-05-20 02:13:37.504774

As you can see, archivedon is always 1 hour after createdon, which is consistent with the default archive internal that I'm currently using. Fyi my jobs are short-lived, less than 1 minute being active.

@bhsiaoNFLX
Copy link
Contributor Author

also, to catch an error in my previous comment:

Furthermore,` in my use case it is possible that this deleteIntervalDays be dynamic (changes won't be very frequent, but possible regardless), another big reason why I care about these sanity checks.

I take back what I said about the dynamic-ness since I forget "delete internal" is a constructor option not publish option so changing the delete internal value at runtime is not possible without doing a stop() + start() or disconnect() + connect(), which I rather avoid doing since it adds complexity.

@timgit
Copy link
Owner

timgit commented May 20, 2020

If archivedOn is 1 hour after createdOn, this just means archiveInterval is probably set to 1 hour. This means completed jobs are moved from the hot job table to the cold archive table. Jobs will be deleted from the archive table 90 days from archivedOn.

@bhsiaoNFLX
Copy link
Contributor Author

bhsiaoNFLX commented May 20, 2020

True, but hypothetical if I change deleteIntervalDays to something different in the future, say, 120, and redeploy my server with the change, I wouldn't be able to explicitly tell by looking at archive.archivedOn which jobs are going to be purged in 90 days versus which ones in 120 days. Again, not really a big deal, just saying. And I can certainly understand that you probably don't want to add yet another date column to track this because it may add confusion.

@timgit
Copy link
Owner

timgit commented May 20, 2020

ArchivedOn is set by now() in postgres. If you restart using new settings, only the new settings are used.

@bhsiaoNFLX
Copy link
Contributor Author

So in my prior example after restart all archived jobs will purge archiveon + 120 days. Thanks for clarifying - good to know.

@bhsiaoNFLX
Copy link
Contributor Author

bhsiaoNFLX commented May 21, 2020

But for future reference and applications, I'd really like to know whether or not there is any negative performance implication when hundreds of thousands of completed jobs sit in the hot queue for a relatively long period of time after completion? I imagine the answer is yes which is why you set the archive interval default to only 1 hour, and I'm guessing you call it the "hot" queue for the reason that you prefer jobs to not stay there too long. However, if the answer is no, I may prefer using archiveIntervalDays = 90, instead of deleteIntervalDays = 90, where I shift my mental model a bit and really treat jobs in the archive table as "deleted."

Also, backtracking your previous response:

While this may clear up some confusion, "Why would I care about setting the archive interval?" is now a valid question, since arguably once a job is completed it could also be immediately moved into "yet another queue table" similar to the archive to consolidate how the completed jobs work. All of the throttling and debouncing ignores completed jobs, so they don't really need to stay in the hot table for normal operations.

I don't follow this paragraph much, especially the last sentence (sorry). Forgive me since I don't actually use throttle/debounce in my application. In my case, I actually do care about setting the delete / archive interval so I can control how long my data lives.

Lastly, in the docs under Events

archived is raised each time 1 or more jobs are archived. The payload is an integer representing the number of jobs archived.

Have you considered enriching the payload to also contain the ids of the archived jobs? This would be useful if you wanted to trigger something to happen after the specific job gets archived, like trigger a TypeORM entity delete.

@timgit
Copy link
Owner

timgit commented May 21, 2020

Yes, there are negative performance issues when tables grow too large.

The confusion around my statement "Why would I care about setting the archive interval?" is on me. I am brainstorming here about distributing jobs across more tables in the future to address the performance issues with large tables. Please disregard this, lol.

pg-boss used to return job ids in events in an older release, but I removed it since it doesn't scale well when the queue volume increases.

One last thing that you may find helpful. You should feel free to copy data out of the archive table and into another table or system if desired if the simple archive settings offered in pg-boss don't resolve your requirements. The 2 important things I'm trying to do with the archive is 1) to keep the primary job table record count as low as possible for the highest volume of jobs coming in and 2) to keep jobs around for a while for auditing or troubleshooting

@bhsiaoNFLX
Copy link
Contributor Author

bhsiaoNFLX commented May 21, 2020

Thanks again for clearing these things up.

It actually makes me feel better that my original assumption to keep the hot queue count small was correct, which means I thankfully don't need to rewrite any code.

The reason I ask the question about ids in events is because I use both TypeORM and pg-boss in my application, but couldn't find a great way to glue these two technologies together (doesn't have to be TypeORM per se). My current object model separates out entities into two categories: one's that by nature need to be cleared away I put into pg-boss queues, while one's that are perpetual I store via TypeORM. The obvious disadvantage here is I now have to manage CRUD using two paradigms instead of one. i.e. using jsonb operators to query against jsonb data + TypeORM Active Record/Data Mapper for persistent entities. This then gave me the idea of using pg-boss job ids to correlate with TypeORM entities, then whenever pg-boss events are raised, trigger ORM delete to synchronize. But since you mention this doesn't scale well, then I guess scratch everything I just said - I can live with 2-paradigms. Again, on the bright side here is I don't need to rewrite any code.

@timgit
Copy link
Owner

timgit commented May 22, 2020

If you need to respond to how jobs are completed, you could use the onComplete() subscriptions, too. Just a thought

@bhsiaoNFLX
Copy link
Contributor Author

yes! I do use that for another use case of mine :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants