feat: plan queue functionality #2782

ghaiszaher · 2022-12-12T19:33:00Z

what

This is a follow-up on feat: plan queue functionality #1923 which was closed because the fork was deleted.
The implementation of the queue functionality is complete and the code is almost production-ready. We are seeking your assistance and would appreciate your feedback if this is the right direction and if this functionality could be considered.
What we have done so far
- Feature flag to enable/disable the queue.
- Automatically entering the queue upon a plan request if another PR has the lock.
- Granting the lock to the next PR in queue when the previous PR is unlocked (either via explicit atlantis unlock command, from Atlantis UI or by merging/closing the PR).
- Notifying the author that the PR is ready to be planned for the unlocked environments/workspaces.
  - We avoided automatic planning because: 1) A PR can have more than one lock and not all of them might be released yet. 2) It was cumbersome to store the initial command in case extra arguments were passed.
- Support for Redis.
- 🐞 bug-fix: if a just-dequeued PR is unlocked (e.g. with atlantis unlock or from UI), the delete lock call fails with deleting lock failed with: remove .../default.tfplan: no such file or directory -> this is fixed by ignoring non-existing file errors when deleting a plan
- Showing the queue in the main UI and in the locks view.
What is missing:
- Handling plan errors with respect to the queue: by adding a global feature flag which allows the user to either:
  - Grant the lock to the next person in the queue (current implementation)
  - Or keep the lock until explicitly unlocked -> queue remains unchanged
- More commands to handle the queue: e.g. jump in front of the queue, or get out of the queue.
- Ability to remove a lock from the queue from the UI
- More test coverage.
  - end to end tests for the feature flag
  - feat: plan queue functionality #1923 (comment)
  - feat: plan queue functionality #1923 (comment)
  - feat: plan queue functionality #1923 (comment)
  - feat: plan queue functionality #1923 (comment)
- if a PR is merged or closed without having the lock -> also remove from the queue.
  Note: This should also be done if the queue is disabled
- Test rendered html content in web_templates
- when "Discard Plan & Unlock" button is clicked in the UI -> stay in the same page if the lock is given to the next person

tests

Docker image published to https://github.com/ghaiszaher/atlantis/pkgs/container/atlantis is tested and the queueing functionality works properly

why

Reasons are explained in the discussion #1722

references

Related discussion Plan queue functionality #1722
Previous PR feat: plan queue functionality #1923 (from older fork)
Previous PR feat: plan queue functionality #2781 (targeted incorrect parent branch)
Related Feature Req: Autoplan on state unlock #1617

contributors

@monikma @amir-elgayed @ghaiszaher

nitrocode · 2023-01-13T01:39:15Z

Due to the related discussion having so many upvotes, it might be worth resurrecting this.

@ghaiszaher would you be able to resolve the conflicts and verify that this works in your system?

ghaiszaher · 2023-01-27T16:04:54Z

Due to the related discussion having so many upvotes, it might be worth resurrecting this.

@ghaiszaher would you be able to resolve the conflicts and verify that this works in your system?

@nitrocode I resolved the conflicts but I discovered a bug when trying to replan a PR that already has the lock, because with the current logic there is a call to UnlockByPull before planning and thus messing up the queue.

I am working on a fix.

ghaiszaher · 2023-01-28T09:12:40Z

Due to the related discussion having so many upvotes, it might be worth resurrecting this.
@ghaiszaher would you be able to resolve the conflicts and verify that this works in your system?

@nitrocode I resolved the conflicts but I discovered a bug when trying to replan a PR that already has the lock, because with the current logic there is a call to UnlockByPull before planning and thus messing up the queue.

I am working on a fix.

@nitrocode I fixed the bug and tried it locally (only with BoltDB), it's working properly now.
We still need to take care of some comments from @lilincmu's review #1923 (review)

nitrocode · 2023-01-28T15:11:53Z

@ghaiszaher that's great to hear. Looking forward to your changes. I changed the pr body into a check list so we can follow your progress. 😄 thank you for your efforts!

ghaiszaher

@nitrocode thanks a lot. I fixed some of the issues and updated the PR description. We're getting closer to a production-ready version but still need some help.

I am moving the initial discussions started in the old PR to this one.

#1923 (comment)
@monikma

About your questions on the tests - yes we have not implemented all of them, as we first wanted to check with you if this is some functionality that sounds like a good idea, and if how we go about implementing it would make sense to you (as mentioned in the PR description). So my understanding is "yes it does" ? :)

We could continue by applying the suggestions and extending tests. Would that make sense to you? What do you expect from us to call our PR "production ready", and get closer to merging it?

One topic we circled around but never properly got to, is feature flag to enable/disable the feature. If you had any suggestions there, it would be great (meaning how feature flags are usually approached in this codebase, we would not want to break some convention).

What do you think about these points? Just to know how to continue with some parts.

Also feel free to review the PR in its current state 😄

ghaiszaher · 2023-01-29T15:05:09Z

server/events/project_locker.go

+			// TODO monikma #8 this will be called if there was a plan error and the lock was automatically dropped;
+			// Should we assure dequeuing of the next PR here too?
+			_, _, err := p.Locker.Unlock(lockAttempt.LockKey, false)


Original discussion: #1923 (comment)

@monikma

Yes, what do you think, what would make more sense?

Just drop everything and wait for the lock owner to fix their plan and acquire the lock again, or

Say "your loss, now you are out of the queue", and dequeue the next PR, or

Quickly re-acquire the lock for this PR so that the owner can plan as soon as they fixed it

.. we have not really decided. I guess 1 is not a good option as then a third person could accidentally get the lock.. also 3 would not be fair since fixing can take time. So I guess 2. Yes, we have not implemented it yet, we were waiting for some general feedback on the functionality first :)

Option 1: this would mean anyone can obtain the lock and get in front of the queue in the meantime
Option 3: That was our initial option but we were not sure why we release the lock in case of plan error in the first place.
Maybe we can also have Option 4: feature flag to either go with Option 2 (dequeue on plan error) or Option 3 (always keep the lock on plan error).

@nitrocode any thoughts?

cc: @runatlantis/maintainers and @krrrr38 for visibility

I like Option 4 that way the behavior is configurable to the user workflow that they prefer.

@jamengual since this PR is getting very large, do you think it makes sense to introduce such feature flag in a follow-up, and for now just give the lock to the next in queue?

I think we can add them in another PR but the feature if approved will not be released if the flags are not added, so we will have to have both PRs

I see, then I will consider it in this PR if you don't mind the size.

ghaiszaher · 2023-01-29T16:09:13Z

server/core/redis/redis_test.go

@@ -79,6 +79,88 @@ func TestRedisWithTLS(t *testing.T) {
 	_ = newTestRedisTLS(s)
 }

+func TestGetQueueByLock(t *testing.T) {


There's a lot of repetition between this file and boltdb_test.go. Would it be possible to combine all these in one generic file, and let the tests run twice using both redis and boltdb (since we expect the exact same behavior)?

Perhaps. Did you want to do this refactoring as a separate pr?

I think this was answered by @jamengual in this comment: #2782 (comment)

But if this is a valid option, I can try to work on it next.

jamengual · 2023-01-29T18:56:35Z

mmmm I don't know if this could make it confusing for people since it does not follow the stablished pattern.

…

On Sun, Jan 29, 2023, 8:09 a.m. Ghais Zaher ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In server/core/redis/redis_test.go <#2782 (comment)> : > @@ -79,6 +79,88 @@ func TestRedisWithTLS(t *testing.T) { _ = newTestRedisTLS(s) } +func TestGetQueueByLock(t *testing.T) { There's a lot of repetition between this file and boltdb_test.go. Would it be possible to combine all these in one generic file, and let them run using both redis and boltdb (since we expect the exact same behavior)? — Reply to this email directly, view it on GitHub <#2782 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQ3ERD4HPZQBKXU44ZBWFDWU2I3LANCNFSM6AAAAAAS4KLVGA> . You are receiving this because your review was requested.Message ID: ***@***.***>

server/core/db/boltdb.go

server/core/locking/locking.go

server/core/redis/redis.go

server/static/bindata_assetfs.go

server/server.go

nitrocode · 2023-02-10T01:02:53Z

server/controllers/api_controller.go

@@ -92,7 +92,7 @@ func (a *APIController) Plan(w http.ResponseWriter, r *http.Request) {
 		a.apiReportError(w, http.StatusInternalServerError, err)


Could you add some end to end tests too since this is such a large change?

nitrocode · 2023-02-10T01:03:14Z

server/events/unlock_command_runner.go

+	u.commentOnDequeuedPullRequests(ctx, dequeueStatus)
+}
+
+func (u *UnlockCommandRunner) commentOnDequeuedPullRequests(ctx *command.Context, dequeueStatus models.DequeueStatus) {


Could you add tests for these functions?

@nitrocode same issue here, should this function be exported just for the tests?

@nitrocode any thoughts on this?

@ghaiszaher Looks like some tests already exist in command_test_runner.go

nitrocode · 2023-02-10T01:03:55Z

server/controllers/templates/web_templates.go

@@ -114,6 +127,22 @@ var IndexTemplate = template.Must(template.New("index.html.tmpl").Parse(`
      <a href="{{ $basePath }}{{.LockPath}}">
        <div class="twelve columns button content lock-row">
        <div class="list-title">{{.RepoFullName}} <span class="heading-font-size">#{{.PullNum}}</span> <code>{{.Path}}</code> <code>{{.Workspace}}</code></div>
+        {{ if .Queue }}


Any web_templates tests that we can add for these changes?

Can't find any test regarding the rendered content

Yep, I also could not find a test for this. I was hoping maybe we could add one in this PR to test verify/validate this change.

server/controllers/api_controller.go

jamengual · 2023-08-14T14:52:11Z

@ghaiszaher do you think you will have time to finish this? Thanks.

ghaiszaher · 2023-08-15T14:19:16Z

@ghaiszaher do you think you will have time to finish this? Thanks.

@jamengual Yes. In the meantime, I would appreciate your opinion on the question in this discussion: #2782 (comment)

ghaiszaher · 2023-08-15T22:26:50Z

server/events/plan_command_runner.go

+		// TODO(Ghais): make sure it's fine to remove this line:
+		//_, _, err = p.lockingLocker.UnlockByPull(baseRepo.FullName, pull.Num)


@jamengual what do you think about deleting this line? We wouldn't really benefit from removing all the locks here. As long as the plans are deleted with p.deletePlans, it should be fine to keep the locks assigned to the current pull requests.

I think we need to bring @GenPage into this conversation but I do agree with you

I'm not against it, but I have some concerns about the plan deletion code within pending_plan_finder, it's a bit dated, and it doesn't make sense to me while we handle deleting plans in multiple places. Wouldn't it be better to use the deletePlan function in working_dir.go instead?

# Conflicts: # server/events/delete_lock_command.go # server/events/delete_lock_command_test.go

GenPage · 2023-09-25T19:43:52Z

I've started reviewing this but will take some time. It's a pretty big change

GenPage · 2023-12-11T18:01:20Z

We will place this on hold as we want to introduce more locking around a plan queue when we have a solution for the current regressions concerning #3345. I would like more eyes on this PR before introducing this new feature.

nitrocode added the work-in-progress label Dec 17, 2022

ghaiszaher marked this pull request as ready for review January 28, 2023 09:09

ghaiszaher requested a review from a team as a code owner January 28, 2023 09:09

github-actions bot added the go Pull requests that update Go code label Jan 28, 2023

nitrocode added the needs tests Change requires tests label Jan 28, 2023

ghaiszaher mentioned this pull request Jan 29, 2023

feat: plan queue functionality #1923

Closed

ghaiszaher commented Jan 29, 2023

View reviewed changes