Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: plan queue functionality #2782

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

ghaiszaher
Copy link
Contributor

@ghaiszaher ghaiszaher commented Dec 12, 2022

what

  • This is a follow-up on feat: plan queue functionality #1923 which was closed because the fork was deleted.

  • The implementation of the queue functionality is complete and the code is almost production-ready. We are seeking your assistance and would appreciate your feedback if this is the right direction and if this functionality could be considered.

  • What we have done so far

    • Feature flag to enable/disable the queue.
    • Automatically entering the queue upon a plan request if another PR has the lock.
    • Granting the lock to the next PR in queue when the previous PR is unlocked (either via explicit atlantis unlock command, from Atlantis UI or by merging/closing the PR).
    • Notifying the author that the PR is ready to be planned for the unlocked environments/workspaces.
      • We avoided automatic planning because: 1) A PR can have more than one lock and not all of them might be released yet. 2) It was cumbersome to store the initial command in case extra arguments were passed.
    • Support for Redis.
    • 🐞 bug-fix: if a just-dequeued PR is unlocked (e.g. with atlantis unlock or from UI), the delete lock call fails with deleting lock failed with: remove .../default.tfplan: no such file or directory -> this is fixed by ignoring non-existing file errors when deleting a plan
    • Showing the queue in the main UI and in the locks view.
    image image
  • What is missing:

    • Handling plan errors with respect to the queue: by adding a global feature flag which allows the user to either:
      • Grant the lock to the next person in the queue (current implementation)
      • Or keep the lock until explicitly unlocked -> queue remains unchanged
    • More commands to handle the queue: e.g. jump in front of the queue, or get out of the queue.
    • Ability to remove a lock from the queue from the UI
    • More test coverage.
    • if a PR is merged or closed without having the lock -> also remove from the queue.
      Note: This should also be done if the queue is disabled
    • Test rendered html content in web_templates
    • when "Discard Plan & Unlock" button is clicked in the UI -> stay in the same page if the lock is given to the next person

tests

why

Reasons are explained in the discussion #1722

references

contributors

@monikma @amir-elgayed @ghaiszaher

@nitrocode
Copy link
Member

Due to the related discussion having so many upvotes, it might be worth resurrecting this.

@ghaiszaher would you be able to resolve the conflicts and verify that this works in your system?

@ghaiszaher
Copy link
Contributor Author

Due to the related discussion having so many upvotes, it might be worth resurrecting this.

@ghaiszaher would you be able to resolve the conflicts and verify that this works in your system?

@nitrocode I resolved the conflicts but I discovered a bug when trying to replan a PR that already has the lock, because with the current logic there is a call to UnlockByPull before planning and thus messing up the queue.

I am working on a fix.

@ghaiszaher ghaiszaher marked this pull request as ready for review January 28, 2023 09:09
@ghaiszaher ghaiszaher requested a review from a team as a code owner January 28, 2023 09:09
@github-actions github-actions bot added the go Pull requests that update Go code label Jan 28, 2023
@ghaiszaher
Copy link
Contributor Author

Due to the related discussion having so many upvotes, it might be worth resurrecting this.
@ghaiszaher would you be able to resolve the conflicts and verify that this works in your system?

@nitrocode I resolved the conflicts but I discovered a bug when trying to replan a PR that already has the lock, because with the current logic there is a call to UnlockByPull before planning and thus messing up the queue.

I am working on a fix.

@nitrocode I fixed the bug and tried it locally (only with BoltDB), it's working properly now.
We still need to take care of some comments from @lilincmu's review #1923 (review)

@nitrocode nitrocode added the needs tests Change requires tests label Jan 28, 2023
@nitrocode
Copy link
Member

@ghaiszaher that's great to hear. Looking forward to your changes. I changed the pr body into a check list so we can follow your progress. 😄 thank you for your efforts!

Copy link
Contributor Author

@ghaiszaher ghaiszaher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitrocode thanks a lot. I fixed some of the issues and updated the PR description. We're getting closer to a production-ready version but still need some help.

I am moving the initial discussions started in the old PR to this one.

#1923 (comment)
@monikma

About your questions on the tests - yes we have not implemented all of them, as we first wanted to check with you if this is some functionality that sounds like a good idea, and if how we go about implementing it would make sense to you (as mentioned in the PR description). So my understanding is "yes it does" ? :)

We could continue by applying the suggestions and extending tests. Would that make sense to you? What do you expect from us to call our PR "production ready", and get closer to merging it?

One topic we circled around but never properly got to, is feature flag to enable/disable the feature. If you had any suggestions there, it would be great (meaning how feature flags are usually approached in this codebase, we would not want to break some convention).

What do you think about these points? Just to know how to continue with some parts.

Also feel free to review the PR in its current state 😄

Comment on lines 94 to 97
// TODO monikma #8 this will be called if there was a plan error and the lock was automatically dropped;
// Should we assure dequeuing of the next PR here too?
_, _, err := p.Locker.Unlock(lockAttempt.LockKey, false)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original discussion: #1923 (comment)

@monikma

Yes, what do you think, what would make more sense?

  1. Just drop everything and wait for the lock owner to fix their plan and acquire the lock again, or
  2. Say "your loss, now you are out of the queue", and dequeue the next PR, or
  3. Quickly re-acquire the lock for this PR so that the owner can plan as soon as they fixed it

.. we have not really decided. I guess 1 is not a good option as then a third person could accidentally get the lock.. also 3 would not be fair since fixing can take time. So I guess 2. Yes, we have not implemented it yet, we were waiting for some general feedback on the functionality first :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Option 1: this would mean anyone can obtain the lock and get in front of the queue in the meantime
Option 3: That was our initial option but we were not sure why we release the lock in case of plan error in the first place.
Maybe we can also have Option 4: feature flag to either go with Option 2 (dequeue on plan error) or Option 3 (always keep the lock on plan error).

@nitrocode any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @runatlantis/maintainers and @krrrr38 for visibility

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Option 4 that way the behavior is configurable to the user workflow that they prefer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamengual since this PR is getting very large, do you think it makes sense to introduce such feature flag in a follow-up, and for now just give the lock to the next in queue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add them in another PR but the feature if approved will not be released if the flags are not added, so we will have to have both PRs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, then I will consider it in this PR if you don't mind the size.

@@ -79,6 +79,88 @@ func TestRedisWithTLS(t *testing.T) {
_ = newTestRedisTLS(s)
}

func TestGetQueueByLock(t *testing.T) {
Copy link
Contributor Author

@ghaiszaher ghaiszaher Jan 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of repetition between this file and boltdb_test.go. Would it be possible to combine all these in one generic file, and let the tests run twice using both redis and boltdb (since we expect the exact same behavior)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. Did you want to do this refactoring as a separate pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was answered by @jamengual in this comment: #2782 (comment)

But if this is a valid option, I can try to work on it next.

@jamengual
Copy link
Contributor

jamengual commented Jan 29, 2023 via email

server/core/db/boltdb.go Outdated Show resolved Hide resolved
server/core/db/boltdb.go Outdated Show resolved Hide resolved
server/core/db/boltdb.go Outdated Show resolved Hide resolved
server/core/db/boltdb.go Outdated Show resolved Hide resolved
server/core/db/boltdb.go Outdated Show resolved Hide resolved
server/server.go Outdated Show resolved Hide resolved
@@ -92,7 +92,7 @@ func (a *APIController) Plan(w http.ResponseWriter, r *http.Request) {
a.apiReportError(w, http.StatusInternalServerError, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some end to end tests too since this is such a large change?

u.commentOnDequeuedPullRequests(ctx, dequeueStatus)
}

func (u *UnlockCommandRunner) commentOnDequeuedPullRequests(ctx *command.Context, dequeueStatus models.DequeueStatus) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add tests for these functions?

Copy link
Contributor Author

@ghaiszaher ghaiszaher Apr 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitrocode same issue here, should this function be exported just for the tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nitrocode any thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghaiszaher Looks like some tests already exist in command_test_runner.go

@@ -114,6 +127,22 @@ var IndexTemplate = template.Must(template.New("index.html.tmpl").Parse(`
<a href="{{ $basePath }}{{.LockPath}}">
<div class="twelve columns button content lock-row">
<div class="list-title">{{.RepoFullName}} <span class="heading-font-size">#{{.PullNum}}</span> <code>{{.Path}}</code> <code>{{.Workspace}}</code></div>
{{ if .Queue }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any web_templates tests that we can add for these changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't find any test regarding the rendered content

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I also could not find a test for this. I was hoping maybe we could add one in this PR to test verify/validate this change.

@jamengual
Copy link
Contributor

@ghaiszaher do you think you will have time to finish this? Thanks.

@ghaiszaher
Copy link
Contributor Author

@ghaiszaher do you think you will have time to finish this? Thanks.

@jamengual Yes. In the meantime, I would appreciate your opinion on the question in this discussion: #2782 (comment)

Comment on lines +244 to +245
// TODO(Ghais): make sure it's fine to remove this line:
//_, _, err = p.lockingLocker.UnlockByPull(baseRepo.FullName, pull.Num)
Copy link
Contributor Author

@ghaiszaher ghaiszaher Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamengual what do you think about deleting this line? We wouldn't really benefit from removing all the locks here. As long as the plans are deleted with p.deletePlans, it should be fine to keep the locks assigned to the current pull requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to bring @GenPage into this conversation but I do agree with you

Copy link
Member

@GenPage GenPage Sep 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not against it, but I have some concerns about the plan deletion code within pending_plan_finder, it's a bit dated, and it doesn't make sense to me while we handle deleting plans in multiple places. Wouldn't it be better to use the deletePlan function in working_dir.go instead?

@GenPage GenPage added this to the v0.26.0 milestone Aug 31, 2023
@GenPage GenPage added the waiting-on-review Waiting for a review from a maintainer label Sep 8, 2023
@GenPage GenPage self-requested a review September 8, 2023 22:23
@GenPage GenPage self-assigned this Sep 8, 2023
# Conflicts:
#	server/events/delete_lock_command.go
#	server/events/delete_lock_command_test.go
@GenPage
Copy link
Member

GenPage commented Sep 25, 2023

I've started reviewing this but will take some time. It's a pretty big change

@GenPage GenPage modified the milestones: v0.26.0, v0.27.0 Oct 6, 2023
@GenPage
Copy link
Member

GenPage commented Dec 11, 2023

We will place this on hold as we want to introduce more locking around a plan queue when we have a solution for the current regressions concerning #3345. I would like more eyes on this PR before introducing this new feature.

@GenPage GenPage added feature New functionality/enhancement needs discussion Large change that needs review from community/maintainers never-stale labels Dec 11, 2023
@GenPage GenPage removed this from the v0.27.0 milestone Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs Documentation feature New functionality/enhancement go Pull requests that update Go code needs discussion Large change that needs review from community/maintainers needs tests Change requires tests never-stale waiting-on-review Waiting for a review from a maintainer work-in-progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants