Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel planning race condition under specific conditions #2341

Closed
ribejara-te opened this issue Jun 28, 2022 · 2 comments · Fixed by #2348
Closed

Parallel planning race condition under specific conditions #2341

ribejara-te opened this issue Jun 28, 2022 · 2 comments · Fixed by #2348
Labels
bug Something isn't working

Comments

@ribejara-te
Copy link
Contributor

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request. Searching for pre-existing feature requests helps us consolidate datapoints for identical requirements into a single place, thank you!
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.

Overview of the Issue

The parallel planning mechanism for different projects with the same workspace name has a race condition where N goroutines will try to clone the same repository into the same directory concurrently.

Reproduction Steps

  1. Set parallel_plan: true in atlantis.yaml
  2. Have 2 different projects in atlantis.yaml with different directories but the same workspace name
  3. Raise a pull request that triggers a plan in both projects
  4. Wait for Atlantis to report both plans (these will succeed)
  5. Trigger one more plan by commenting atlantis plan
  6. Wait for both plans to fail

Logs

Plan output for one project:

running git clone --branch REDACTED --depth=1 --single-branch https://REDACTED@REDACTED /home/atlantis/.atlantis/repos/REDACTED/XXXX/default: Cloning into '/home/atlantis/.atlantis/repos/REDACTED/XXXX/default'...
fatal: Unable to create '/home/atlantis/.atlantis/repos/REDACTED/XXXX/default/.git/shallow.lock': No such file or directory
: exit status 128

Plan output for the other project:

running git clone --branch REDACTED --depth=1 --single-branch https://REDACTED@REDACTED /home/atlantis/.atlantis/repos/REDACTED/XXXX/default: Cloning into '/home/atlantis/.atlantis/repos/REDACTED/XXXX/default'...
fatal: Unable to read current working directory: No such file or directory
fatal: fetch-pack: invalid index-pack output
: exit status 128

Relevant server logs (you'll see why these are relevant later on):

will re-clone repo, could not determine if was at correct commit: git rev-parse HEAD: exit status 128: HEAD
fatal: ambiguous argument 'HEAD': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

Relevant stacktrace:

github.com/runatlantis/atlantis/server/events.(*FileWorkspace).Clone
  github.com/runatlantis/atlantis/server/events/working_dir.go:105
github.com/runatlantis/atlantis/server/events.(*DefaultProjectCommandRunner).doPlan
  github.com/runatlantis/atlantis/server/events/project_command_runner.go:374
github.com/runatlantis/atlantis/server/events.(*DefaultProjectCommandRunner).Plan
  github.com/runatlantis/atlantis/server/events/project_command_runner.go:208
github.com/runatlantis/atlantis/server/events.(*ProjectOutputWrapper).updateProjectPRStatus
  github.com/runatlantis/atlantis/server/events/project_command_runner.go:169
github.com/runatlantis/atlantis/server/events.(*ProjectOutputWrapper).Plan
  github.com/runatlantis/atlantis/server/events/project_command_runner.go:149
github.com/runatlantis/atlantis/server/events.RunAndEmitStats
  github.com/runatlantis/atlantis/server/events/instrumented_project_command_runner.go:39
github.com/runatlantis/atlantis/server/events.(*InstrumentedProjectCommandRunner).Plan
  github.com/runatlantis/atlantis/server/events/instrumented_project_command_runner.go:13
github.com/runatlantis/atlantis/server/events.runProjectCmdsParallel.func1
  github.com/runatlantis/atlantis/server/events/project_command_pool_executor.go:28

Environment details

Version:

v0.19.5-pre.20220616

Repo atlantis.yaml file:

version: 3
parallel_plan: true
projects:
  - dir: projects/one
    autoplan:
      enabled: true
      when_modified: ['*.tf*']
  - dir: projects/two
    autoplan:
      enabled: true
      when_modified: ['*.tf*']

Root cause

When we trigger the second plan via atlantis plan, this is what happens:

  1. We get to this line according to our logs (which confusingly enough says "applies" not "plans", I guess that's a typo)
  2. This runs the Plan(...) function N times, one per plan up to --parallel-pool-size, concurrently via goroutines
  3. Which basically wraps the doPlan(...) function
  4. Which runs the p.WorkingDir.Clone(...) function
  5. Which, since we saw in logs before, fails to run git rev-parse, thus calls the forceClone(...) function
  6. Which (1) deletes the directory, (2) creates the directory and (3) clones the repository into the directory (this is done concurrently by both plan goroutines, there's no locking/semaphone mechanism that prevents it).

And of course, since two goroutines are running rm && mkdir && git clone on the same directory concurrently, the both fail at some point with different errors every time they do, since race conditions are unpredictable.

Fixing this should be as straightforward as adding some mutex mechanism before calling p.WorkingDir.Clone(...). What do you think?

@ribejara-te ribejara-te added the bug Something isn't working label Jun 28, 2022
@plutino
Copy link

plutino commented Jun 28, 2022

We ran into the same problem. We have a mono-repo setup with about 50 different terraform directories, each with 3 terraform workspaces corresponding to 3 different Atlantis workflows. Occasionally some changes may trigger all Atlantis projects to plan (e.g., changes on a common dependent module), and they'd all fail with the same error OP mentioned.

@ribejara-te
Copy link
Contributor Author

I've raised #2348 with my fix proposal.

jamengual pushed a commit that referenced this issue Jul 27, 2022
* fix: repository cloning race condition (#2341)

* fix: switched from sync to golang.org/x/sync/semaphore

* fix: check value of sem.Acquire(...)
krrrr38 pushed a commit to krrrr38/atlantis that referenced this issue Dec 16, 2022
…s#2348)

* fix: repository cloning race condition (runatlantis#2341)

* fix: switched from sync to golang.org/x/sync/semaphore

* fix: check value of sem.Acquire(...)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants