Skip to content

Conversation

@mrnugget
Copy link
Contributor

@mrnugget mrnugget commented Feb 9, 2021

This does two things:

  1. When workspaces are used in the campaign spec, only an archive of the workspace (including all subdirectories and subworkspaces) is downloaded. That helps a lot with monorepos where it's unfeasible to download the whole repository.
  2. When only the workspace is downloaded two files from the root of the repository are additionally downloaded, if they exist: .gitignore and .gitattributes.

The code is by not polished yet, but it works, I added a lot of tests and hope to get a first round of reviews that I can then work off tomorrow.

What's missing:

  • I haven't tested the volume workspace mode manually yet
  • I want to remove the duplication between unzip and (wc *dockerBindWorkspaceCreator) copyToWorkspace and already have prepareCopyDestinationFile
  • Changelog entry
  • Make fetchOnlyWorkspace a flag that enables this feature
  • Documentation in sourcegraph/sourcegraph

Big questions:

  • Do we want to make only-download-workspace-archive the default behaviour?
  • Do we want to make download-additional-files the default behaviour?

Thank you for reviewing.

Yours truly,

image

@mrnugget mrnugget marked this pull request as ready for review February 10, 2021 16:29
@mrnugget mrnugget requested a review from a team February 10, 2021 16:29
Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM on a code level, but I do have a concern over the additional file behaviour: .gitignore and .gitattributes can exist at intermediate levels too, and they're complicated enough in terms of how they interact (given the pattern format is more than a simple glob, but includes negation and directory-specific syntax) that this probably won't replicate the exact behaviour a user would normally have with a full monorepo checkout.

Specifically, I'm thinking about this kind of example (I'm using .gitignore, but everything would apply to .gitattributes as well):

.
├── a
│   ├── b
│   │   └── .gitignore
│   └── .gitignore
└── .gitignore

If the workspace is a/b, then a normal commit would have ignore rules applied starting from the root, with each lower level overriding the previous level. Since negation comes in here, where you could end up with a problem would be if, say, ./.gitignore included node_modules, but ./a/.gitignore then negated that: with the current implementation, node_modules would be ignored when it shouldn't be.

I think what we'd have to do here is to replicate the skeleton of the monorepo when creating the workspace: include all intermediate directories, and init the Git repo at the root, even though the working directory would still be a/b (in that example).

Honestly, that feels pretty complicated to implement within our current structure. Should we continue to support multiple workspace creators, or should we collapse it back down into a single case (presumably volume, or the RPC volume approach I've been espousing)?

@eseliger
Copy link
Member

I agree with Adam, this is slightly different behavior than git would have and that would be bad, I think. Code-wise this looks good and it's something that can be super valuable when having large repositories, but this needs to work, otherwise users will be very confused about changes introduced to the repo, I think.

Regarding your two questions:

Do we want to make only-download-workspace-archive the default behaviour?

I think the answer here would be no, for the following reason:
Often, workspaces rely on files that live outside of that folder, like a global tsconfig.json file, that is just extended by this sub-directory workspace. I think this is valid for many files and just white-listing those would be hard to do. Not sure if it should be configurable from the spec, but at least I think it's a somewhat confusing thing, especially if the monorepo structure is not that well isolating the separate workspaces.

Do we want to make download-additional-files the default behaviour?

I think yes. These files are required for the under-the-hood process of generating the patch. It should ideally never be surprisingly large, just because a node_modules is not ignored properly or whatever. That can probably lead to a lot frustration before figuring out why that is. After all, our users don't know and care about the inner workings of src-cli.

@mrnugget
Copy link
Contributor Author

Ah! I see. Okay, either I'm misunderstanding what the two of you wrote or what you describe, Adam, is exactly what we're doing:

I think what we'd have to do here is to replicate the skeleton of the monorepo when creating the workspace: include all intermediate directories, and init the Git repo at the root, even though the working directory would still be a/b (in that example).

If your workspace is at a/b/c then the working directory will be in c, but a/b still exist and .gitignore will be in the root.

See this snippet from the tests:

{
Run: `if [[ -f ".gitignore" ]]; then echo "yes" >> gitignore-exists; fi`,
Container: "doesntmatter:13",
},
{
Run: `if [[ $(basename $(pwd)) == "a" && -f "../.gitignore" ]]; then echo "yes" >> gitignore-exists; fi`,
Container: "doesntmatter:13",
},
{
Run: `if [[ $(basename $(pwd)) == "b" && -f "../../.gitignore" ]]; then echo "yes" >> gitignore-exists; fi`,
Container: "doesntmatter:13",
},

In b the .gitignore is downloaded to ../../.gitignore, in a it's at ../.gitignore, and in the root it's at ./.gitignore.

I think the bit that you might be missing is that with workspaces we only download a/b but when it's unzipped the git repository root will look like:

.
├── a
│   ├── b
│   │   └── <files in b here>
│   └── <no files here>
└── <no files here, except .gitignore and .gitattributes>

Does that make sense?

(Sidenote: this was also the thing that made me surprised by how easy it was to implement :) I simply unzip another ZIP file and then change the working directory for the script in the container from /work to /work/a/b.)

@mrnugget
Copy link
Contributor Author

AHA! I just talked to @eseliger about this on Slack and he told me what I was missing: what if there's a .gitignore in a too? Or, if, say, your workspace is in a/b/c/d: what if there are .gitignore files in a, b, c?

I understand. Good news and famous last words: I don't think it's hard to built this :) I'll give it a shot.

@mrnugget
Copy link
Contributor Author

@eseliger

I think the answer here would be no, for the following reason:

I agree. Question is: what should we call the option in the spec?

workspaces:
  - rootAtLocationOf: package.json
    in: "github.com/sourcegraph/automation-testing"
    subdirectoryArchive: true

Or fetchOnlyWorkspace: bool, or noFullArchive: bool?

@eseliger
Copy link
Member

onlyFetchWorkspace sounds best out of the three suggested IMO, but not sure. I'm very bad at naming :)

@mrnugget
Copy link
Contributor Author

@LawnGnome take a look at af602c6. @eseliger and I just worked on this together. It was rather simple :) Takes us a few requests more to download the additional files, but I think we can later optimize that maybe server side?

@malomarrec
Copy link

malomarrec commented Feb 11, 2021

onlyFetchWorkspace sounds best IMO as well. And I like the imperative naming fetch, it helps understand what it means
subdirectoryArchive sounds confusing to me.

Copy link
Contributor

@LawnGnome LawnGnome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Do we want to make only-download-workspace-archive the default behaviour?

As I said this morning, in a perfect world I'd make the default depend on the size of the repo: larger repos would default to yes; smaller to no. In practice, though, I'd vote for no.

Do we want to make download-additional-files the default behaviour?

Yes.

what should we call the option in the spec?

onlyFetchWorkspace seems reasonable to me, but I'd also potentially suggest sparseWorkspace, by analogy with the --sparse option supported by git clone.

@mrnugget
Copy link
Contributor Author

@malomarrec @eseliger What do you think of sparseWorkspace? I like it a lot, sounds cool 😎

@eseliger
Copy link
Member

hmm sparseWorkspace - I never heard the word sparse in that context before. Maybe shallow is the word I would use? No strong opinion though

@mrnugget
Copy link
Contributor Author

Alright, I can see that. onlyFetchWorkspace it is then.

@mrnugget
Copy link
Contributor Author

@LawnGnome Erik and I just paired on the configuration flag and tested it manually too. I'm going to merge it now, but feel free to leave comments for follow-ups if you find anything.

@mrnugget mrnugget merged commit 225502c into main Feb 12, 2021
@mrnugget mrnugget deleted the mrn/workspace-subdir-archives branch February 12, 2021 13:34
scjohns pushed a commit that referenced this pull request Apr 24, 2023
* Only fetch archive of subdirectories in workspaces

* Fetch additional files

* Add a comment

* Make fake file mux more generic

* Copy additional files into workspaces before creating git repo

* Add TODOs

* Fix directory check on Windows

* Hash local file path

* Fix typo in docstring

* More debug output

* Try this

* Remove debug output

* Sort names before mounting them

* Remove left-over slice

* Remove duplication in copying of files

* Hopefully fix copying

* Download additional files on way up from workspace to root dir

* Add a unit test for additional files

* Fix on Windows

* Debug output

* Do not use os.FileSeparator for URLs

* Update changelog entry

* Disable download of only workspace by default, add flag

* Add a docstring in schema
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants