- 
                Notifications
    
You must be signed in to change notification settings  - Fork 69
 
Only fetch archive of subdirectories in workspaces #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM on a code level, but I do have a concern over the additional file behaviour: .gitignore and .gitattributes can exist at intermediate levels too, and they're complicated enough in terms of how they interact (given the pattern format is more than a simple glob, but includes negation and directory-specific syntax) that this probably won't replicate the exact behaviour a user would normally have with a full monorepo checkout.
Specifically, I'm thinking about this kind of example (I'm using .gitignore, but everything would apply to .gitattributes as well):
.
├── a
│   ├── b
│   │   └── .gitignore
│   └── .gitignore
└── .gitignore
If the workspace is a/b, then a normal commit would have ignore rules applied starting from the root, with each lower level overriding the previous level. Since negation comes in here, where you could end up with a problem would be if, say, ./.gitignore included node_modules, but ./a/.gitignore then negated that: with the current implementation, node_modules would be ignored when it shouldn't be.
I think what we'd have to do here is to replicate the skeleton of the monorepo when creating the workspace: include all intermediate directories, and init the Git repo at the root, even though the working directory would still be a/b (in that example).
Honestly, that feels pretty complicated to implement within our current structure. Should we continue to support multiple workspace creators, or should we collapse it back down into a single case (presumably volume, or the RPC volume approach I've been espousing)?
| 
           I agree with Adam, this is slightly different behavior than git would have and that would be bad, I think. Code-wise this looks good and it's something that can be super valuable when having large repositories, but this needs to work, otherwise users will be very confused about changes introduced to the repo, I think. Regarding your two questions: 
 I think the answer here would be no, for the following reason: 
 I think yes. These files are required for the under-the-hood process of generating the patch. It should ideally never be surprisingly large, just because a node_modules is not ignored properly or whatever. That can probably lead to a lot frustration before figuring out why that is. After all, our users don't know and care about the inner workings of src-cli.  | 
    
| 
           Ah! I see. Okay, either I'm misunderstanding what the two of you wrote or what you describe, Adam, is exactly what we're doing: 
 If your workspace is at  See this snippet from the tests: src-cli/internal/campaigns/executor_test.go Lines 327 to 338 in 84a1b16 
 In  I think the bit that you might be missing is that with workspaces we only download  Does that make sense? (Sidenote: this was also the thing that made me surprised by how easy it was to implement :) I simply unzip another ZIP file and then change the working directory for the script in the container from   | 
    
| 
           AHA! I just talked to @eseliger about this on Slack and he told me what I was missing: what if there's a  I understand. Good news and famous last words: I don't think it's hard to built this :) I'll give it a shot.  | 
    
          
 I agree. Question is: what should we call the option in the spec? workspaces:
  - rootAtLocationOf: package.json
    in: "github.com/sourcegraph/automation-testing"
    subdirectoryArchive: trueOr   | 
    
| 
           
  | 
    
| 
           @LawnGnome take a look at af602c6. @eseliger and I just worked on this together. It was rather simple :) Takes us a few requests more to download the additional files, but I think we can later optimize that maybe server side?  | 
    
| 
           
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Do we want to make only-download-workspace-archive the default behaviour?
As I said this morning, in a perfect world I'd make the default depend on the size of the repo: larger repos would default to yes; smaller to no. In practice, though, I'd vote for no.
Do we want to make download-additional-files the default behaviour?
Yes.
what should we call the option in the spec?
onlyFetchWorkspace seems reasonable to me, but I'd also potentially suggest sparseWorkspace, by analogy with the --sparse option supported by git clone.
| 
           @malomarrec @eseliger What do you think of   | 
    
| 
           hmm   | 
    
| 
           Alright, I can see that.   | 
    
| 
           @LawnGnome Erik and I just paired on the configuration flag and tested it manually too. I'm going to merge it now, but feel free to leave comments for follow-ups if you find anything.  | 
    
* Only fetch archive of subdirectories in workspaces * Fetch additional files * Add a comment * Make fake file mux more generic * Copy additional files into workspaces before creating git repo * Add TODOs * Fix directory check on Windows * Hash local file path * Fix typo in docstring * More debug output * Try this * Remove debug output * Sort names before mounting them * Remove left-over slice * Remove duplication in copying of files * Hopefully fix copying * Download additional files on way up from workspace to root dir * Add a unit test for additional files * Fix on Windows * Debug output * Do not use os.FileSeparator for URLs * Update changelog entry * Disable download of only workspace by default, add flag * Add a docstring in schema
This does two things:
workspacesare used in the campaign spec, only an archive of the workspace (including all subdirectories and subworkspaces) is downloaded. That helps a lot with monorepos where it's unfeasible to download the whole repository..gitignoreand.gitattributes.The code is by not polished yet, but it works, I added a lot of tests and hope to get a first round of reviews that I can then work off tomorrow.
What's missing:
unzipand(wc *dockerBindWorkspaceCreator) copyToWorkspaceand already haveprepareCopyDestinationFilefetchOnlyWorkspacea flag that enables this featuresourcegraph/sourcegraphBig questions:
Thank you for reviewing.
Yours truly,