Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude .git for cloud storage #1494

Merged
merged 6 commits into from
Dec 8, 2022
Merged

Exclude .git for cloud storage #1494

merged 6 commits into from
Dec 8, 2022

Conversation

infwinston
Copy link
Member

@infwinston infwinston commented Dec 7, 2022

This is a quick patch to make s3 exclude .git for #1485. However, this might be a surprising behavior to some users.
Alternative is to implement our own code to read .gitignore and generate corresponding directories to exclude (i.e., parsing those **/*.pyc syntax ourselves)
Should we implement that?

Note sky launch will respect .gitignore and putting .git/ inside is enough to exclude the directory. or should we make it default for all rsync?

Tested

  • sky spot launch a workdir containing .git
  • mounted storage with S3, GCS

@infwinston
Copy link
Member Author

Hm tests seem to be broken now due to python version issue #609 (comment)

@concretevitamin
Copy link
Collaborator

Could we run some tests and write a Tested section?

@infwinston
Copy link
Member Author

Ah yes I already did some tests. updated the post.

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @infwinston. I've run the following tests for AWS/S3 only:

» cat test.yaml                                         
run: |
  set -ex
  pwd
  ls -lthr

mkdir dir-with-no-git
sky spot launch test.yaml --cloud aws -n dir-with-no-git --workdir=dir-with-no-git

mkdir dir-with-git-file; cd dir-with-git-file; touch .git
sky spot launch test.yaml --cloud aws -n dir-with-git-file --workdir=dir-with-git-file

Can you help test them on GCS before merging? It'd also be great to run the file mount smoke tests.

@infwinston
Copy link
Member Author

I just tested spot launch with GCP and confirm that the GCS bucket doesn't contain .git/.

(sky) ubuntu@ip-172-31-94-104:~/sky_workdir/skypilot$ sky spot launch test.yaml --cloud gcp -n dir-with-git-file --workdir=dir-with-git-file
Task from YAML spec: test.yaml
Launching a new spot task 'dir-with-git-file'. Proceed? [Y/n]:
I 12-08 01:25:06 execution.py:575] Translating file_mounts with local source paths to SkyPilot Storage...
I 12-08 01:25:06 execution.py:601] Workdir 'dir-with-git-file' will be synced to cloud storage 'skypilot-workdir-ubuntu-3fe64d43'.
I 12-08 01:25:06 execution.py:672] Uploading sources to cloud storage. See: sky storage ls
I 12-08 01:25:08 storage.py:1342] Created GCS bucket skypilot-workdir-ubuntu-3fe64d43 in US-CENTRAL1 with storage class STANDARD
Launching managed spot job dir-with-git-file from spot controller...
Launching spot controller...
I 12-08 01:25:13 cloud_vm_ray_backend.py:1001] To view detailed progress: tail -n100 -f /home/ubuntu/sky_logs/sky-2022-12-08-01-25-09-515628/provision.log
I 12-08 01:25:23 cloud_vm_ray_backend.py:1257] Launching on GCP us-central1 (us-central1-a)

@infwinston
Copy link
Member Author

using_file_mounts smoke test passed. merging this PR.

@infwinston
Copy link
Member Author

infwinston commented Dec 8, 2022

hmm I just realized our bert_qa example shown on the spot job doc may fail. because git checkout is used with workdir... @concretevitamin wdyt?
https://skypilot.readthedocs.io/en/latest/examples/spot-jobs.html#task-yaml

One simple solution is to modify the YAML in the doc to use git clone. but we may need to be careful that some users might actually want to use git checkout with workdir.

@concretevitamin
Copy link
Collaborator

Great catch!

  1. (meta comment) Is this example covered by some smoke test? If so, we should really run smoke tests as a blanket check and rely less on manual detection (not saying this is the case here).

  2. I think it is fine, if we make the following changes:

  • In the code of this PR, detect whether the directory .git exists, and if so print a clear warning like, "Detected .git directory; ignoring it due to the potential large number of files. Git operations inside setup/run commands will not work."
  • As part of this PR, update that doc page, and any other .py example files. E.g.,
# git clone https://github.com/huggingface/transformers.git ~/transformers

->

# git clone https://github.com/huggingface/transformers.git ~/transformers -b v4.18.0

@infwinston
Copy link
Member Author

infwinston commented Dec 8, 2022

Smoke test passed. Just done an end-to-end test with and without excluding .git. and confirmed the cloud buckets as expected. But looks like the difference of uploading time is not big.

.git excluded

(sky) ubuntu@ip-172-31-94-104:~/sky_workdir/skypilot$ sky spot launch --workdir . ls
Task from command: ls
Launching a new spot task 'sky-770d-ubuntu'. Proceed? [Y/n]: Y
I 12-08 06:02:42 execution.py:575] Translating file_mounts with local source paths to SkyPilot Storage...
I 12-08 06:02:42 execution.py:601] Workdir '.' will be synced to cloud storage 'skypilot-workdir-ubuntu-73b8d216'.
I 12-08 06:02:42 execution.py:672] Uploading sources to cloud storage. See: sky storage ls
I 12-08 06:02:44 storage.py:1345] Created GCS bucket skypilot-workdir-ubuntu-73b8d216 in US-CENTRAL1 with storage class STANDARD
W 12-08 06:02:44 storage.py:725] '.git' directory under '.' is excluded during sync
Launching managed spot job sky-770d-ubuntu from spot controller...
Launching spot controller...
I 12-08 06:02:52 cloud_vm_ray_backend.py:1001] To view detailed progress: tail -n100 -f /home/ubuntu/sky_logs/sky-2022-12-08-06-02-49-316146/provision.log

.git included

(sky) ubuntu@ip-172-31-94-104:~/sky_workdir/skypilot$ sky spot launch --workdir . ls
Task from command: ls
Launching a new spot task 'sky-bfb5-ubuntu'. Proceed? [Y/n]:
I 12-08 06:04:19 execution.py:575] Translating file_mounts with local source paths to SkyPilot Storage...
I 12-08 06:04:19 execution.py:601] Workdir '.' will be synced to cloud storage 'skypilot-workdir-ubuntu-9bb2ab63'.
I 12-08 06:04:19 execution.py:672] Uploading sources to cloud storage. See: sky storage ls
I 12-08 06:04:21 storage.py:1339] Created GCS bucket skypilot-workdir-ubuntu-9bb2ab63 in US-CENTRAL1 with storage class STANDARD
Launching managed spot job sky-bfb5-ubuntu from spot controller...
Launching spot controller...
I 12-08 06:04:30 cloud_vm_ray_backend.py:1001] To view detailed progress: tail -n100 -f /home/ubuntu/sky_logs/sky-2022-12-08-06-04-27-121384/provision.log

@infwinston
Copy link
Member Author

@concretevitamin I also add some warning if .git is detected. PTAL

W 12-08 06:02:44 storage.py:725] '.git' directory under '.' is excluded during sync

Copy link
Collaborator

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit concerning there's no speed difference.

Maybe try the example in #1463?

sky/data/storage.py Outdated Show resolved Hide resolved
Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
@infwinston
Copy link
Member Author

infwinston commented Dec 8, 2022

The example in #1463 seems to be related to a local network issue that occasionally happens to my laptop. Now it seems to work well with and without .git. speed is also similar.

Will keep an eye on this.

@infwinston infwinston merged commit 8652b6b into master Dec 8, 2022
@infwinston infwinston deleted the exclude-dot-git branch December 8, 2022 07:29
iojw pushed a commit that referenced this pull request Feb 18, 2023
* s3 exclude .git

* gcs

* comments

* warning and fix doc

* Update sky/data/storage.py

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>

* fix

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants