-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add Azure Blob Storage remote #868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This change adds a new cloud remote based on Azure Blob Storage [1] and the azure-storage-blob SDK [2]. The implementation closely follows the patterns laid out by the existing cloud remotes (AWS and GCP). Notable differences include: - No use of a lock in the progress-bar callback and no requirement to look up total transfer size ahead of time: the Azure Storage SDK provides the total transfer bytes and current transfer offset in its progress callback. - When `exists` is called, check the existence of every blob instead of listing the entire container. Container list operations can be expensive so unless the number of checked files is very large, making separate requests is likely going to be more efficient. - For increased flexibility, access credentials for the Azure Storage backend can be provided either connection-string style via the remote URL or via environment variables. The Azure Storage backend can be emulated via Azurite [3]. [1] https://azure.microsoft.com/en-us/services/storage/blobs/ [2] https://github.com/Azure/azure-storage-python [3] https://github.com/Azure/azurite This change was created together with @EricSchles.
efiop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great patch! Thank you!
| } for md5 in md5s] | ||
|
|
||
| def exists(self, path_infos): | ||
| # TODO: why does S3 fetch all blobs first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is simply much faster :) We mostly use exists() method when filtering a bulk of cache files to decide if we need to download/upload them and in s3 list_objects is much-much faster than trying to check keys one-by-one. If blob_service.exists() method is plenty fast as is, there is no reason to redo this part of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put this as a comment instead of this TODO.
|
Fixes #442 . |
|
Current exists() implementation is fine, thank you! I'll setup a test shortly and if it is too slow will change it to list-then-check logic. @c-w @EricSchles Thank you guys so much for the amazing work! I will be sure to release new version this week, after I'll setup azure for testing. |
|
Thanks for the super quick merge, @efiop. Would be excellent to know the results of the test for multiple-exists versus list-then-check, so please keep me in the loop :) Do also let me know if you need any help setting up Blob Storage for testing, e.g. Azurite works well and I've used that with Docker on Travis on other OSS projects in the past. |
Sure :)
Actually, I'm not very familiar with Azure and I would really appreciate if you could add azurite to our travis setup, so we could test your driver with it. Could you add it please? I will be happy to provide any help needed to assist with that. After test is up and running on CI, I will be able to instantaneously release new version of dvc with your driver included. |
This change adds a new cloud remote based on Azure Blob Storage and the azure-storage-blob SDK.
The implementation closely follows the patterns laid out by the existing cloud remotes (AWS and GCP). Notable differences include:
No use of a lock in the progress-bar callback and no requirement to look up total transfer size ahead of time: the Azure Storage SDK provides the total transfer bytes and current transfer offset in its progress callback.
When
existsis called, check the existence of every blob instead of listing the entire container. Container list operations can be expensive so unless the number of checked files is very large, making separate requests is likely going to be more efficient. Please let me know if a list-then-check implementation of the exists function would be preferred.For increased flexibility, access credentials for the Azure Storage backend can be provided either connection-string style via the remote URL or via environment variables. The Azure Storage backend can be emulated via Azurite.
This change was created together with @EricSchles.