-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Storage] Replace gsutil with an aio library #631
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #631 +/- ##
==========================================
+ Coverage 71.89% 80.98% +9.09%
==========================================
Files 56 55 -1
Lines 4639 4669 +30
Branches 675 671 -4
==========================================
+ Hits 3335 3781 +446
+ Misses 1250 860 -390
+ Partials 54 28 -26
|
f0301e6
to
8b15606
Compare
medusa/storage/google_storage.py
Outdated
with GSUtil(self.config) as gsutil: | ||
for parent, src_paths in _group_by_parent(srcs): | ||
yield self._upload_paths(gsutil, parent, src_paths, dest) | ||
@retry(stop_max_attempt_number=MAX_UP_DOWN_LOAD_RETRIES, wait_fixed=5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: we should probably retry individual blob uploads (_upload_blob()
) instead of this one.
It will work nicely with the resumable transfer mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
medusa/storage/google_storage.py
Outdated
) | ||
resp = resp['resource'] | ||
else: | ||
resp = await self.gcs_storage.upload_from_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: upload_from_filename()
will read the whole file into memory, which will obviously not work for us.
Let's use upload() instead and pass it a file object as file_data
arg. Let's also set force_resumable_upload
to true
and disable the timeout for now.
Multipart uploads in this lib don't seem to work as we would want them to anyway (seems like all parts are read at once, they're not uploaded concurrently and there's no retry when a part fails).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
medusa/storage/google_storage.py
Outdated
with GSUtil(self.config) as gsutil: | ||
for parent, src_paths in _group_by_parent(srcs): | ||
yield self._download_paths(gsutil, parent, src_paths, dest) | ||
@retry(stop_max_attempt_number=MAX_UP_DOWN_LOAD_RETRIES, wait_fixed=5000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: I think we should move the retries on individual downloads instead.
Since the retry swallows the exceptions, I'd also catch it, log it and then re-raise it for better observability of these failed attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
medusa/backup_node.py
Outdated
@@ -31,7 +31,7 @@ | |||
from medusa.index import add_backup_start_to_index, add_backup_finish_to_index, set_latest_backup_in_index | |||
from medusa.monitoring import Monitoring | |||
from medusa.storage import Storage, format_bytes_str, ManifestObject, divide_chunks | |||
from medusa.storage.google_storage import GSUTIL_MAX_FILES_PER_CHUNK | |||
from medusa.storage.google_storage import GOOGLE_MAX_FILES_PER_CHUNK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Do we still need this? I have a feeling this was related to gsutil.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, we don't need this. The new storage driver implementations handle this themsleves (eventually we'll do this in the AbstractStorage
anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Large files downloads are failing due to the download method being invoked.
medusa/storage/google_storage.py
Outdated
logging.debug("Blob {} last modification time is {}".format(blob.name, blob.extra["last_modified"])) | ||
return parser.parse(blob.extra["last_modified"]) | ||
try: | ||
await self.gcs_storage.download_to_filename( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: This method reads the files at once and puts their content into memory. We need to use download_stream() instead which will return a BufferedStream which should be writeable to a file without loading all the data in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Pushed a new commit, together with a rebase on a recent master.
setup.py
Outdated
@@ -70,7 +70,8 @@ | |||
'dnspython>=2.2.1', | |||
'asyncio==3.4.3', | |||
'aiohttp==3.8.5', | |||
'aiohttp-s3-client==0.8.17' | |||
'aiohttp-s3-client==0.8.17', | |||
'gcloud-aio-storage==8.3.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I added this dependency which was missing from setup.py
d0ef407
to
0306927
Compare
0306927
to
5eb0166
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great but I'm wondering about the concurrency level of some operations.
medusa/storage/google_storage.py
Outdated
async def _delete_objects(self, objects: t.List[AbstractBlob]): | ||
coros = [self._delete_object(obj) for obj in objects] | ||
await asyncio.gather(*coros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is this going to delete all objects concurrently? Maybe our concurrency setting should apply here to avoid sending too many concurrent requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's all at once. Chunking might be a good idea. Do I chunk it in chunks of config.max_concurrent_transfers size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I chunk it in chunks of config.max_concurrent_transfers size?
Yes, sounds good
medusa/storage/google_storage.py
Outdated
async def _download_blobs(self, srcs: t.List[t.Union[Path, str]], dest: t.Union[Path, str]): | ||
coros = [self._download_blob(src, dest) for src in map(str, srcs)] | ||
await asyncio.gather(*coros) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: is it downloading all files concurrently at the same time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the same as with the deletes. Do we chunk by concucurrent transfers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should chunk, otherwise we may overwhelm the network which could have unforeseen impacts.
34ec86d
to
f799451
Compare
SonarCloud Quality Gate failed. 0 Bugs No Coverage information Catch issues before they fail your Quality Gate with our IDE extension SonarLint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome stuff!
* [Storage] Replace gsutil with an aio library * [Storage] Dont chunk files outside of storage drivers * [Storage/GCS] Move retries to uploads of individual blobs * [Storage/GCS] Use timeouts everywhere and force resumable where relevant * [Storage/GCS] Move retries for downloads to individual blobs * [GCS Storage] Download via stream instead of all into memory * [GCS Storage] Chunk deletes/downloads by config.concurrent_transfers
* [Storage] Replace gsutil with an aio library * [Storage] Dont chunk files outside of storage drivers * [Storage/GCS] Move retries to uploads of individual blobs * [Storage/GCS] Use timeouts everywhere and force resumable where relevant * [Storage/GCS] Move retries for downloads to individual blobs * [GCS Storage] Download via stream instead of all into memory * [GCS Storage] Chunk deletes/downloads by config.concurrent_transfers
There is a bit of duplicated code, but that will go into the
AbstractStorage
once we do #628.The S3 ITs are failing, because I enforced KMS and there is a KMS test that is not working because #612 is still open.
Oh, and I forgot to remove unused code, will handle asap.
Fixes #627.
Fixes #637.