-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added support for handling 206 response in eta.core.web.download_file()
- Fix for Issue #620
#621
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohis06 appreciate your work here! Let me know what you think of my comments so far
eta/core/web.py
Outdated
pb.update(8 * len(chunk)) | ||
|
||
while True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are specifically targeting 206 responses here, can we instead check r
for this status code and only use the range request if a 206 is encountered?
I see a few potential issues with the current approach:
_get_content_length()
may returnNone
(check out the implementation below) which will cause the new code here to fail- I believe that not all HTTP servers support range requests
- If a request is truncated for a reason other than 206, then we enter a
while True
loop, which seems dangerous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brimoor, thanks for reviewing it and providing the above comments! :)
If we are specifically targeting 206 responses here, can we instead check
r
for this status code and only use the range request if a 206 is encountered?
That's a valid concern, and I even tried implementing it this way at the start, but only later did I figure out that the first time when it enters this loop, r.status_code
is 200 and not 206 as shown in the wget output here.
I see a few potential issues with the current approach:
_get_content_length()
may returnNone
(check out the implementation below) which will cause the new code here to fail
Sure, so in that case, we can update the while loop condition as shown in the code snippet below.
- I believe that not all HTTP servers support range requests
This is an interesting issue, and I tried doing some research on this. From what I found online, we can resume a file download only when an HTTP server accepts range requests. If it doesn't, it's not possible to resume a download from where it was left off. This link helped clarify a few things. So to ensure we can send a range request to the HTTP server, we can check before entering the while loop whether the Accepts-Range
header is set to something other than None, as shown in the code snippet below.
- If a request is truncated for a reason other than 206, then we enter a
while True
loop, which seems dangerous
Actually, I don't think we would run into this scenario since _get_streaming_response()
handles this for us. As seen in the existing implementation, if we receive a status code other than 200/206, it raises a WebSessionError
, and the control would never come back to the while loop to resume the download.
Here's the updated code snippet:
def _do_download(self, r, f):
size_bytes = _get_content_length(r)
total_downloaded_bytes = 0
size_bits = 8 * size_bytes if size_bytes is not None else None
with etau.ProgressBar(
size_bits, use_bits=True, quiet=self.quiet
) as pb:
for chunk in r.iter_content(chunk_size=self.chunk_size):
f.write(chunk)
total_downloaded_bytes += len(chunk)
pb.update(8 * len(chunk))
while size_bytes is not None and ("Accept-Ranges" in r.headers and r.headers["Accept-Ranges"] is not None):
remaining_bytes = size_bytes - total_downloaded_bytes
if remaining_bytes > 0:
logger.debug(
"Continuing download...Total downloaded bytes: %d, Remaining bytes: %d"
% (total_downloaded_bytes, remaining_bytes)
)
r = self._get_streaming_response(
r.url,
headers={
"Range": "bytes=%d-" % total_downloaded_bytes
},
)
for chunk in r.iter_content(chunk_size=self.chunk_size):
f.write(chunk)
total_downloaded_bytes += len(chunk)
pb.update(8 * len(chunk))
else:
break
Kindly let me know if this updated code would do the needful!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the thorough analysis; your proposed implementation looks great to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @brimoor, for your feedback and approval!
I've committed the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for this!
This pull request addresses the issue outlined in #620.
With these changes,
eta.core.web.download_file()
now enables the complete download of any file from the internet, even when obtained as a 206 partial content response.The implementation has undergone thorough testing with the provided script. The script attempts to download the same tar file that initially caused the issue reported in #620. Additionally, have validated scenarios where the download is not a 206 response, all of which function correctly.
Upon completion of the download, the md5sum of the downloaded file was verified and matched the expected value.