Added support for handling 206 response in `eta.core.web.download_file()` - Fix for Issue #620 #621

rohis06 · 2024-02-29T01:03:32Z

This pull request addresses the issue outlined in #620.

With these changes, eta.core.web.download_file() now enables the complete download of any file from the internet, even when obtained as a 206 partial content response.

The implementation has undergone thorough testing with the provided script. The script attempts to download the same tar file that initially caused the issue reported in #620. Additionally, have validated scenarios where the download is not a 206 response, all of which function correctly.

#!/usr/bin/env python
# pragma pylint: disable=redefined-builtin
# pragma pylint: disable=unused-wildcard-import
# pragma pylint: disable=wildcard-import
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from builtins import *

# pragma pylint: enable=redefined-builtin
# pragma pylint: enable=unused-wildcard-import
# pragma pylint: enable=wildcard-import

import logging
import os

import eta.core.web as etaw
import eta.core.utils as etau


logger = logging.getLogger(__name__)


FILE_ID = "http://data.csail.mit.edu/places/places365/test_256.tar"

path = os.path.join(os.path.dirname(__file__), "test.tar")
etaw.download_file(FILE_ID, path=path)

Upon completion of the download, the md5sum of the downloaded file was verified and matched the expected value.

brimoor

@rohis06 appreciate your work here! Let me know what you think of my comments so far

brimoor · 2024-02-29T02:29:56Z

eta/core/web.py

                pb.update(8 * len(chunk))

+            while True:


If we are specifically targeting 206 responses here, can we instead check r for this status code and only use the range request if a 206 is encountered?

I see a few potential issues with the current approach:

_get_content_length() may return None (check out the implementation below) which will cause the new code here to fail

I believe that not all HTTP servers support range requests

If a request is truncated for a reason other than 206, then we enter a while True loop, which seems dangerous

@brimoor, thanks for reviewing it and providing the above comments! :)

If we are specifically targeting 206 responses here, can we instead check r for this status code and only use the range request if a 206 is encountered?

That's a valid concern, and I even tried implementing it this way at the start, but only later did I figure out that the first time when it enters this loop, r.status_code is 200 and not 206 as shown in the wget output here.

I see a few potential issues with the current approach:

_get_content_length() may return None (check out the implementation below) which will cause the new code here to fail

Sure, so in that case, we can update the while loop condition as shown in the code snippet below.

I believe that not all HTTP servers support range requests

This is an interesting issue, and I tried doing some research on this. From what I found online, we can resume a file download only when an HTTP server accepts range requests. If it doesn't, it's not possible to resume a download from where it was left off. This link helped clarify a few things. So to ensure we can send a range request to the HTTP server, we can check before entering the while loop whether the Accepts-Range header is set to something other than None, as shown in the code snippet below.

If a request is truncated for a reason other than 206, then we enter a while True loop, which seems dangerous

Actually, I don't think we would run into this scenario since _get_streaming_response() handles this for us. As seen in the existing implementation, if we receive a status code other than 200/206, it raises a WebSessionError, and the control would never come back to the while loop to resume the download.

Here's the updated code snippet:

def _do_download(self, r, f): size_bytes = _get_content_length(r) total_downloaded_bytes = 0 size_bits = 8 * size_bytes if size_bytes is not None else None with etau.ProgressBar( size_bits, use_bits=True, quiet=self.quiet ) as pb: for chunk in r.iter_content(chunk_size=self.chunk_size): f.write(chunk) total_downloaded_bytes += len(chunk) pb.update(8 * len(chunk)) while size_bytes is not None and ("Accept-Ranges" in r.headers and r.headers["Accept-Ranges"] is not None): remaining_bytes = size_bytes - total_downloaded_bytes if remaining_bytes > 0: logger.debug( "Continuing download...Total downloaded bytes: %d, Remaining bytes: %d" % (total_downloaded_bytes, remaining_bytes) ) r = self._get_streaming_response( r.url, headers={ "Range": "bytes=%d-" % total_downloaded_bytes }, ) for chunk in r.iter_content(chunk_size=self.chunk_size): f.write(chunk) total_downloaded_bytes += len(chunk) pb.update(8 * len(chunk)) else: break

Kindly let me know if this updated code would do the needful!

Thanks for the thorough analysis; your proposed implementation looks great to me!

Thank you, @brimoor, for your feedback and approval!
I've committed the changes.

brimoor

LGTM, thanks for this!

brimoor · 2024-03-02T18:43:24Z

@rohis06 just FYI- this patch will be publicly available in voxel51-eta==0.12.6, which will go live when #622 is merged!

rohis06 · 2024-03-02T19:05:11Z

@rohis06 just FYI- this patch will be publicly available in voxel51-eta==0.12.6, which will go live when #622 is merged!

Thanks for letting me know, @brimoor! Looking forward to contributing more to @voxel51 :)

added support for handling 206 response

555651d

brimoor requested changes Feb 29, 2024

View reviewed changes

brimoor requested a review from swheaton February 29, 2024 02:31

review comments addressed

d56be35

brimoor approved these changes Mar 1, 2024

View reviewed changes

brimoor merged commit 2b36830 into voxel51:develop Mar 1, 2024

brimoor mentioned this pull request Mar 2, 2024

eta.core.web.download_file() failing #620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for handling 206 response in `eta.core.web.download_file()` - Fix for Issue #620 #621

Added support for handling 206 response in `eta.core.web.download_file()` - Fix for Issue #620 #621

rohis06 commented Feb 29, 2024

brimoor left a comment

brimoor Feb 29, 2024

rohis06 Mar 1, 2024

brimoor Mar 1, 2024

rohis06 Mar 1, 2024

brimoor left a comment

brimoor commented Mar 2, 2024

rohis06 commented Mar 2, 2024

Added support for handling 206 response in eta.core.web.download_file() - Fix for Issue #620 #621

Added support for handling 206 response in eta.core.web.download_file() - Fix for Issue #620 #621

Conversation

rohis06 commented Feb 29, 2024

brimoor left a comment

Choose a reason for hiding this comment

brimoor Feb 29, 2024

Choose a reason for hiding this comment

rohis06 Mar 1, 2024

Choose a reason for hiding this comment

brimoor Mar 1, 2024

Choose a reason for hiding this comment

rohis06 Mar 1, 2024

Choose a reason for hiding this comment

brimoor left a comment

Choose a reason for hiding this comment

brimoor commented Mar 2, 2024

rohis06 commented Mar 2, 2024

Added support for handling 206 response in `eta.core.web.download_file()` - Fix for Issue #620 #621

Added support for handling 206 response in `eta.core.web.download_file()` - Fix for Issue #620 #621