Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for handling 206 response in eta.core.web.download_file() - Fix for Issue #620 #621

Merged
merged 2 commits into from
Mar 1, 2024

Conversation

rohis06
Copy link
Contributor

@rohis06 rohis06 commented Feb 29, 2024

This pull request addresses the issue outlined in #620.

With these changes, eta.core.web.download_file() now enables the complete download of any file from the internet, even when obtained as a 206 partial content response.

The implementation has undergone thorough testing with the provided script. The script attempts to download the same tar file that initially caused the issue reported in #620. Additionally, have validated scenarios where the download is not a 206 response, all of which function correctly.

#!/usr/bin/env python
# pragma pylint: disable=redefined-builtin
# pragma pylint: disable=unused-wildcard-import
# pragma pylint: disable=wildcard-import
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from builtins import *

# pragma pylint: enable=redefined-builtin
# pragma pylint: enable=unused-wildcard-import
# pragma pylint: enable=wildcard-import

import logging
import os

import eta.core.web as etaw
import eta.core.utils as etau


logger = logging.getLogger(__name__)


FILE_ID = "http://data.csail.mit.edu/places/places365/test_256.tar"

path = os.path.join(os.path.dirname(__file__), "test.tar")
etaw.download_file(FILE_ID, path=path)

Upon completion of the download, the md5sum of the downloaded file was verified and matched the expected value.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rohis06 appreciate your work here! Let me know what you think of my comments so far

eta/core/web.py Outdated
pb.update(8 * len(chunk))

while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are specifically targeting 206 responses here, can we instead check r for this status code and only use the range request if a 206 is encountered?

I see a few potential issues with the current approach:

  • _get_content_length() may return None (check out the implementation below) which will cause the new code here to fail
  • I believe that not all HTTP servers support range requests
  • If a request is truncated for a reason other than 206, then we enter a while True loop, which seems dangerous

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brimoor, thanks for reviewing it and providing the above comments! :)

If we are specifically targeting 206 responses here, can we instead check r for this status code and only use the range request if a 206 is encountered?

That's a valid concern, and I even tried implementing it this way at the start, but only later did I figure out that the first time when it enters this loop, r.status_code is 200 and not 206 as shown in the wget output here.

I see a few potential issues with the current approach:

  • _get_content_length() may return None (check out the implementation below) which will cause the new code here to fail

Sure, so in that case, we can update the while loop condition as shown in the code snippet below.

  • I believe that not all HTTP servers support range requests

This is an interesting issue, and I tried doing some research on this. From what I found online, we can resume a file download only when an HTTP server accepts range requests. If it doesn't, it's not possible to resume a download from where it was left off. This link helped clarify a few things. So to ensure we can send a range request to the HTTP server, we can check before entering the while loop whether the Accepts-Range header is set to something other than None, as shown in the code snippet below.

  • If a request is truncated for a reason other than 206, then we enter a while True loop, which seems dangerous

Actually, I don't think we would run into this scenario since _get_streaming_response() handles this for us. As seen in the existing implementation, if we receive a status code other than 200/206, it raises a WebSessionError, and the control would never come back to the while loop to resume the download.

Here's the updated code snippet:

    def _do_download(self, r, f):
        size_bytes = _get_content_length(r)
        total_downloaded_bytes = 0
        size_bits = 8 * size_bytes if size_bytes is not None else None

        with etau.ProgressBar(
            size_bits, use_bits=True, quiet=self.quiet
        ) as pb:
            for chunk in r.iter_content(chunk_size=self.chunk_size):
                f.write(chunk)
                total_downloaded_bytes += len(chunk)
                pb.update(8 * len(chunk))

            while size_bytes is not None and ("Accept-Ranges" in r.headers and r.headers["Accept-Ranges"] is not None):
                remaining_bytes = size_bytes - total_downloaded_bytes
                if remaining_bytes > 0:
                    logger.debug(
                        "Continuing download...Total downloaded bytes: %d, Remaining bytes: %d"
                        % (total_downloaded_bytes, remaining_bytes)
                    )
                    r = self._get_streaming_response(
                        r.url,
                        headers={
                            "Range": "bytes=%d-" % total_downloaded_bytes
                        },
                    )
                    for chunk in r.iter_content(chunk_size=self.chunk_size):
                        f.write(chunk)
                        total_downloaded_bytes += len(chunk)
                        pb.update(8 * len(chunk))
                else:
                    break

Kindly let me know if this updated code would do the needful!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the thorough analysis; your proposed implementation looks great to me!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @brimoor, for your feedback and approval!
I've committed the changes.

Copy link
Contributor

@brimoor brimoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for this!

@brimoor brimoor merged commit 2b36830 into voxel51:develop Mar 1, 2024
@brimoor
Copy link
Contributor

brimoor commented Mar 2, 2024

@rohis06 just FYI- this patch will be publicly available in voxel51-eta==0.12.6, which will go live when #622 is merged!

@rohis06
Copy link
Contributor Author

rohis06 commented Mar 2, 2024

@rohis06 just FYI- this patch will be publicly available in voxel51-eta==0.12.6, which will go live when #622 is merged!

Thanks for letting me know, @brimoor! Looking forward to contributing more to @voxel51 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants