Skip to content

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Mar 30, 2022

We have plenty of reports that the download of datasets hosted on GDrive does not work as expected:

Although most of them are closed, they see another comment from time to time since a user stumbled upon "the same issue"

The errors non-descriptive in most cases. The problem is that we don't check the MD5 sum after the download and naively write the response from GDrive to disk. In contrast, on download_url we perform such a check

# check integrity of downloaded file
if not check_integrity(fpath, md5):
raise RuntimeError("File not found or corrupted.")

The most common failure case is GDrive returning an unknown API response as HTML. This PR adds a MD5 check after the download with an additional HTML check to make the error message more descriptive.

@facebook-github-bot
Copy link
Contributor

facebook-github-bot commented Mar 30, 2022

💊 CI failures summary and remediations

As of commit a6cdc84 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Conflicts:
	torchvision/datasets/utils.py
Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pmeier .

The most common failure case is GDrive returning an unknown API response as HTML

Shouldn't we check the HTML code of the download / reply from GDrive instead of downloading then?

@pmeier
Copy link
Collaborator Author

pmeier commented Apr 11, 2022

Shouldn't we check the HTML code of the download / reply from GDrive instead of downloading then?

Unfortunately, GDrive mostly just returns 200 and thus there is very little we can do without actually inspecting the response. Maybe we can improve this by always analyzing the first chunk of data rather than only if we detect an MD5 mismatch. Thoughts?

@NicolasHug
Copy link
Member

analyzing the first chunk of data rather than only if we detect an MD5 mismatch. Thoughts

If you mean checking against the HTML regex right after the files are downloaded, then yeah I feel like this might be where the check needs to be

Copy link
Member

@NicolasHug NicolasHug left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Philip

@pmeier pmeier merged commit 16af667 into pytorch:main May 20, 2022
@pmeier pmeier deleted the gdrive-download-error branch May 20, 2022 13:40
facebook-github-bot pushed a commit that referenced this pull request Jun 1, 2022
Summary:
* improve error handling for GDrive downloads

* perform HTML check regardless of MD5 check

Reviewed By: NicolasHug

Differential Revision: D36760932

fbshipit-source-id: 1cad96e1505f88f6945c048d9c3e0fbe1ccfd00f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants