improve error handling for GDrive downloads #5704

pmeier · 2022-03-30T08:04:03Z

We have plenty of reports that the download of datasets hosted on GDrive does not work as expected:

Although most of them are closed, they see another comment from time to time since a user stumbled upon "the same issue"

The errors non-descriptive in most cases. The problem is that we don't check the MD5 sum after the download and naively write the response from GDrive to disk. In contrast, on download_url we perform such a check

vision/torchvision/datasets/utils.py

Lines 150 to 152 in aa21197

    
           # check integrity of downloaded file 
        
           if not check_integrity(fpath, md5): 
        
               raise RuntimeError("File not found or corrupted.")

The most common failure case is GDrive returning an unknown API response as HTML. This PR adds a MD5 check after the download with an additional HTML check to make the error message more descriptive.

facebook-github-bot · 2022-03-30T08:04:13Z

💊 CI failures summary and remediations

As of commit a6cdc84 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Conflicts: torchvision/datasets/utils.py

NicolasHug

Thanks @pmeier .

The most common failure case is GDrive returning an unknown API response as HTML

Shouldn't we check the HTML code of the download / reply from GDrive instead of downloading then?

torchvision/datasets/utils.py

pmeier · 2022-04-11T09:35:06Z

Shouldn't we check the HTML code of the download / reply from GDrive instead of downloading then?

Unfortunately, GDrive mostly just returns 200 and thus there is very little we can do without actually inspecting the response. Maybe we can improve this by always analyzing the first chunk of data rather than only if we detect an MD5 mismatch. Thoughts?

NicolasHug · 2022-04-11T09:41:59Z

analyzing the first chunk of data rather than only if we detect an MD5 mismatch. Thoughts

If you mean checking against the HTML regex right after the files are downloaded, then yeah I feel like this might be where the check needs to be

NicolasHug

Thanks Philip

Summary: * improve error handling for GDrive downloads * perform HTML check regardless of MD5 check Reviewed By: NicolasHug Differential Revision: D36760932 fbshipit-source-id: 1cad96e1505f88f6945c048d9c3e0fbe1ccfd00f

improve error handling for GDrive downloads

d467132

pmeier added enhancement module: datasets labels Mar 30, 2022

pmeier requested a review from NicolasHug March 30, 2022 08:04

facebook-github-bot added the cla signed label Mar 30, 2022

pmeier mentioned this pull request Apr 1, 2022

Caltech101, Caltech256 downloads are broken due to Google Drive redirect "scan for viruses" popup #5716

Closed

Merge branch 'main'

a6cdc84

Conflicts: torchvision/datasets/utils.py

NicolasHug reviewed Apr 11, 2022

View reviewed changes

torchvision/datasets/utils.py Outdated Show resolved Hide resolved

pmeier added 2 commits April 11, 2022 11:43

Merge branch 'main' into gdrive-download-error

db4a377

perform HTML check regardless of MD5 check

5584810

pmeier requested a review from NicolasHug April 11, 2022 10:00

pmeier mentioned this pull request May 20, 2022

Replace torchvision.datasets.utils with functionality from torchdata #6060

Draft

NicolasHug approved these changes May 20, 2022

View reviewed changes

Merge branch 'main' into gdrive-download-error

ee9d38e

pmeier merged commit 16af667 into pytorch:main May 20, 2022

pmeier deleted the gdrive-download-error branch May 20, 2022 13:40

abhi-glitchhg mentioned this pull request Jun 1, 2022

Failed to download CelebA dataset using download=True #1920

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

improve error handling for GDrive downloads #5704

improve error handling for GDrive downloads #5704

pmeier commented Mar 30, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 30, 2022 •

edited

Loading

Uh oh!

NicolasHug left a comment

Uh oh!

Uh oh!

pmeier commented Apr 11, 2022

Uh oh!

NicolasHug commented Apr 11, 2022

Uh oh!

NicolasHug left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# check integrity of downloaded file
	if not check_integrity(fpath, md5):
	raise RuntimeError("File not found or corrupted.")

improve error handling for GDrive downloads #5704

improve error handling for GDrive downloads #5704

Conversation

pmeier commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 30, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pmeier commented Apr 11, 2022

Uh oh!

NicolasHug commented Apr 11, 2022

Uh oh!

NicolasHug left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pmeier commented Mar 30, 2022 •

edited

Loading

facebook-github-bot commented Mar 30, 2022 •

edited

Loading