Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MNIST dataset no longer downloads (repeat of old fixed March 2020 problem) #3497

Closed
kesterlester opened this issue Mar 3, 2021 · 10 comments · Fixed by #3498
Closed

MNIST dataset no longer downloads (repeat of old fixed March 2020 problem) #3497

kesterlester opened this issue Mar 3, 2021 · 10 comments · Fixed by #3498

Comments

@kesterlester
Copy link

kesterlester commented Mar 3, 2021

🐛 Bug

This seems to be a recurrence of an issue spotted in #1938 which was fixed back in March 2020 and then closed, but has now reappeared. There are a number of people in #1938 reporting that the issue appeared somewhere in the last 12 hours.

To Reproduce

Steps to reproduce the behavior:

  1. Open a fresh google colab
  2. Try something like the following:
import torchvision
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.ToTensor(),
                              transforms.Normalize((0.5,), (0.5,)),
                              ])
trainset = datasets.MNIST('PATH_TO_STORE_TRAINSET', download=True, train=True, transform=transform)

Results in:

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to PATH_TO_STORE_TRAINSET/MNIST/raw/train-images-idx3-ubyte.gz
0/? [00:00<?, ?it/s]
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-16-492e382ce34e> in <module>()
      4                               transforms.Normalize((0.5,), (0.5,)),
      5                               ])
----> 6 trainset = datasets.MNIST('PATH_TO_STORE_TRAINSET', download=True, train=True, transform=transform)

11 frames
/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py in __init__(self, root, train, transform, target_transform, download)
     77 
     78         if download:
---> 79             self.download()
     80 
     81         if not self._check_exists():

/usr/local/lib/python3.7/dist-packages/torchvision/datasets/mnist.py in download(self)
    144         for url, md5 in self.resources:
    145             filename = url.rpartition('/')[2]
--> 146             download_and_extract_archive(url, download_root=self.raw_folder, filename=filename, md5=md5)
    147 
    148         # process and save as torch files

/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py in download_and_extract_archive(url, download_root, extract_root, filename, md5, remove_finished)
    254         filename = os.path.basename(url)
    255 
--> 256     download_url(url, download_root, filename, md5)
    257 
    258     archive = os.path.join(download_root, filename)

/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py in download_url(url, root, filename, md5)
     82                 )
     83             else:
---> 84                 raise e
     85         # check integrity of downloaded file
     86         if not check_integrity(fpath, md5):

/usr/local/lib/python3.7/dist-packages/torchvision/datasets/utils.py in download_url(url, root, filename, md5)
     70             urllib.request.urlretrieve(
     71                 url, fpath,
---> 72                 reporthook=gen_bar_updater()
     73             )
     74         except (urllib.error.URLError, IOError) as e:  # type: ignore[attr-defined]

/usr/lib/python3.7/urllib/request.py in urlretrieve(url, filename, reporthook, data)
    245     url_type, path = splittype(url)
    246 
--> 247     with contextlib.closing(urlopen(url, data)) as fp:
    248         headers = fp.info()
    249 

/usr/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

/usr/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

/usr/lib/python3.7/urllib/request.py in http_response(self, request, response)
    639         if not (200 <= code < 300):
    640             response = self.parent.error(
--> 641                 'http', request, response, code, msg, hdrs)
    642 
    643         return response

/usr/lib/python3.7/urllib/request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

/usr/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    501         for handler in handlers:
    502             func = getattr(handler, meth_name)
--> 503             result = func(*args)
    504             if result is not None:
    505                 return result

/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 403: Forbidden

Expected behavior

The dataset should just be loaded (as indeed it was this morning).

Environment

Collecting environment information...
PyTorch version: 1.7.1+cu101
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64)
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final)
CMake version: version 3.12.0

Python version: 3.7 (64-bit runtime)
Is CUDA available: False
CUDA runtime version: 11.0.221
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.7.1+cu101
[pip3] torchsummary==1.5.1
[pip3] torchtext==0.3.1
[pip3] torchvision==0.8.2+cu101
[conda] Could not collect```



cc @pmeier
@kesterlester
Copy link
Author

Also posted issue here: googlecolab/colabtools#1889

@bensussman
Copy link

Yes, I am seeing this issue as of today (it was not happening yesterday)

@kaushikb11
Copy link

kaushikb11 commented Mar 3, 2021

Commenting here, for updates! 👀

@ForeverAT
Copy link

I have faced this problem too.

@kesterlester
Copy link
Author

The colab team at google closed my report googlecolab/colabtools#1889 there with the following statement:

This isn't a colab issue; the data you're trying to download is refusing based on user-agent:

!curl -s -O -H 'xUser-agent: Python-urllib/3.7' http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
successfully downloads ~9.9MiB of data (note the leading 'x' in the User-agent header) but

!curl -s -O -H 'User-agent: Python-urllib/3.7' http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
only gets the 403 page.

You probably want to report this to whoever hosts the data.

It's an interesting observation, if nothing else. Still, I think it's hard to pin things on yann.lecun ... since any solution that depends critically on the upness or downness of his website is probably fragile anyway, even if MNIST is "his". Surely there is a better place than there for torchvision to use as a more future proof place from which to download its data?

@fmassa
Copy link
Member

fmassa commented Mar 3, 2021

Thanks for the heads up, we will be investigating this issue

@andresgtn
Copy link

andresgtn commented Mar 3, 2021

copy this snippet at the top of your notebook, run it, and then just load your datasets as usual...

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

@pmeier
Copy link
Collaborator

pmeier commented Mar 3, 2021

Yeah, I'm pretty sure they blacklisted the standard urllib user agent.

@pmeier
Copy link
Collaborator

pmeier commented Mar 3, 2021

Can confirm that the default user agent is blacklisted. We will get in touch with the author and see if the change is intentional or not.

  • If it is unintentional this should resolve itself in the next few days.
  • If it is intentional we could simply change the default user agent.

Problem with the latter is that we are going to publish a new release tomorrow and any patch we provide will not be included. Thus, lets hope it is the former.

Anyway, until this is resolved follow #3497 (comment) for a work around.

@pmeier
Copy link
Collaborator

pmeier commented Mar 3, 2021

We have decided to do the user agent fix anyway. It will also make its way in the upcoming release (#3499).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants