Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading MNIST dataset with torchvision gives HTTP Error 403 #1938

Closed
Borda opened this issue Mar 4, 2020 · 51 comments · Fixed by kubeflow/katib#1457
Closed

Downloading MNIST dataset with torchvision gives HTTP Error 403 #1938

Borda opened this issue Mar 4, 2020 · 51 comments · Fixed by kubeflow/katib#1457

Comments

@Borda
Copy link
Contributor

Borda commented Mar 4, 2020

🐛 Bug

I'm getting a 403 error when I try to download MNIST dataset with torchvision 0.4.2.

To Reproduce

../.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:68: in __init__
    self.download()
../.local/lib/python3.6/site-packages/torchvision/datasets/mnist.py:135: in download
    download_and_extract_archive(url, download_root=self.raw_folder, filename=filename)
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:248: in download_and_extract_archive
    download_url(url, download_root, filename, md5)
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:96: in download_url
    raise e
../.local/lib/python3.6/site-packages/torchvision/datasets/utils.py:84: in download_url
    reporthook=gen_bar_updater()
/usr/local/lib/python3.6/urllib/request.py:248: in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
/usr/local/lib/python3.6/urllib/request.py:223: in urlopen
    return opener.open(url, data, timeout)
/usr/local/lib/python3.6/urllib/request.py:532: in open
    response = meth(req, response)
/usr/local/lib/python3.6/urllib/request.py:642: in http_response
    'http', request, response, code, msg, hdrs)
/usr/local/lib/python3.6/urllib/request.py:570: in error
    return self._call_chain(*args)
/usr/local/lib/python3.6/urllib/request.py:504: in _call_chain
    result = func(*args)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <urllib.request.HTTPDefaultErrorHandler object at 0x7efbf9edaac8>
req = <urllib.request.Request object at 0x7efbf9eda8d0>
fp = <http.client.HTTPResponse object at 0x7efbf9edaf98>, code = 403
msg = 'Forbidden', hdrs = <http.client.HTTPMessage object at 0x7efbf9ea22b0>

    def http_error_default(self, req, fp, code, msg, hdrs):
>       raise HTTPError(req.full_url, code, msg, hdrs, fp)
E       urllib.error.HTTPError: HTTP Error 403: Forbidden

Environment

  • torch==1.3.1
  • torchvision==0.4.2

Additional context

https://app.circleci.com/jobs/github/PyTorchLightning/pytorch-lightning/6877

@fmassa
Copy link
Member

fmassa commented Mar 4, 2020

Thanks for reporting! I can reproduce the issue locally, and downloading from the browser works.

I don't yet know what the root cause is though.

@fmassa
Copy link
Member

fmassa commented Mar 4, 2020

I think we might need to pass header in the download_url function

def download_url(url, root, filename=None, md5=None):
"""Download a file from a url and place it in root.
Args:
url (str): URL to download file from
root (str): Directory to place downloaded file in
filename (str, optional): Name to save the file under. If None, use the basename of the URL
md5 (str, optional): MD5 checksum of the download. If None, do not check
"""
from six.moves import urllib
root = os.path.expanduser(root)
if not filename:
filename = os.path.basename(url)
fpath = os.path.join(root, filename)
makedir_exist_ok(root)
# check if file is already present locally
if check_integrity(fpath, md5):
print('Using downloaded and verified file: ' + fpath)
else: # download the file
try:
print('Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(
url, fpath,
reporthook=gen_bar_updater()
)
except (urllib.error.URLError, IOError) as e:
if url[:5] == 'https':
url = url.replace('https:', 'http:')
print('Failed download. Trying https -> http instead.'
' Downloading ' + url + ' to ' + fpath)
urllib.request.urlretrieve(
url, fpath,
reporthook=gen_bar_updater()
)
else:
raise e
# check integrity of downloaded file
if not check_integrity(fpath, md5):
raise RuntimeError("File not found or corrupted.")
according to https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

cc @cpuhrsch @vincentqb @zhangguanheng66 for awareness

@soumith
Copy link
Member

soumith commented Mar 4, 2020

this is because the download links for mnist at https://github.com/pytorch/vision/blob/master/torchvision/datasets/mnist.py#L33-L36 are hosted on yann.lecun.com and that server has moved under CloudFlare protection.

@fmassa we need to maybe mirror and change the URLs to maybe the PyTorch S3 bucket or something

@Borda
Copy link
Contributor Author

Borda commented Mar 4, 2020

so could we make a hot-fix somehow?

@fmassa
Copy link
Member

fmassa commented Mar 4, 2020

@Borda I haven't tried the current hotfix I mentioned, but I think it might be possible, would you be able to try it and send a PR? Otherwise I'll look into it early next week (I'm working towards ECCV deadline tomorrow)

And I would rather avoid hosting the datasets ourselves, as this would give precedence on us storing the datasets.

@eduardo4jesus
Copy link

eduardo4jesus commented Mar 4, 2020

Is there any way to have a quick fix without using the master?
I am concerned about the potential changes I have to do in my code for going from the version I am using (1.4.0) and the master.

@mvelebit
Copy link

mvelebit commented Mar 4, 2020

@eduardo4jesus You can explicitly add headers as stated above, something alike:

opener = urllib.request.URLopener()
opener.addheader('User-Agent', some_user_agent)
opener.retrieve(
    url, fpath,
    reporthook=gen_bar_updater()
)

(line 81 and onwards in vision/torchvision/datasets/utils.py). Seems to be a quick workaround that works.

nvcastet added a commit to nvcastet/horovod that referenced this issue Mar 4, 2020
See pytorch/vision#1938

Signed-off-by: Nicolas V Castet <nvcastet@us.ibm.com>
@nvcastet
Copy link

nvcastet commented Mar 4, 2020

@eduardo4jesus You could patch your model script at the top using:

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

It will use that user agent for the entire script assuming the opener does not get overwritten somewhere else.

@nvcastet
Copy link

nvcastet commented Mar 4, 2020

To make it work for python 2 as well:

import urllib
try:
    # For python 2
    class AppURLopener(urllib.FancyURLopener):
        version = "Mozilla/5.0"

    urllib._urlopener = AppURLopener()
except AttributeError:
    # For python 3
    opener = urllib.request.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')]
    urllib.request.install_opener(opener)

@joergsimon
Copy link

joergsimon commented Mar 4, 2020

so for python 3 I now use the following snipplet:

from torchvision import datasets
import torchvision.transforms as transforms
import urllib

num_workers = 0
batch_size = 20
basepath = 'some/base/path'
transform = transforms.ToTensor()

def set_header_for(url, filename):
    opener = urllib.request.URLopener()
    opener.addheader('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36')
    opener.retrieve(
    url, f'{basepath}/{filename}')

set_header_for('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', 'train-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', 'train-labels-idx1-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', 't10k-images-idx3-ubyte.gz')
set_header_for('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', 't10k-labels-idx1-ubyte.gz')
train_data = datasets.MNIST(root='data', train=True,
                                   download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
                                  download=False, transform=transform)

You would need to modify the basepath variable of course

@fatihbeyhan
Copy link

I've just got the same problem. Waiting for the answer without changing codes... (ROOKIE ALERT)

@knamdar
Copy link

knamdar commented Mar 5, 2020

I've just got the same problem. Waiting for the answer without changing codes... (ROOKIE ALERT)

Clone this to your working dir:
https://github.com/knamdar/data

@joergsimon
Copy link

joergsimon commented Mar 5, 2020 via email

@eduardo4jesus
Copy link

eduardo4jesus commented Mar 5, 2020

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

@nvcastet, Thank you so much for the clarification. At that point I misunderstood that I would have to go into Torchvision library and change one of its internal files, which would not ben a smooth move on Colab/Kaggle.

vision/torchvision/datasets/utils.py

@fmassa
Copy link
Member

fmassa commented Mar 5, 2020

This should have been fixed now, there is no need to update torchvision.

All should be working as before, without any change on the user side.

This was fixed on the server hosting the original dataset (thanks @soumith !).

As such, I'm closing this issue but let us know if you still face this issue.

@fmassa fmassa closed this as completed Mar 5, 2020
@Flock1
Copy link

Flock1 commented Mar 11, 2021

copy this snippet at the top of your notebook, run it, and then just load your datasets as usual...

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

This worked. I am new to this, so can you tell me what does this piece of code does?

@HugoSchmutz
Copy link

copy this snippet at the top of your notebook, run it, and then just load your datasets as usual...

from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

Didn't work for me on Colab today

@mlelarge
Copy link

You can download it from my webpage:
https://gist.github.com/mlelarge/60ddefa9e16bc06f7f4fc7bff769bdb1

@junpuf
Copy link

junpuf commented Mar 11, 2021

Did anyone got urllib.error.HTTPError: HTTP Error 503: Service Unavailable last night?

@ChengguiSun
Copy link

copy this snippet at the top of your notebook, run it, and then just load your datasets as usual...
from six.moves import urllib
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
urllib.request.install_opener(opener)

Didn't work for me on Colab today

Didn't work for me either

@BernardoOlisan
Copy link

Did anyone got urllib.error.HTTPError: HTTP Error 503: Service Unavailable last night?

Yep yesterday it was working for me but now is not, I dont know what happend, is there a solution for this?

@alisterburt
Copy link

alisterburt commented Mar 12, 2021

@BernardoOlisan @ChengguiSun the solution from @mlelarge works great, the following worked for me in a notebook

!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

from torchvision.datasets import MNIST
from torchvision import transforms

mnist_train = MNIST('./', download=False,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                    ]), train=True)

@HugoSchmutz
Copy link

Thank you ! @mlelarge and @alisterburt

@BernardoOlisan
Copy link

!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

nope still got the exact same error

@alisterburt
Copy link

@BernardoOlisan do you get the same error when trying to download from that link in a browser? Just tried again locally and all worked fine...

@scaomath
Copy link

scaomath commented Mar 12, 2021

!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

nope still got the exact same error

You have to have a folder called MNIST/raw/ in the folder with your dataset script or notebook

torchvision.datasets.MNIST(root='./', download=False,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                    ]), train=True)

If you have your code

torchvision.datasets.MNIST(root='./data', download=False,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                    ]), train=True)

then the folder has to be structured as

├── data
│   ├── MNIST
│         ├── processed
│         └── raw
└── Python script

@andrewssobral
Copy link

andrewssobral commented Mar 12, 2021

The solution proposed by @mlelarge works fine for me, but in my case I have:

# For new versions of TorchVision
wget -O MNIST.tar.gz https://activeeon-public.s3.eu-west-2.amazonaws.com/datasets/MNIST.new.tar.gz

# For old versions of TorchVision (<= 1.0.1*)
wget -O MNIST.tar.gz https://activeeon-public.s3.eu-west-2.amazonaws.com/datasets/MNIST.old.tar.gz

tar -zxvf MNIST.tar.gz

then, I do:

...
train_loader = torch.utils.data.DataLoader(
  datasets.MNIST('./', train=True, download=False,
                 transform=transforms.Compose([
                     transforms.ToTensor(),
                     transforms.Normalize((0.1307,), (0.3081,))
                 ])),
  batch_size=args.batch_size, shuffle=True, **kwargs)
...

and everything works well.

@fmassa
Copy link
Member

fmassa commented Mar 12, 2021

This should be fixed (again) in the next torchvision nightly, and the fix will be present in the next minor release of torchvision, which should be out soon.

See #3544 for more details

@andrewssobral
Copy link

Perfect, thank you @fmassa !

@ChengguiSun
Copy link

@alisterburt, thank you. I used that solution too. Sometimes it works, sometimes it doesn't. It works for me tonight. It's just very slow to download the dataset.

@BernardoOlisan @ChengguiSun the solution from @mlelarge works great, the following worked for me in a notebook

!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

from torchvision.datasets import MNIST
from torchvision import transforms

mnist_train = MNIST('./', download=False,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                    ]), train=True)

@mohamed-180
Copy link

try this

from torchvision import datasets
datasets.MNIST(".", download=True)

@scaomath
Copy link

@alisterburt, thank you. I used that solution too. Sometimes it works, sometimes it doesn't. It works for me tonight. It's just very slow to download the dataset.

Because www.di.ens.fr/~lelarge is apparently @mlelarge 's personal website's storage on a university's web host which probably cannot handle the bandwidth for all amateur data scientists relying on torch.dataset. We should download the original from LeCun's site manually by ourselves, upload to a personal cloud storage that supports http, then use wget.

@EricPengShuai
Copy link

EricPengShuai commented Aug 4, 2021

@BernardoOlisan @ChengguiSun the solution from @mlelarge works great, the following worked for me in a notebook

!wget www.di.ens.fr/~lelarge/MNIST.tar.gz
!tar -zxvf MNIST.tar.gz

from torchvision.datasets import MNIST
from torchvision import transforms

mnist_train = MNIST('./', download=False,
                    transform=transforms.Compose([
                        transforms.ToTensor(),
                    ]), train=True)

@alisterburt I encountered the following problems after trying

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Download\GnuWin32-wget/etc/wgetrc
--2021-08-04 19:11:37--  http://www.di.ens.fr/~lelarge/MNIST.tar.gz
正在解析主机 www.di.ens.fr... 129.199.99.14
Connecting to www.di.ens.fr|129.199.99.14|:80... 已连接已发出 HTTP 请求正在等待回应... 302 Found
位置https://www.di.ens.fr/~lelarge/MNIST.tar.gz [跟随至新的 URL]
--2021-08-04 19:11:37--  https://www.di.ens.fr/~lelarge/MNIST.tar.gz
Connecting to www.di.ens.fr|129.199.99.14|:443... 已连接无法建立 SSL 连接tar: Error opening archive: Failed to open 'MNIST.tar.gz'
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-2-48b8be9a8697> in <module>
      8                     transform=transforms.Compose([
      9                         transforms.ToTensor(),
---> 10                     ]), train=True)

c:\users\peng\anaconda3\envs\pytorch_1.8\lib\site-packages\torchvision\datasets\mnist.py in __init__(self, root, train, transform, target_transform, download)
     80
     81         if not self._check_exists():
---> 82             raise RuntimeError('Dataset not found.' +
     83                                ' You can use download=True to download it')
     84

RuntimeError: Dataset not found. You can use download=True to download it

@rodrsilv-stark
Copy link

Have you tried to decompress the file first?

@EricPengShuai
Copy link

Have you tried to decompress the file first?

No, I just try it in the ipython(anaconda/pytorch 1.8) in my windows. But I try it in my macbook successfully. I don't know why.

@SiddhantSadangi
Copy link

SiddhantSadangi commented May 14, 2024

@Borda - seems like this issue has resurfaced

@fancellu
Copy link

fancellu commented May 14, 2024

I try

dataset = torchvision.datasets.MNIST(root="mnist/", train=True, download=True, transform=torchvision.transforms.ToTensor())

and I get 403s

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden```

@dandv
Copy link

dandv commented May 23, 2024

Still getting this error with torchvision==0.16.2.

trainset = torchvision.datasets.MNIST(
    "./downloads/mnist",
    download=True,
    train=True,
)
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

@mahtin
Copy link

mahtin commented Jul 1, 2024

This is still happening intermittently when reading the files from http://yann.lecun.com/exdb/mnist/ however, this simple fix solves the intermittent nature of this issue.

< datasets_url = 'http://yann.lecun.com/exdb/mnist/'
> datasets_url = 'https://storage.googleapis.com/cvdf-datasets/mnist/'

The same files are present there and in fact many places, for example: https://github.com/golbin/TensorFlow-MNIST/raw/master/mnist/data/ plus other places.

I think it's a sad case of the original site being either overrun or broken.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment