Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mldata.org is down (for good?) #8588

Closed
fradav opened this issue Mar 15, 2017 · 54 comments
Closed

mldata.org is down (for good?) #8588

fradav opened this issue Mar 15, 2017 · 54 comments

Comments

@fradav
Copy link

fradav commented Mar 15, 2017

Description

Unable to retrieve dataset from mdata.org
The site is down.

Steps/Code to Reproduce

from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

Expected Results

[mnist data loaded]

Actual Results


TimeoutError Traceback (most recent call last)
C:\Users\frada\Dev\Python\Miniconda3\lib\urllib\request.py in do_open(self, http_class, req, **http_conn_args)
1317 h.request(req.get_method(), req.selector, req.data, headers,
-> 1318 encode_chunked=req.has_header('Transfer-encoding'))
1319 except OSError as err: # timeout error

C:\Users\frada\Dev\Python\Miniconda3\lib\http\client.py in request(self, method, url, body, headers, encode_chunked)
1238 """Send a complete request to the server."""
-> 1239 self._send_request(method, url, body, headers, encode_chunked)
1240

C:\Users\frada\Dev\Python\Miniconda3\lib\http\client.py in _send_request(self, method, url, body, headers, encode_chunked)
1284 body = _encode(body, 'body')
-> 1285 self.endheaders(body, encode_chunked=encode_chunked)
1286

C:\Users\frada\Dev\Python\Miniconda3\lib\http\client.py in endheaders(self, message_body, encode_chunked)
1233 raise CannotSendHeader()
-> 1234 self._send_output(message_body, encode_chunked=encode_chunked)
1235

C:\Users\frada\Dev\Python\Miniconda3\lib\http\client.py in _send_output(self, message_body, encode_chunked)
1025 del self._buffer[:]
-> 1026 self.send(msg)
1027

C:\Users\frada\Dev\Python\Miniconda3\lib\http\client.py in send(self, data)
963 if self.auto_open:
--> 964 self.connect()
965 else:

C:\Users\frada\Dev\Python\Miniconda3\lib\http\client.py in connect(self)
935 self.sock = self._create_connection(
--> 936 (self.host,self.port), self.timeout, self.source_address)
937 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)

C:\Users\frada\Dev\Python\Miniconda3\lib\socket.py in create_connection(address, timeout, source_address)
721 if err is not None:
--> 722 raise err
723 else:

C:\Users\frada\Dev\Python\Miniconda3\lib\socket.py in create_connection(address, timeout, source_address)
712 sock.bind(source_address)
--> 713 sock.connect(sa)
714 return sock

TimeoutError: [WinError 10060] Une tentative de connexion a échoué car le parti connecté n’a pas répondu convenablement au-delà d’une certaine durée ou une connexion établie a échoué car l’hôte de connexion n’a pas répondu

Versions

Windows-10-10.0.14393-SP0
Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)]
NumPy 1.12.0
SciPy 0.19.0
Scikit-Learn 0.18.1

@dredwilliams
Copy link

I'm having the same problem -- the scikit-learn function as well as trying to go directly to mldata.org ...

Any thoughts?

@jnothman
Copy link
Member

jnothman commented Mar 20, 2017 via email

@lesteve
Copy link
Member

lesteve commented Mar 20, 2017

I suppose we should be deprecating mldata fetchers :(

Is there any indication that mldata.org is not going to be back up any time soon ? I did not find anything from a quick googling. Also wild guessing a bit here, but I was mixing mldata.org with mlcomp.org at first, which is supposed to be taken down in March 2017 maybe it is the same for you @jnothman.

@jnothman
Copy link
Member

jnothman commented Mar 20, 2017 via email

@lesteve
Copy link
Member

lesteve commented Mar 20, 2017

5 days is also a surprising downtime for something of this nature!

Agreed. I'll try to contact one of the website maintainer I found through Google and see what happens.

@jnothman
Copy link
Member

jnothman commented Mar 20, 2017 via email

@iampawansingh
Copy link

Before mldata.org goes up, whats the work around if data has not already been downloaded?

@lesteve
Copy link
Member

lesteve commented Mar 22, 2017

whats the work around if data has not already been downloaded?

Have you tried googling the name of the dataset you may find a copy somewhere? In principle you just need to find the .mat and put it in ~/scikit_learn_data/mldata.

Alternatively, find someone (or maybe you on another computer) that uses scikit-learn and has it downloaded it to share the content of its ~/scikit_learn_data/mldata folder.

@iampawansingh
Copy link

I tried googling but did could not find the .mat file. However, the data I was looking i got it's link from the book from which data was taken. However, with this data I am not able to reproduce the result which is demonstrated on sklearn user guide page.

Due to this I wanted the data in the exact format so that I am sure that result is bad due to data and not due to modelling.

@lesteve
Copy link
Member

lesteve commented Mar 23, 2017

I googled and found https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat in a matter of seconds. I was able to reproduce the output of this example:
http://scikit-learn.org/stable/auto_examples/neural_networks/plot_mnist_filters.html

Hope this helps although it is hard to know because you are not very explicit about what you are trying to do ...

@iampawansingh
Copy link

iampawansingh commented Mar 23, 2017

@lesteve sorry for being bit vague! I am trying to replicate Guassian Process Regressor Example - http://scikit-learn.org/stable/auto_examples/gaussian_process/plot_gpr_co2.html
I am not able to fetch .mat file, due to which I am not able to replicate this example. As mentioned before, I did get the data after some searching, but even with this data, I am not able to replicate results.
I have been breaking my head on this from last two days. So, any pointer here would be of great help

@mikiobraun
Copy link

Hello there, I'll try to get in touch with mldata's admin. I'll let you know about the updates.

It should in theory be provided indefinitely...

@cjermain
Copy link

FYI, mldata.org is still down.

@iampawansingh
Copy link

While breaking my head to get the data, I found this github repository which contains most of the data - https://github.com/vincentarelbundock/Rdatasets However, this will involve downloading .Rda or csv file and converting it in the required format which can be consumed by sklearn.

@mikiobraun
Copy link

mldata.org (and also mloss.org unfortunately) servers were very sick... we're on it.

@raghavrv
Copy link
Member

@jnothman @amueller @mfeurer Time to add openml.org fetcher? ;)

@jnothman
Copy link
Member

jnothman commented Mar 28, 2017 via email

@ageron
Copy link
Contributor

ageron commented Mar 29, 2017

Or mirrors?

@lesteve
Copy link
Member

lesteve commented Mar 29, 2017

Or figshare ... there is an issue #7425 and a PR #7429.

@ogrisel
Copy link
Member

ogrisel commented Mar 29, 2017

The way these things are going, I wish we had some way to not rely on the
availability of a single unassured host. Ugh. Torrents anyone?

I like the torrents solution as it's decentralized but unfortunately there are many corporate and institutional environments where the bittorrent protocol is banned...

That being said http://academictorrents.com is a great tracker, especially if you are interested in fetching large datasets like MSCOCO, ImageNet and OpenImages.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Mar 29, 2017 via email

@gfilla
Copy link

gfilla commented Mar 29, 2017

Anyone looking specifically for code on getting the MNIST data set can use this from Tensorflow:

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/")

mnist_images = mnist.train.images
mnist_labels = mnist.train.labels

@ageron
Copy link
Contributor

ageron commented Mar 31, 2017

@mikiobraun Any news? It seems that mldata.org is returning "Page unavailable" for every request. AFAICT, mldata.org was managed by the European project PASCAL2, which was closed about 3 years ago. Who is in charge (and paying) for the servers now?

@mikiobraun
Copy link

Hi @ageron. Service is hosted by TU Berlin who agreed to keep the service running indefinitely (it is essentially a single instance VM with some NAS attached disk storage). Admin at TU Berlin is on it... . Thanks for your patience... .

@ageron
Copy link
Contributor

ageron commented Apr 6, 2017

Thanks for your feedback @mikiobraun.

@ageron
Copy link
Contributor

ageron commented Apr 7, 2017

In case someone needs this, here's a function that downloads MNIST from another source and stores it in the default location where scikit-learn stores mldata datasets (~/scikit_learn_data/mldata/). So after you call this function, fetch_mldata("MNIST original") will work fine.

from shutil import copyfileobj
from six.moves import urllib
from sklearn.datasets.base import get_data_home
import os

def fetch_mnist(data_home=None):
    mnist_alternative_url = "https://github.com/amplab/datascience-sp14/raw/master/lab7/mldata/mnist-original.mat"
    data_home = get_data_home(data_home=data_home)
    data_home = os.path.join(data_home, 'mldata')
    if not os.path.exists(data_home):
        os.makedirs(data_home)
    mnist_save_path = os.path.join(data_home, "mnist-original.mat")
    if not os.path.exists(mnist_save_path):
        mnist_url = urllib.request.urlopen(mnist_alternative_url)
        with open(mnist_save_path, "wb") as matlab_file:
            copyfileobj(mnist_url, matlab_file)

Here's an example:

>>> fetch_mnist()
>>> from sklearn.datasets import fetch_mldata
>>> mnist = fetch_mldata("MNIST original")
>>> mnist
{'target': array([ 0.,  0.,  0., ...,  9.,  9.,  9.]), 'data': array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), 'COL_NAMES': ['label', 'data'], 'DESCR': 'mldata.org dataset: mnist-original'}

@kracekumar
Copy link

Keras uploaded the dataset to S3. Amazon charges s3 based on incoming traffic.

Another approach is to upload the dataset in GitHub/GitLab repo and use the HTTP URL in the code.

@shatu
Copy link

shatu commented Apr 18, 2017

BTW, is the list of all the datasets hosted at mldata.org available somewhere?

@ageron
Copy link
Contributor

ageron commented Apr 19, 2017

@shatu Can't find a nice list, but mldata.org is partly browsable using archive.org, for example here.

@chengsoonong
Copy link

mldata.org is back up.

I agree that there are challenges with keeping the response time small when it does go down though. We will talk to openml to see how we can migrate/mirror mldata there. The tricky bit is to try to retain the versions.

@lesteve
Copy link
Member

lesteve commented May 11, 2017

@chengsoonong thanks! Closing this one.

@lesteve lesteve closed this as completed May 11, 2017
@omtinez
Copy link

omtinez commented Mar 23, 2018

Can we re-open this issue now that it appears to be back offline?

@lesteve
Copy link
Member

lesteve commented Mar 23, 2018

@omtinez what makes you think mldata.org is offline? For example this works fine for me (I made sure the data was not cached locally by deleting ~/scikit_learn_data/mldata):

from sklearn.datasets import fetch_mldata
fetch_mldata('MNIST original')

@omtinez
Copy link

omtinez commented Mar 23, 2018

It appears to be back online now, albeit working very slowly for me...

@ThomasDelteil
Copy link
Contributor

ThomasDelteil commented Apr 19, 2018

it appears down for me at this moment

edit: back up now, definitely needs something more stable than this

@lesteve
Copy link
Member

lesteve commented Apr 19, 2018

edit: back up now, definitely needs something more stable than this

@ThomasDelteil there is some ongoing work on an OpenML fetcher. You are more than welcome to help on the #9908 PR, e.g. by reviewing, trying it out, giving feedback etc ...

soorya19 added a commit to soorya19/sparsity-based-defenses that referenced this issue Jul 16, 2018
Added a workaround to download MNIST data since mldata.org keeps going down (scikit-learn/scikit-learn#8588)
@ghost
Copy link

ghost commented Aug 23, 2018

for those local file not working, try to create a new notebook and do the samething

@jnothman
Copy link
Member

I'm not sure what @nakebull. At least we now have fetch_openml, although openml and mldata have different datasets and openml delivers dats in a text based format that is more flexible, but slower to load.

@joaquinvanschoren
Copy link
Contributor

I'll try to get all mldata.org datasets into OpenML (the mldata folks agreed to this). At the moments, I sadly can't reach the mldata server. Did anyone ever download all of them? That would be a huge help. Thanks!

@lesteve
Copy link
Member

lesteve commented Nov 15, 2018

@joaquinvanschoren I just emailed you the email of someone I have contacted in the past about mldata.org problems and has been very helpful each time. Let me know if you don't receive my email.

@glathrom
Copy link

glathrom commented Feb 9, 2019

You can download the code from THE MNIST DATABASE
and here is a repository with a couple of functions to load the data for you. https://github.com/glathrom/sklearn_MNIST_load

@jnothman
Copy link
Member

jnothman commented Feb 10, 2019 via email

@shortshortday
Copy link

Thanks for your solution! @lesteve,

It works after I put https://github.com/amplab/datascience-sp14/blob/master/lab7/mldata/mnist-original.mat into ~/scikit_learn_data/mldata/.

mnist = fetch_mldata('MNIST original')

BTW, anyone know why http://mldata.org/ went down so long?

@jnothman
Copy link
Member

jnothman commented Mar 28, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests