Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fathom datasets should be open, public and freely available #28

Closed
tobigithub opened this issue May 19, 2017 · 1 comment
Closed

Fathom datasets should be open, public and freely available #28

tobigithub opened this issue May 19, 2017 · 1 comment

Comments

@tobigithub
Copy link

Hi,
just my 2 cents, the datasets that are needed to run this benchmark should be publicly, open and freely available. Currently some proposed sets ones are not: http://fathom.readthedocs.io/en/latest/quickstart/#downloading-data

Saying that one should (potentially illegally) obtain ATARI ROMs, when Atari just recently filed copyright claims against several developers, https://www.google.com/search?q=Atari+ROM+copyright not only creates a problem for developers but also users.

Also multiple datasets require signing of licenses or logins. But that is usually counter productive, tedious and limits the user base. As a solution one could use synthetic datasets, or sets that are broadly available or maybe smaller sets and just duplicate them. The benchmark will perform probably fine, of course without a renowned dataset. I am sure that is not always possible and coming from a different field I can not propose any good replacements, but I think its a valid thought.

@svrama
Copy link
Collaborator

svrama commented May 19, 2017

Hi @tobigithub, thanks for your insightful feedback.

You raise many good points. Because Fathom's goal is to provide a suite of well-known models for profiling, we've sometimes had to grit our teeth and inconvenience users in obtaining the necessary datasets.

Regarding the Atari ROMs specifically, I'm not a lawyer, but I believe their use for academic research falls under fair use. That said, I agree that it would be nice for there to be a canonical task for reinforcement learning which does not skirt copyright law.

Unfortunately, several of the datasets (e.g., ImageNet, LDC datasets) do require registration, sometimes with a fee. While this is not ideal, it is the reality of obtaining the datasets considered standard in the machine learning community.

Adding synthetic datasets has been on our todo list for a while, but since most users had access to these canonical datasets already, we haven't prioritized that. If you have any specific needs which require synthetic datasets, let us know and we can try to set something up for you.

@svrama svrama closed this as completed May 19, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants