Grab rdatasets #561

jseabold · 2012-10-31T23:44:01Z

Add a get_rdatasets function with optional caching capabilities. Caching is turned off by default. Not sure if we want to change this to turned on. Maybe save for once we have global configs set up?

I also used this to pull the data utils functions back into datasets like we wanted to do a long time ago. Just never had a reason.

Usage (It's not case-sensitive for the data. Don't know if this can lead to name clashing or even if the datasets in R are uniquely named across packages.):

import statsmodels.api as sm
duncan = sm.datasets.grab_rdataset("Duncan")

# save it to cache
stackloss = sm.datasets.grab_rdataset("stackloss", cache=True)

# load from cache
stackloss = sm.datasets.grab_rdataset("stackloss", cache=True)

longley = sm.datasets.grab_rdataset("longley")

jseabold · 2012-11-14T22:22:00Z

Ok, should work on all Python versions now. For some reason

"asdasd".encode("zip")

does not work on my Python 3 build, though I have zlib support and the docs indicate that it should. I dunno. I had to work around it.

Replaced httplib2 with urllib2 since httplib2 is not in the standard library. This meant I had to rewrite the caching stuff.

Now we have caching, we return a Dataset (bunch) object with a DataFrame in the data attribute, as well as attributes for descr, package, and from_cache.

If you decide to add the rst (or markdown) version of the docs to the repo, I'll add them to the Dataset object. Should be good to merge though, if we don't want to wait. Usage is still

from statsmodels.datasets import get_rdataset

# no caching
duncan = get_rdataset("Duncan")

# download but put in cache
duncan = get_rdataset("Duncan", True)

# from cached
duncan = get_rdataset("Duncan", True)

Caching is dumb because I'm not checking the HTML response headers, and there's currently no way to overwrite the cached version. You'd have to clear it from the cache directory or clear the whole directory to re-download. Also, there's no download_missing option, so if you ask to use the cache and it's not there, it will download it. Seemed reasonable to me.

The default is not to use the cache. I'm not sure about making caching the default. I think we may want to since we have to download both the data and the index file with all the meta data every time, otherwise. I adapted the get_data_home, clear_data_home from scikit-learn for the caching data home directory. By default it's in HOME/statsmodels_data.

vincentarelbundock · 2012-11-15T00:47:32Z

Cool. I registered an issue vincentarelbundock/Rdatasets#5

I'll think about that whenever I get around to doing the cleanups I already planned (shouldn't affect users or urls or anything).

vincentarelbundock · 2012-11-15T02:53:48Z

Sorry, just one thing: I think this needs to be case sensitive. We have both Titanic.csv and titanic.csv for example.

http://vincentarelbundock.github.com/Rdatasets/doc/titanic.html
http://vincentarelbundock.github.com/Rdatasets/doc/Titanic.html

jseabold · 2012-11-15T02:59:38Z

Oh that's annoying. I was wondering if we don't even need to be able to specify the package even, like with the pandas.rpy.load_data function, since I seem to recall that even dataset names aren't unique across packages.

vincentarelbundock · 2012-11-15T04:29:06Z

I did encounter at least one name that was used by two packages. I didn't think much of it then, but it could be a more serious problem if people actually start using this. As of 3 minutes ago, Rdatasets groups docs and files in subfolders based on the package with which the datasets were distributed. I also made a pull request to adjust some of the links in the statsmodels documentation: #570

If I were you, the grab datasets function would:

Download the csv index
Check if the dataset name provided is unique. If yes, download doc and csv using the raw links provided in the csv index
Otherwise, break and print a message to ask users to input the package name as a second argument to the grab_dataset function.

vincentarelbundock · 2012-11-15T04:30:54Z

Sorry for the changes. You're the first actual user, so you make me realize the original design of Rdatasets was not optimal. I really do intend to keep things very stable as soon as we agree on a winning/convenient formula.

jseabold · 2012-11-15T14:36:45Z

No worries. It's an easy fix, and this definitely makes more sense.

jseabold · 2012-11-15T17:53:19Z

Ok, I think this is good to go now. Changed it so that you need to add the package name (default 'datasets'), the Dataset object now has the rst docstring available, and I added it to the main datasets documentation.

jseabold · 2012-11-15T18:04:47Z

Any idea why this is failing? I apparently have no way to check the build status. Everything looks fine here in Python 2.6 and 3.2.

Grab rdatasets

jseabold mentioned this pull request Oct 31, 2012

Add Rdatasets to optional dependencies list #559

Closed

jseabold added 17 commits November 15, 2012 12:40

STY: Switch to relative imports and fix pep-8

0261a0f

REF: Move dateutils back to datasets dir

53f7d5e

ENH: Add get_rdatasets for dl'ing and caching from Rdatasets package

549ccc8

ENH: Get data meta info from repo. Return Bunch object

0370bef

STY: Comment indents

4b6a665

REF: httplib2 is not in standard library. don't use it.

29f2e56

REF: Remove import of moved file

8c05cb8

REF: Fix import of new data after rebase

58a69bf

REF: BaseException.message is deprecated and gone in Py3.

5bbd88e

PY3: Fix Python 3 compat

9ac07c2

ENH: Update r datasets for upstream changes. Make case sensitive

5e4d8db

TST: Add tests for get_rdatasets

931288a

TST: Skip test on Py3. Works but can't test both

d954916

ENH: Use package to get meta to avoid duplicates

efe3504

ENH: Add docs to R datasets

a95c3b7

DOC: Update datasets docs

74304b0

DOC: Fix typo

0e4c4c3

TST: Include new cached file

b540431

jseabold added a commit that referenced this pull request Nov 15, 2012

Merge pull request #561 from jseabold/grab-rdatasets

663aea6

Grab rdatasets

jseabold merged commit 663aea6 into statsmodels:master Nov 15, 2012

PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014

Merge pull request statsmodels#561 from jseabold/grab-rdatasets

e188433

Grab rdatasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grab rdatasets #561

Grab rdatasets #561

jseabold commented Oct 31, 2012

jseabold commented Nov 14, 2012

vincentarelbundock commented Nov 15, 2012

vincentarelbundock commented Nov 15, 2012

jseabold commented Nov 15, 2012

vincentarelbundock commented Nov 15, 2012

vincentarelbundock commented Nov 15, 2012

jseabold commented Nov 15, 2012

jseabold commented Nov 15, 2012

jseabold commented Nov 15, 2012

Grab rdatasets #561

Grab rdatasets #561

Conversation

jseabold commented Oct 31, 2012

jseabold commented Nov 14, 2012

vincentarelbundock commented Nov 15, 2012

vincentarelbundock commented Nov 15, 2012

jseabold commented Nov 15, 2012

vincentarelbundock commented Nov 15, 2012

vincentarelbundock commented Nov 15, 2012

jseabold commented Nov 15, 2012

jseabold commented Nov 15, 2012

jseabold commented Nov 15, 2012