Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grab rdatasets #561

Merged
merged 18 commits into from Nov 15, 2012
Merged

Grab rdatasets #561

merged 18 commits into from Nov 15, 2012

Conversation

jseabold
Copy link
Member

Add a get_rdatasets function with optional caching capabilities. Caching is turned off by default. Not sure if we want to change this to turned on. Maybe save for once we have global configs set up?

I also used this to pull the data utils functions back into datasets like we wanted to do a long time ago. Just never had a reason.

Usage (It's not case-sensitive for the data. Don't know if this can lead to name clashing or even if the datasets in R are uniquely named across packages.):

import statsmodels.api as sm
duncan = sm.datasets.grab_rdataset("Duncan")

# save it to cache
stackloss = sm.datasets.grab_rdataset("stackloss", cache=True)

# load from cache
stackloss = sm.datasets.grab_rdataset("stackloss", cache=True)

longley = sm.datasets.grab_rdataset("longley")

@jseabold
Copy link
Member Author

Ok, should work on all Python versions now. For some reason

"asdasd".encode("zip")

does not work on my Python 3 build, though I have zlib support and the docs indicate that it should. I dunno. I had to work around it.

Replaced httplib2 with urllib2 since httplib2 is not in the standard library. This meant I had to rewrite the caching stuff.

Now we have caching, we return a Dataset (bunch) object with a DataFrame in the data attribute, as well as attributes for descr, package, and from_cache.

If you decide to add the rst (or markdown) version of the docs to the repo, I'll add them to the Dataset object. Should be good to merge though, if we don't want to wait. Usage is still

from statsmodels.datasets import get_rdataset

# no caching
duncan = get_rdataset("Duncan")

# download but put in cache
duncan = get_rdataset("Duncan", True)

# from cached
duncan = get_rdataset("Duncan", True)

Caching is dumb because I'm not checking the HTML response headers, and there's currently no way to overwrite the cached version. You'd have to clear it from the cache directory or clear the whole directory to re-download. Also, there's no download_missing option, so if you ask to use the cache and it's not there, it will download it. Seemed reasonable to me.

The default is not to use the cache. I'm not sure about making caching the default. I think we may want to since we have to download both the data and the index file with all the meta data every time, otherwise. I adapted the get_data_home, clear_data_home from scikit-learn for the caching data home directory. By default it's in HOME/statsmodels_data.

@vincentarelbundock
Copy link
Contributor

Cool. I registered an issue vincentarelbundock/Rdatasets#5

I'll think about that whenever I get around to doing the cleanups I already planned (shouldn't affect users or urls or anything).

@vincentarelbundock
Copy link
Contributor

Sorry, just one thing: I think this needs to be case sensitive. We have both Titanic.csv and titanic.csv for example.

http://vincentarelbundock.github.com/Rdatasets/doc/titanic.html
http://vincentarelbundock.github.com/Rdatasets/doc/Titanic.html

@jseabold
Copy link
Member Author

Oh that's annoying. I was wondering if we don't even need to be able to specify the package even, like with the pandas.rpy.load_data function, since I seem to recall that even dataset names aren't unique across packages.

@vincentarelbundock
Copy link
Contributor

I did encounter at least one name that was used by two packages. I didn't think much of it then, but it could be a more serious problem if people actually start using this. As of 3 minutes ago, Rdatasets groups docs and files in subfolders based on the package with which the datasets were distributed. I also made a pull request to adjust some of the links in the statsmodels documentation: #570

If I were you, the grab datasets function would:

  1. Download the csv index
  2. Check if the dataset name provided is unique. If yes, download doc and csv using the raw links provided in the csv index
  3. Otherwise, break and print a message to ask users to input the package name as a second argument to the grab_dataset function.

@vincentarelbundock
Copy link
Contributor

Sorry for the changes. You're the first actual user, so you make me realize the original design of Rdatasets was not optimal. I really do intend to keep things very stable as soon as we agree on a winning/convenient formula.

@jseabold
Copy link
Member Author

No worries. It's an easy fix, and this definitely makes more sense.

@jseabold
Copy link
Member Author

Ok, I think this is good to go now. Changed it so that you need to add the package name (default 'datasets'), the Dataset object now has the rst docstring available, and I added it to the main datasets documentation.

@jseabold
Copy link
Member Author

Any idea why this is failing? I apparently have no way to check the build status. Everything looks fine here in Python 2.6 and 3.2.

jseabold added a commit that referenced this pull request Nov 15, 2012
@jseabold jseabold merged commit 663aea6 into statsmodels:master Nov 15, 2012
PierreBdR pushed a commit to PierreBdR/statsmodels that referenced this pull request Sep 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants