New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grab rdatasets #561
Grab rdatasets #561
Conversation
Ok, should work on all Python versions now. For some reason
does not work on my Python 3 build, though I have zlib support and the docs indicate that it should. I dunno. I had to work around it. Replaced httplib2 with urllib2 since httplib2 is not in the standard library. This meant I had to rewrite the caching stuff. Now we have caching, we return a Dataset (bunch) object with a DataFrame in the data attribute, as well as attributes for descr, package, and from_cache. If you decide to add the rst (or markdown) version of the docs to the repo, I'll add them to the Dataset object. Should be good to merge though, if we don't want to wait. Usage is still
Caching is dumb because I'm not checking the HTML response headers, and there's currently no way to overwrite the cached version. You'd have to clear it from the cache directory or clear the whole directory to re-download. Also, there's no download_missing option, so if you ask to use the cache and it's not there, it will download it. Seemed reasonable to me. The default is not to use the cache. I'm not sure about making caching the default. I think we may want to since we have to download both the data and the index file with all the meta data every time, otherwise. I adapted the get_data_home, clear_data_home from scikit-learn for the caching data home directory. By default it's in HOME/statsmodels_data. |
Cool. I registered an issue vincentarelbundock/Rdatasets#5 I'll think about that whenever I get around to doing the cleanups I already planned (shouldn't affect users or urls or anything). |
Sorry, just one thing: I think this needs to be case sensitive. We have both Titanic.csv and titanic.csv for example. http://vincentarelbundock.github.com/Rdatasets/doc/titanic.html |
Oh that's annoying. I was wondering if we don't even need to be able to specify the package even, like with the |
I did encounter at least one name that was used by two packages. I didn't think much of it then, but it could be a more serious problem if people actually start using this. As of 3 minutes ago, Rdatasets groups docs and files in subfolders based on the package with which the datasets were distributed. I also made a pull request to adjust some of the links in the statsmodels documentation: #570 If I were you, the grab datasets function would:
|
Sorry for the changes. You're the first actual user, so you make me realize the original design of Rdatasets was not optimal. I really do intend to keep things very stable as soon as we agree on a winning/convenient formula. |
No worries. It's an easy fix, and this definitely makes more sense. |
Ok, I think this is good to go now. Changed it so that you need to add the package name (default 'datasets'), the Dataset object now has the rst docstring available, and I added it to the main datasets documentation. |
Any idea why this is failing? I apparently have no way to check the build status. Everything looks fine here in Python 2.6 and 3.2. |
Grab rdatasets
Add a get_rdatasets function with optional caching capabilities. Caching is turned off by default. Not sure if we want to change this to turned on. Maybe save for once we have global configs set up?
I also used this to pull the data utils functions back into datasets like we wanted to do a long time ago. Just never had a reason.
Usage (It's not case-sensitive for the data. Don't know if this can lead to name clashing or even if the datasets in R are uniquely named across packages.):