-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Add CSV backend #723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CSV backend #723
Conversation
This is great. Do you think we could use odo to abstract away from the backend? https://odo.readthedocs.org/en/latest/ |
I started reading up on and experimenting with blaze, odo, and friends
this weekend, and they all look really nice. But at I don't yet have a
clear picture of how to best integrate them and would be glad to see any
suggestions or proposals for this (but perhaps we should start a
separate, dedicated issue for exploring that?).
|
@kyleam, I think dask with its array of parallel shared memory, async and distributed schedulers and chunking system would be the best bet, aside from odo, for integration. More info, including an offer from the library author to help, can be found in #707. Matt can help with use case guidance as well. |
pymc3/tests/test_csv_backend.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why two underscores?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make the indices visually distinct since it's common for
variable names to have both underscores and digits. But if people
have a strong preference against it, I'm ok changing it to one
underscore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I think its good to keep.
Is pandas ever a hard dependency to install? Is there a reason not to make it a hard dep? Otherwise looks good to me. |
I'm for making Pandas a non-optional dependency. If we do that, I'd |
@twiecki @fonnesbeck ? Thoughts on making pandas a hard dep? |
Given that Pandas is now a core library in the scientific stack, forcing users to install it is not unreasonable. |
+1 on adding pandas as a hard dependency |
@kyleam can you add the dependency to setup.py? Then I think we're good to go. |
Will do. |
This has a couple of advantages over the Text backend. 1. The values are stored as in a table per chain, not a file per variable. This is easier to inspect and work with directly if desired (e.g., with pd.read_csv). 2. Values are stored during sampling, not kept in memory.
The main reason to keep the Text backend around was that pandas was an optional dependency. Now that this is no longer the case, the original Text backend doesn't offer any advantages over the CSV backend. (Even if someone prefers not to write to files while sampling, they can sample with the NDArray backend and then use the CSV dump function.) Rename CSV to Text. This name is more appropriate because the values being stored as plain text is the important feature, not which delimiter is used.
Updated. Does anyone have an issue with removing Text (and renaming CSV to Text)? |
What additional capabilities beyond CSV does Text currently have? If it doesn't really have any, then I'm all for it. |
None (see commit message). |
Yeah, lets replace it then. |
This has a couple of advantages over the Text backend.
variable. This is easier to inspect and work with directly if
desired (e.g., with pd.read_csv).
This relies on Pandas, which currently isn't listed as a hard
dependency.