Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas API #3

Open
datapythonista opened this issue Aug 31, 2019 · 2 comments
Open

pandas API #3

datapythonista opened this issue Aug 31, 2019 · 2 comments

Comments

@datapythonista
Copy link
Member

There are cases where the pandas API can be inconsistent or not very intuitive.

Also, the pandas namespaces are huge (Series has around 200 public attributes/methods)

Would be useful to discuss about possible improvements to the API, places where people reimplementing it thought that they were replicating something wrong, and general ideas for making the pandas public API easier for users.

@xhochy
Copy link
Member

xhochy commented Sep 3, 2019

It would nice to also see what a dream-like DataFrame API would look like. I guess there are multiple opions on how a DataFrame API should look like but it would be really good to cover them. This also goes a bit into the direction of #4, as there will be choices like eager vs lazy API, inplace modifications vs full immutability, out-of-core/in-memory/distributed and so on.

Everyone is vocal that there were some design choices made in the history of pandas that are regretted nowdays, e.g. https://wesmckinney.com/blog/apache-arrow-pandas-internals/. We cannot solve them with a single API but we can definitely improve on the Pandas API.

While Apache Arrow is trying to bring an in-memory format for the DataFrame-like data and basic algorithms and a lot of I/O, its intention is not to provide an end-user API. It is tough a tool in building future DataFrame API for end-users by providing the necessary, performant building blocks.

@devin-petersohn
Copy link
Contributor

Would be useful to discuss about possible improvements to the API, places where people reimplementing it thought that they were replicating something wrong, and general ideas for making the pandas public API easier for users.

This is something we're exploring as well in Modin, but at a multiprocessing/distributed level. We are taking an academic approach toward solving some of these issues. We are also building Modin to be pluggable and have played around with Arrow compute kernels and one of the Gandiva LLVM operators to see how it affects performance in the multiprocessing setting. (Spoiler alert Arrow is fast 😄). Modin is modular to allow for these types of improvements (you can also run it on Dask now too).

It would nice to also see what a dream-like DataFrame API would look like.

I doubt there's an easy answer here. Once something becomes a standard it is very difficult to change that standard in a significant way, and for better or worse pandas (and by extension the API) is a standard. "The best API is the one you already know". Easy to use is also relative because any issue I have with pandas is almost guaranteed to have been answered on StackOverflow before. "How do I ... in pandas?" is really easy to find answers for. As intuitive as a new API could be, it's going to be tough to beat the internet help/community of pandas.

Everyone is vocal that there were some design choices made in the history of pandas that are regretted nowdays, e.g. https://wesmckinney.com/blog/apache-arrow-pandas-internals/.

Part of the mission of Modin is to meet users where they are and take on the challenges that presents. Some things within the pandas API will never be fast in a distributed environment. Some operations are extremely difficult to support. Truthfully, the API scope problem is different from the execution problem, because a sufficiently intelligent query planner would be able to identify poorly written code and optimize it, that includes understanding what a user is trying to do rather than what they are typing. To that end, we have a reduced internal API for the operations that we support because we want there to be one implementation for a given behavior.

From a systems perspective, pandas is an easy system to hate because it breaks so much of the conventional wisdom that database works have brought us over the last 40+ years. Part of my PhD work is to bridge this gap without losing things that make pandas what it is.

It might seem like I love the pandas API, but I do not (trust me on this 😄). I wanted to lay out the challenges I see in changing or making a new API/library and I think it's a lot harder than just making things faster or simplifying the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants