This repository has been archived by the owner on Jul 3, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 38
Brings distributed execution and optional freedom from pandas #47
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Functions can create or take in scalars. Adding this to the example so that people can see that more explicitly.
Hamilton was hardcoded to return a pandas dataframe. This change introduces a module `base` that houses: 1. an interface object `ResultMixin` with a static function called "build_result". 2. an interface object `HamiltonGraphAdapter` that houses the required interface to augment execution more broadly. `ResultMixin`: The idea is that we'll use this class as the interface to describe the shape required to return different results. I then went ahead and implemented DictResult, PandasDataFrameResult, and NumpyMatrixResult. `HamiltonGraphAdapter`: paves the way for distributed computation & specifying different return types. Since it allows us to augment and wrap Hamilton functions as we're walking the graph. All GraphAdapters will therefore inherit and override the functions in `HamiltonGraphAdapter`. I decided to have it extend ResultMixin to simplify things. In some cases the HamiltonGraphAdapter will have its own need to stitch together a result, in other cases it will just delegate to a mixin, or via delegation. Why those two "type" functions in `HamiltonGraphAdapter`? We do a static time compile check for equivalence when building a graph, and one when we're inspecting inputs. We need to delegate this check to the adapter in case there are type mismatches because some might be expected (e.g. dask can mimic pandas). I don't necessarily like wiring through the adapter all the way through, but it works; I think we'll be able to change things easily since basically all the code in graph.py should be viewed as internal code. Otherwise all these changes should be backwards compatible -- passing in a HamiltonGraphAdapter is optional. The `SimplePythonDataFrameHamiltonGraphAdapter`, which mirrors the current Hamilton behavior is the default if none is provided. Comment on FunctionGraph: I expect to further augment FunctionGraph to tease out certain parts of it to enable a broader array of use cases. Otherwise I think we're pretty fixed on "black-box" reuse, so getting reuse of the FunctionGraph in different context by passing in a object of a set interface, versus using inheritance.
This moves things to an experimental package -- to make it clear that these are early bits of code. It then implements the HamiltonGraphAdapter interface, adds tests, and a hello world. The salient thing to note about the hello world is that we introduce a data_loaders and a business_logic modules. These are to separate hamilton logic functions. The idea is that business_logic is fairly fixed and invariant, and then if you wanted to go from local development to running on a dask cluster, you'd either include/change the data_loaders module for some other module that loaded things the way you want them to.
This code is MVP to get things running on Ray! One reason why the base.ResultMixin are static functions on an object, is so that Ray can serialized them. Otherwise I believe this is an easy way to run hamilton on a multicore system locally -- as well as connect to a remote cluster. The hello world is practically identical to the dask one, apart from setting up the GraphAdapter and ray config.
This code works on Spark 3.2 when Koalas becomes part of spark officially. This code seems to work as intended, we will likely want to always set `set_option('compute.ops_on_diff_frames', True)` , as people will likely want hamilton to handle doing joins, versus having to do that upfront before passing data into hamilton. Note: Spark, Dask, don't 100% implement the pandas API, but they do so enough, that simple aggregations, and most used functions do work.
So that people can install the right dependencies as necessary to use these.
The circleci config was off. This fixes that. Otherwise we add tests for ray and dask, and then tell it where to run the unit tests -- because otherwise it was running all tests, and because dependencies were off things were failing.
skrawcz
changed the title
Brings distributed and freedom from pandas
Brings distributed execution and freedom from pandas
Feb 3, 2022
skrawcz
changed the title
Brings distributed execution and freedom from pandas
Brings distributed execution and optional freedom from pandas
Feb 3, 2022
So that people have some orientation as to what is going on.
It can return Any type, because it delegates to the passed in ResultMixin object.
So that people know how to run it.
So that people know what's required.
So that people know what to install if they want to run the example.
This allows you to easily insert a ResultMixin of your choice if you just want regular python execution. TODO: follow up with an example.
So that it's clear that it only expected numpy arrays..
It adds information on scaling characteristics, and pointers to documentation where it makes sense. I guess over time we'll add more to these classes as we get feedback about what works, and what doesn't.
Using a boolean was limiting, it did not allow for people to control the visualization. Instead, I think a dictionary of kwargs is the way to signal visualization. I tested that things work with an empty dictionary. Otherwise the user is pointed to the docs on it.
Since by default the visualization libs are not installed. Why spam the logs. Require the user to flip it to True, and then deal with installing those extra dependencies.
This is to: 1. Force us to have things that work across platform easily. 2. Showcase that people can not only scale code easily, but port it too!
We want to install "complete" by default. This fixes that.
We do not have extensive testing -- but at least locally I have had no issues running python 3.8 and 3.9.
This mirrors the h_dask and h_ray tests. Adds this to run in circleci. Note regarding circleci: since we need Java we have to change the image. The image has python but not pip. So I need to do some apt-get things, and use aptitude to install things so that python is in a working state on the image. It does complain about bdist not installing with pyspark, however that doesn't seem to break actually running the code.
Can we reorganize tests? I propose the following:
Symlink is clever but I'd rather them just refer to the same thing? Happy to push this off till later, but I'd really prefer if tests are all together... |
elijahbenizzy
approved these changes
Feb 6, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, shipping for the 1.3.0 release
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
[Short description explaining the high-level reason for the pull request]
Additions
Removals
Changes
Testing
Screenshots
If applicable
Notes
Todos
Checklist
Testing checklist
Python