Brings distributed execution and optional freedom from pandas #47

skrawcz · 2022-02-03T06:05:43Z

[Short description explaining the high-level reason for the pull request]

Additions

Removals

Changes

Testing

Screenshots

If applicable

Notes

Todos

Checklist

Testing checklist

Python

python 3.6
python 3.7

Functions can create or take in scalars. Adding this to the example so that people can see that more explicitly.

Hamilton was hardcoded to return a pandas dataframe. This change introduces a module `base` that houses: 1. an interface object `ResultMixin` with a static function called "build_result". 2. an interface object `HamiltonGraphAdapter` that houses the required interface to augment execution more broadly. `ResultMixin`: The idea is that we'll use this class as the interface to describe the shape required to return different results. I then went ahead and implemented DictResult, PandasDataFrameResult, and NumpyMatrixResult. `HamiltonGraphAdapter`: paves the way for distributed computation & specifying different return types. Since it allows us to augment and wrap Hamilton functions as we're walking the graph. All GraphAdapters will therefore inherit and override the functions in `HamiltonGraphAdapter`. I decided to have it extend ResultMixin to simplify things. In some cases the HamiltonGraphAdapter will have its own need to stitch together a result, in other cases it will just delegate to a mixin, or via delegation. Why those two "type" functions in `HamiltonGraphAdapter`? We do a static time compile check for equivalence when building a graph, and one when we're inspecting inputs. We need to delegate this check to the adapter in case there are type mismatches because some might be expected (e.g. dask can mimic pandas). I don't necessarily like wiring through the adapter all the way through, but it works; I think we'll be able to change things easily since basically all the code in graph.py should be viewed as internal code. Otherwise all these changes should be backwards compatible -- passing in a HamiltonGraphAdapter is optional. The `SimplePythonDataFrameHamiltonGraphAdapter`, which mirrors the current Hamilton behavior is the default if none is provided. Comment on FunctionGraph: I expect to further augment FunctionGraph to tease out certain parts of it to enable a broader array of use cases. Otherwise I think we're pretty fixed on "black-box" reuse, so getting reuse of the FunctionGraph in different context by passing in a object of a set interface, versus using inheritance.

This moves things to an experimental package -- to make it clear that these are early bits of code. It then implements the HamiltonGraphAdapter interface, adds tests, and a hello world. The salient thing to note about the hello world is that we introduce a data_loaders and a business_logic modules. These are to separate hamilton logic functions. The idea is that business_logic is fairly fixed and invariant, and then if you wanted to go from local development to running on a dask cluster, you'd either include/change the data_loaders module for some other module that loaded things the way you want them to.

This code is MVP to get things running on Ray! One reason why the base.ResultMixin are static functions on an object, is so that Ray can serialized them. Otherwise I believe this is an easy way to run hamilton on a multicore system locally -- as well as connect to a remote cluster. The hello world is practically identical to the dask one, apart from setting up the GraphAdapter and ray config.

This code works on Spark 3.2 when Koalas becomes part of spark officially. This code seems to work as intended, we will likely want to always set `set_option('compute.ops_on_diff_frames', True)` , as people will likely want hamilton to handle doing joins, versus having to do that upfront before passing data into hamilton. Note: Spark, Dask, don't 100% implement the pandas API, but they do so enough, that simple aggregations, and most used functions do work.

So that people can install the right dependencies as necessary to use these.

The circleci config was off. This fixes that. Otherwise we add tests for ray and dask, and then tell it where to run the unit tests -- because otherwise it was running all tests, and because dependencies were off things were failing.

So that people have some orientation as to what is going on.

It can return Any type, because it delegates to the passed in ResultMixin object.

So that people know how to run it.

So that people know what's required.

So that people know what to install if they want to run the example.

This allows you to easily insert a ResultMixin of your choice if you just want regular python execution. TODO: follow up with an example.

So that it's clear that it only expected numpy arrays..

It adds information on scaling characteristics, and pointers to documentation where it makes sense. I guess over time we'll add more to these classes as we get feedback about what works, and what doesn't.

Using a boolean was limiting, it did not allow for people to control the visualization. Instead, I think a dictionary of kwargs is the way to signal visualization. I tested that things work with an empty dictionary. Otherwise the user is pointed to the docs on it.

Since by default the visualization libs are not installed. Why spam the logs. Require the user to flip it to True, and then deal with installing those extra dependencies.

This is to: 1. Force us to have things that work across platform easily. 2. Showcase that people can not only scale code easily, but port it too!

We want to install "complete" by default. This fixes that.

We do not have extensive testing -- but at least locally I have had no issues running python 3.8 and 3.9.

This mirrors the h_dask and h_ray tests. Adds this to run in circleci. Note regarding circleci: since we need Java we have to change the image. The image has python but not pip. So I need to do some apt-get things, and use aptitude to install things so that python is in a working state on the image. It does complain about bdist not installing with pyspark, however that doesn't seem to break actually running the code.

elijahbenizzy · 2022-02-06T02:36:47Z

Can we reorganize tests? graph_adapter_tests probably shouldn't be in it's own directory.

I propose the following:

We move it all to a directory under tests experimental
They share resources
Pytest can be run on the subdirectories.

Symlink is clever but I'd rather them just refer to the same thing?

Happy to push this off till later, but I'd really prefer if tests are all together...

elijahbenizzy

As discussed, shipping for the 1.3.0 release

chmp and others added 12 commits January 8, 2022 15:53

Introduce an executor abstraction

6d974f8

Add additional indirection to construct final dataframe

1c98676

Ignore dask execution files

061d4e8

Implement a dask executor

fd3273b

Remove unused import

b206b33

Adds documentation showing scalar creation & input

5274f26

Functions can create or take in scalars. Adding this to the example so that people can see that more explicitly.

Adds optional dependencies for the experimental distributed adapters

954dd9e

So that people can install the right dependencies as necessary to use these.

Adds to circleci and fixes it

9d52af4

The circleci config was off. This fixes that. Otherwise we add tests for ray and dask, and then tell it where to run the unit tests -- because otherwise it was running all tests, and because dependencies were off things were failing.

skrawcz changed the title ~~Brings distributed and freedom from pandas~~ Brings distributed execution and freedom from pandas Feb 3, 2022

skrawcz changed the title ~~Brings distributed execution and freedom from pandas~~ Brings distributed execution and optional freedom from pandas Feb 3, 2022

Stefan Krawczyk and others added 16 commits February 2, 2022 22:23

Adds basic README stubs to hello worlds for ray, dask, spark

2cd436a

So that people have some orientation as to what is going on.

Adjust Dask build_result annotation

39903a6

It can return Any type, because it delegates to the passed in ResultMixin object.

Adds install note to dask hello world

70ea6e5

So that people know how to run it.

Adds install note to ray hello world

ef46fa3

So that people know what's required.

Adds install note to spark hello world

1a9765b

So that people know what to install if they want to run the example.

Fixes EOL

54e9ef5

Adds a SimplePythonGraphAdapter

64b094a

This allows you to easily insert a ResultMixin of your choice if you just want regular python execution. TODO: follow up with an example.

Adds some more documentation to NumpyMatrixResult

47ee714

So that it's clear that it only expected numpy arrays..

Adds [Ray|Dask|Koalas]GraphAdapter documentation

5b6076e

It adds information on scaling characteristics, and pointers to documentation where it makes sense. I guess over time we'll add more to these classes as we get feedback about what works, and what doesn't.

Sets visualize=False in hello world examples

b641b08

Since by default the visualization libs are not installed. Why spam the logs. Require the user to flip it to True, and then deal with installing those extra dependencies.

Makes hello world examples all use same file

0b9d76f

This is to: 1. Force us to have things that work across platform easily. 2. Showcase that people can not only scale code easily, but port it too!

Fixes Dask dependency install

4ff44d9

We want to install "complete" by default. This fixes that.

Extends Hamilton to run on 3.8, and 3.9 officially

d82ec63

We do not have extensive testing -- but at least locally I have had no issues running python 3.8 and 3.9.

Fixes dask test

bb740c3

elijahbenizzy mentioned this pull request Feb 6, 2022

1.3.0 release #59

Closed

11 tasks

elijahbenizzy marked this pull request as ready for review February 6, 2022 04:03

elijahbenizzy approved these changes Feb 6, 2022

View reviewed changes

elijahbenizzy merged commit ae2ae12 into main Feb 6, 2022

elijahbenizzy deleted the distributed_prototype branch February 6, 2022 20:54

gitbook-com bot pushed a commit that referenced this pull request Feb 9, 2022

GitBook: [#47] No subject

7d0626c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Brings distributed execution and optional freedom from pandas #47

Brings distributed execution and optional freedom from pandas #47

skrawcz commented Feb 3, 2022

elijahbenizzy commented Feb 6, 2022 •

edited

Loading

elijahbenizzy left a comment

Brings distributed execution and optional freedom from pandas #47

Brings distributed execution and optional freedom from pandas #47

Conversation

skrawcz commented Feb 3, 2022

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist

Python

elijahbenizzy commented Feb 6, 2022 • edited Loading

elijahbenizzy left a comment

Choose a reason for hiding this comment

elijahbenizzy commented Feb 6, 2022 •

edited

Loading