Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Brings distributed execution and optional freedom from pandas #47

Merged
merged 28 commits into from
Feb 6, 2022

Conversation

skrawcz
Copy link
Collaborator

@skrawcz skrawcz commented Feb 3, 2022

[Short description explaining the high-level reason for the pull request]

Additions

Removals

Changes

Testing

Screenshots

If applicable

Notes

Todos

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows the standards laid out in the TODO link to standards
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python

  • python 3.6
  • python 3.7

chmp and others added 12 commits January 8, 2022 15:53
Functions can create or take in scalars. Adding this to the
example so that people can see that more explicitly.
Hamilton was hardcoded to return a pandas dataframe. This change
introduces a module `base` that houses:

1. an interface object `ResultMixin` with a static function called "build_result".
2. an interface object `HamiltonGraphAdapter` that houses the required interface to augment execution more broadly.

`ResultMixin`: The idea is that we'll use this class as the interface to describe the shape required to return different results.
I then went ahead and implemented DictResult, PandasDataFrameResult, and NumpyMatrixResult.

`HamiltonGraphAdapter`: paves the way for distributed computation & specifying different return types. Since it allows us to augment and wrap Hamilton
functions as we're walking the graph. All GraphAdapters will therefore inherit and override the functions in `HamiltonGraphAdapter`.
I decided to have it extend ResultMixin to simplify things. In some cases the HamiltonGraphAdapter will have its own need
to stitch together a result, in other cases it will just delegate to a mixin, or via delegation.

Why those two "type" functions in `HamiltonGraphAdapter`? We do a static time compile check for equivalence when building a graph, and
one when we're inspecting inputs. We need to delegate this check to the adapter in case there are type mismatches
because some might be expected (e.g. dask can mimic pandas). I don't necessarily like wiring through the
adapter all the way through, but it works; I think we'll be able to change things
easily since basically all the code in graph.py should be viewed as internal code.

Otherwise all these changes should be backwards compatible -- passing in a HamiltonGraphAdapter is optional. The
`SimplePythonDataFrameHamiltonGraphAdapter`, which mirrors the current Hamilton behavior is the default if none
is provided.

Comment on FunctionGraph: I expect to further augment FunctionGraph to tease out certain parts of it to enable
a broader array of use cases. Otherwise I think we're pretty fixed on "black-box" reuse, so getting reuse of the
FunctionGraph in different context by passing in a object of a set interface, versus using inheritance.
This moves things to an experimental package -- to make it clear
that these are early bits of code.

It then implements the HamiltonGraphAdapter interface, adds
tests, and a hello world.

The salient thing to note about the hello world is that we
introduce a data_loaders and a business_logic modules. These
are to separate hamilton logic functions. The idea is that business_logic
is fairly fixed and invariant, and then if you wanted to go from local development
to running on a dask cluster, you'd either include/change the data_loaders
module for some other module that loaded things the way you want them to.
This code is MVP to get things running on Ray!

One reason why the base.ResultMixin are static
functions on an object, is so that Ray can serialized them.

Otherwise I believe this is an easy way to run hamilton
on a multicore system locally -- as well as connect
to a remote cluster.

The hello world is practically identical to the dask one,
apart from setting up the GraphAdapter and ray config.
This code works on Spark 3.2 when Koalas becomes part of spark
officially.

This code seems to work as intended, we will likely want to always set
`set_option('compute.ops_on_diff_frames', True)` , as people will
likely want hamilton to handle doing joins, versus having to do that
upfront before passing data into hamilton.

Note: Spark, Dask, don't 100% implement the pandas API, but they
do so enough, that simple aggregations, and most used functions
do work.
So that people can install the right dependencies as necessary to
use these.
The circleci config was off. This fixes that.

Otherwise we add tests for ray and dask, and
then tell it where to run the unit tests -- because
otherwise it was running all tests, and because
dependencies were off things were failing.
@skrawcz skrawcz changed the title Brings distributed and freedom from pandas Brings distributed execution and freedom from pandas Feb 3, 2022
@skrawcz skrawcz changed the title Brings distributed execution and freedom from pandas Brings distributed execution and optional freedom from pandas Feb 3, 2022
Stefan Krawczyk and others added 16 commits February 2, 2022 22:23
So that people have some orientation as to what is going on.
It can return Any type, because it delegates to
the passed in ResultMixin object.
So that people know how to run it.
So that people know what's required.
So that people know what to install if they want to run the example.
This allows you to easily insert a ResultMixin of your
choice if you just want regular python execution.

TODO: follow up with an example.
So that it's clear that it only expected numpy arrays..
It adds information on scaling characteristics, and pointers to
documentation where it makes sense.

I guess over time we'll add more to these classes as we get
feedback about what works, and what doesn't.
Using a boolean was limiting, it did not allow for people
to control the visualization. Instead, I think a dictionary
of kwargs is the way to signal visualization.

I tested that things work with an empty dictionary. Otherwise
the user is pointed to the docs on it.
Since by default the visualization libs are not installed.
Why spam the logs. Require the user to flip it to True,
and then deal with installing those extra dependencies.
This is to:

1. Force us to have things that work across platform easily.
2. Showcase that people can not only scale code easily, but port it too!
We want to install "complete" by default. This fixes that.
We do not have extensive testing -- but at least locally
I have had no issues running python 3.8 and 3.9.
This mirrors the h_dask and h_ray tests.

Adds this to run in circleci.

Note regarding circleci: since we need Java we
have to change the image. The image has python
but not pip. So I need to do some apt-get things,
and use aptitude to install things so that python
is in a working state on the image. It does complain
about bdist not installing with pyspark, however
that doesn't seem to break actually running the code.
@elijahbenizzy
Copy link
Collaborator

elijahbenizzy commented Feb 6, 2022

Can we reorganize tests? graph_adapter_tests probably shouldn't be in it's own directory.

I propose the following:

  1. We move it all to a directory under tests experimental
  2. They share resources
  3. Pytest can be run on the subdirectories.

Symlink is clever but I'd rather them just refer to the same thing?

Happy to push this off till later, but I'd really prefer if tests are all together...

@elijahbenizzy elijahbenizzy mentioned this pull request Feb 6, 2022
11 tasks
@elijahbenizzy elijahbenizzy marked this pull request as ready for review February 6, 2022 04:03
Copy link
Collaborator

@elijahbenizzy elijahbenizzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, shipping for the 1.3.0 release

@elijahbenizzy elijahbenizzy merged commit ae2ae12 into main Feb 6, 2022
@elijahbenizzy elijahbenizzy deleted the distributed_prototype branch February 6, 2022 20:54
gitbook-com bot pushed a commit that referenced this pull request Feb 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants