Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Commit

Permalink
Adds/edits/updates some documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
skrawcz committed May 13, 2021
1 parent 8dfbac3 commit aa89ddd
Show file tree
Hide file tree
Showing 11 changed files with 504 additions and 295 deletions.
8 changes: 5 additions & 3 deletions CONTRIBUTING.md
@@ -1,6 +1,6 @@
# Guidance on how to contribute

> All contributions to this project will be released under the Affero General Public License v3 (AGPLv3).
> All contributions to this project will be released under the Affero General Public License v3 (AGPLv3).
> By submitting a pull request or filing a bug, issue, or
> feature request, you are agreeing to comply with this waiver of copyright interest.
> Details can be found in our [CLA](CLA.md) and [LICENSE](LICENSE).
Expand Down Expand Up @@ -29,5 +29,7 @@ Generally speaking, you should fork this repository, make changes in your
own fork, and then submit a pull request. All new code should have associated
unit tests that validate implemented features and the presence or lack of defects.
Additionally, the code should follow any stylistic and architectural guidelines
prescribed by the project. In the absence of such guidelines, mimic the styles
and patterns in the existing code-base.
prescribed by the project. For us here, this means you install a pre-commit hook and use
the given style files. Basically, you should mimic the styles and patterns in the Hamilton code-base.

In terms of getting setup to develop, we invite you to read our [developer setup guide](developer_setup.md).
381 changes: 92 additions & 289 deletions README.md

Large diffs are not rendered by default.

138 changes: 138 additions & 0 deletions basics.md
@@ -0,0 +1,138 @@
# Hamilton Basics

There are two parts to Hamilton:

1. Hamilton Functions.

Hamilton Functions are what you, the end user write.

2. Hamilton Driver.

Once you've written your functions, you will need to use the Hamilton Driver to build the DAG and orchestrate
execution.

Let's diver deeper into these parts below, but first a word on terminology.

We use the following terms interchangeably, e.g. a ____ in Hamilton is ... :

* column
* variable
* node
* function

That's because we're representing columns as functions, which are parts of a directed acyclic graph. That is
a column is a part of a dataframe. To compute a column we write a function that has input variables. From these functions
we create a DAG and represent each function as a node, linking each input variable by an edge to its respective node.

## Hamilton Functions
Using Hamilton is all about writing functions. From these functions a dataframe is constructed for you at execution time.

A simple (but rather contrived) example of what Hamilton does that adds two numbers is as follows:

```python
def _sum(*vars):
"""Helper function to sum numbers.
This is here to demonstrate that functions starting with _ do not get processed by hamilton.
"""
return sum(vars)

def sum_a_b(a: int, b: int) -> int:
"""Adds a and b together
:param a: The first number to add
:param b: The second number to add
:return: The sum of a and b
"""
return _sum(a,b) # Delegates to a helper function
```

While this looks like a simple python function, there are a few components to note:
1. The function name `sum_a_b` is a globally unique key. In the DAG there can only be one function named `sum_a_b`.
While this is not optimal for functionality reuse, it makes it extremely easy to learn exactly how a node in the DAG is generated,
and separate out that logic for debugging/iterating.
2. The function `sum_a_b` depends on two upstream nodes -- `a` and `b`. This means that these values must either be:
* Defined by another function
* Passed in by the user as a configuration variable (see `Hamilton Driver Code` below)
3. The function `sum_a_b` makes full use of the python type-hint system. This is required in Hamilton,
as it allows us to type-check the inputs and outputs to match with upstream producers and downstream consumers. In this case,
we know that the input `a` has to be an integer, the input `b` has to also be an integer, and anything that declares `sum_a_b` as an input
has to declare it as an integer.
4. Standard python documentation is a first-class citizen. As we have a 1:1 relationship between python functions and
nodes, each function documentation also describes a piece of business logic.
5. Functions that start with _ are ignored, and not included in the DAG. Hamilton tries to make use of every function
in a module, so this allows us to easily indicate helper functions that won't become part of the DAG.


### Python Types & Hamilton

Hamilton makes use of python's type-hinting feature to check compatibility between function outputs and function inputs. However,
this is not particularly sophisticated, largely due to the lack of available tooling in python. Thus, generic types do not function correctly.
The following will not work:

```python
def some_func() -> Dict[str, int]:
return {1: 2}
```

The following will both work:
```python
def some_func() -> Dict:
return {1: 2}
```

```python
def some_func() -> dict:
return {1: 2}
```

While this is unfortunate, the typing API in python is not yet sophisticated enough to rely on accurate subclass validation.

## Hamilton Driver Code
For documentation on the actual Hamilton Driver code, we invite the reader to [read the Driver class source code](/hamilton/driver.py) directly.

At a high level, the driver code does two things:

1. Create a Directed Acyclic Graph (DAG) from functions you define.
```python
from hamilton import driver
dr = driver.Driver(config, *modules_to_load) # this creates the DAG from the modules you pass in.
```
2. It orchestrates execution given expected output and provided input.
```python
df = dr.execute(final_vars, overrides, display_graph) # this executes the DAG appropriately to create the dataframe.
```

The driver object also has a few other methods, e.g. `display_all_functions()`, `list_available_variables()`, but they're
really only used for debugging purposes.

Let's dive into the driver constructor call, and the execute method.

### Constructor Call to Driver()
The constructor call is pretty simple. Each constructor call sets up a DAG for execution given some configuration.
So if you want to change something about the DAG, very likely you'll need to create a new Driver() object.

#### config: Dict[str, Any], e.g. Configuration
The configuration is used not just to feed data to the DAG, but also to determine the structure of the DAG.
As such, it is passed in to the constructor, and used during DAG creation. This enables such decorators like @config.when.

Otherwise the contents of the _config_ dictionary should include all the inputs required for whatever final output you
want to create. The configuration dictionary should not be used for overriding what Hamilton will compute.
To do this, use the `override` parameter as part of the `execute()` -- see below.

#### *modules: ModuleType
This can be any number of modules. We traverse the modules in the order they are provided.

### Driver.execute()
The execute function determines the DAG walk required to get the requisite final variables (aka columns) that you want
in the dataframe. It also ensures that you have provided everything to execute properly.

Once it executes it uses a dictionary to memoize results, so that everything is only computed once. It executes the DAG
via a recursive depth-first-traversal, which leads to the possibility (although highly unlikely) of hitting python
recursion depth errors. If that happens, the culprit is almost always a circular reference in the graph. We suggest
displaying the DAG to verify this.

To help speed up development of new or existing Hamilton Functions, we enable you to _override_ parts of the DAG. What
this means is that before calling `execute()`, you have computed some result that you want to use instead of what Hamilton
would produce. To do so, you just pass in a dictionary of `{'col_name': YOUR_VALUE}` as the overrides argument to the
execute function.

To visualize the DAG that would be executed, pass the flag `display_graph=True` to execute. It will render an image in a pdf format.
173 changes: 173 additions & 0 deletions decorators.md
@@ -0,0 +1,173 @@
# Decorators

While the 1:1 mapping of column -> function implementation is powerful, we've implemented a few decorators to promote
business-logic reuse. The decorators we've defined are as follows
(source can be found in [function_modifiers](hamilton/function_modifiers.py)):

## @parameterized
Expands a single function into n, each of which corresponds to a function in which the parameter value is replaced by
that specific value.
```python
import pandas as pd
from hamilton.function_modifiers import parametrized
import internal_package_with_logic

ONE_OFF_DATES = {
#output name # doc string # input value to function
('D_ELECTION_2016', 'US Election 2016 Dummy'): '2016-11-12',
('SOME_OUTPUT_NAME', 'Doc string for this thing'): 'value to pass to function',
}
# parameter matches the name of the argument in the function below
@parametrized(parameter='one_off_date', assigned_output=ONE_OFF_DATES)
def create_one_off_dates(date_index: pd.Series, one_off_date: str) -> pd.Series:
"""Given a date index, produces a series where a 1 is placed at the date index that would contain that event."""
one_off_dates = internal_package_with_logic.get_business_week(one_off_date)
return internal_package_with_logic.bool_to_int(date_index.isin([one_off_dates]))
```
We see here that `parameterized` allows you keep your code DRY by reusing the same function to create multiple
distinct outputs. The _parameter_ key word argument has to match one of the arguments in the function. The rest of
the arguments are pulled from outside the DAG. The _assigned_output_ key word argument takes in a dictionary of
tuple(Output Name, Documentation string) -> value.

## @extract_columns
This works on a function that outputs a dataframe, that we want to extract the columns from and make them individually
available for consumption. So it expands a single function into _n functions_, each of which take in the output dataframe
and output a specific column as named in the `extract_coumns` decorator.
```python
import pandas as pd
from hamilton.function_modifiers import extract_columns

@extract_columns('fiscal_date', 'fiscal_week_name', 'fiscal_month', 'fiscal_quarter', 'fiscal_year')
def fiscal_columns(date_index: pd.Series, fiscal_dates: pd.DataFrame) -> pd.DataFrame:
"""Extracts the fiscal column data.
We want to ensure that it has the same spine as date_index.
:param fiscal_dates: the input dataframe to extract.
:return:
"""
df = pd.DataFrame({'date_index': date_index}, index=date_index.index)
merged = df.join(fiscal_dates, how='inner')
return merged
```
Note: if you have a list of columns to extract, then when you call `@extract_columns` you should call it with an
asterisk like this:
```python
import pandas as pd
from hamilton.function_modifiers import extract_columns

@extract_columns(*my_list_of_column_names)
def my_func(...) -> pd.DataFrame:
"""..."""
```

## @does
`@does` is a decorator that essentially allows you to run a function over all the input parameters. So you can't pass
any function to `@does`, it has to take any amount of inputs and process them in the same way.
```python
import pandas as pd
from hamilton.function_modifiers import does
import internal_package_with_logic

def sum_series(**series: pd.Series) -> pd.Series:
...

@does(sum_series)
def D_XMAS_GC_WEIGHTED_BY_DAY(D_XMAS_GC_WEIGHTED_BY_DAY_1: pd.Series,
D_XMAS_GC_WEIGHTED_BY_DAY_2: pd.Series) -> pd.Series:
"""Adds D_XMAS_GC_WEIGHTED_BY_DAY_1 and D_XMAS_GC_WEIGHTED_BY_DAY_2"""
pass

@does(internal_package_with_logic.identity_function)
def copy_of_x(x: pd.Series) -> pd.Series:
"""Just returns x"""
pass
```
The example here is a function, that all that it does, is sum all the parameters together. So we can annotate it with
the `@does` decorator and pass it the `sum_series` function.
The `@does` decorator is currently limited to just allow functions that consist only of one argument, a generic **kwargs.

## @model
`@model` allows you to abstract a function that is a model. You will need to implement models that make sense for
your business case. Reach out if you need examples.

Under the hood, they're just DAG nodes whose inputs are determined by a configuration parameter. A model takes in
two required parameters:
1. The class it uses to run the model. If external to Stitch Fix you will need to write your own, else internally
see the internal docs for this. Basically the class defined determines what the function actually does.
2. The configuration key that determines how the model functions. This is just the name of a configuration parameter
that stores the way the model is run.

The following is an example usage of `@model`:

```python
import pandas as pd
from hamilton.function_modifiers import model
import internal_package_with_logic

@model(internal_package_with_logic.GLM, 'model_p_cancel_manual_res')
# This runs a GLM (Generalized Linear Model)
# The associated configuration parameter is 'model_p_cancel_manual_res',
# which points to the results of loading the model_p_cancel_manual_res table
def prob_cancel_manual_res() -> pd.Series:
pass
```

`GLM` here is not part of the hamilton framework, and instead a user defined model.

Models (optionally) accept a `output_column` parameter -- this is specifically if the name of the function differs
from the output column that it right to. E.G. if you use the model result as an intermediate object, and manipulate
it all later. This is necessary because various dependent columns that a model queries
(e.g. `MULTIPLIER_...` and `OFFSET_...`) are derived from the model's name.

## @config.when*

`@config.when` allows you to specify different implementations depending on configuration parameters.

The following use cases are supported:
1. A column is present for only one value of a config parameter -- in this case, we define a function only once,
with a `@config.when`
```python
import pandas as pd
from hamilton.function_modifiers import config

# signups_parent_before_launch is only present in the kids business line
@config.when(business_line='kids')
def signups_parent_before_launch(signups_from_existing_womens_tf: pd.Series) -> pd.Series:
"""TODO:
:param signups_from_existing_womens_tf:
:return:
"""
return signups_from_existing_womens_tf
```
2. A column is implemented differently for different business inputs, e.g. in the case of Stitch Fix gender intent.
```python
import pandas as pd
from hamilton.function_modifiers import config, model
import internal_package_with_logic

# Some 21 day autoship cadence does not exist for kids, so we just return 0s
@config.when(gender_intent='kids')
def percent_clients_something__kids(date_index: pd.Series) -> pd.Series:
return pd.Series(index=date_index.index, data=0.0)

# In other business lines, we have a model for it
@config.when_not(gender_intent='kids')
@model(internal_package_with_logic.GLM, 'some_model_name', output_column='percent_clients_something')
def percent_clients_something_model() -> pd.Series:
pass
```
Note the following:
- The function cannot have the same name in the same file (or python gets unhappy), so we name it with a
__ (dunderscore) as a suffix. The dunderscore is removed before it goes into the function.
- There is currently no `@config.otherwise(...)` decorator, so make sure to have `config.when` specify set of
configuration possibilities.
Any missing cases will not have that output column (and subsequent downstream nodes may error out if they ask for it).
To make this easier, we have a few more `@config` decorators:

- `@config.when_not(param=value)` Will be included if the parameter is _not_ equal to the value specified.
- `@config.when_in(param=[value1, value2, ...])` Will be included if the parameter is equal to one of the specified
values.
- `@config.when_not_in(param=[value1, value2, ...])` Will be included if the parameter is not equal to any of the
specified values.
- `@config` If you're feeling adventurous, you can pass in a lambda function that takes in the entire configuration
and resolves to
`True` or `False`. You probably don't want to do this.
43 changes: 43 additions & 0 deletions developer_setup.md
@@ -0,0 +1,43 @@
# Developer/Contributor Setup

## Repo organization

This repository is organized as follows:

1. hamilton/ is code to orchestrate and execute the graph.
2. tests/ is the place where unit tests (or light integration tests) are located.

## How to contribute

1. Checkout the repo. If external to Stitch Fix, fork the repo.
2. Create a virtual environment for it. See python algo curriculum slides for details.
3. Activate the virtual environment and install all dependencies. One for the package, one for making comparisons, one for running unit tests. I.e. `pip install -r requirements*.txt` should install all three for you.
3. Make pycharm depend on that virtual environment & install required dependencies (it should prompt you because it'll read the requirements.txt file).
4. `brew install pre-commit` if you haven't.
5. Run `pre-commit install` from the root of the repository.
6. Create a branch off of the latest master branch. `git checkout -b my_branch`.
7. Do you work & commit it.
8. Push to github and create a PR.
9. When you push to github circle ci will kick off unit tests and migration tests (for Stitch Fix users only).


## How to run unit tests

You need to have installed the `requirements-test.txt` dependencies into the environment you're running for this to work. You can run tests two ways:

1. Through pycharm/command line.
2. Using circle ci locally. The config for this lives in `.circleci/config.yml` which also shows commands to run tests
from the command line.

### Using pycharm to execute & debug unit tests

You can debug and execute unit tests in pycharm easily. To set it up, you just hit `Edit configurations` and then
add New > Python Tests > pytest. You then want to specify the `tests/` folder under `Script path`, and ensure the
python environment executing it is the appropriate one with all the dependencies installed. If you add `-v` to the
additional arguments part, you'll then get verbose diffs if any tests fail.

### Using circle ci locally

You need to install the circleci command line tooling for this to work. See the unit testing algo curriculum slides for details.
Once you have installed it you just need to run `circleci local execute` from the root directory and it'll run the entire suite of tests
that are setup to run each time you push a commit to a branch in github.
11 changes: 11 additions & 0 deletions examples/hello_world/my_functions.py
@@ -0,0 +1,11 @@
import pandas as pd


def avg_3wk_spend(spend: pd.Series) -> pd.Series:
"""Rolling 3 week average spend."""
return spend.rolling(3).mean()


def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
"""The cost per signup in relation to spend."""
return spend / signups

0 comments on commit aa89ddd

Please sign in to comment.