Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Commit

Permalink
GitBook: [#65] No subject
Browse files Browse the repository at this point in the history
  • Loading branch information
skrawcz authored and gitbook-bot committed Feb 21, 2022
1 parent 73297ec commit 970bb74
Show file tree
Hide file tree
Showing 5 changed files with 54 additions and 35 deletions.
2 changes: 1 addition & 1 deletion SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@
* [Function Naming](best-practices/function-naming.md)
* [Code Organization](best-practices/code-organization.md)
* [Function Modifiers](best-practices/function-modifiers.md)
* [Loading Data](best-practices/loading-data.md)
* [Common Indices](best-practices/common-indices.md)
* [Loading Data](best-practices/loading-data.md)
* [Output Immutability](best-practices/output-immutability.md)
* [Extensions](extensions.md)
* [Talks | Podcasts | Blogs](talks-or-podcasts-or-blogs.md)
Expand Down
31 changes: 1 addition & 30 deletions best-practices/code-organization.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,6 @@
---
description: Hamilton will force you to organize your code! Here's some tip
description: Guidebook coming! We appreciate contributions, as always...
---

# Code Organization

Hamilton forces you to put your code into modules that are distinct from where you run your code. 

You'll soon find that a single python module does not make sense, and so you'll organically start to (very likely) put like functions with like functions, i.e. thus creating domain specific modules --> _use this to your development advantage!_

At Stitch Fix we:

1. Use modules to model team thinking, e.g. date\_features.py.
2. Use modules to helps isolate what you’re working on. 
3. Use modules to replace parts of your Hamilton dataflow very easily for different contexts.

## Team thinking

You'll need to curate your modules. We suggest orienting this around how teams think about the business. 

E.g. marketing spend features should be in the same module, or in separate modules but in the same directory/package.

This will then make it easy for people to browse the code base and discover what is available. 

## Helps isolate what you're working on

Grouping functions into modules then helps set the tone for what you're working on. It helps set the "namespace", if you will, for that function. Thus you can have the same function name used in multiple modules, as long as only one of those modules is imported to build the DAG.

Thus modules help you create boundaries in your code base to isolate functions that you'll want to change inputs to.

## Enables you to replace parts of your DAG easily for different contexts

The names you provide as inputs to functions form a defined "interface", to borrow a computer science term, so if you want to swap/change/augment an input, having a function that would map to it defined in another module(s) provides a lot of flexibility. Rather than having a single module with all functions defined in it, separating the functions into different modules could be a productivity win. 

Why? That's because when you come to tell Hamilton what functions constitute your dataflow (i.e. DAG), you'll be able to simply replace/add/change the module being passed. So if you want to compute inputs for certain functions differently, this composability of including/excluding modules, when building the DAG provides a lot of flexibility that you can exploit to make your development cycle faster. 
21 changes: 18 additions & 3 deletions best-practices/common-indices.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,20 @@
# Shared Indices
---
description: If you're creating dataframes, then this will apply to you!
---

While Hamilton is a general-purpose framework, we've found a common pattern is to manipulate datasets that have shared indices (spines). Although this might not apply towards every use-case (E.G. more complex joins with spark dataframes), a large selection of use-cases can be enabled if every dataframe in your pipeline shares an index. This is particularly pertinent when writing transformations over (non-event-based) time-series data.
# Common Indices

While Hamilton is a general-purpose framework, we've found a common pattern is to manipulate datasets that have shared indices (spines) for creating dataframes.

Although this might not apply towards every use-case (E.G. more complex joins with spark dataframes), a large selection of use-cases can be enabled if every dataframe in your pipeline shares an index. This is particularly pertinent when writing transformations over (non-event-based) time-series data.

While Hamilton currently has no means of enforcing shared-spine, it is up to the writer of the function to validate input data as necessary. Thus we recommend the following if you are creating a dataframe as output:

### Best practice:

1. Load data via functions, defined in their own specific module.
2. Take that loaded data, and transform/ensure indexes match the output you want to create.
3. Continue with transformations.

For time-series modeling, this will mean you provide a common time-series index. Or, if you're creating features for input to a classification model, e.g. over clients, then ensure the index is client\_ids.

While hamilton currently has no means of enforcing shared-spine, it is up to the writer of the function to validate input data as necessary. 
14 changes: 13 additions & 1 deletion best-practices/function-modifiers.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,18 @@
---
description: Guidebook coming! We appreciate contributions, as always...
description: The `@` above Hamilton Functions
---

# Function Modifiers

Hamilton has a bunch of function modifiers, i.e. python decorators, to modify function behavior.

The behaviors vary based on the function modifier. Please see [available-decorators.md](../reference/api-reference/available-decorators.md "mention") for the current list of supported ones.

## Why would I use them?

These function modifiers are either to:

1. Enable you to make lots of functions that vary by inputs concisely. E.g. to keep your code DRYer. 
2. Provide functionality to make Hamilton more powerful. E.g. so you can break apart multiple outputs.

There unfortunately isn't an easy thing to say here, other than, read the list of decorators and their functionality, and then if you see the need arise, use them 😀.
21 changes: 21 additions & 0 deletions best-practices/output-immutability.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,23 @@
---
description: In Hamilton, functions are only called once!
---

# Output Immutability



Immutability means, that once a "data structure", e.g. a column is created, and output by a function, the values in the column are not changeable.

When Hamilton figures out the execution call path, it walks it and calls functions only once. This means, that if the output of a function is immutable, then there's only one place it was created; it's not modified anywhere else. This provides a great debugging experience if there are ever issues in your dataflow. We believe that by default, one should always strive for immutability of outputs.

However, it is up to you, the Hamilton function writer, to ensure that immutability is something that is adhered to.

### Best practice:

1. To preserve “immutability” of outputs, don’t mutate passed in data structures.

e.g. if you get passed in a pandas series, don’t mutate it.

1. Test for this in your unit tests if this is something important to you!
2. Otherwise YMMV with debugging:
1. Clearly document mutating inputs in your function documentation if you do mutate inputs provided. That will make debugging your code that much simpler!

0 comments on commit 970bb74

Please sign in to comment.