Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Commit

Permalink
Adds more context to the feature engineering example README
Browse files Browse the repository at this point in the history
To help set the tone and explain what feature engineering is,
as well as more context about the scenarios and the task.
  • Loading branch information
skrawcz committed Feb 21, 2023
1 parent c7b0aae commit f175938
Showing 1 changed file with 110 additions and 18 deletions.
128 changes: 110 additions & 18 deletions examples/feature_engineering/README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,124 @@
# write features once, use anywhere
# Feature Engineering

What is feature engineering? It's the process of transforming data for input to a "model".

To make models better, it's common to perform and try a lot of "transforms". This is where Hamilton comes in.
Hamilton allows you to:
* write different transformations in a straightforward and formulaic manner
* keep them managed and versioned with computational lineage (if using something like git)
* has a great testing and documentation story

which allows you to sanely iterate, maintain, and determine what works best for your modeling domain.

In this examples series, we'll skip talking about the benefits of Hamilton here, and instead focus on how to use it
for feature engineering. But first, some context on what challenges you're likely to face with feature engineering
in general.

# What is hard about feature engineering?
There are certain dimensions that make feature engineering hard:

1. Code: Organizing and maintaining code for reuse/collaboration/discoverability.
2. Lineage: Keeping track of what data is being used for what purpose.
3. Deployment: Offline vs online vs streaming needs.

## Code: Organizing and maintaining code for reuse/collaboration/discoverability.
> Individuals build features, but teams own them.
Have you ever dreaded taking over someone else's code? This is a common problem with feature engineering!

Why? The code for feature engineering is often spread out across many different files, and created by many individuals.
E.g. scripts, notebooks, libraries, etc., and written in many different ways. This makes it hard to reuse code,
collaborate, discover what code is available, and therefore maintain what is actually being used in "production" and
what is not.

## Lineage: Keeping track of what data is being used for what purpose
With the growth of data teams, along with data governance & privacy regulations, the need for knowing and understanding what
data is being used and for what purpose is important for the business to easily answer. A "modeler" a lot of the times
is not a stakeholder in needing this visibility, they just want to build models, but these concerns are often put on
their plate to address, which slows down their ability to build and ship features and thus models.

Not having lineage or visbility into what data is being used for what purpose can lead to a lot of problems:
- teams break data assumptions without knowing it, e.g. upstream team stops updating data used downstream.
- teams are not aware of what data is available to them, e.g. duplication of data & effort.
- teams have to spend time figuring out what data is being used for what purpose, e.g. to audit models.
- teams struggle to debug inherited feature workflows, e.g. to fix bugs or add new features.

## Deployment: Offline vs online vs streaming needs
This is a big topic. We wont do it justice here, but let's try to give a brief overview of two main problems:

(1) There are a lot of different deployment needs when you get something to production. For example, you might want to:
- run a batch job to generate features for a model
- hit a webservice to make predictions in realtime that needs features computed on the fly, or retrieved from a cache (e.g. feature store).
- run a streaming job to generate features for a model in realtime
- require all three or a subset of the above ways of deploying features.

So the challenge is, how do you design your processes to take in account your deployment needs?

(2) Implement features once or twice or thrice? To enable (1), you need to ask yourself, can we share features? or
do we need to reimplement them for every system that we want to use them in?

With (1) and (2) in mind, you can see that there are a lot of different dimensions to consider when designing your
feature engineering processes. They have to connect with each other, and be flexible enough to support your specific
deployment needs.

# Using Hamilton for Feature Engineering for Batch/Offline
If you fall into **only** needing to deploy features for batch jobs, then stop right there. You don't need these examples,
as they are focused on how to bridge the gap between "offline" and "online" feature engineering. You should instead
browse the other examples like `data_quality`.

# Using Hamilton for Feature Engineering for Batch/Offline and Online/Streaming
These example scenarios here are for the people who have to deal with both batch and online feature engineering.

We provide two examples for two common scenarios that occur if you have this need. Note, the example code in these
scenarios tries to be illustrative about how to think and frame using Hamilton. It contains minimal features so as to
not overwhelm you, and leaves out some implementation details that you would need to fill in for your specific use case,
e.g. like fitting a model using the features, or where to store aggregate feature values, etc.

## Scenario Context
A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
happens is that the code for features is not shared, and results in two implementations
that result in subtle bugs and hard to maintain code.

With this example we show how you can use Hamilton to:
With this example series, we show how you can use Hamilton to:

1. write a feature once.
2. leverage the feature code anywhere that python runs. e.g. in batch and online.
1. write a feature once. (scenarios 1 and 2)
2. leverage that feature code anywhere that python runs. e.g. in batch and online. (scenarios 1 and 2)
3. show how to modularize components so that if you have values cached in a feature store,
you can inject those values into your feature computation needs.
you can inject those values into your feature computation needs. (scenario 2)

# Scenarios
We provide two examples for two common scenarios. The example code here tries to be illustrative about
how to think and frame using Hamilton to solve these two scenarios; it contains minimal features so as to not
overwhelm you.
The task that we're modeling here isn't that important, but if you must know, we're trying to predict the number of
hours of absence that an employee will have given some information about them; this is based off the `data_quality`
example, which is based off of the [Metaflow+Hamilton example](https://outerbounds.com/blog/developing-scalable-feature-engineering-dags/),
where Hamilton was used for the feature engineering process -- in that example only offline feature engineering was modeled.

Assumptions we're using:
1. You have a fixed set of features that you want to compute for a model that you have determined as being useful a priori.
2. We are agnostic of the actual model -- and skip any details of that in the examples.

Let's explain the context of the two scenarios a bit more.

## Scenario 1: the simple case - ETL + Online API
Assume we can get the same raw inputs at prediction time, as it was provided in at training time.
However, we don't want to recompute `age_mean` and `age_std_dev` because recomputing aggregation features
doesn't make sense in an online setting (usually). Instead, we "store" the values for them when we compute features,
and then use those "stored" at prediction time to get the right computation to happen.
In this scenario we assume we can get the same raw inputs at prediction time, as would be provided at training time.

This is a straightforward process if all your feature transforms are [map operations](https://en.wikipedia.org/wiki/Map_(higher-order_function)).
If however you have some transforms that are aggregations, then you need to be careful about how you connect your offline
ETL with online.

In this example, there are two features, `age_mean` and `age_std_dev`, that we avoid recomputing in an online setting.
Instead, we "store" the values for them when we compute features in the offline ETL, and then use those "stored" values
at prediction time to get the right feature computation to happen.

## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
At prediction time we might only have some of the raw data required to compute a prediction. To get the rest
we need to make an API call, e.g. a feature store or a database, that will provide us with that information.
This example shows one way to modularize your Hamilton code so that you can swap out the "source" of the data.
A good exercise would be to make note of the differences with this scenario (2) and scenario 1.
In this scenario we assume we are not passed in data, but need to fetch it ourselves as part of the online API request.

We will pretend to hit a feature store, that will provide us with the required data to compute the features for
input to the model. This example shows one way to modularize your Hamilton code so that you can swap out the "source"
of the data. To simplify the example, we assume that we can get all the input data we need from a feature store, rather
than it also coming in via the request.

A good exercise would be to make note of the differences with this scenario (2) and scenario 1 in how they structure
the code with Hamilton.

# What next?
# What's next?
Jump into each directory and read the README, it'll explain how the example is set up and how things should work.

0 comments on commit f175938

Please sign in to comment.