Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Adds first scenario for feature engineering examples #311

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
124 changes: 124 additions & 0 deletions examples/feature_engineering/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Feature Engineering

What is feature engineering? It's the process of transforming data for input to a "model".

To make models better, it's common to perform and try a lot of "transforms". This is where Hamilton comes in.
Hamilton allows you to:
* write different transformations in a straightforward and formulaic manner
* keep them managed and versioned with computational lineage (if using something like git)
* has a great testing and documentation story

which allows you to sanely iterate, maintain, and determine what works best for your modeling domain.

In this examples series, we'll skip talking about the benefits of Hamilton here, and instead focus on how to use it
for feature engineering. But first, some context on what challenges you're likely to face with feature engineering
in general.

# What is hard about feature engineering?
There are certain dimensions that make feature engineering hard:

1. Code: Organizing and maintaining code for reuse/collaboration/discoverability.
2. Lineage: Keeping track of what data is being used for what purpose.
3. Deployment: Offline vs online vs streaming needs.

## Code: Organizing and maintaining code for reuse/collaboration/discoverability.
> Individuals build features, but teams own them.

Have you ever dreaded taking over someone else's code? This is a common problem with feature engineering!

Why? The code for feature engineering is often spread out across many different files, and created by many individuals.
E.g. scripts, notebooks, libraries, etc., and written in many different ways. This makes it hard to reuse code,
collaborate, discover what code is available, and therefore maintain what is actually being used in "production" and
what is not.

## Lineage: Keeping track of what data is being used for what purpose
With the growth of data teams, along with data governance & privacy regulations, the need for knowing and understanding what
data is being used and for what purpose is important for the business to easily answer. A "modeler" a lot of the times
is not a stakeholder in needing this visibility, they just want to build models, but these concerns are often put on
their plate to address, which slows down their ability to build and ship features and thus models.

Not having lineage or visbility into what data is being used for what purpose can lead to a lot of problems:
- teams break data assumptions without knowing it, e.g. upstream team stops updating data used downstream.
- teams are not aware of what data is available to them, e.g. duplication of data & effort.
- teams have to spend time figuring out what data is being used for what purpose, e.g. to audit models.
- teams struggle to debug inherited feature workflows, e.g. to fix bugs or add new features.

## Deployment: Offline vs online vs streaming needs
This is a big topic. We wont do it justice here, but let's try to give a brief overview of two main problems:

(1) There are a lot of different deployment needs when you get something to production. For example, you might want to:
- run a batch job to generate features for a model
- hit a webservice to make predictions in realtime that needs features computed on the fly, or retrieved from a cache (e.g. feature store).
- run a streaming job to generate features for a model in realtime
- require all three or a subset of the above ways of deploying features.

So the challenge is, how do you design your processes to take in account your deployment needs?

(2) Implement features once or twice or thrice? To enable (1), you need to ask yourself, can we share features? or
do we need to reimplement them for every system that we want to use them in?

With (1) and (2) in mind, you can see that there are a lot of different dimensions to consider when designing your
feature engineering processes. They have to connect with each other, and be flexible enough to support your specific
deployment needs.

# Using Hamilton for Feature Engineering for Batch/Offline
If you fall into **only** needing to deploy features for batch jobs, then stop right there. You don't need these examples,
as they are focused on how to bridge the gap between "offline" and "online" feature engineering. You should instead
browse the other examples like `data_quality`.

# Using Hamilton for Feature Engineering for Batch/Offline and Online/Streaming
These example scenarios here are for the people who have to deal with both batch and online feature engineering.

We provide two examples for two common scenarios that occur if you have this need. Note, the example code in these
scenarios tries to be illustrative about how to think and frame using Hamilton. It contains minimal features so as to
not overwhelm you, and leaves out some implementation details that you would need to fill in for your specific use case,
e.g. like fitting a model using the features, or where to store aggregate feature values, etc.

## Scenario Context
A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
happens is that the code for features is not shared, and results in two implementations
that result in subtle bugs and hard to maintain code.

With this example series, we show how you can use Hamilton to:

1. write a feature once. (scenarios 1 and 2)
2. leverage that feature code anywhere that python runs. e.g. in batch and online. (scenarios 1 and 2)
3. show how to modularize components so that if you have values cached in a feature store,
you can inject those values into your feature computation needs. (scenario 2)

The task that we're modeling here isn't that important, but if you must know, we're trying to predict the number of
hours of absence that an employee will have given some information about them; this is based off the `data_quality`
example, which is based off of the [Metaflow+Hamilton example](https://outerbounds.com/blog/developing-scalable-feature-engineering-dags/),
where Hamilton was used for the feature engineering process -- in that example only offline feature engineering was modeled.

Assumptions we're using:
1. You have a fixed set of features that you want to compute for a model that you have determined as being useful a priori.
2. We are agnostic of the actual model -- and skip any details of that in the examples.

Let's explain the context of the two scenarios a bit more.

## Scenario 1: the simple case - ETL + Online API
In this scenario we assume we can get the same raw inputs at prediction time, as would be provided at training time.

This is a straightforward process if all your feature transforms are [map operations](https://en.wikipedia.org/wiki/Map_(higher-order_function)).
If however you have some transforms that are aggregations, then you need to be careful about how you connect your offline
ETL with online.

In this example, there are two features, `age_mean` and `age_std_dev`, that we avoid recomputing in an online setting.
Instead, we "store" the values for them when we compute features in the offline ETL, and then use those "stored" values
at prediction time to get the right feature computation to happen.

## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
In this scenario we assume we are not passed in data, but need to fetch it ourselves as part of the online API request.

We will pretend to hit a feature store, that will provide us with the required data to compute the features for
input to the model. This example shows one way to modularize your Hamilton code so that you can swap out the "source"
of the data. To simplify the example, we assume that we can get all the input data we need from a feature store, rather
than it also coming in via the request.

A good exercise would be to make note of the differences with this scenario (2) and scenario 1 in how they structure
the code with Hamilton.

# What's next?
Jump into each directory and read the README, it'll explain how the example is set up and how things should work.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
75 changes: 75 additions & 0 deletions examples/feature_engineering/scenario_1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Scenario 1: the simple case - ETL + Online API

Context required to understand this scenario:
1. you have read the main README in the directory above.
2. you understand what it means to have an extract, transform, load (ETL) process to create features. It ingests raw data,
runs transformations, and then creates a training set for you to train a model on.
3. you understand the need to have an online API from where you want to serve the predictions. The assumption here is
that you can provide the same raw data to it that you would have access to in your ETL process.
4. you understand why aggregation features, like `mean()` or `std_dev()`, make sense only in an
offline setting where you have all the data. In an online setting, computing them likely does not make sense.

# How it works

1. You run `etl.py`, and in addition to storing the features/training a model, you store the aggregate global values
for `age_mean` and `age_std_dev`. `etl.py` uses `offline_loader.py`, `features.py`, and `named_model_feature_sets.py`.
2. You then run `fastapi_server.py`, which is the online webservice with the trained model (not shown here). `fastapi_server.py`
uses `features.py`, and `named_model_feature_sets.py`.
Note, in real-life you would need to figure out a process to inject the aggregate global values for `age_mean` and `age_std_dev`
into the feature computation process. E.g. If you're getting started, these could be hardcoded values, or stored to a file that
is loaded much like the model, or queried from a database, etc. Though you'll want
to ensure these values match whatever values the model was trained with. If you need help here, join our slack and we're happy to help you figure this out!

Here's a mental image of how things work and how they relate to the files/modules & Hamilton:
![offline-online-image](FeaturesExampleScenario1.svg?sanitize=True)


Otherwise, here is a description of all the files and what they do:

## offline_loader.py
Contains logic to load raw data. Here it's a flat file, but it could be going
to a database, etc.

## features.py
The feature transform logic that takes raw data and transforms it into features. It contains some runtime
dataquality checks using Pandera.

Important not, there are two aggregations features defined: `age_mean` and `age_std_dev`, that are computed on the
skrawcz marked this conversation as resolved.
Show resolved Hide resolved
`age` column. These make sense to compute in an offline setting as you have all the data, but in an online setting where
you'd be performing inference, that doesn't makse sense. So for the online case, these computations be "overridden" in
`fastapi_server.py` with the values that were computed in the offline setting that you have stored (as mentioned above
and below it's up to you how to store them/sync them). The nice thing in Hamilton is that we can also "tag" these two
feature transforms with information to indicate to someone reading the code, that they should be overriden in the
online feature computation context.

## etl.py
This script mimics what one might do to fit a model: extract data, transform into features,
and then load features somewhere or fit a model. It's pretty basic and is meant
to be illustrative. It is not complete, i.e. doesn't save, or fit a model, it just extracts and transforms data
into features to create a dataframe.

As seen in this image of what is executed - we see the that data is pulled from a data source, and transformed into features.
![offline execution](offline_execution.dot.png)

Note, you need to store `age_mean` and
`age_std_dev` and then somehow get those values to plug into the code in `fastapi_server.py` - we "hand-waive" this here.

## named_model_feature_sets.py
Rather than hardcoding what features the model should have in two places, we define
it in a single place and import it where needed; this is simple if you can share the code eaisly.
However, this is something you'll have to determine how to best do in your set up. There are many ways to do this,
come ask in the slack channel if you need help.

## fastapi_server.py
The FastAPI server that serves the predictions. It's pretty basic and is meant to
illustrate the steps of what's required to serve a prediction from a model, where
you want to use the same feature computation logic as in your ETL process.

Note: the aggregation feature values are provided at run time and are the same
for all predictions -- how you "link" or "sync" these values to the webservice & model
is up to you; in this example we just hardcode them.

Here is the DAG that is executed when a request is made. As you can see, no data is loaded, as we're assuming
that data comes from the API request. Note: `age_mean` and `age_std_dev` are overridden with values and would
not be executed (our visualization doesn't take into account overrides just yet).
![online execution](online_execution.dot.png)
61 changes: 61 additions & 0 deletions examples/feature_engineering/scenario_1/etl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
"""
This is part of an offline ETL that you'd likely have.
You pull data from a source, and transform it into features, and then save/fit a model with them.

Here we ONLY use Hamilton to create the features, with comment stubs for the rest of the ETL that would normally
be here.
"""
import features
import named_model_feature_sets
import offline_loader
import pandas as pd

from hamilton import driver


def create_features(source_location: str) -> pd.DataFrame:
"""Extracts and transforms data to create feature set.

Hamilton functions encode:
- pulling the data
- transforming the data into features

Hamilton then handles building a dataframe.

:param source_location: the location to load data from.
:return: a pandas dataframe.
"""
model_features = named_model_feature_sets.model_x_features
config = {}
dr = driver.Driver(config, offline_loader, features)
# Visualize the DAG if you need to:
# dr.display_all_functions('./offline_my_full_dag.dot', {"format": "png"})
dr.visualize_execution(
model_features,
"./offline_execution.dot",
{"format": "png"},
inputs={"location": source_location},
)
df = dr.execute(
# add age_mean and age_std_dev to the features
model_features + ["age_mean", "age_std_dev"],
inputs={"location": source_location},
)
return df


if __name__ == "__main__":
# stick in command line args here
_source_location = "../../data_quality/pandera/Absenteeism_at_work.csv"
_features_df = create_features(_source_location)
# we need to store `age_mean` and `age_std_dev` somewhere for the online side.
# exercise for the reader: where would you store them for your context?
# ideas: with the model? in a database? in a file? in a feature store? (all reasonable answers it just
# depends on your context).
_age_mean = _features_df["age_mean"].values[0]
_age_std_dev = _features_df["age_std_dev"].values[0]
print(_features_df)
# Then do something with the features_df, e.g. define functions to do the following:
# save_features(features_df[named_model_feature_sets.model_x_features], "my_model_features.csv")
# train_model(features_df[named_model_feature_sets.model_x_features])
# etc.
131 changes: 131 additions & 0 deletions examples/feature_engineering/scenario_1/fastapi_server.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
""""
This is a simple example of a FastAPI server that uses Hamilton on the request
path to transform the data into features, and then uses a fake model to make
a prediction.

The assumption here is that you get all the raw data passed in via the request.

Otherwise for aggregration type features, you need to pass in a stored value
that we have mocked out with `load_invariant_feature_values`.
"""

import fastapi
import features
import named_model_feature_sets
import pandas as pd
import pydantic

from hamilton import base
from hamilton.experimental import h_async

app = fastapi.FastAPI()

# know the model schema somehow.
model_input_features = named_model_feature_sets.model_x_features


def load_invariant_feature_values() -> dict:
"""This function would load the invariant feature values from a database or file.
:return: a dictionary of invariant feature values.
"""
return {
"age_mean": 33.0,
"age_std_dev": 12.0,
}


def fake_model(df: pd.DataFrame) -> pd.Series:
"""Function to simulate a model"""
# do some transformation.
return df.sum() # this is nonsensical but provides a single number.
skrawcz marked this conversation as resolved.
Show resolved Hide resolved


# you would load the model from disk or a registry -- here it's a function.
model = fake_model
# need to load the invariant features somehow.
invariant_feature_values = load_invariant_feature_values()

# we instantiate an async driver once for the life of the app:
dr = h_async.AsyncDriver({}, features, result_builder=base.SimplePythonDataFrameGraphAdapter())


class PredictRequest(pydantic.BaseModel):
"""Assumption here is that all this data is available via the request that comes in."""

id: int
reason_for_absence: int
month_of_absence: int
day_of_the_week: int
seasons: int
transportation_expense: int
distance_from_residence_to_work: int
service_time: int
age: int
work_load_average_per_day: float
hit_target: int
disciplinary_failure: int
education: int
son: int
social_drinker: int
social_smoker: int
pet: int
weight: int
height: int
body_mass_index: int


@app.post("/predict")
async def predict_model_version1(request: PredictRequest) -> dict:
"""Illustrates how a prediction could be made that needs to compute some features first.
In this version we go to the feature store, and then pass in what we get from the feature
store as overrides to the model.

If you wanted to visualize execution, you could do something like:
dr.visualize_execution(model_input_features,
'./online_execution.dot',
{"format": "png"},
inputs=input_series)

:param request: the request body.
:return: a dictionary with the prediction value.
"""
# one liner to quickly create some series from the request.
input_series = pd.DataFrame([request.dict()]).to_dict(orient="series")
# create the features -- point here is we're reusing the same code as in the training!
# with the ability to provide static values for things like `age_mean` and `age_std_dev`.
features = await dr.execute(
model_input_features, inputs=input_series, overrides=invariant_feature_values
)
prediction = model(features)
return {"prediction": prediction.values[0]}


if __name__ == "__main__":
# If you run this as a script, then the app will be started on localhost:8000
import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8000)

# here's a request you can cut and past into http://localhost:8000/docs
example_request_input = {
"id": 11,
"reason_for_absence": 26,
"month_of_absence": 7,
"day_of_the_week": 3,
"seasons": 2,
"transportation_expense": 1,
"distance_from_residence_to_work": 1,
"service_time": 289,
"age": 36,
"work_load_average_per_day": 13,
"hit_target": 33,
"disciplinary_failure": 239.554,
"education": 97,
"son": 0,
"social_drinker": 1,
"social_smoker": 2,
"pet": 1,
"weight": 90,
"height": 172,
"body_mass_index": 30, # remove this comma to make it valid json.
}