Skip to content
This repository has been archived by the owner on Jul 3, 2023. It is now read-only.

Commit

Permalink
Completes feature engineering scenario 1 example
Browse files Browse the repository at this point in the history
This example shows how one might use Hamilton to compute
features in an offline and online fashion. The assumption here
is that the request passed into the API has all the raw data
required to compute features.

This example also shows how one might "override" some values
that are required for computing features, in this example they
are `age_mean` and `age_std_dev`. This can be required when you
computing aggregation features does not make sense at
inference time.
  • Loading branch information
skrawcz committed Feb 19, 2023
1 parent 3cbd38c commit 36799a9
Show file tree
Hide file tree
Showing 8 changed files with 80 additions and 18 deletions.
10 changes: 6 additions & 4 deletions examples/feature_engineering/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# write features once, use anywhere
A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
setting, as well as an online (e.g. synchronous request via FastAPI). What commonly
setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
happens is that the code for features is not shared, and results in two implementations
that result in subtle bugs and hard to maintain code.

Expand All @@ -18,13 +18,15 @@ overwhelm you.

## Scenario 1: the simple case - ETL + Online API
Assume we can get the same raw inputs at prediction time, as it was provided in at training time.
However, we don't want to recompute `age_mean` and `age_std_dev` because recomputing them doesn't make sense.
Instead, we store the result of that at training time, and then use it at prediction time to get the right
features for the model.
However, we don't want to recompute `age_mean` and `age_std_dev` because recomputing aggregation features
doesn't make sense in an online setting (usually). Instead, we "store" the values for them when we compute features,
and then use those "stored" at prediction time to get the right computation to happen.

## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
At prediction time we might only have some of the raw data required to compute a prediction. To get the rest
we need to make an API call, e.g. a feature store or a database, that will provide us with that information.
This example shows one way to modularize your Hamilton code so that you can swap out the "source" of the data.
A good exercise would be to make note of the differences with this scenario (2) and scenario 1.

# What next?
Jump into each directory and read the README, it'll explain how the example is set up and how things should work.
35 changes: 27 additions & 8 deletions examples/feature_engineering/scenario_1/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@ Assumptions:
2. you have an online API from where you want to serve the predictions, and you can provide the same raw data to it that,
you would have access to in your ETL process.

TODO: picture

## ETL Process
ETL stands for Extract, Transform, Load. This is the process of taking raw data, transforming it into features,
and then loading the data somewhere/doing something with it. E.g. you pull raw data, transform it into features,
Expand All @@ -17,7 +15,7 @@ for this example.
Here is a description of all the files and what they do.
Note: aggregation features, like `mean()` or `std_dev()` make sense only in an
offline setting where you have all the data. In an online setting, computing them
does not make sense. In `etl.py` there is a note that you need to store `age_mean` and
does not make sense (probably). In `etl.py` there is a note that you need to store `age_mean` and
`age_std_dev` and then somehow get those values to plug into the code in `fastapi_server.py`.
If you're getting started, these could be hardcoded values, or stored to a file that
is loaded much like the model, or queried from a database, etc. Though you'll want
Expand All @@ -28,19 +26,40 @@ Contains logic to load raw data. Here it's a flat file, but it could be going
to a database, etc.

#### features.py
The feature transform logic that takes raw data and transforms it into features.
The feature transform logic that takes raw data and transforms it into features. It contains some runtime
dataquality checks using Pandera.

Important not, there are two aggregations features defined: `age_mean` and `age_std_dev`, that are computed on the
`age` column. These make sense to compute in an offline setting as you have all the data, but in an online setting where
you'd be performing inference, that doesn't makse sense. So for the online case, these computations be "overridden" in
`fastapi_server.py` with the values that were computed in the offline setting that you have stored (as mentioned above
and below it's up to you how to store them/sync them).

#### etl.py
This script that mimics what one might do to fit a model: extract data, transform into features,
This script mimics what one might do to fit a model: extract data, transform into features,
and then load features somewhere or fit a model. It's pretty basic and is meant
to be illustrative.
to be illustrative. It is not complete, i.e. doesn't save, or fit a model, it just extracts and transforms data
into features to create a dataframe.

As seen in this image of what is executed - we see the that data is pulled from a data source, and transformed into features.
![offline execution](offline_execution.dot.png)

#### constants.py
Rather than hardcoding what features the model should have in two places, we define
it in a single place and import it where needed.
it in a single place and import it where needed; this is simple if you can share the code eaisly.
However, this is something you'll have to determine how to best do in your set up. There are many ways to do this,
come ask in the slack channel if you need help.

#### fastapi_server.py
The FastAPI server that serves the predictions. It's pretty basic and is meant to
illustrate the steps of what's required to serve a prediction from a model, where
you want to use the same feature computation logic as in your ETL process.
Note: aggregation features

Note: the aggregation feature values are provided at run time and are the same
for all predictions -- how you "link" or "sync" these values to the webservice & model
is up to you; in this example we just hardcode them.

Here is the DAG that is executed when a request is made. As you can see, no data is loaded, as we're assuming
that data comes from the API request. Note: `age_mean` and `age_std_dev` are overridden with values and would
not be executed (our visualization doesn't take into account overrides just yet).
![online execution](online_execution.dot.png)
24 changes: 19 additions & 5 deletions examples/feature_engineering/scenario_1/etl.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,28 @@


def create_features(source_location: str) -> pd.DataFrame:
"""Extracts and transforms data to create feature set.
Hamilton functions encode:
- pulling the data
- transforming the data into features
Hamilton then handles building a dataframe.
:param source_location: the location to load data from.
:return: a pandas dataframe.
"""
model_features = constants.model_x_features
config = {}
dr = driver.Driver(config, offline_loader, features)
# Visualize the DAG if you need to:
# dr.display_all_functions('./offline_my_full_dag.dot', {"format": "png"})
# dr.visualize_execution(model_features,
# './offline_execution.dot',
# {"format": "png"},
# inputs={"location": "../data_quality/pandera/Absenteeism_at_work.csv"})
dr.visualize_execution(
model_features,
"./offline_execution.dot",
{"format": "png"},
inputs={"location": source_location},
)
df = dr.execute(
# add age_mean and age_std_dev to the features
model_features + ["age_mean", "age_std_dev"],
Expand All @@ -34,7 +47,8 @@ def create_features(source_location: str) -> pd.DataFrame:
_features_df = create_features(_source_location)
# we need to store `age_mean` and `age_std_dev` somewhere for the online side.
# exercise for the reader: where would you store them for your context?
# ideas: with the model? in a database? in a file? in a feature store?
# ideas: with the model? in a database? in a file? in a feature store? (all reasonable answers it just
# depends on your context).
_age_mean = _features_df["age_mean"].values[0]
_age_std_dev = _features_df["age_std_dev"].values[0]
print(_features_df)
Expand Down
23 changes: 22 additions & 1 deletion examples/feature_engineering/scenario_1/fastapi_server.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,14 @@
""""
This is a simple example of a FastAPI server that uses Hamilton on the request
path to transform the data into features, and then uses a fake model to make
a prediction.
The assumption here is that you get all the raw data passed in via the request.
Otherwise for aggregration type features, you need to pass in a stored value
that we have mocked out with `load_invariant_feature_values`.
"""

import constants
import fastapi
import features
Expand Down Expand Up @@ -69,6 +80,15 @@ async def predict_model_version1(request: PredictRequest) -> dict:
"""Illustrates how a prediction could be made that needs to compute some features first.
In this version we go to the feature store, and then pass in what we get from the feature
store as overrides to the model.
If you wanted to visualize execution, you could do something like:
dr.visualize_execution(model_input_features,
'./online_execution.dot',
{"format": "png"},
inputs=input_series)
:param request: the request body.
:return: a dictionary with the prediction value.
"""
# one liner to quickly create some series from the request.
input_series = pd.DataFrame([request.dict()]).to_dict(orient="series")
Expand All @@ -87,6 +107,7 @@ async def predict_model_version1(request: PredictRequest) -> dict:

uvicorn.run(app, host="0.0.0.0", port=8000)

# here's a request you can cut and past into http://localhost:8000/docs
example_request_input = {
"id": 11,
"reason_for_absence": 26,
Expand All @@ -107,5 +128,5 @@ async def predict_model_version1(request: PredictRequest) -> dict:
"pet": 1,
"weight": 90,
"height": 172,
"body_mass_index": 30,
"body_mass_index": 30, # remove this comma to make it valid json.
}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions examples/feature_engineering/scenario_1/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
fastapi
pandas
pandera
sf-hamilton
uvicorn
1 change: 1 addition & 0 deletions examples/feature_engineering/scenario_2/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
STAY TUNED...!

0 comments on commit 36799a9

Please sign in to comment.