Completes feature engineering scenario 1 example

This example shows how one might use Hamilton to compute features in an offline and online fashion. The assumption here is that the request passed into the API has all the raw data required to compute features. This example also shows how one might "override" some values that are required for computing features, in this example they are `age_mean` and `age_std_dev`. This can be required when you computing aggregation features does not make sense at inference time.
stitchfix · Feb 19, 2023 · 36799a9 · 36799a9
1 parent 3cbd38c
commit 36799a9
Show file tree

Hide file tree

Showing 8 changed files with 80 additions and 18 deletions.
diff --git a/examples/feature_engineering/README.md b/examples/feature_engineering/README.md
@@ -1,6 +1,6 @@
 # write features once, use anywhere
 A not too uncommon task is that you need to do feature engineering in an offline (e.g. batch via airflow)
-setting, as well as an online (e.g. synchronous request via FastAPI). What commonly
+setting, as well as an online setting (e.g. synchronous request via FastAPI). What commonly
 happens is that the code for features is not shared, and results in two implementations
 that result in subtle bugs and hard to maintain code.
 
@@ -18,13 +18,15 @@ overwhelm you.
 
 ## Scenario 1: the simple case - ETL + Online API
 Assume we can get the same raw inputs at prediction time, as it was provided in at training time.
-However, we don't want to recompute `age_mean` and `age_std_dev` because recomputing them doesn't make sense.
-Instead, we store the result of that at training time, and then use it at prediction time to get the right
-features for the model.
+However, we don't want to recompute `age_mean` and `age_std_dev` because recomputing aggregation features
+doesn't make sense in an online setting (usually). Instead, we "store" the values for them when we compute features,
+and then use those "stored" at prediction time to get the right computation to happen.
 
 ## Scenario 2: the more complex case - request doesn't have all the raw data - ETL + Online API
 At prediction time we might only have some of the raw data required to compute a prediction. To get the rest
 we need to make an API call, e.g. a feature store or a database, that will provide us with that information.
+This example shows one way to modularize your Hamilton code so that you can swap out the "source" of the data.
+A good exercise would be to make note of the differences with this scenario (2) and scenario 1.
 
 # What next?
 Jump into each directory and read the README, it'll explain how the example is set up and how things should work.
diff --git a/examples/feature_engineering/scenario_1/README.md b/examples/feature_engineering/scenario_1/README.md
@@ -4,8 +4,6 @@ Assumptions:
 2. you have an online API from where you want to serve the predictions, and you can provide the same raw data to it that,
 you would have access to in your ETL process.
 
-TODO: picture
-
 ## ETL Process
 ETL stands for Extract, Transform, Load. This is the process of taking raw data, transforming it into features,
 and then loading the data somewhere/doing something with it. E.g. you pull raw data, transform it into features,
@@ -17,7 +15,7 @@ for this example.
 Here is a description of all the files and what they do.
 Note: aggregation features, like `mean()` or `std_dev()` make sense only in an
 offline setting where you have all the data. In an online setting, computing them
-does not make sense. In `etl.py` there is a note that you need to store `age_mean` and
+does not make sense (probably). In `etl.py` there is a note that you need to store `age_mean` and
 `age_std_dev` and then somehow get those values to plug into the code in `fastapi_server.py`.
 If you're getting started, these could be hardcoded values, or stored to a file that
 is loaded much like the model, or queried from a database, etc. Though you'll want
@@ -28,19 +26,40 @@ Contains logic to load raw data. Here it's a flat file, but it could be going
 to a database, etc.
 
 #### features.py
-The feature transform logic that takes raw data and transforms it into features.
+The feature transform logic that takes raw data and transforms it into features. It contains some runtime
+dataquality checks using Pandera.
+
+Important not, there are two aggregations features defined: `age_mean` and `age_std_dev`, that are computed on the
+`age` column. These make sense to compute in an offline setting as you have all the data, but in an online setting where
+you'd be performing inference, that doesn't makse sense. So for the online case, these computations be "overridden" in
+`fastapi_server.py` with the values that were computed in the offline setting that you have stored (as mentioned above
+and below it's up to you how to store them/sync them).
 
 #### etl.py
-This script that mimics what one might do to fit a model: extract data, transform into features,
+This script mimics what one might do to fit a model: extract data, transform into features,
 and then load features somewhere or fit a model. It's pretty basic and is meant
-to be illustrative.
+to be illustrative. It is not complete, i.e. doesn't save, or fit a model, it just extracts and transforms data
+into features to create a dataframe.
+
+As seen in this image of what is executed - we see the that data is pulled from a data source, and transformed into features.
+![offline execution](offline_execution.dot.png)
 
 #### constants.py
 Rather than hardcoding what features the model should have in two places, we define
-it in a single place and import it where needed.
+it in a single place and import it where needed; this is simple if you can share the code eaisly.
+However, this is something you'll have to determine how to best do in your set up. There are many ways to do this,
+come ask in the slack channel if you need help.
 
 #### fastapi_server.py
 The FastAPI server that serves the predictions. It's pretty basic and is meant to
 illustrate the steps of what's required to serve a prediction from a model, where
 you want to use the same feature computation logic as in your ETL process.
-Note: aggregation features
+
+Note: the aggregation feature values are provided at run time and are the same
+for all predictions -- how you "link" or "sync" these values to the webservice & model
+is up to you; in this example we just hardcode them.
+
+Here is the DAG that is executed when a request is made. As you can see, no data is loaded, as we're assuming
+that data comes from the API request. Note: `age_mean` and `age_std_dev` are overridden with values and would
+not be executed (our visualization doesn't take into account overrides just yet).
+![online execution](online_execution.dot.png)
diff --git a/examples/feature_engineering/scenario_1/etl.py b/examples/feature_engineering/scenario_1/etl.py
@@ -11,15 +11,28 @@
 
 
 def create_features(source_location: str) -> pd.DataFrame:
+    """Extracts and transforms data to create feature set.
+
+    Hamilton functions encode:
+     - pulling the data
+     - transforming the data into features
+
+    Hamilton then handles building a dataframe.
+
+    :param source_location: the location to load data from.
+    :return: a pandas dataframe.
+    """
     model_features = constants.model_x_features
     config = {}
     dr = driver.Driver(config, offline_loader, features)
     # Visualize the DAG if you need to:
     # dr.display_all_functions('./offline_my_full_dag.dot', {"format": "png"})
-    # dr.visualize_execution(model_features,
-    #                        './offline_execution.dot',
-    #                        {"format": "png"},
-    #                        inputs={"location": "../data_quality/pandera/Absenteeism_at_work.csv"})
+    dr.visualize_execution(
+        model_features,
+        "./offline_execution.dot",
+        {"format": "png"},
+        inputs={"location": source_location},
+    )
     df = dr.execute(
         # add age_mean and age_std_dev to the features
         model_features + ["age_mean", "age_std_dev"],
@@ -34,7 +47,8 @@ def create_features(source_location: str) -> pd.DataFrame:
     _features_df = create_features(_source_location)
     # we need to store `age_mean` and `age_std_dev` somewhere for the online side.
     # exercise for the reader: where would you store them for your context?
-    # ideas: with the model? in a database? in a file? in a feature store?
+    # ideas: with the model? in a database? in a file? in a feature store? (all reasonable answers it just
+    # depends on your context).
     _age_mean = _features_df["age_mean"].values[0]
     _age_std_dev = _features_df["age_std_dev"].values[0]
     print(_features_df)

diff --git a/examples/feature_engineering/scenario_1/fastapi_server.py b/examples/feature_engineering/scenario_1/fastapi_server.py
@@ -1,3 +1,14 @@
+""""
+This is a simple example of a FastAPI server that uses Hamilton on the request
+path to transform the data into features, and then uses a fake model to make
+a prediction.
+
+The assumption here is that you get all the raw data passed in via the request.
+
+Otherwise for aggregration type features, you need to pass in a stored value
+that we have mocked out with `load_invariant_feature_values`.
+"""
+
 import constants
 import fastapi
 import features
@@ -69,6 +80,15 @@ async def predict_model_version1(request: PredictRequest) -> dict:
     """Illustrates how a prediction could be made that needs to compute some features first.
     In this version we go to the feature store, and then pass in what we get from the feature
     store as overrides to the model.
+
+    If you wanted  to visualize execution, you could do something like:
+        dr.visualize_execution(model_input_features,
+                                './online_execution.dot',
+                                {"format": "png"},
+                                inputs=input_series)
+
+    :param request: the request body.
+    :return: a dictionary with the prediction value.
     """
     # one liner to quickly create some series from the request.
     input_series = pd.DataFrame([request.dict()]).to_dict(orient="series")
@@ -87,6 +107,7 @@ async def predict_model_version1(request: PredictRequest) -> dict:
 
     uvicorn.run(app, host="0.0.0.0", port=8000)
 
+    # here's a request you can cut and past into http://localhost:8000/docs
     example_request_input = {
         "id": 11,
         "reason_for_absence": 26,
@@ -107,5 +128,5 @@ async def predict_model_version1(request: PredictRequest) -> dict:
         "pet": 1,
         "weight": 90,
         "height": 172,
-        "body_mass_index": 30,
+        "body_mass_index": 30,  # remove this comma to make it valid json.
     }
diff --git a/examples/feature_engineering/scenario_1/offline_execution.dot.png b/examples/feature_engineering/scenario_1/offline_execution.dot.png
diff --git a/examples/feature_engineering/scenario_1/online_execution.dot.png b/examples/feature_engineering/scenario_1/online_execution.dot.png
diff --git a/examples/feature_engineering/scenario_1/requirements.txt b/examples/feature_engineering/scenario_1/requirements.txt
@@ -0,0 +1,5 @@
+fastapi
+pandas
+pandera
+sf-hamilton
+uvicorn
diff --git a/examples/feature_engineering/scenario_2/README.md b/examples/feature_engineering/scenario_2/README.md
@@ -0,0 +1 @@
+STAY TUNED...!