Chicago Taxi Example
The Chicago Taxi example demonstrates the end-to-end workflow and steps of how to analyze, validate and transform data, train a model, analyze and serve it. It uses the following TFX components:
- ExampleGen ingests and splits the input dataset.
- StatisticsGen calculates statistics for the dataset.
- SchemaGen SchemaGen examines the statistics and creates a data schema.
- ExampleValidator looks for anomalies and missing values in the dataset.
- Transform performs feature engineering on the dataset.
- Trainer trains the model using TensorFlow Estimators
- Evaluator performs deep analysis of the training results.
- ModelValidator ensures that the model is "good enough" to be pushed to production.
- Pusher deploys the model to a serving infrastructure.
Inference in the example is powered by:
- TensorFlow Serving for serving.
This example uses the Taxi Trips dataset released by the City of Chicago.
Note: This site provides applications using data that has been modified for use from its original source, www.cityofchicago.org, the official website of the City of Chicago. The City of Chicago makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.
- Apache Airflow is used for pipeline orchestration.
- Apache Beam is used for distributed processing.
- TensorFlow is used for model training, evaluation and inference.
Development for this example will be isolated in a Python virtual environment. This allows us to experiment with different versions of dependencies.
There are many ways to install
virtualenv, see the
TensorFlow install guides for different
platforms, but here are a couple:
- For Linux:
sudo apt-get install python-pip python-virtualenv python-dev build-essential
- For Mac:
sudo easy_install pip pip install --upgrade virtualenv
Create a Python 3.6 virtual environment for this example and activate the
virtualenv -p python3.6 taxi_pipeline source ./taxi_pipeline/bin/activate
Configure common paths:
export AIRFLOW_HOME=~/airflow export TAXI_DIR=~/taxi export TFX_DIR=~/tfx
Next, install the dependencies required by the Chicago Taxi example:
pip install tensorflow==1.14.0 pip install apache-airflow==1.10.5 pip install tfx==0.14.0
Next, initialize Airflow
Copy the pipeline definition to Airflow's DAG directory
The benefit of the local example is that you can edit any part of the pipeline and experiment very quickly with various components. First let's download the data for the example:
mkdir -p $TAXI_DIR/data/simple wget -O $TAXI_DIR/data/simple/data.csv https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv?raw=true
Next, copy the TFX pipeline definition to Airflow's
($AIRFLOW_HOME/dags) so it can run the pipeline. To find the
location of your TFX installation, use this command:
pip show tfx
Use the location shown when setting the TFX_EXAMPLES path below.
Copy the Chicago Taxi example pipeline into the Airflow DAG folder.
mkdir -p $AIRFLOW_HOME/dags/ cp $TFX_EXAMPLES/taxi_pipeline_simple.py $AIRFLOW_HOME/dags/
The module file
taxi_utils.py used by the Trainer and Transform
components will reside in $TAXI_DIR. Copy it there.
cp $TFX_EXAMPLES/taxi_utils.py $TAXI_DIR
Run the local example
Airflow webserver (in 'taxi_pipeline' virtualenv):
Open a new terminal window:
and start the
Open a browser to
127.0.0.1:8080 and click on the
It should look like the image below if you click the Graph View option.
Run the example
If you were looking at the graph above, click on the
DAGs button to
get back to the DAGs view.
chicago_taxi_simple pipeline in Airflow by toggling
the DAG to
On. Now that it is schedulable, click on the
Trigger DAG button (triangle inside a circle) to start the run. You
can view status by clicking on the started job, found in the
Last run column. This process will take several minutes.
Serve the TensorFlow model
Once the pipeline completes, the model will be copied by the Pusher to the directory configured in the example code:
LOCAL_MODEL_DIR=$TAXI_DIR/serving_model/taxi_simple \ start_model_server_local.sh
This will pick up the latest model under above path.
EXAMPLES_FILE=~/taxi/data/simple/data.csv \ SCHEMA_FILE=~/tfx/pipelines/chicago_taxi_simple/SchemaGen/output/CHANGE_TO_LATEST_DIR/schema.pbtxt \ classify_local.sh
Chicago Taxi Flink Example (python 2.7, 3.5, 3.6, 3.7)
Start local Flink cluster and Beam job server:
git clone https://github.com/tensorflow/tfx ~/tfx-source && pushd ~/tfx-source sh tfx/examples/chicago_taxi/setup_beam_on_flink.sh
Follow above instructions of Chicago Taxi Example with 'taxi_pipeline_simple' replaced by 'taxi_pipeline_portable_beam'. (Check http://localhost:8081 for the Flink Cluster Dashboard)
Chicago Taxi Spark Example (python 2.7, 3.5, 3.6, 3.7)
Start local Spark cluster and Beam job server:
git clone https://github.com/tensorflow/tfx ~/tfx-source && pushd ~/tfx-source sh tfx/examples/chicago_taxi/setup_beam_on_spark.sh
Follow above instructions of Chicago Taxi Example with 'taxi_pipeline_simple' replaced by 'taxi_pipeline_portable_beam'. (Check http://localhost:8081 for the Spark Cluster Dashboard)
Please see the TFX User Guide to learn more.