Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
PiperOrigin-RevId: 236950503
- Loading branch information
jarekw
authored and
tensorflow-extended-team
committed
Mar 6, 2019
1 parent
d86292a
commit 4f4f76c
Showing
2 changed files
with
196 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,196 @@ | ||
# Chicago Taxi Example | ||
|
||
The Chicago Taxi example demonstrates the end-to-end workflow and steps of how | ||
to analyze, validate and transform data, train a model, analyze and serve it. It | ||
uses the following [TFX](https://www.tensorflow.org/tfx) components: | ||
|
||
* [ExampleGen](https://github.com/tensorflow/tfx/blob/master/docs/guide/examplegen.md) | ||
ingests and splits the input dataset. | ||
* [StatisticsGen](https://github.com/tensorflow/tfx/blob/master/docs/guide/statsgen.md) | ||
calculates statistics for the dataset. | ||
* [SchemaGen](https://github.com/tensorflow/tfx/blob/master/docs/guide/schemagen.md) | ||
SchemaGen examines the statistics and creates a data schema. | ||
* [ExampleValidator](https://github.com/tensorflow/tfx/blob/master/docs/guide/exampleval.md) | ||
looks for anomalies and missing values in the dataset. | ||
* [Transform](https://github.com/tensorflow/tfx/blob/master/docs/guide/transform.md) | ||
performs feature engineering on the dataset. | ||
* [Trainer](https://github.com/tensorflow/tfx/blob/master/docs/guide/trainer.md) | ||
trains the model using TensorFlow [Estimators](https://www.tensorflow.org/guide/estimators) | ||
* [Evaluator](https://github.com/tensorflow/tfx/blob/master/docs/guide/evaluator.md) | ||
performs deep analysis of the training results. | ||
* [ModelValidator](https://github.com/tensorflow/tfx/blob/master/docs/guide/modelval.md) | ||
ensures that the model is "good enough" to be pushed to production. | ||
* [Pusher](https://github.com/tensorflow/tfx/blob/master/docs/guide/pusher.md) | ||
deploys the model to a serving infrastructure. | ||
|
||
Inference in the example is powered by: | ||
|
||
* [TensorFlow Serving](https://www.tensorflow.org/serving) for serving. | ||
|
||
## The dataset | ||
|
||
This example uses the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) | ||
released by the City of Chicago. | ||
|
||
Note: This site provides applications using data that has been modified | ||
for use from its original source, www.cityofchicago.org, the official website of | ||
the City of Chicago. The City of Chicago makes no claims as to the content, | ||
accuracy, timeliness, or completeness of any of the data provided at this site. | ||
The data provided at this site is subject to change at any time. It is | ||
understood that the data provided at this site is being used at one’s own risk. | ||
|
||
You can [read more](https://cloud.google.com/bigquery/public-data/chicago-taxi) about | ||
the dataset in [Google BigQuery](https://cloud.google.com/bigquery/). Explore | ||
the full dataset in the | ||
[BigQuery UI](https://bigquery.cloud.google.com/dataset/bigquery-public-data:chicago_taxi_trips). | ||
|
||
## Local prerequisites | ||
|
||
* [Apache Airflow](https://airflow.apache.org/) is used for pipeline orchestration. | ||
* [Apache Beam](https://beam.apache.org/) is used for distributed processing. | ||
* [Docker](https://www.docker.com/) is used for containerization. | ||
* [TensorFlow](https://tensorflow.org) is used for model training, evaluation and inference. | ||
|
||
### Install dependencies | ||
|
||
Development for this example will be isolated in a Python virtual environment. | ||
This allows us to experiment with different versions of dependencies. | ||
|
||
There are many ways to install `virtualenv`, see the | ||
[TensorFlow install guides](https://www.tensorflow.org/install) for different | ||
platforms, but here are a couple: | ||
|
||
* For Linux: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
sudo apt-get install python-pip python-virtualenv python-dev build-essential | ||
</pre> | ||
|
||
* For Mac: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
sudo easy_install pip | ||
pip install --upgrade virtualenv | ||
</pre> | ||
|
||
Create a Python 2.7 virtual environment for this example and activate the | ||
`virtualenv`: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
virtualenv -p python2.7 taxi | ||
source ./taxi/bin/activate | ||
</pre> | ||
|
||
Configure common paths: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
export AIRFLOW_HOME=~/airflow | ||
export TAXI_DIR=~/taxi | ||
export TFX_DIR=~/tfx | ||
</pre> | ||
|
||
Next, install the dependencies required by the Chicago Taxi example: | ||
|
||
<!--- bring back once requirements.txt file is available | ||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
pip install -r requirements.txt | ||
</pre> | ||
--> | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
pip install tensorflow | ||
pip install docker | ||
export SLUGIFY_USES_TEXT_UNIDECODE=yes | ||
pip install apache-airflow | ||
pip install tfx==0.12.0rc3 | ||
</pre> | ||
|
||
Next, initialize Airflow | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
airflow initdb | ||
</pre> | ||
|
||
### Copy the pipeline definition to Airflow's DAG directory | ||
|
||
The benefit of the local example is that you can edit any part of the pipeline | ||
and experiment very quickly with various components. The example comes with a | ||
small subset of the Taxi Trips dataset as CSV files. | ||
|
||
Note: Make sure to execute the following commands from the <code>tfx/examples/chicago_taxi_pipeline</code> directory. | ||
|
||
Let's copy the dataset CSV to the directory where TFX ExampleGen will ingest it | ||
from: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
mkdir -p $TAXI_DIR | ||
cp -r data/ $TAXI_DIR | ||
</pre> | ||
|
||
Let's copy the TFX pipeline definition to Airflow's | ||
<code>DAGs directory</code> <code>($AIRFLOW_HOME/dags)</code> so it can run the pipeline. | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
mkdir -p $AIRFLOW_HOME/dags/taxi | ||
cp taxi_pipeline_simple.py $AIRFLOW_HOME/dags/taxi | ||
</pre> | ||
|
||
The module file <code>taxi_utils.py</code> used by the Trainer and Transform | ||
components will reside in $TAXI_DIR, let's copy it there. | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
cp taxi_utils.py $TAXI_DIR | ||
</pre> | ||
|
||
## Run the local example | ||
|
||
### Start Airflow | ||
|
||
Start the <code>Airflow webserver</code>: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
airflow webserver | ||
</pre> | ||
|
||
Open a new terminal window: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
virtualenv -p python2.7 taxi | ||
source ./taxi/bin/activate | ||
</pre> | ||
|
||
and start the <code>Airflow scheduler</code>: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
airflow scheduler | ||
</pre> | ||
|
||
Open a browser to <code>127.0.0.1:8080</code> and click on the <code>chicago_taxi_simple</code> example. | ||
It should look like the image below if you click the Graph View option. | ||
|
||
![Pipeline view](chicago_taxi_pipeline_simple.png) | ||
|
||
### Run the example | ||
|
||
If you were looking at the graph above, go back to the <code>DAGs View</code> to enable the | ||
<code>chicago_taxi_simple</code>. | ||
Enable the pipeline in airflow by toggling the DAG to <code>On</code>. Now that it is | ||
schedulable, click on the <code>run button</code> (triangle) to start the run. | ||
You can view status by clicking on the started job, found in the <code>Last run</code> column. | ||
|
||
### Serve the TensorFlow model | ||
|
||
Once the pipeline completes, the model will be copied by the [Pusher](https://github.com/tensorflow/tfx/blob/master/docs/guide/pusher.md) | ||
to the directory configured in the example code: | ||
|
||
<pre class="devsite-terminal devsite-click-to-copy"> | ||
ls $TFX_DIR/pipelines/serving_model/taxi_simple | ||
</pre> | ||
|
||
|
||
To serve the model with [TensorFlow Serving](https://www.tensorflow.org/serving) | ||
please follow the instructions [here](https://github.com/tensorflow/tfx/blob/master/examples/chicago_taxi/README.md#serve-the-tensorflow-model). | ||
|
||
# Learn more | ||
|
||
Please see the [TFX User Guide](https://github.com/tensorflow/tfx/blob/master/docs/guide/index.md) to learn more. |
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.