<a href="https://colab.research.google.com/github/rcrowe-google/schemacomponent/blob/Nirzari%2Ffeature%2Fexample/example/taxi_example_colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chicago taxi example using TFX schema curation custom component

This example demonstrate the use of schema curation custom component. User defined function `schema_fn` defined in `module_file.py` is used to change schema feature `tips` from required to optional using schema curation component.

base code taken from: https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/components_keras.ipynb

## Setup

### Install TFX

**Note: In Google Colab, because of package updates, the first time you run this cell you must restart the runtime (Runtime > Restart runtime ...).**

In [None]:
!pip install -U tfx

In [None]:
x = !pwd

if 'schemacomponent' not in str(x):
  !git clone https://github.com/rcrowe-google/schemacomponent
  %cd schemacomponent/example

## Chicago taxi example pipeline


In [None]:
import os
import pprint
import tempfile
import urllib

import absl
import tensorflow as tf
import tensorflow_model_analysis as tfma
tf.get_logger().propagate = False
pp = pprint.PrettyPrinter()

from tfx import v1 as tfx
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

%load_ext tfx.orchestration.experimental.interactive.notebook_extensions.skip

In [None]:
from schemacomponent.component import component

### Set up pipeline paths

In [None]:
# This is the root directory for your TFX pip package installation.
_tfx_root = tfx.__path__[0]

### Download example data
We download the example dataset for use in our TFX pipeline.

The dataset we're using is the [Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew) released by the City of Chicago. The columns in this dataset are:

<table>
<tr><td>pickup_community_area</td><td>fare</td><td>trip_start_month</td></tr>
<tr><td>trip_start_hour</td><td>trip_start_day</td><td>trip_start_timestamp</td></tr>
<tr><td>pickup_latitude</td><td>pickup_longitude</td><td>dropoff_latitude</td></tr>
<tr><td>dropoff_longitude</td><td>trip_miles</td><td>pickup_census_tract</td></tr>
<tr><td>dropoff_census_tract</td><td>payment_type</td><td>company</td></tr>
<tr><td>trip_seconds</td><td>dropoff_community_area</td><td>tips</td></tr>
</table>

With this dataset, we will build a model that predicts the `tips` of a trip.

In [None]:
_data_root = tempfile.mkdtemp(prefix='tfx-data')
DATA_PATH = 'https://raw.githubusercontent.com/tensorflow/tfx/master/tfx/examples/chicago_taxi_pipeline/data/simple/data.csv'
_data_filepath = os.path.join(_data_root, "data.csv")
urllib.request.urlretrieve(DATA_PATH, _data_filepath)

## Run TFX components 
In the cells that follow, we create TFX components one-by-one and generates `schema` using `schemaGen` component.

In [None]:
context = InteractiveContext()

#create and run exampleGen component
example_gen = tfx.components.CsvExampleGen(input_base=_data_root)
context.run(example_gen)

#create and run statisticsGen component
statistics_gen = tfx.components.StatisticsGen(
    examples=example_gen.outputs['examples'])
context.run(statistics_gen)

#create and run schemaGen component
schema_gen = tfx.components.SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False)
context.run(schema_gen)


## Schema curation custom component

Using schema curation component `tips` is changed into `optional` feature

Code for modifying schema is in user supplied `schema_fn` in `module_file.py`


### Display infered schema

In the infered schema, `tips` feature is shown as a `required` feature:


      tips | FLOAT |	required	| single	




In [None]:
#display infered schema
context.show(schema_gen.outputs['schema'])

### Modifying schema 

In [None]:
#schema curation component
schema_curation = component.SchemaCuration(schema=schema_gen.outputs['schema'],
        module_file='module_file.py')
context.run(schema_curation)

### Display modified schema

feature `tips` is now `optional` in the modified schema

In [None]:
context.show(schema_curation.outputs['custom_schema'])