# Part 1: Basic Usages of ml.Metadata

Instructions to play with the demo
* Step 1: copy this notebook.
* Step 2: Connect to ML Metadata Demo by tfx-dev 
>* Step 2, Option 2: if this doesn't work, join the MDB group tfx-dev.
>*  Step 2, Option 3: If this doesn't work, run your own instance (for 20 instances).
>>*  Patch CL/223555619
>>*  change user in third_party/ml_metadata/google/demo/colab_pool.borg
>>*  blaze build -c opt third_party/ml_metadata/google/demo:notebook.par
>>*  borgcfg third_party/ml_metadata/google/demo/colab_pool.borg reload




* Step 3: Run the code blocks below.

## Import Packages

In [0]:
from colabtools import adhoc_import
from google3.file.base import pywrapfile
from google3.file.recordio.python import recordio
import tensorflow_data_validation as tfdv
import os
import time

from google3.third_party.ml_metadata.metadata_store import metadata_store
from google3.third_party.ml_metadata.proto import metadata_store_pb2
from google3.third_party.tensorflow_metadata.proto.v0 import statistics_pb2
from google3.third_party.tensorflow_metadata.proto.v0 import schema_pb2

## Interaction with a Metadata Store
### [ConnectionConfig](https://cs.corp.google.com/piper///depot/google3/third_party/ml_metadata/proto/metadata_store.proto?rcl=224366994&l=219) provides the options to use a list of physical backend for storing the metadata in a **MetadataStore**.
- Fake (in memory db)
- SQLite (db file)
- MySql (db server)

In [0]:
# Use Connection Config to create a store
connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.fake_database.SetInParent()
store = metadata_store.MetadataStore(connection_config)

### [MetadataStore](https://cs.corp.google.com/piper///depot/google3/third_party/ml_metadata/metadata_store/metadata_store.py?rcl=224393676&l=34) contains a list of APIs to create and manipulate metadata. The main concepts are:
- Types: 
  * It defines the concepts of possible things in a pipeline, such as components and generated files.
  * One can define **ArtifactType** to describe a set of files, for instance, in the current TFX:
    - Data can be viewed as a type of Artifact with properties such as _span_, _split_, _version_ 
    - Stats, Schema are other types of Artifact.     
  * Similarly, one can define **ExecutionType** to describe a set of similar component's run, e.g.,:
    - StatsGen's run is a type of Execution, always uses some file in Data type and generates files in Stats type
    - DataValidator may run in a mode of generating a file in Schema type by looking at a Stats typed file. In addition it can run in a mode to validate a Stats typed file by given a Schema typed file, and generates Anomaly typed file. 
  * with the type associated with a pipeline artifact, it is more than a file stored as PPPs. The pipeline's description and execution history can be captured in a structured way, and used for later purposes, such as provenance tracking, analyzing errors in a run, and even auditing against policies. 
- **Artifact** and **Execution**:
 * with the types, when the pipeline runs, the actual files, and component runs happen. 
 * Metadata Store allows to ingest the component runs' history, generated files including their physical locations, and related properties.
 * It provides transactional methods and allows orchestration relies on the atomicity of Metadata Store operatoins.
- **Event**
 * describes the relationships between Artifacts and Executions, such as Input and Output.
 
 
Next code block, we illustrate how to create and query those major concepts in the metadata store. 

### a) Before pipeline run, register artifact and execution types 

In [0]:
# Create ArtifactTypes, e.g., Data and Schema
data_type = metadata_store_pb2.ArtifactType()
data_type.name = "DataSet"
data_type.properties["span"] = metadata_store_pb2.INT
data_type.properties["split"] = metadata_store_pb2.STRING
data_type.properties["version"] = metadata_store_pb2.INT
data_type_id = store.put_artifact_type(data_type)

stats_type = metadata_store_pb2.ArtifactType()
stats_type.name = "Statistics"
stats_type.properties["state"] = metadata_store_pb2.STRING
stats_type_id = store.put_artifact_type(stats_type)

# Create ExecutionTpye, e.g., StatsGen
statsgen_type = metadata_store_pb2.ExecutionType()
statsgen_type.name = "StatsGen"
statsgen_type.properties["state"] = metadata_store_pb2.STRING
statsgen_type_id = store.put_execution_type(statsgen_type)

### b) During pipeline run, track component run status, generated artifacts, and their lineage
1) Let's prepare some data first. **BE SURE TO CHANGE THE `suffix` variable below to your LDAP**:

In [0]:
suffix = 'test_sandbox' # 'martinz_2' # Add your LDAP here, or your_ldap_2 if you want to start fresh
# Prepare data
BASE_DIR = "/tmp/ml_metadata_demo_" + suffix + "/"
train_path = BASE_DIR + "train"
test_path = BASE_DIR + "test_path"

!fileutil mkdir -p $BASE_DIR
!fileutil cp -f /cns/is-d/home/mingzhong/bug_party/training_10k.tfrecord $train_path
!fileutil cp -f /cns/is-d/home/mingzhong/bug_party/test.tfrecord $test_path

training_10k.tfrecord 100% |Goooooooooooogle|   8.90M   13.03M/s Time: 00:00:00
test.tfrecord 100% |Goooooooooooooooooooogle|  887.7K    2.99M/s Time: 00:00:00


2) Let's track these two train/test files in the metadata store using the Data type we defined

In [0]:
# During a StatsGen run, tracking the input/output files.
def publish_data_artifact(store, uri, span, split, version):
  data_artifact = metadata_store_pb2.Artifact()
  data_artifact.uri = uri
  data_artifact.properties["span"].int_value = span
  data_artifact.properties["split"].string_value = split
  data_artifact.properties["version"].int_value = version
  data_artifact.type_id = data_type_id
  [artifact_id] = store.put_artifacts([data_artifact])
  return artifact_id

train_data_id = publish_data_artifact(store, train_path, 0, "TRAIN", 0)
train_id = train_data_id # TODO: remove this
test_data_id = publish_data_artifact(store, test_path, 0, "TEST", 0)
test_id = test_data_id # TODO: remove this

3) Let's check whether the two artifacts are actually stored properly

_Note_: the metadata store APIs returns stored metadata model as protos defined in [metadata_store.proto](https://cs.corp.google.com/piper///depot/google3/third_party/ml_metadata/proto/metadata_store.proto).

In [0]:
all_artifacts = store.get_artifacts()
print all_artifacts

[id: 1
type_id: 1
uri: "/tmp/ml_metadata_demo_test_sandbox/train"
properties {
  key: "span"
  value {
    int_value: 0
  }
}
properties {
  key: "split"
  value {
    string_value: "TRAIN"
  }
}
properties {
  key: "version"
  value {
    int_value: 0
  }
}
, id: 2
type_id: 1
uri: "/tmp/ml_metadata_demo_test_sandbox/test_path"
properties {
  key: "span"
  value {
    int_value: 0
  }
}
properties {
  key: "split"
  value {
    string_value: "TEST"
  }
}
properties {
  key: "version"
  value {
    int_value: 0
  }
}
]


4) Next let's run the StatsGen component using tfdv, and we illustrate metadata ingestions calls for future components. Please take a closer look at the inline comments in the following code blocks. 

In [0]:
# a) To start the component run, the caller (orchestration engine) use the  
#    metadata store to get the location of the artifact. 
[training_data] = store.get_artifacts_by_id([train_data_id])

# b) In this case, the tfdv statsgen works on the training data. We find it from 
#    the metadata store, and passes its location(uri) to tfdv statsgen.
stats_file = tfdv.generate_statistics_from_tfrecord(training_data.uri)

In [0]:
# c) When component publishes its Artifact (`stats_filepath`), the underline 
#    implementation and uses the metadata store to publish the file as an 
#    and Artifact, so that it will be visiable to the downstream components.
def publish_stats(statsgen_output, file_uri, user_properties):
  # when the output is ready, create a unpublished artifact
  stats_artifact = metadata_store_pb2.Artifact()
  stats_artifact.uri = file_uri
  stats_artifact.type_id = stats_type_id
  for name, value in user_properties.items():
    stats_artifact.custom_properties[name].string_value = value
  stats_artifact.properties["state"].string_value = "UNPUBLISHED"
  # register it to database, and then write to file 
  [stats_artifact_id] = store.put_artifacts([stats_artifact])
  # writing to disk, so even if it fails, the file can still be GCed
  with recordio.RecordWriter(file_uri, "a") as output_file:
    output_file.WriteRecord(statsgen_output.SerializeToString())
  # once it finishes writing to disk, we update its status to COMPLETE so that
  # the following components can use it now. 
  stats_artifact.id = stats_artifact_id
  stats_artifact.properties["state"].string_value = "COMPLETE"
  return store.put_artifacts([stats_artifact])[0]
  
# Note while publishing more user properties can be attached. 
train_stats_id = publish_stats(stats_file, BASE_DIR + "train_stats.pbtxt", \
                 {"comment": "generated status for demo day"})

# Let's check the stored artifact
store.get_artifacts_by_id([train_stats_id])

[id: 3
 type_id: 2
 uri: "/tmp/ml_metadata_demo_test_sandbox/train_stats.pbtxt"
 properties {
   key: "state"
   value {
     string_value: "COMPLETE"
   }
 }
 custom_properties {
   key: "comment"
   value {
     string_value: "generated status for demo day"
   }
 }]

5) In addition to keeping track of artifacts using the metadata store, the runs of components can be captured by Executions. Futhermore, the Input/Output lineage of component runs and their dependent artifacts can be captured by Events in the metadata. We still use the above StatsGen example to show how to use MetadataStore APIs to connect the dots.

In [0]:
# Tracking runs as executions in MetadataStore
# Illustrating possible state transitions in a component run. 
def component_run_with_metadata(execution_type_id, input_id, output_id):
  # 1. component begins, register the run in the metadata store
  component_run = metadata_store_pb2.Execution()
  component_run.type_id = execution_type_id;
  component_run.properties["state"].string_value = "RUNNING"
  [run_id] = store.put_executions([component_run])
  # 2. declare the artifact will be processed in the run
  input_event = metadata_store_pb2.Event()
  input_event.artifact_id = input_id
  input_event.execution_id = run_id 
  input_event.type = metadata_store_pb2.Event.DECLARED_INPUT
  store.put_events([input_event])  
  # 3. component starts 
  # ... finished reading the artifact from the storage
  #     then it marks it as real input, and start processing
  input_event.type = metadata_store_pb2.Event.INPUT
  store.put_events([input_event])  
  # ... processing the input
  # ... almost finished, before writing to disk
  #     it declares an output, first create Artifact, get its `output_id`
  output_event = metadata_store_pb2.Event()
  output_event.artifact_id = output_id
  output_event.execution_id = run_id 
  output_event.type = metadata_store_pb2.Event.DECLARED_OUTPUT
  store.put_events([output_event]) 
  # 4. component publishes the output
  # ... write to disk
  #     then change output artifact to COMPLETE, and add an output event
  output_event.type = metadata_store_pb2.Event.OUTPUT
  store.put_events([output_event]) 
  # 5. component finishes, updated its Executionn
  component_run.id = run_id
  component_run.properties["state"].string_value = "COMPLETED"
  return store.put_executions([component_run])[0]


run_id = component_run_with_metadata(statsgen_type_id, train_data_id, train_stats_id)
print "The StatsGen Run in MetadataStore: \n"
print store.get_executions_by_id([run_id])
print "\nThe Associated Events of that StatsGen Run: \n"
for e in store.get_events_by_execution_ids([run_id]):
  print e
  print 

The StatsGen Run in MetadataStore: 

[id: 1
type_id: 3
properties {
  key: "state"
  value {
    string_value: "COMPLETED"
  }
}
]

The Associated Events of that StatsGen Run: 

artifact_id: 1
execution_id: 1
type: DECLARED_INPUT
milliseconds_since_epoch: 1544235439831


artifact_id: 1
execution_id: 1
type: INPUT
milliseconds_since_epoch: 1544235439831


artifact_id: 3
execution_id: 1
type: DECLARED_OUTPUT
milliseconds_since_epoch: 1544235439831


artifact_id: 3
execution_id: 1
type: OUTPUT
milliseconds_since_epoch: 1544235439831




# Part 2: ML Metadata in the Chicago Taxi notebook




First, **you need to run the code above before running the code below**.


We begin with a few methods that represent a barebones orchestration system. Here, executions have one input and output. We have also defined some basic types. 

In [0]:
# This cell basically defines a very primitive orchestration system, where
# executions have one input and output.
# decent support for RecordIO proto artifacts.

# Notice that this cell is defining the types in the store. You can define
# whatever types with whatever properties you want.

def create_metadata_store():
  """Make a fake, local metadata store."""
  connection_config = metadata_store_pb2.ConnectionConfig()
  connection_config.fake_database.SetInParent()
  return metadata_store.MetadataStore(connection_config)

def get_schema_type(store):
  """Gets the schema type ID, or creates one if it doesn't exist."""
  artifact_type = metadata_store_pb2.ArtifactType()
  artifact_type.name = "Schema"
  artifact_type.properties["version"] = metadata_store_pb2.INT
  return store.put_artifact_type(artifact_type)

def get_data_type(store):
  """Gets the data type ID, or creates one if it doesn't exist."""
  artifact_type = metadata_store_pb2.ArtifactType()
  artifact_type.name = "Data"
  artifact_type.properties["span"] = metadata_store_pb2.INT
  artifact_type.properties["split"] = metadata_store_pb2.STRING
  artifact_type.properties["version"] = metadata_store_pb2.INT
  return store.put_artifact_type(artifact_type)

def get_stats_type(store):
  """Gets the stats type ID, or creates one if it doesn't exist."""
  artifact_type = metadata_store_pb2.ArtifactType()
  artifact_type.name = "Stats"
  return store.put_artifact_type(artifact_type)

def get_stats_gen_type(store):
  """Gets the type of a Stats execution."""
  execution_type = metadata_store_pb2.ExecutionType()
  execution_type.name = "Stats"
  return store.put_execution_type(execution_type)

def get_infer_schema_type(store):
  """Gets the type of a Stats execution."""
  execution_type = metadata_store_pb2.ExecutionType()
  execution_type.name = "InferSchema"
  return store.put_execution_type(execution_type)

def get_stats_gen_execution(store):
  """Returns a local stats gen execution object.
  
  The result can be put in the database, or used in is_already_run below.
  """
  execution = metadata_store_pb2.Execution()
  execution.type_id = get_stats_gen_type(store)
  return execution

def get_infer_schema_execution(store):
  """Returns a local stats gen execution object.
  
  The result can be put in the database, or used in is_already_run below.
  """
  execution = metadata_store_pb2.Execution()
  execution.type_id = get_infer_schema_type(store)
  return execution


##### A light API on top of metadata store. ####################################

def put_event(store, execution_id, artifact_id, is_input):
  """Commits a single event to the repository."""
  event = metadata_store_pb2.Event()
  event.artifact_id = artifact_id
  event.execution_id = execution_id
  event.type = metadata_store_pb2.Event.DECLARED_INPUT if is_input else metadata_store_pb2.Event.DECLARED_OUTPUT
  store.put_events([event])

def publish_execution(store, execution, input_artifact_ids, output_artifact_ids):
  [execution_id] = store.put_executions([execution])
  # This can also be done as a single transaction.
  # Note: paths on inputs and outputs are coming soon!
  # Exercise: have this method call put_events(...) once, so that all events
  # are completed as a single transaction.
  for x in input_artifact_ids:
    put_event(store, execution_id, x, True)
  for x in output_artifact_ids:
    put_event(store, execution_id, x, False)


# An example of how to use provenance to drive orchestration.
def get_old_execution_id_or_none(store, execution, input_artifact_id):
  """Test if something was already run."""
  # Find events with input_artifact_id.
  events = store.get_events_by_artifact_ids([input_artifact_id])
   # Get the events where it was an official input.
  input_events = filter(lambda e:e.type==metadata_store_pb2.Event.DECLARED_INPUT, events)
  if not input_events:
    return None
  # Get the executions corresponding to those inputs.
  old_executions = store.get_executions_by_id([e.execution_id for e in input_events])

  for old in old_executions:
    # Also a good idea to check that the properties are equal.
    if old.type_id == execution.type_id:
      return old.id
  return None


# Tests if an execution has already been run.
def is_already_run(store, execution, input_artifact_id):
  """Test if something was already run."""
  return get_old_execution_id_or_none(store, execution, input_artifact_id) is not None

# Get outputs of a previous execution.
def get_outputs(store, execution, input_artifact_id):
  """Find any outputs from an earlier, similar execution."""
  old_execution_id = get_old_execution_id_or_none(store, execution, input_artifact_id) is not None
  if old_execution_id is None:
    return None

  # Find events with input_artifact_id.
  events = store.get_events_by_execution_ids([old_execution_id])
  # Get the events where it was an official output.
  output_events = filter(lambda e:e.type==metadata_store_pb2.Event.DECLARED_OUTPUT, events)
  return [e.artifact_id for e in output_events]



#### A fake stats execution.""""""

def run_fake_stats(store, data_artifact_id):
  """A fake stats run."""
  stats_gen_type = get_stats_gen_type(store)
  [data_artifact] = store.get_artifacts_by_id([data_artifact_id])

  stats_artifact = metadata_store_pb2.Artifact()
  stats_artifact.type_id = get_stats_type(store)
  stats_gen_execution = get_stats_gen_execution(store)
  [stats_artifact_id] = store.put_artifacts([stats_artifact])
  publish_execution(store, get_stats_gen_execution(store), [data_artifact_id],
                    [stats_artifact_id])
  return stats_artifact_id



def pretty_print_artifact(artifact, type_names):
  result = ("{Type: " + type_names[artifact.type_id] + 
            ", ID: " + str(artifact.id) +
            " uri: " + artifact.uri + "{")
  
  for k,v in artifact.properties.items():
    if v.HasField("int_value"):
      result += k + ":" + str(v.int_value) + ","
    if v.HasField("string_value"):
      result += k + ":" + v.string_value + ","
  result += "}}"

def display_all_artifacts(store):
  all_artifacts = store.get_artifacts()

  # This serves as a cache: we could also query the database directly
  # for the names of the types of the artifacts.
  type_names = {get_schema_type(store):"Schema",
               get_stats_type(store):"Stats",
               get_data_type(store):"Data"}


  # Displays all the artifacts.
  for artifact in all_artifacts:
    pretty_print_artifact(artifact, type_names)

def display_all_executions(store):
  for execution in store.get_executions():
    print(execution)


We can also introduce methods for reading and writing various types of artifacts. These can be either general (as in publish_proto and get_proto) or specific (as in publish_stats or get_schema).

In [0]:
# Here are some methods for publishing and getting protos from artifacts.

def publish_proto(store, artifact, message):
  """Serializes a proto."""
  artifact_copy = metadata_store_pb2.Artifact()
  artifact_copy.type_id = artifact.type_id
  [artifact_id] = store.put_artifacts([artifact_copy])
  # Note the non-atomicity here. This can be solved with a state property.
  artifact_copy.MergeFrom(artifact)
  artifact_copy.uri = BASE_DIR + str(artifact_id)
  artifact_copy.id = artifact_id
  # TODO(martinz): write the proto to disk.
  with recordio.RecordWriter(artifact_copy.uri, "a") as output_file:
    output_file.WriteRecord(message.SerializeToString())
  # Commit the whole artifact. Note that any properties that are changed are
  # updated, and any that are removed are deleted.
  store.put_artifacts([artifact_copy])
  return artifact_id

# An example of generic code that can run on top of metadata.
def get_proto(store, artifact_id, message):
  """Deserializes a proto, given an artifact ID."""
  [artifact] = store.get_artifacts_by_id([artifact_id])
  with recordio.RecordReader(artifact.uri) as reader:
      # TODO(martinz): check that you read SOMETHING.
      for record in reader:
        message.ParseFromString(record)

def publish_stats(store, stats_proto):
  """Because stats has no properties except provenance, we can publish it directly."""
  artifact = metadata_store_pb2.Artifact()
  artifact.type_id = get_stats_type(store)
  return publish_proto(store, artifact, stats_proto)

def get_stats(store, stats_id):
  result = statistics_pb2.DatasetFeatureStatisticsList()
  get_proto(store, stats_id, result)
  return result

def publish_schema(store, schema_proto):
  """Here, we are ignoring the version."""
  artifact = metadata_store_pb2.Artifact()
  artifact.type_id = get_schema_type(store)
  return publish_proto(store, artifact, schema_proto)

def get_schema(store, stats_id):
  result = schema_pb2.Schema()
  get_proto(store, stats_id, result)
  return result

def run_real_stats(store, data_id):
  [data_artifact] = store.get_artifacts_by_id([data_id])
  output_stats = tfdv.generate_statistics_from_tfrecord(data_artifact.uri)
  
  output_stats_id = publish_stats(store, output_stats)
  
  publish_execution(store, get_stats_gen_execution(store), [data_id], [output_stats_id])
  return output_stats_id


def visualize_statistics(store, stats_id):
  tfdv.visualize_statistics(get_stats(store, stats_id))

def infer_schema(store, stats_id):
  # TODO: check if this has already been run.
  
  schema_id = publish_schema(store, tfdv.infer_schema(get_stats(store, stats_id)))
  
  publish_execution(store, get_infer_schema_execution(store), [stats_id], [schema_id])
  return schema_id

def display_schema(store, schema_id):
  tfdv.display_schema(get_schema(store, schema_id))  


In [0]:
# No requirement to have properties correspond to their locations on disk.
train_id = publish_data_artifact(store, train_path, 0, "TRAIN", 0)
test_id = publish_data_artifact(store, test_path, 0, "TEST", 0)


This notebook describes how to explore and validate Chicago Taxi dataset using TensorFlow Data Validation.

## Memoization

Here, we are running TFDV's generate_statistics_from_tf_record. However, we have the capacity to return a cached result if we have run statistics before. Note that this directly relies on provenance, as opposed to directory structure.

In [0]:
def run_with_cache(store, data_id):
  if is_already_run(store, get_stats_gen_execution(store), data_id):
    print("returning cached result")
    [result] = get_outputs(store, get_stats_gen_execution(store), data_id)
    return result
  else:
    print("returning real result")
    return run_real_stats(store, data_id)

# Uncomment this to run with or without caching.
# Notice how only one cached result is returned.
stats_id = run_with_cache(store, train_id)
# stats_id = run_real_stats(store, train_id)
visualize_statistics(store, stats_id)

returning real result


If you want to play around with this, you can:
* Add a property to the data artifact type that identifies the type of input.
* When run_real_stats is run, modify it to call generate_statistics_from_tf_record or generate_statistics_from_csv.

In [0]:
schema_id = infer_schema(store, stats_id)

In general, TFDV uses conservative heuristics to infer stable data properties
from the statistics in order to avoid overfitting the schema to the specific
dataset. It is strongly advised to **review the inferred schema and refine
it as needed**, to capture any domain knowledge about the data that TFDV's
heuristics might have missed.

## Visualizing Artifacts

In addition, we want to be able to visualize artifacts.

In [0]:
display_schema(store, schema_id)

Unnamed: 0_level_0,Type,Presence,Valency,Domain
Feature name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
'trip_id',BYTES,required,,-
'tips',FLOAT,required,,-
'community_areas',BYTES,optional,single,"(-inf,inf)"
'miles',FLOAT,required,,-
'end',BYTES,optional,single,-
'pu_longitude',FLOAT,optional,single,-
'taxi_id',BYTES,required,,-
'pay_type',STRING,required,,'pay_type'
'do_com_area',BYTES,optional,single,"(-inf,inf)"
'total',FLOAT,required,,-


Unnamed: 0_level_0,Values
Domain,Unnamed: 1_level_1
'pay_type',"'Cash', 'Credit Card', 'Dispute', 'No Charge', 'Pcard', 'Prcard', 'Unknown'"
'company',"'0118 - 42111 Godfrey S.Awir', '1085 - 72312 N and W Cab Co', '1085 - N and W Cab Co', '1247 - 72807 Daniel Ayertey', '2092 - 61288 Sbeih company', '2733 - 74600 Benny Jona', '2733 - Benny Jona', '2809 - 95474 C & D Cab Co Inc.', '3141 - 87803 Zip Cab', '3152 - 97284 Crystal Abernathy', '3201 - C&D Cab Co Inc', '3385 - Eman Cab', '3591 - 63480 Chuks Cab', '3620 - 52292 David K. Cab Corp.', '3623 - 72222 Arrington Enterprises', '3897 - 57856 Ilie Malec', '4053 - Adwar H. Nikola', '4197 - 41842 Royal Star', '4197 - Royal Star', '4615 - 83503 Tyrone Henderson', '4615 - Tyrone Henderson', '4623 - 27290 Jay Kim', '4623 - Jay Kim', '5006 - 39261 Salifu Bawa', '5129 - 87128', '5724 - 75306 KYVI Cab Inc', '585 - 88805 Valley Cab Co', '5874 - 73628 Sergey Cab Corp.', '6488 - 83287 Zuha Taxi', '6747 - Mueen Abdalla', 'Blue Ribbon Taxi Association Inc.', 'Chicago Elite Cab Corp.', 'Chicago Elite Cab Corp. (Chicago Carriag', 'Chicago Medallion Leasing INC', 'Chicago Medallion Management', 'Choice Taxi Association', 'Dispatch Taxi Affiliation', 'KOAM Taxi Association', 'Northwest Management LLC', 'Taxi Affiliation Services', 'Top Cab Affiliation'"


In [0]:
# Compute stats over eval data.
test_stats_id = run_with_cache(store, test_id)

# We have written a method for visualizing statistics.
visualize_statistics(store, test_stats_id)

returning real result


In [0]:
# Compare stats of eval data with training data.

# Here, we are grabbing protos from their ids, and passing them into a method.

tfdv.visualize_statistics(lhs_statistics=get_stats(store, test_stats_id), rhs_statistics=get_stats(store, stats_id),
                          lhs_name='TEST_DATASET', rhs_name='TRAIN_DATASET')

## After the pipeline run, the users can use the metadata store to query artifacts, e.g., locations, properties, lineages.

For example, the following block uses the lineage from the data to find derived artifacts within 1-hop. 

In [0]:
def find_data(path):
  for artifact in store.get_artifacts():
    if artifact.uri == path:
      print "  type: ", artifact.type_id
      print "  uri: ", artifact.uri
      print "  span: ", artifact.properties["span"].int_value
      print "  split: ", artifact.properties["split"].string_value
      print "  version: ", artifact.properties["version"].int_value
      return artifact.id
  return -1
  
# 1. find the artifact
print "Querying 1-hop dependencies of the training data:"
print "Found training data:"
artifact_id = find_data(train_path)

# 2. find the component runs which used this artifact
print "Used by the following executions as input:"
execution_ids = []
for event in store.get_events_by_artifact_ids([artifact_id]):
  if event.type == metadata_store_pb2.Event.INPUT:
    execution_ids.append(event.execution_id)
    print "  - execution id: ", event.execution_id, \
          " (occured at:", time.ctime(int(event.milliseconds_since_epoch/1000)), ")"

# 3. find the output artifacts by those component runs
print "The following artifacts are derived from it:"
derived_artifact_ids = []
for event in store.get_events_by_execution_ids(execution_ids):
  if event.type == metadata_store_pb2.Event.OUTPUT:
    derived_artifact_ids.append(event.artifact_id)
    print "  - artifact id: ", event.artifact_id, " by execution id: ", event.execution_id
    
# 4. list the artifacts, if it is a Stats type, output it
found_latest_stats_artifact = ""
print "Artifact details :"
for artifact in store.get_artifacts_by_id(derived_artifact_ids):
  print "  path: ", artifact.uri
  print "  is Stats Type: ", artifact.type_id == stats_type_id
  found_latest_stats_artifact = artifact

Querying 1-hop dependencies of the training data:
Found training data:
  type:  1
  uri:  /tmp/ml_metadata_demo_test_sandbox/train
  span:  0
  split:  TRAIN
  version:  0
Used by the following executions as input:
  - execution id:  1  (occured at: Fri Dec  7 18:17:19 2018 )
The following artifacts are derived from it:
  - artifact id:  3  by execution id:  1
Artifact details :
  path:  /tmp/ml_metadata_demo_test_sandbox/train_stats.pbtxt
  is Stats Type:  True


## More Than One Input Artifact

Exercise: Make a method that records anomalies. Create a type for anomalies, a type for validate_statistics execution. If time permits, add caching.

Note: at present, we do not have named inputs in events. We plan to fix that very soon. Nonetheless, in this setting you can use the types to distinguish between the stats and the schema.

In [0]:

# Check eval data for errors by validating the eval data stats using the previously inferred schema.
# Note: here we are directly accessing the protos from the metadata and passing them to a method.
# Wee! Multiple layers of abstraction!
anomalies = tfdv.validate_statistics(get_stats(store, test_stats_id), get_schema(store, schema_id))

In [0]:
def get_validate_statistics_type(store):
  raise Unimplemented()

def publish_anomales(store, anomalies):
  # See publish_schema and publish_stats 
  raise Unimplemented()
  
def get_anomalies(store):
  # See get_schema and get_stats

  # Hint: result = anomalies_pb2.Anomalies()
  raise Unimplemented()

def validate_statistics(store, stats_id, schema_id):
  raise Unimplemented()
  # Run validate_statistics, and record the result.
  # Hint: anomalies = tfdv.validate_statistics(get_stats(store, test_stats_id), get_schema(store, schema_id))


def display_anomalies(store, anomalies_id):
  # Hint: tfdv.display_anomalies(anomalies)
  raise Unimplemented()

def get_anomalies_id_for_data(store, data_id):
  """Gets the anomalies for the data, or None if it doesn't exist."""
  # Hint: get the stats generated by the data.
  # Then get the anomalies generated by the model.
  # NOTE: the current checks for an execution only work for executions that
  # have one input. How can they 
  raise Unimplemented()

def display_anomalies_for_data(store, data_id):
  # Hint: get the right anomalies, then display them. Print an error if no
  # anomalies exist.
  raise Unimplemented()


  

# Part 3: Adding Anomalies 

The anomalies indicate that out of domain values were found for features `company` and `payment_type` in the stats in < 1% of the feature values. If this was expected, then the schema can be updated as follows.

In [0]:
anomalies_id = validate_statistics(store, test_stats_id, schema_id)
display_anomalies(anomalies_id)


NameError: ignored

In [0]:


def get_anomalies_id_for_data(store, data_id):
  """Gets the anomalies for the data, or None if it doesn't exist."""
  # Hint: get the stats generated by the data.
  # Then get the anomalies generated by the model.
  # NOTE: the current checks for an execution only work for executions that
  # have one input. How can they 
  raise Unimplemented()

def display_anomalies_for_data(store, data_id):
  # Hint: get the right anomalies, then display them. Print an error if no
  # anomalies exist.
  raise Unimplemented()


In [0]:
# We can also operate directly on the protos.
def update_schema(store, schema_id):
  schema = get_schema(store, schema_id)
  
  # Relax the minimum fraction of values that must come from the domain for feature company.
  company = tfdv.get_feature(schema, 'company')
  company.distribution_constraints.min_domain_mass = 0.9

  # Add new value to the domain of feature payment_type.
  payment_type_domain = tfdv.get_domain(schema, 'payment_type')
  payment_type_domain.value.append('Prcard')
  # Return a new schema.
  return publish_schema(store, schema)

In [0]:
new_schema_id = update_schema(store, schema_id)


updated_anomalies_id = validate_statistics(store, test_stats_id, new_schema_id)

In [0]:
tfdv.display_anomalies(updated_anomalies)

# Part 4: Safeguards
While the system is very flexible, there are a variety of safeguards. For example:

1.   You can't change the type of an artifact or execution when you update it. 
2.   (For now), you can't change a named type's properties.
3.   You can't create an event that points to artifact IDs or executions that don't exist.
4.   You can't set a (regular) property for an artifact where that property is not in the type.





# Part 5: Build Your Own Orchestration System

ML Metadata provides a basic interface for metadata, but it is intentionally not an orchestration system, and there are few limitations in the data model.

1.   Write a class that wraps metadata store and provides more restrictive API to the database.


```
class MyMetadataStore(object):
  def __init__(self):
    self._metadata_store = metadata_store.MetadataStore(...)
```


2.   Add a state property to all the artifacts. Write a query that only selects artifacts where the state is live.
3.   Add a timestamp property to all the artifacts. Write a "current version" query that selects the most recent schema. Hint: make your own create_artifact_type that appends the new properties?
3.   Add a pipeline_id property to all the artifacts. When you create your wrapper class, set a pipeline ID. Write a variant of the API that limits you to only artifacts with the current pipeline ID (e.g., have a new get_artifacts() that calls the existing get_artifacts(), then filters out artifacts without the right ID: have a put_artifacts() that sets the pipeline ID before committing the artifacts, et cetera.)   



