<a href="https://colab.research.google.com/github/soerenml/tfx/blob/main/TFX_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ML Ops with TFX

Before we are going to start with our main code we should define MLops and why MLops is so important. Hapke and Nelson (2020) compare MLops to Ford's invention of scaleable production: decompose a products production into several scaleable steps.

In 2020, ML is stil often done adhoc via notebooks lacking lineage and classical software engineering rigour. Different authors have stressed the point that Machine learning engineering is different from classical software engineering due to the fact that data (i.e. the environment) is under constant flux.

The following notebook is based on Hapke and Nelson (2020) paraphrasing their findings and applying their code samples to my own problem.

## Why MLops?

According to Hapke and Nelson (2020), MLops is needed to:

+ standardize and encapsulate processes,
+ save costs,
+ prevent bugs,
+ save costs,
+ focus on innovation not maintanance,
+ be GDPR compliant.

All this is achieved through **automation**. The end goal is to have an automated machine with perfect lineage. In order to achieve this, Hapke and Nelson (2020) define ten distinct steps in a machine learning pipeline:

### 1. Data ingestion
Data ingestion does not intend to conduct any processing on the incoming data. It's simply focssing on pure ingestion. Nevertheless, this process can become quite sophisticated depending on the type(batch / stream) and size of the data.



In [7]:
!pip install tfx



In [21]:
import tensorflow as tf
import tfx

print("""
Package versions\n
Tensorflow: {tf}
TFX: {tfx}""".format(
    tf=tf.__version__,
    tfx=tf.__version__))


Package versions

Tensorflow: 2.3.0
TFX: 2.3.0


## 0 Set up project infrastructure[link text](https://)

In [49]:
import os
import shutil

# Create folder for metadata
if not os.path.exists('./metadata'):
    os.makedirs('./metadata')
else:
  shutil.rmtree('./metadata')
  os.makedirs('./metadata')

# Create folder for metadata
if not os.path.exists('./data'):
    os.makedirs('./data')
else:
  shutil.rmtree('./data')
  os.makedirs('./data')

In [50]:
%%bash
cd data
wget 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

--2020-12-13 20:15:53--  https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 60302 (59K) [text/plain]
Saving to: ‘titanic.csv’

     0K .......... .......... .......... .......... .......... 84% 6.85M 0s
    50K ........                                              100% 10.8M=0.008s

2020-12-13 20:15:53 (7.25 MB/s) - ‘titanic.csv’ saved [60302/60302]



# 2. Define metadata

There are different options to save metadata with TFX:

+ In-memory,
+ SQLite
+ MySQL

In this notebook I am going use <code>interactive pipelines</code>



In [39]:
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
context = InteractiveContext(
    pipeline_name = '',
    pipeline_root = './Metadata')



In [53]:
from tfx.components import CsvExampleGen
from tfx.utils.dsl_utils import external_input

examples = external_input("./data")
example_gen = CsvExampleGen(input=examples)
context.run(example_gen)



0,1
.execution_id,3
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x7f5e331664a8.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f5e32dd4ac8.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base']./data['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:60302,xor_checksum:1607890553,sum_checksum:1607890553"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f5e32dd4ac8.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f5e32dd4ac8.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base']./data['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:60302,xor_checksum:1607890553,sum_checksum:1607890553"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f5e32dd4ac8.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,./Metadata/CsvExampleGen/examples/3
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],./data
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:60302,xor_checksum:1607890553,sum_checksum:1607890553"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x7f5e32dd4ac8.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./Metadata/CsvExampleGen/examples/3) at 0x7f5e32dd4b00.type<class 'tfx.types.standard_artifacts.Examples'>.uri./Metadata/CsvExampleGen/examples/3.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,./Metadata/CsvExampleGen/examples/3
.span,0
.split_names,"[""train"", ""eval""]"
.version,0
