# Lab5: AI Pipeline
### Supporting Diabetes research 
Cloud AI Platform Pipelines provides a way to deploy robust, repeatable machine learning pipelines along with monitoring, auditing, version tracking, and reproducibility, and delivers an enterprise-ready, easy to install, secure execution environment for your ML workflows. 
The goal is to democratize machine learning and to increase development speed by eliminating the need for data movement. In this tutorial, you use the sample   dataset for BigQuery.


## Step 1. Set up your environment.

AI Platform Pipelines will prepare a development environment to build a pipeline, and a Kubeflow Pipeline cluster to run the newly built pipeline.

**NOTE:** To select a particular TensorFlow version, or select a GPU instance, create a TensorFlow pre-installed instance in AI Platform Notebooks.

**NOTE:** There might be some errors during package installation. For example: 

>"ERROR: some-package 0.some_version.1 has requirement other-package!=2.0.,&lt;3,&gt;=1.15, but you'll have other-package 2.0.0 which is incompatible." Please ignore these errors at this moment.


Install `tfx`, `kfp`, and `skaffold`, and add installation path to the `PATH` environment variable.

In [1]:
# Install tfx and kfp Python packages.
!pip install --user --upgrade -q tfx==0.21.2
!pip install --user --upgrade -q kfp==0.2.5
# Download skaffold and set it executable.
!curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && chmod +x skaffold && mv skaffold /home/jupyter/.local/bin/

[31mERROR: google-cloud-storage 1.26.0 has requirement google-resumable-media<0.6dev,>=0.5.0, but you'll have google-resumable-media 0.4.1 which is incompatible.[0m
[31mERROR: google-cloud-bigquery 1.17.1 has requirement google-resumable-media<0.5.0dev,>=0.3.1, but you'll have google-resumable-media 0.5.1 which is incompatible.[0m
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 44.9M  100 44.9M    0     0   102M      0 --:--:-- --:--:-- --:--:--  102M


## 1.1 Set Path

In [5]:
# Set `PATH` to include user python binary directory and a directory containing `skaffold`.
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

env: PATH=/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin:/home/jupyter/.local/bin


## 1.2 Check version

In [6]:
!python3 -c "import tfx; print('TFX version: {}'.format(tfx.__version__))"

TFX version: 0.21.2


## 1.3 Check GCP Project

In [7]:
# Read GCP project id from env.
shell_output=!gcloud config list --format 'value(core.project)' 2>/dev/null
GCP_PROJECT_ID=shell_output[0]
print("GCP project ID:" + GCP_PROJECT_ID)

GCP project ID:covid-19-271622


We also need to access your KFP cluster. You can access it in your Google Cloud Console under "AI Platform > Pipeline" menu. The "endpoint" of the KFP cluster can be found from the URL of the Pipelines dashboard, or you can get it from the URL of the Getting Started page where you launched this notebook. Let's create an `ENDPOINT` environment variable and set it to the KFP cluster endpoint. **ENDPOINT should contain only the hostname part of the URL.** For example, if the URL of the KFP dashboard is `https://1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com/#/start`, ENDPOINT value becomes `1e9deb537390ca22-dot-asia-east1.pipelines.googleusercontent.com`.

>**NOTE: You MUST set your ENDPOINT value below.**

## 1.4 Set AI Pipeline Kube cluster endpoint

In [8]:
# This refers to the KFP cluster endpoint
ENDPOINT='4dfa62a617d46f32-dot-us-central2.pipelines.googleusercontent.com' # Enter your ENDPOINT here.
if not ENDPOINT:
    from absl import logging
    logging.error('Set your ENDPOINT in this cell.')

## 1.5 set custom image

In [9]:
# Docker image name for the pipeline image 
CUSTOM_TFX_IMAGE='gcr.io/' + GCP_PROJECT_ID + '/tfx-pipeline'

## Step 2. Copy the predefined template to your project directory.

In this step, we will create a working pipeline project directory and files by copying additional files from a predefined template.

You may give your pipeline a **different name by changing the `PIPELINE_NAME` below.** This will also become the name of the project directory where your files will be put.

## 1.6 Set pipeline name

In [10]:
PIPELINE_NAME="mch_pipeline"
import os
PROJECT_DIR=os.path.join(os.path.expanduser("~"),"AIHub",PIPELINE_NAME)

This TFX pipeline includes the `mhc peptide` template with the TFX python package. If you are planning to solve a point-wise prediction problem, including classification and regresssion, this template could be used as a starting point.

We will be copying pre-built Peptide Prediction pipeline files from shared GCS bucket. You can also use
the `tfx template copy` CLI command copies predefined template files into your project directory.

**Note:** Create a folder called '**mhc_pipeline**' under AIHub on the left panel file browser.

## 1.7 run once to copy template files

In [11]:
#!tfx template copy \
#  --pipeline_name={PIPELINE_NAME} \
#  --destination_path={PROJECT_DIR} \
#  --model=taxi

# 2. Validate setup

In [12]:
%cd {PROJECT_DIR}

/home/jupyter/AIHub/mch_pipeline


## 2.1 Check files

In [13]:
!ls

beam_dag_runner.py     features.py	       model.py
build.yaml	       features_test.py        model_test.py
configs.py	       hparams.py	       old
data		       __init__.py	       pipeline.py
data_validation.ipynb  kubeflow_dag_runner.py  preprocessing.py
deployedmhc	       mch_pipeline.tar.gz     preprocessing_test.py
Dockerfile	       model_analysis.ipynb    __pycache__


## 2.2 Test template feature set

In [14]:
!python3 features_test.py

Running tests under Python 3.7.6: /opt/conda/bin/python3
[ RUN      ] FeaturesTest.testNumberOfBucketFeatureBucketCount
[  FAILED  ] FeaturesTest.testNumberOfBucketFeatureBucketCount
[ RUN      ] FeaturesTest.testTransformedNames
[       OK ] FeaturesTest.testTransformedNames
[ RUN      ] FeaturesTest.test_session
[  SKIPPED ] FeaturesTest.test_session
FAIL: testNumberOfBucketFeatureBucketCount (__main__.FeaturesTest)
testNumberOfBucketFeatureBucketCount (__main__.FeaturesTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "features_test.py", line 33, in testNumberOfBucketFeatureBucketCount
    len(features.CATEGORICAL_FEATURE_MAX_VALUES))
AssertionError: 1 != 0

----------------------------------------------------------------------
Ran 3 tests in 0.002s

FAILED (failures=1, skipped=1)


IF some tests failing just skipped 

## 2.3 Review GCS bucket list

In [15]:
# You can see your buckets using `gsutil`. Following command will show bucket names without prefix and postfix.
!gsutil ls | cut -d / -f 3

artifacts.covid-19-271622.appspot.com
bq_epitope_workshop
cancer_vaccine
covid-19-271622_cloudbuild
dataproc-staging-us-central1-598002519658-gxbdggqa
dataproc-temp-us-central1-598002519658-hu8gmvbe
edward_heart
hla_peptide_dataset
hostedkfp-default-y95ed9e0de
hostedkfp-default-yfu8n9ppw9
mhc_peptide_model
mhc_pipelines
virus_mutations


# 3. Create pipeline based on template

In [16]:
!tfx pipeline create  \
--pipeline_path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT} \
--build_target_image={CUSTOM_TFX_IMAGE}

CLI
Creating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
Target image gcr.io/covid-19-271622/tfx-pipeline is not used. If the build spec is provided, update the target image in the build spec file build.yaml.
Use skaffold to build the container image.
/home/jupyter/.local/bin/skaffold
[33mWARN[0m[0000] {{.IMAGE_NAME}} is deprecated, envTemplate's template should only specify the tag value. See https://skaffold.dev/docs/pipeline-stages/taggers/ 
New container image is built. Target image is available in the build spec file.
INFO:absl:Neither eval_config nor feature_slicing_spec is passed, the model is treated as estimator.
Instructions for updating:
ModelValidator is deprecated, use Evaluator instead.
INFO:absl:Adding upstream dependencies for component BigQueryExampleGen
INFO:absl:Adding upstream dependencies for component StatisticsGen
INFO:absl:   ->  Component: BigQueryExampleGen
INFO:absl:Adding u

## 3.1 Run an experiment on a pipeline

In [56]:
!tfx run create --pipeline_name={PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Creating a run for pipeline: mch_pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Run created for pipeline: mch_pipeline
+-----------------+--------------------------------------+----------+---------------------------+
| pipeline_name   | run_id                               | status   | created_at                |
| mch_pipeline    | a4ab65d0-72f0-4a21-86d1-bd6d914da852 |          | 2020-07-30T06:44:33+00:00 |
+-----------------+--------------------------------------+----------+---------------------------+


# 4. Update pipeline and run an experiment for modifications. 
 Data scientist can quickly plug and explore various models (see module: https://github.com/testpilot0/covid/tree/master/mch_pipeline/model.py line 87-84)

In [59]:
# Update the pipeline
!tfx pipeline update \
--pipeline_path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT}
# You can run the pipeline the same way.
!tfx run create --pipeline_name {PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Updating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
Use skaffold to build the container image.
/home/jupyter/.local/bin/skaffold
New container image is built. Target image is available in the build spec file.
INFO:absl:Neither eval_config nor feature_slicing_spec is passed, the model is treated as estimator.
Instructions for updating:
ModelValidator is deprecated, use Evaluator instead.
INFO:absl:Adding upstream dependencies for component BigQueryExampleGen
INFO:absl:Adding upstream dependencies for component StatisticsGen
INFO:absl:   ->  Component: BigQueryExampleGen
INFO:absl:Adding upstream dependencies for component SchemaGen
INFO:absl:   ->  Component: StatisticsGen
INFO:absl:Adding upstream dependencies for component ExampleValidator
INFO:absl:   ->  Component: SchemaGen
INFO:absl:   ->  Component: StatisticsGen
INFO:absl:Adding upstream dependencies for component Transform
INFO:absl:   ->  

## 4.1 Update only  command

In [65]:
# Update the pipeline
!tfx pipeline update \
--pipeline_path=kubeflow_dag_runner.py \
--endpoint={ENDPOINT}

CLI
Updating pipeline
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Reading build spec from build.yaml
Use skaffold to build the container image.
/home/jupyter/.local/bin/skaffold
New container image is built. Target image is available in the build spec file.
INFO:absl:Neither eval_config nor feature_slicing_spec is passed, the model is treated as estimator.
Instructions for updating:
ModelValidator is deprecated, use Evaluator instead.
INFO:absl:Adding upstream dependencies for component BigQueryExampleGen
INFO:absl:Adding upstream dependencies for component StatisticsGen
INFO:absl:   ->  Component: BigQueryExampleGen
INFO:absl:Adding upstream dependencies for component SchemaGen
INFO:absl:   ->  Component: StatisticsGen
INFO:absl:Adding upstream dependencies for component Transform
INFO:absl:   ->  Component: BigQueryExampleGen
INFO:absl:   ->  Component: SchemaGen
INFO:absl:Adding upstream dependencies for component ExampleValidator
INFO:absl:  

# 4.2 Run updated pipeline

In [4]:
# You can run the pipeline the same way.
!tfx run create --pipeline_name {PIPELINE_NAME} --endpoint={ENDPOINT}

CLI
Creating a run for pipeline: {PIPELINE_NAME}
Detected Kubeflow.
Use --engine flag if you intend to use a different orchestrator.
Pipeline "{PIPELINE_NAME}" does not exist.


## [Graph](https://4dfa62a617d46f32-dot-us-central2.pipelines.googleusercontent.com/#/runs/details/8df9444f-a0d9-4c7e-8522-330e7ccd883e) representation of this pipeline
+ Model can be explored in [Tensorboard](https://4dfa62a617d46f32-dot-us-central2.pipelines.googleusercontent.com/apis/v1beta1/_proxy/viewer-639153f43fb50fb8f89ca441e93c719f36059592-service.default.svc.cluster.local:80/tensorboard/viewer-639153f43fb50fb8f89ca441e93c719f36059592/#graphs&run=serving_model_dir)

# This is end of Lab4! Congratualtions!