# Notebook to demonstrate end to end reproducible machine learning on Kubernetes

This is divided into two sections: 
1. Reproducible machine learning workflow with Tensorflow 2: A semantic segmentation problem
2. Deployment of 1) on kubernetes with Kubeflow & Pachyderm for full provenance and reproducibility

# 1. Reproducible machine learning workflow with Tensorflow 2: A semantic segmentation problem

Machine learning algorithms often required complex computation. This is increasing becoming true as we move towards complex network architectures that requires Giga FLOPs/ Tera FLOPs computations.
We move more and more towards GPU and efficient hardware. But the speed and efficiency sometimes comes at a cost of reproducibility. Different GPU architectures due to different stream multiprocessing unit may give different results.
Even using parallelism on CPU, may give different results. There is very interesting presentation by Corden [Consistency of Floating Point Results or Why doesn’t my application always give the same answer?”](https://www.nccs.nasa.gov/images/FloatingPoint_consistency.pdf) that covers some of these in details.
			      	
It is not just hardware, some libraries performing intensive computations do not guarantee reproducibility for some routines. One such example is [NVIDIA's deep neural network library](https://developer.nvidia.com/cudnn) that do not guarantee same bitwise results even on same GPU for some routines [ref](https://docs.nvidia.com/deeplearning/sdk/cudnn-developer-guide/index.html#reproducibility).

Then we have randomness! Algorithmic randomness like Dropouts, Random initializations, Random augmentations or more process and practice based randomness such as shuffle of data etc. Using unseeded randomness also make reproducibility very hard.  

Using Tensorflow 2.0, 100% reproducible deep learning can be practiced if used correctly. This is due to [Duncan Riach](https://github.com/duncanriach) excellent work [https://github.com/NVIDIA/](https://github.com/NVIDIA/tensorflow-determinism). Also, thanks to wider Tensorflow team.
[“Determinism in deep learning” By Duncan Riach @ GTC 2019](https://drive.google.com/file/d/18pmjeiXWqzHWB8mM2mb3kjN4JSOZBV4A/view) is very interesting presentation as well. 

Other than seeding all randomness in all layers of my ML network, executing set_global_determinism [set_global_determinism](https://github.com/suneeta-mall/e2e-ml-on-k8s/blob/master/pypkg/pylib/utils.py#L15-L42) method guarantees 100% same results if fed with same dataset.

```python
def set_global_determinism(seed=42, fast_n_close=False):
    """
        Enable 100% reproducibility on operations related to tensor and randomness.
        Parameters:
        seed (int): seed value for global randomness
        fast_n_close (bool): whether to achieve efficient at the cost of determinism/reproducibility
    """
    set_seeds(seed=seed)
    if fast_n_close:
        return

    logging.warning("*******************************************************************************")
    logging.warning("*** set_global_determinism is called,setting full determinism, will be slow ***")
    logging.warning("*******************************************************************************")

    os.environ['TF_DETERMINISTIC_OPS'] = '1'
    os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
    # https://www.tensorflow.org/api_docs/python/tf/config/threading/set_inter_op_parallelism_threads
    tf.config.threading.set_inter_op_parallelism_threads(1)
    tf.config.threading.set_intra_op_parallelism_threads(1)
    from tfdeterminism import patch
    patch()

def set_seeds(seed=42):
    os.environ['PYTHONHASHSEED'] = str(seed)
    random.seed(seed)
    tf.random.set_seed(seed)
    np.random.seed(seed)
```

To run the e2e example locally, run following. This implements machine learning workflow shown below:
![Machine learning workflow](resources/ml-workflow.jpg "Machine learning workflow")

with training code scripted in [train.py](app/train.py). See [ML workflow steps details](ML_WORKFLOWS.md) for more information on other steps.

In [None]:
! docker run -t -i --rm --name e2e-ml --entrypoint bash suneetamall/e2e-ml-on-k8s:latest /run_e2e.sh

# 2. Deployment of 1) on kubernetes with Kubeflow & Pachyderm for full provenance and reproducibility

However, 100% reproducible machine learning code comes at a cost. With this particular, pet segmentation problem, noticeable increase of 3.75 times higher training time is observed. 
This may not be acceptable if we are only looking at variance in results in order of 0.5 unit (eg range of accuracy 91.0-91.5 on various run). In such scenarios, maintaining full provenance 
becomes really important. This is not just limited to maintaining the lineage to training data but should be extended to entire workflow including serving of model and infrastructural components.

This section caters to that scenario. Follow instructions below for demo:

Prerequisite 
- Install Kubectl suitable to Kube version
- Install [pachctl 1.9.8](https://github.com/pachyderm/pachyderm/releases/tag/v1.9.8)
- Basically BYO Kubernetes cluster, install ArgoCD ([quick guide](https://github.com/suneeta-mall/e2e-ml-on-k8s/blob/master/cluster-conf/README.md#configuring-kubernetes-cluster-with-gitops)) 

In [None]:
# Install ArgoCD app
! kubectl apply –f https://raw.githubusercontent.com/suneeta-mall/e2e-ml-on-k8s/master/cluster-conf/e2e-ml-argocd-app.yaml

In [None]:
# Configure pachctl 
! pachctl config update context `pachctl config get active-context` --namespace kubeflow
! pachctl port-forward & 

# Pipeline specification
See [ML workflow steps details](ML_WORKFLOWS.md) for detailed introduction on each step of the ML workflow.
There are two type ml workflow pipeline spec defined: 
1. [Pachyderm only pipeline](cluster-conf/k8s/ml-workflow/pachyderm-specs.yaml)
2. [Pachyderm in conjunction with Kubeflow](cluster-conf/k8s/ml-workflow/extend_pachyderm-specs-with-kubeflow.yaml)

In [None]:
# Pachyderm only pipeline
! pygmentize cluster-conf/k8s/ml-workflow/pachyderm-specs.yaml
! pachctl create pipeline -f cluster-conf/k8s/ml-workflow/pachyderm-specs.yaml

In [None]:
# Pachyderm in conjunction with Kubeflow
! pygmentize cluster-conf/k8s/ml-workflow/extend_pachyderm-specs-with-kubeflow.yaml
! pachctl create pipeline -f cluster-conf/k8s/ml-workflow/extend_pachyderm-specs-with-kubeflow.yaml


# Checking for status on processes
See [pachctl](https://docs.pachyderm.com/latest/reference/pachctl/pachctl_list_job/) reference to check on job status.
An example output may look like following:
```
ID                               PIPELINE    STARTED      DURATION       RESTART PROGRESS  DL       UL       STATE
116788baecd54d0ba15f4dd5372a8f15 model-kf    46 hours ago 50 minutes     0       1 + 0 / 1 43.32MiB 278.7MiB success
c7efc83593854f8291af6db577fd286c release     2 days ago   18 seconds     0       1 + 0 / 1 177.7MiB 0B       success
8ae9a293f2b94ab5ba2ae02186783025 evaluate    2 days ago   11 minutes     0       1 + 0 / 1 180.7MiB 1.277MiB success
046f26e008fe4c009c7f338997aaedb6 tune-kf     2 days ago   4 hours        0       1 + 0 / 1 0B       116B     success
2e6dec46669e4360bbf6e2ba3ee6da61 calibrate   2 days ago   5 minutes      0       1 + 0 / 1 143.2MiB 37.4MiB  success
b5d2feebc0ef4d15893e27562b11636f tensorboard 2 days ago   -              0       0 + 0 / 0 0B       0B       running
9a74b4195b044390a7afb08c0dde63e4 model       2 days ago   8 hours        0       1 + 0 / 1 43.32MiB 1.224GiB success
54db4db243b24ec8aa2bbe4d3b8ce8bd train       2 days ago   7 hours        0       1 + 0 / 1 43.32MiB 974.5MiB success
f3d0b49316254e968fd996caeb333a14 tune        2 days ago   4 hours        0       1 + 0 / 1 43.32MiB 1.088GiB success
016942a72bc34dca882f318fa3779a68 transform   3 days ago   9 minutes      0       1 + 0 / 1 777.7MiB 43.32MiB success
7468207b23aa4e5097be5a926d15a0b4 warehouse   3 days ago   About a minute 0       1 + 0 / 1 773.5MiB 777.7MiB success
```

In [None]:
! pachctl list job



# Checking predictions
Once the release step is complete, a Seldon deployment is created which starts prediction server with version of released model.

In [None]:
# Enable port-forward if no ingress/loadbalancer is set
! kubectl port-forward -n istio-system svc/istio-ingressgateway 8080:80

In [None]:
import numpy as np
import seldon_core
from pylib import TUNE_CONF
from pylib import load_validation as load_test, dataset_for_split, display, binarize, IMG_CHANNEL, IMG_HEIGHT, IMG_WIDTH
from seldon_core.seldon_client import SeldonClient

release_commit = input()
input_data_dir="resources"
prediction_version=f'petset-{release_commit[0:6]}'

test_slice = dataset_for_split(input_data_dir, "calibration")
test_dataset = test_slice.map(load_test, num_parallel_calls=TUNE_CONF).batch(1)

In [None]:
# Requesting prediction from versioned API
sc = SeldonClient(gateway="istio", transport="rest", deployment_name=prediction_version,namespace='kubeflow', 
                  gateway_endpoint='localhost:8080')
for img, mask in test_dataset.take(1):
    # http://localhost:8080/seldon/kubeflow/petset-c517ec/api/v0.1/predictions
    # https://docs.seldon.io/projects/seldon-core/en/latest/workflow/serving.html
    r = sc.predict(shape=(IMG_HEIGHT, IMG_WIDTH, IMG_CHANNEL), data=img[0].numpy(), payload_type='ndarray', names=[])
    predictions = seldon_core.utils.seldon_message_to_json(r.response)
    prediction_res = np.asarray(predictions['data']['ndarray'])
    display([img[0], mask[0], prediction_res, binarize(prediction_res)], 
            title=['Input Image', 'True Mask', 'Predicted Mask', 'Thresholded Mask'])
    

In [None]:
# Sending feedback to versioned API (this currently uses reinforcement send_feedback API ignoring reward but custom APIs can be added on serving component for feedback
# truth_pred = seldon_core.utils.seldon_message_to_json(r.response)
fb_res=sc.feedback(prediction_request=r.request, prediction_response=r.response, reward=None)