# Covid-19 lung classification pipeline in Kubeflow

In this notebook, the **Covid-19 lung classification notebook** is segmented into components and executed as a **Kubeflow pipeline** run. A pipeline is a description of an ML workflow that includes all of the steps in the form of components in the workflow. A pipeline component is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. For example, this can be a component responsible for data preprocessing, data transformation, model training, and so on. For a conventional data science notebook to run as a Kubeflow pipeline it has to be brought into a Kubeflow *friendly* format which this notebook is dedicated to.

![pics](pics/lung_Kubeflow.JPG)

Train the new network (only the new classifier part) for the task of differentiating x-ray pictures of healthy lungs from x-ray pictures of covid-19 infected lungs.

## Imports & Constants

In [230]:
import IPython.display
import json
import kfp
import kfp.dsl as dsl
import kfp.components as comp
from kfp.components import (
    InputPath,
    OutputPath
)
import requests
from typing import NamedTuple

## Load resuable components, define data location & name, MinIO, and namespace

Reusable components for repetitive steps are loaded in the first step. The components are located in a coworker's github as a **.yaml** file and have to be loaded using the url path. Kubeflow is designed to allow data scientists to reuse components when they execute a step of the ML workflow that happens frequently, for example downloading the data into the notebook. Other components that can be reused here are for model conversion, model upload and model deployment. The components for those steps are only compatible with models trained using *tensorflow*. For models using other frameworks, other components need to be loaded or the component needs to be defined in the notebook.

The dataset used in this notebook was uploaded to the file hosting service box. The URL and file name is mentioned next as well as the model name. Kubeflow ships with MinIO inside to store all of its pipelines, artifacts and logs. The URL, username and password must be called here. 

Kubeflow comes with multi-user isolation which simplifies user operations because each user only views and edits the Kubeflow components and model artifacts defined in their configuration. Isolation uses Kubernetes **Namespaces**. The Namespace needs to be specified before the other steps of the pipeline can be defined. 

In [232]:
LABELS = [
    "Covid",
    "Normal"
]


DOWNLOAD_AND_EXTRACT_COMPONENT_URL = "https://raw.githubusercontent.com/lehrig/kubeflow-ppc64le-components/main/data-extraction/download-and-extract-from-url/component.yaml"
CONVERT_MODEL_TO_ONNX_COMPONENT_URL = "https://raw.githubusercontent.com/lehrig/kubeflow-ppc64le-components/main/model-building/convert-to-onnx/component.yaml"
UPLOAD_MODEL_COMPONENT_URL = "https://raw.githubusercontent.com/lehrig/kubeflow-ppc64le-components/main/model-building/upload-model/component.yaml"
DEPLOY_MODEL_WITH_KSERVE_COMPONENT_URL = "https://raw.githubusercontent.com/lehrig/kubeflow-ppc64le-components/main/model-deployment/deploy-model-with-kserve/component.yaml"

DATASET_URL = "https://ibm.box.com/shared/static/5k8j40bj9niw4lqsslz9l4eydjj96s0w.zip"
DATASET_FILE_NAME = "Covid_lungs.zip"
MODEL_NAME = "covid-classification"

MINIO_URL = "minio-service.kubeflow:9000"
MINIO_USER = "minio"
MINIO_PASS = "minio123"

with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace") as f:
    NAMESPACE = f.read()
NAMESPACE

'user-example-com'

In [234]:
client = kfp.Client()

# Pipeline
## 1.1 Load dataset

The first component download the data and extracts it from a zip file. 

In [237]:
download_and_extract_comp = comp.load_component_from_url(
    DOWNLOAD_AND_EXTRACT_COMPONENT_URL
)

## 1.2 Preprocessing

In the second component all the preprocessing is done before the data can be used to train the model. The data scientist has to decide which steps qualify as preprocessing steps and incorporates the code pieces into this component. In this example, the list of images are taken from the dataset directory and then the list of data and class images is initialized. After that the data and labels are converted to NumPy arrays. Finally, one-hot encoding is performed on the labels.

Besides the preprocessing code, the component follows a clear logic where **Input** and **Output paths** are defined at the top, **packages & modules** are imported, **data** is imported, and after all the relevant code is inserted the data gets saved to a **new data directory** and the component receives a **base image** that contains all the relevant packages needed to run the code inside the component. This logic stays the same for every subsequent component. 

In [239]:
def preprocess_data(
    data_dir:InputPath(str),
    prep_dir: OutputPath(str)
):
    import os
    import requests
    import numpy as np
    import cv2
    from imutils import paths
    from sklearn.preprocessing import LabelBinarizer
    from tensorflow.keras.utils import to_categorical

    print("[INFO] loading images...")
    imagePaths = list(paths.list_images(data_dir))
    print("length of imagePaths: "+ str(len(imagePaths)))
    data = []
    labels = []

    # loop over the image paths
    for imagePath in imagePaths:
        # extract the class label (directory-name) from the filename
        label = imagePath.split(os.path.sep)[-2]
        print("label: " + label)
        if (label=="covid" or label=="normal"):
            # load the image, swap color channels, and resize it to be a fixed
            # 224x224 pixels while ignoring aspect ratio
            image = cv2.imread(imagePath)
            image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
            image = cv2.resize(image, (224, 224))

            # update the data and labels lists, respectively
            data.append(image)
            labels.append(label)
            
    print("length of labels: "+ str(len(labels))) 
    print("length of data: "+ str(len(data)))

    data = np.array(data) / 255.0
    labels = np.array(labels)

    lb = LabelBinarizer()
    labels = lb.fit_transform(labels)
    labels = to_categorical(labels)
    
    print("length of labels: "+ str(len(labels))) 
    print("length of data: "+ str(len(data)))

    if not os.path.exists(prep_dir):
        os.makedirs(prep_dir)

    np.savez(f'/{prep_dir}/data.npz', data=data, labels=labels)

preprocess_data_comp = kfp.components.create_component_from_func(
    func=preprocess_data,
    packages_to_install=["imutils"],
    base_image='quay.io/ibm/kubeflow-notebook-image-ppc64le:elyra3.7.0-py3.8-tensorflow-cpu2.7.0'
)

## 1.3 Train test split
The **data and labels** are saved to the *prep_dir* in the first component and are loaded again in the next component. This step takes care of the split of the training and testing data.

In [241]:
def traintestsplit_data(
    prep_dir: InputPath(str),
    file_name: str,
    traintest_dir: OutputPath(str)
):
    """Split data into train/validate/test data. Saves result into `prep_data_dir`."""

    import os
    import numpy as np
    from sklearn.preprocessing import LabelBinarizer
    from tensorflow.keras.utils import to_categorical
    from sklearn.model_selection import train_test_split
    from tensorflow.keras.preprocessing.image import ImageDataGenerator

    data = np.load(f'/{prep_dir}/data.npz')['data']
    labels = np.load(f'/{prep_dir}/data.npz')['labels']
    
    print("length of labels: "+ str(len(labels))) 
    print("length of data: "+ str(len(data)))
    
    (trainX, testX, trainY, testY) = train_test_split(data, labels, test_size=0.40, stratify=labels, random_state=42)

    print(trainX.shape)
    print(testX.shape)
    print(trainY.shape)
    print(testY.shape)

    if not os.path.exists(traintest_dir):
        os.makedirs(traintest_dir)

    np.savez(f'/{traintest_dir}/train_data.npz', trainX, trainY)
    np.savez(f'/{traintest_dir}/val_data.npz', testX, testY)

    print(f'Data saved to {traintest_dir}:')
    print(os.listdir(traintest_dir))

traintestsplit_data_comp = kfp.components.create_component_from_func(
    func=traintestsplit_data,
    base_image='quay.io/ibm/kubeflow-component-tensorflow-cpu:latest'
)

## 1.4 Train the model
In this component the model is trained and then it gets saved to the **model directory**. The **Image Data Generator** is also applied in this component. The [ImageDataGenerator](https://keras.io/api/preprocessing/image/) is an easy way to load and augment images in batches for image classification tasks. Together with the method `fit_generator()` (see below), it provides the possibility, that not all of the training data must be kept in the memory. Instead only the current batch is loaded. Moreover, the `ImageDataGenerator`-class provides methods to modify images, e.g. by shift, rotation, flipping, color-transform etc. 
In the code cell below an object of this class is instantiated, which will randomly rotate images within an angle of 15°.

In [262]:
def train_model(
    traintest_dir: InputPath(str),
    model_dir: OutputPath(str)
):
    """Trains model. Once trained, the model is persisted to `model_dir`."""

    import os
    import numpy as np
    import tensorflow as tf
    from tensorflow.keras.preprocessing.image import ImageDataGenerator
    from tensorflow.keras.applications import VGG16
    from tensorflow.keras.layers import AveragePooling2D, Dropout, Flatten, Dense, Input
    from tensorflow.keras.models import Model
    from tensorflow.keras.optimizers import Adam
    
    train_data = np.load(f'{traintest_dir}/train_data.npz')
    trainX = train_data[train_data.files[0]]
    trainY = train_data[train_data.files[1]]
    
    val_data = np.load(f'{traintest_dir}/val_data.npz')
    testX = val_data[val_data.files[0]]
    testY = val_data[val_data.files[1]]
    
    INIT_LR = 1e-3 #Initial Learning Rate
    EPOCHS = 25 #Number of epochs in training
    BS = 10 #Training Batch Size

    trainAug = ImageDataGenerator(rotation_range=15, fill_mode="nearest")

    baseModel = VGG16(weights="imagenet", include_top=False,input_tensor=Input(shape=(224, 224, 3)))

    headModel = baseModel.output
    headModel = AveragePooling2D(pool_size=(4, 4))(headModel)
    headModel = Flatten(name="flatten")(headModel)
    headModel = Dense(64, activation="relu")(headModel)
    headModel = Dropout(0.5)(headModel)
    headModel = Dense(2, activation="softmax")(headModel)

    model = Model(inputs=baseModel.input, outputs=headModel)

    for layer in baseModel.layers:
        layer.trainable = False

    model.summary()

    print("[INFO] compiling model...")
    opt = Adam(lr=INIT_LR, decay=INIT_LR / EPOCHS)
    model.compile(loss="binary_crossentropy", optimizer=opt,metrics=["accuracy"])

    print("[INFO] training classifier part of the network...")
    hist = model.fit_generator(
        trainAug.flow(trainX, trainY, batch_size=BS),
        steps_per_epoch=len(trainX) // BS,
        validation_data=(testX, testY),
        validation_steps=len(testX) // BS,
        verbose=False,
        epochs=EPOCHS)

    print("Model train history:")
    print(hist.history)

    if not os.path.exists(model_dir):
        os.makedirs(model_dir)
        
    model.save(model_dir)
    print(f"Model saved to: {model_dir}")


train_model_comp = kfp.components.create_component_from_func(
    func=train_model,
    output_component_file='train_model_component.yaml',
    base_image='quay.io/ibm/kubeflow-component-tensorflow-gpu:2.7.0-dev'
)

## 1.5 Evaluate model with validation data
This component does the evaluation of the model. The necessary packages and data from previously created directories are loaded.

In [263]:
def evaluate_model(
    traintest_dir: InputPath(str),
    model_dir: InputPath(str)  
):
    """Loads a saved model from file and uses a pre-downloaded dataset for evaluation.
    Model metrics are persisted to `/mlpipeline-metrics.json` for Kubeflow Pipelines
    metadata."""

    import json
    import numpy as np
    import tensorflow as tf
    from collections import namedtuple
    from sklearn.metrics import classification_report
    from sklearn.preprocessing import LabelBinarizer
    
    val_data = np.load(f'{traintest_dir}/val_data.npz')
    testX = val_data[val_data.files[0]]
    testY = val_data[val_data.files[1]]
    
    BS = 10
    
    model = tf.keras.models.load_model(model_dir)

    print("[INFO] Apply model on test data...")
    predIdxs = model.predict(testX, batch_size=BS)

    predIdxs = np.argmax(predIdxs, axis=1)

    print(predIdxs)

evaluate_model_comp = kfp.components.create_component_from_func(
    func=evaluate_model,
    output_component_file='evaluate_model_component.yaml',
    base_image='quay.io/ibm/kubeflow-component-tensorflow-cpu:latest'
)

## 1.6 Create confusion matrix
This pipeline uses a **confusion matrix** to visualize the performance of the model. To create a confusion matrix a single self-contained component must be defined following the same logic as the other components. The **model**, the **input data** and important **packages** need to be loaded first and then the confusion matrix is defined in detail. Like the other components, this one also needs a base image containing all the necessary modules to run the code inside the component.

In [270]:
def plot_confusion_matrix(
        traintest_dir: InputPath(str),
        model_dir: InputPath(str),
        labels: list,
        mlpipeline_ui_metadata_path: OutputPath()):
    import json
    import logging
    import numpy as np
    import pandas as pd
    from sklearn.metrics import confusion_matrix
    import sys
    import tensorflow as tf

    logging.basicConfig(
        stream=sys.stdout,
        level=logging.INFO,
        format='%(levelname)s %(asctime)s: %(message)s'
    )
    
    val_data = np.load(f'{traintest_dir}/val_data.npz')
    testX = val_data[val_data.files[0]]
    testY = val_data[val_data.files[1]]
    
    model = tf.keras.models.load_model(model_dir)

    y_true = np.argmax(testY, axis=1)
    y_pred = np.argmax(model.predict(testX), axis=1)
    confusion_matrix = confusion_matrix(y_true, y_pred)

    data = []
    for target_index, target_row in enumerate(confusion_matrix):
        for predicted_index, count in enumerate(target_row):
            data.append((labels[target_index], labels[predicted_index], count))

    df = pd.DataFrame(
        data,
        columns=['target', 'predicted', 'count']
    )

    metadata = {
      'outputs': [{
        'type': 'confusion_matrix',
        'format': 'csv',
        'schema': [
          {'name': 'target', 'type': 'CATEGORY'},
          {'name': 'predicted', 'type': 'CATEGORY'},
          {'name': 'count', 'type': 'NUMBER'},
        ],
        "storage": "inline",
        'source': df.to_csv(
            columns=['target', 'predicted', 'count'],
            header=False,
            index=False),
        'labels': labels,
      }]
    }

    logging.info("Dumping mlpipeline_ui_metadata...")
    with open(mlpipeline_ui_metadata_path, 'w') as metadata_file:
        json.dump(metadata, metadata_file)

    logging.info("Finished.")


plot_confusion_matrix_comp = kfp.components.create_component_from_func(
    func=plot_confusion_matrix,
    base_image='quay.io/ibm/kubeflow-notebook-image-ppc64le:elyra3.7.0-py3.8-tensorflow-cpu2.7.0'
)

## 1.7 Convert model to ONNX (by reusing a Kubeflow component)

In [92]:
convert_model_to_onnx_comp = kfp.components.load_component_from_url(
    CONVERT_MODEL_TO_ONNX_COMPONENT_URL
)

## 1.8 Upload model to MinIO artifact store (by reusing a Kubeflow component)

In [106]:
upload_model_comp = kfp.components.load_component_from_url(
    UPLOAD_MODEL_COMPONENT_URL
)

## 1.9 Deploy the model using KServe (by reusing a Kubeflow component)

In [107]:
deploy_model_with_kserve_comp = kfp.components.load_component_from_url(
    DEPLOY_MODEL_WITH_KSERVE_COMPONENT_URL
)

## 2 Pipeline
After all the components have been specified, the pipeline is defined using the **@dsl.pipeline** decorator. The pipeline determines the succession of components to run and which parameters to pass between them. 

In [271]:
@dsl.pipeline(
  name='Covid19 lung classification',
  description='A pipeline that performs a Covid19 classification on lung images'
)
def covid19_lung_classification_pipeline(
            labels: list,
            dataset_url: str,
            dataset_file_name: str = "data.zip",
            data_dir: str = "/train/data",
            prep_dir: str = "/train/prep_data",
            traintest_dir: str = "/train/traintest_data",
            model_dir: str = "/train/model",
            model_name: str = "covid-classification",
            minio_url: str = MINIO_URL,
            minio_user: str = MINIO_USER,
            minio_pass: str = MINIO_PASS):
    
    download_and_extract_task = download_and_extract_comp(
        url=dataset_url,
        file_name=dataset_file_name
    )

    preprocess_data_task = preprocess_data_comp(
        download_and_extract_task.outputs['data_path']
    )
    
    traintestsplit_data_task = traintestsplit_data_comp(
        preprocess_data_task.outputs['prep_dir'],
        file_name=dataset_file_name
    )  
    
    train_model_task = train_model_comp(
        traintestsplit_data_task.output
    ).set_gpu_limit(1)

    evaluate_model_task = evaluate_model_comp(
        traintestsplit_data_task.output,
        train_model_task.output
    )

    plot_confusion_matrix_task = plot_confusion_matrix_comp(
        traintestsplit_data_task.output,
        train_model_task.output,
        labels
    )

    convert_model_to_onnx_task = convert_model_to_onnx_comp(
        train_model_task.output
    )

    upload_model_task = upload_model_comp(
        convert_model_to_onnx_task.output,
        minio_url,
        minio_user,
        minio_pass,
        model_name=model_name
    )

    deploy_model_with_kserve_task = deploy_model_with_kserve_comp(
        model_name=model_name
    )

    deploy_model_with_kserve_task.after(upload_model_task)

## 3 Run the pipline within an experiment
After defining the pipeline arguments the pipeline run is executed. Click on *Run details* which will appear below the cell and view the run of the pipeline inside the Kubeflow Pipelines UI opening in the browser.

In [272]:
arguments = {
    'labels': labels,
    'dataset_url': DATASET_URL,
    'dataset_file_name': DATASET_FILE_NAME,
    'data_dir': '/train/data',
    'prep_dir': '/train/prep_data',
    'traintest_dir' : '/train/traintest_data',
    'model_dir': '/train/model',
    'model_name': MODEL_NAME,
    'minio_url': MINIO_URL,
    'minio_user': MINIO_USER,
    'minio_pass': MINIO_PASS
}

client.create_run_from_pipeline_func(
    covid19_lung_classification_pipeline,
    arguments=arguments,
    namespace=NAMESPACE
)

RunPipelineResult(run_id=1303620f-00ba-4dbe-8c82-41aed3e458c1)