# Insurance charges regression pipeline in Kubeflow

In this notebook, the **insurance charge regression notebook** is segmented into components and executed as a **Kubeflow pipeline** run. A pipeline is a description of an ML workflow that includes all of the steps in the form of components in the workflow. A pipeline component is a self-contained set of user code, packaged as a Docker image, that performs one step in the pipeline. For example, this can be a component responsible for data preprocessing, data transformation, model training, and so on. For a conventional data science notebook to run as a Kubeflow pipeline it has to be brought into a Kubeflow *friendly* format which this notebook is dedicated to.

![pics](pics/insurance_Kubeflow.JPG)

## Load resuable components, define data location & name, MinIO, and namespace

Reusable components for repetitive steps are loaded in the first step. The components are located in a coworker's github as a **.yaml** file and have to be loaded using the url path. Kubeflow is designed to allow data scientists to reuse components when they execute a step of the ML workflow that happens frequently, for example downloading the data into the notebook. The dataset used in this notebook was uploaded to the file hosting service box. The URL and file name is mentioned next as well as the model name. Kubeflow ships with MinIO inside to store all of its pipelines, artifacts and logs. The URL, username and password must be called here. 

Kubeflow comes with multi-user isolation which simplifies user operations because each user only views and edits the Kubeflow components and model artifacts defined in their configuration. Isolation uses Kubernetes **Namespaces**. The Namespace needs to be specified before the other steps of the pipeline can be defined. 

In [68]:
DOWNLOAD_AND_EXTRACT_COMPONENT_URL = "https://raw.githubusercontent.com/lehrig/kubeflow-ppc64le-components/main/data-extraction/download-and-extract-from-url/component.yaml"

DATASET_URL = "https://ibm.box.com/shared/static/yqdpzhhe4x878hxcgu4a6uobra1dq7e9.zip"
DATASET_FILE_NAME = "insurance.zip"
MODEL_NAME = "insurance-cost-regression"

MINIO_URL = "minio-service.kubeflow:9000"
MINIO_USER = "minio"
MINIO_PASS = "minio123"

with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace") as f:
    NAMESPACE = f.read()
NAMESPACE

'user-example-com'

Some important packages to build and run Kubeflow pipelines are imported

In [55]:
import kfp
import kfp.components as comp
from typing import NamedTuple
import kfp.dsl as dsl
from kfp.components import (
    InputPath,
    OutputPath
)

In [56]:
client = kfp.Client()

# Pipeline
## 1.1 Load Dataset

The first component download the data and extracts it from a zip file. 

In [69]:
download_and_extract_comp = comp.load_component_from_url(
    DOWNLOAD_AND_EXTRACT_COMPONENT_URL
)

## 1.2 Preprocessing

In the second component all the preprocessing is done before the data can be used to train the model. The data scientist has to decide which steps qualify as preprocessing steps and incorporates the code pieces into this component. In this example, the non-numerical features 'sex', 'smoker', 'region' are transformed into numerical features using the **Label Encoder**. After the preprocessing is done the data is saved to a new data directory called *prep_data_dir* as well as the dataframe which takes on the **.pkl** format. 

Besides the preprocessing code, the component follows a clear logic where **Input** and **Output paths** are defined at the top, **packages & modules** are imported, **data** is imported, and after all the relevant code is inserted the data gets saved to a **new data directory** and the component receives a **base image** that contains all the relevant packages needed to run the code inside the component. This logic stays the same for every subsequent component. 

## Label Encoder

In [94]:
def preprocess_data(
    data_dir:InputPath(str),
    prep_data_dir: OutputPath(str)
):
    from sklearn.preprocessing import LabelEncoder
    import numpy as np
    import os
    import pandas as pd
    
    data = f'{data_dir}/insurance.csv'
    
    insurancedf=pd.read_csv(data,na_values=[" ","null"])

    catFeats=['sex','smoker','region']
    for cf in catFeats:
        print("\nFeature %s :"%cf)
        print(insurancedf[cf].value_counts())

    for cf in catFeats:
        insurancedf[cf] = LabelEncoder().fit_transform(insurancedf[cf].values)

    if not os.path.exists(prep_data_dir):
        os.makedirs(prep_data_dir)

    insurancedf.to_pickle(f'{prep_data_dir}/insurancedf.pkl')

preprocess_data_comp = kfp.components.create_component_from_func(
    func=preprocess_data,
    base_image='quay.io/ibm/kubeflow-notebook-image-ppc64le:elyra3.7.0-py3.8-tensorflow-cpu2.7.0',
)

## 1.3 Train model

The **insurancedf** dataframe saved to the *prep_data_dir* in the first component is loaded again in the next component. This step takes care of the split of the training and testing data and also the model training. After the model is trained it gets saved to the **model directory**. The test and train splits also get saved so that they can be used in the next component.

In [117]:
def train_model(
    prep_data_dir: comp.InputPath(str),
    model_dir: comp.OutputPath(str),
    traintest_dir: comp.OutputPath(str)
):
    """The train test split is done and then the model is trained"""
    
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    import pandas as pd
    import os
    import numpy as np
    import pickle

    insurancedf = pd.read_pickle(f'{prep_data_dir}/insurancedf.pkl')
    
    X=insurancedf.values[:,:-1]
    y=insurancedf.values[:,-1]
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
    
    linreg=LinearRegression()
    linreg.fit(X_train,y_train)
    
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)
        
    if not os.path.exists(traintest_dir):
        os.makedirs(traintest_dir)
        
    np.savez(f'{traintest_dir}/train_data.npz', X_train, y_train)
    np.savez(f'{traintest_dir}/val_data.npz', X_test, y_test)
    
    filename = f'{model_dir}/finalized_model.sav'
    pickle.dump(linreg, open(filename, 'wb'))
    
train_model_comp = kfp.components.create_component_from_func(
    func=train_model,
    base_image='quay.io/ibm/kubeflow-notebook-image-ppc64le:elyra3.7.0-py3.8-tensorflow-cpu2.7.0'
)

## 1.4 Model evaluation

The final component does the evaluation of the model. The necessary packages and data from previously created directories are loaded.

In [124]:
def evaluate_model(
    traintest_dir: comp.InputPath(str),
    model_dir: comp.InputPath(str),
):

    import numpy as np
    import pickle

    val_data = np.load(f'{traintest_dir}/val_data.npz')
    X_test = val_data[val_data.files[0]]

    model = pickle.load(open(f'{model_dir}/finalized_model.sav', 'rb'))

    ypred=model.predict(X_test)

    print(ypred)

evaluate_model_comp = kfp.components.create_component_from_func(
    func=evaluate_model,
    base_image='quay.io/ibm/kubeflow-notebook-image-ppc64le:elyra3.7.0-py3.8-tensorflow-cpu2.7.0'
)

# 2 Pipeline

After all the components have been specified, the pipeline is defined using the **@dsl.pipeline** decorator. The pipeline determines the succession of components to run and which parameters to pass between them. 

In [129]:
@dsl.pipeline(
  name='Insurance regression pipeline',
  description='insurance regression ....'
)
def insurance_pipeline(dataset_url: str,
                    dataset_file_name: str = "data.zip",
                    data_dir: str = "/train/data",
                    prep_data_dir: str = "/train/prep_data",
                    model_dir: str = "/train/model",
                    model_name: str = "insurance-regression",
                    minio_url: str = MINIO_URL,
                    minio_user: str = MINIO_USER,
                    minio_pass: str = MINIO_PASS):
    download_and_extract_task = download_and_extract_comp(
        url=dataset_url,
        file_name=dataset_file_name
    )

    preprocess_data_task = preprocess_data_comp(
        download_and_extract_task.outputs['data_path']
    )

    train_model_task = train_model_comp(
        preprocess_data_task.output
    ).set_gpu_limit(1)

    evaluate_model_task = evaluate_model_comp(
        train_model_task.outputs['traintest_dir'],
        train_model_task.outputs['model_dir']
    ).set_gpu_limit(1)

## 2.1 Run the pipeline

After defining the pipeline arguments the pipeline run is executed. Click on *Run details* which will appear below the cell and view the run of the pipeline inside the Kubeflow Pipelines UI opening in the browser.

In [130]:
# Specify argument values for your pipeline run.
arguments = {
    'dataset_url': DATASET_URL,
    'dataset_file_name': DATASET_FILE_NAME,
    'data_dir': '/train/data',
    'prep_data_dir': '/train/prep_data',
    'model_dir': '/train/model',
    'model_name': MODEL_NAME,
    'minio_url': MINIO_URL,
    'minio_user': MINIO_USER,
    'minio_pass': MINIO_PASS
}

client.create_run_from_pipeline_func(
    insurance_pipeline,
    arguments=arguments,
    namespace=NAMESPACE
)

RunPipelineResult(run_id=cf0ed133-9a16-4b43-bf8c-38acaac02ae5)