<i>Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.</i>
<br><br>
# SVD Hyperparameter Tuning with Kubeflow

In this notebook, we show how to tune the hyperparameters of a matrix factorization algorithm, SVD (Singular Value Decomposition) from the Surprise library, by utilizing **[Kubeflow](https://www.kubeflow.org/)** in the context of movie recommendations. Kubeflow is a machine learning toolkit for [Kubernetes](https://kubernetes.io/) which makes deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

We present the overall process of deploying Kubeflow on [AKS (Azure Kubernetes Service)](https://azure.microsoft.com/en-us/services/kubernetes-service/) and utilize it to run hyperparameter tuning experiments by demonstrating some key steps while avoiding too much detail. 

For more details about the **SVD** algorithm:
* [Surprise SVD deep-dive notebook](../02_model/surprise_svd_deep_dive.ipynb)
* [Original paper](http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)
* [Surprise homepage](https://surprise.readthedocs.io/en/stable/)
  
Regarding **Kubeflow**, please refer to:
* [Azure Kubeflow labs github repo](https://github.com/Azure/kubeflow-labs)
* [Kubeflow official doc: Getting started on Kubernetes](https://www.kubeflow.org/docs/started/getting-started-k8s/)
* [Hyperparameter tuning a Tensorflow model on Kubeflow with GPU cluster](https://github.com/loomlike/hyperparameter-tuning-on-kubernetes)

## 0. Prerequisites

* Docker (if you want to create your own docker image) - To install, see [docker site](https://docs.docker.com/install/).
* Azure CLI - The easiest way is to use [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/).
  - You need the Azure CLI version 2.0.64 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-apt?view=azure-cli-latest#update).

In [1]:
%load_ext autoreload
%autoreload 2

In [12]:
import os
import sys
sys.path.append("../../")
import time
from tempfile import TemporaryDirectory

import surprise
import papermill as pm
import pandas as pd

from reco_utils.common.constants import SEED
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import rmse, precision_at_k, ndcg_at_k
from reco_utils.recommender.surprise.surprise_utils import compute_rating_predictions, compute_ranking_predictions
from reco_utils.kubeflow.manifest.utils import (
    choice,
    uniform,
    make_hypertune_manifest,
    worker_manifest,
)
from reco_utils.kubeflow.manifest import (
    Goal,
    SearchType,
    WorkerType,
)

print("System version: {}".format(sys.version))
print("Surprise version: {}".format(surprise.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]
Surprise version: 1.0.6


## 1. Setup
During the setup, we create AKS cluster and install Kubeflow on it.

#### 1.1 AKS setup
To create AKS and cluster, first make sure you signed in to use Azure CLI with a correct subscription.

In [None]:
!az login

In [None]:
!az account show

To change the subscription, run `az account set --subscription <YOUR-SUBSCRIPTION-NAME-OR-ID>`.

Next, **set desired names for your resource group and AKS as well as the region you want to create the resources at** to the following cell.

In [3]:
RG_NAME = "reco-aks-rg"  # YOUR-RESOURCE-GROUP-NAME
AKS_NAME = "reco-aks"    # RESOURCE-NAME
LOCATION = "eastus"      # RESOURCE-REGION. To get all the available region, run 'az account list-locations' and see 'name' key

Then, run the following commands to create the resources. This example will create four [Standard_D2_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-general#dv2-series) CPU VM nodes for the cluster.

In [None]:
# Create resource group
!az group create --name {RG_NAME} --location {LOCATION}

# Create AKS cluster
!az aks create \
    --resource-group {RG_NAME} \
    --name {AKS_NAME} \
    --node-count 4 \
    --node-vm-size Standard_D2_v2 \
    --enable-addons monitoring \
    --generate-ssh-keys

Creating an AKS cluster may take few minutes. If the creation is successful, you'll see something like:
```
{- Finished ..ion done[############################################]  100.0000%
  "aadProfile": null,
    "addonProfiles": {
      "omsagent": {
        "config": {
  ...
```

Now, install [Kubernetes CLI](https://docs.microsoft.com/en-us/azure/aks/tutorial-kubernetes-deploy-cluster#install-the-kubernetes-cli) `kubectl` for running commands against the cluster. If you already installed kubectl, skip this cell.

In [None]:
!az aks install-cli

Connect the CLI to your cluster by runing:

In [None]:
!az aks get-credentials --resource-group {RG_NAME} --name {AKS_NAME}

If you have an error, check if you have read/write permissions on the kubernetes config file. In a linux machine, the file will be at `~/.kube/config`.

To verify the connection of CLI to your cluster, run:

In [3]:
!kubectl get nodes

NAME                       STATUS   ROLES   AGE   VERSION
aks-nodepool1-30917087-0   Ready    agent   23h   v1.12.8
aks-nodepool1-30917087-1   Ready    agent   23h   v1.12.8
aks-nodepool1-30917087-2   Ready    agent   23h   v1.12.8
aks-nodepool1-30917087-3   Ready    agent   23h   v1.12.8


If the connection is successful, the nodes information will be printed out like:
```
NAME                       STATUS    ROLES     AGE       VERSION
aks-nodepool1-17965807-0   Ready     agent     11m       v1.12.8
...
```

#### 1.2 Kubeflow setup
Kubeflow makes use of *[ksonnet](https://www.kubeflow.org/docs/components/ksonnet/)* to help manage deployments.

First, setup environment variables and download the ksonnet file by running the following scripts:

In [32]:
os.environ["OS_TYPE"] = "linux"  # Use "darwin" for Mac
os.environ["KS_VER"] = "0.13.1"
os.environ["KS_PKG"] = "ks_{0}_{1}_amd64".format(os.environ["KS_VER"], os.environ["OS_TYPE"])
os.environ["PATH"] = "{0}:{1}/bin/{2}".format(
    os.environ["PATH"],
    os.environ["HOME"],
    os.environ["KS_PKG"]
)

In [33]:
%%bash

wget -O /tmp/${KS_PKG}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz -q
mkdir -p ${HOME}/bin
tar -xvf /tmp/${KS_PKG}.tar.gz -C ${HOME}/bin

x ks_0.13.1_darwin_amd64/CHANGELOG.md
x ks_0.13.1_darwin_amd64/CODE-OF-CONDUCT.md
x ks_0.13.1_darwin_amd64/CONTRIBUTING.md
x ks_0.13.1_darwin_amd64/LICENSE
x ks_0.13.1_darwin_amd64/README.md
x ks_0.13.1_darwin_amd64/ks


Next, download, extract Kubeflow and deploy it. For more details about this process, please see [installation instruction of Kubeflow on Azure](https://www.kubeflow.org/docs/azure/deploy/install-kubeflow/).

In [37]:
os.environ["KFCTL_VER"] = "0.5.1"
os.environ["KFCTL_PKG"] = "kfctl_v{}_{}".format(os.environ["KFCTL_VER"], os.environ["OS_TYPE"])
os.environ["PATH"] = "{0}:{1}/bin".format(
    os.environ["PATH"],
    os.environ["HOME"]
)
os.environ['KFAPP'] = "kfapp"

In [35]:
%%bash

wget -O /tmp/${KFCTL_PKG}.tar.gz https://github.com/kubeflow/kubeflow/releases/download/v${KFCTL_VER}/${KFCTL_PKG}.tar.gz -q
tar -xvf /tmp/${KFCTL_PKG}.tar.gz -C ${HOME}/bin

x ./kfctl


In [None]:
%%bash

kfctl init ${KFAPP}
cd ${KFAPP}
kfctl generate k8s
kfctl apply k8s

To verify the deployment, check kubeflow pods as follows:

In [4]:
!kubectl -n kubeflow get pods

NAME                                                       READY   STATUS    RESTARTS   AGE
ambassador-7b8477f667-96mmt                                1/1     Running   0          21h
ambassador-7b8477f667-kbchr                                1/1     Running   0          21h
ambassador-7b8477f667-mswcr                                1/1     Running   0          21h
argo-ui-9cbd45fdf-sgm6k                                    1/1     Running   0          21h
centraldashboard-796c755dcf-v6ctn                          1/1     Running   0          21h
jupyter-web-app-589f8756c9-rvvlx                           1/1     Running   0          21h
katib-ui-7c6997fd96-rqkdw                                  1/1     Running   0          21h
metacontroller-0                                           1/1     Running   0          21h
minio-594df758b9-sqnfc                                     1/1     Running   0          21h
ml-pipeline-75b5d4585-dqwzb                                1/1     Running   0  

We change the namespace to be "kubeflow" so that we don't need to use `-n kubeflow` argument for every *kubectl* command in this example.

In [9]:
!kubectl config set-context {AKS_NAME} --namespace=kubeflow

Context "junminaks-cpu" modified.


#### 1.3 Persistent volumn setup
One last thing we should do before moving to the next step is to create a persistent volumn to store our dataset. A PersistentVolumeClaim (PVC) is a request for storage by a user. For details, see [persistent volumes with Azure files](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv). Here, we create 10G size storage, which is defined in *[reco_utils/kubeflow/manifest/azure-file-pvc.yaml](../../reco_utils/kubeflow/manifest/azure-file-pvc.yaml)*.

In [3]:
!kubectl apply -f ../../reco_utils/kubeflow/manifest/azure-file-sc.yaml
!kubectl apply -f ../../reco_utils/kubeflow/manifest/azure-pvc-roles.yaml
!kubectl apply -f ../../reco_utils/kubeflow/manifest/azure-file-pvc.yaml

storageclass.storage.k8s.io/azurefile created
clusterrole.rbac.authorization.k8s.io/system:azure-cloud-provider created
clusterrolebinding.rbac.authorization.k8s.io/system:azure-cloud-provider created
persistentvolumeclaim/azurefile created


To verify the deployment, run:

In [10]:
!kubectl get pvc azurefile

NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
azurefile   Bound    pvc-3086e6ad-8e16-11e9-9ec7-46b2371e9153   10Gi       RWX            azurefile      20h


## 2. Experiment Preparation
#### 2.1 Dataset
1. Download data and split into training, validation and testing sets
2. Upload the training and validation sets to our PVC. To do that,
  1. Attach a pod to the PVC
  2. Copy the datasets onto the pod
  3. Delete the pod

In [53]:
# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

TRAIN_FILE_NAME = "movielens_" + MOVIELENS_DATA_SIZE + "_train.pkl"
VAL_FILE_NAME = "movielens_" + MOVIELENS_DATA_SIZE + "_val.pkl"
TEST_FILE_NAME = "movielens_" + MOVIELENS_DATA_SIZE + "_test.pkl"

USERCOL = 'userID'
ITEMCOL = 'itemID'

In [52]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USERCOL, ITEMCOL, "rating"]
)

data.head()

100%|██████████| 4.81k/4.81k [00:00<00:00, 5.14kKB/s]


Unnamed: 0,userID,itemID,rating
0,196,242,3.0
1,186,302,3.0
2,22,377,1.0
3,244,51,2.0
4,166,346,1.0


In [54]:
train, validation, test = python_random_split(data, [0.7, 0.15, 0.15], seed=SEED)

In [23]:
tmpdir = TemporaryDirectory()

train_pickle_path = os.path.join(tmpdir.name, TRAIN_FILE_NAME)
train.to_pickle(train_pickle_path)

val_pickle_path = os.path.join(tmpdir.name, VAL_FILE_NAME)
validation.to_pickle(val_pickle_path)

test_pickle_path = os.path.join(tmpdir.name, TEST_FILE_NAME)
test.to_pickle(test_pickle_path)

Now we create a pod by using [reco_utils/kubeflow/manifest/pvc-loader.yaml](../../reco_utils/kubeflow/manifest/pvc-loader.yaml) to upload the datasets into `/data` folder.

In [11]:
!kubectl delete pod pvc-loader  # Delete if the pod already exists
!kubectl apply -f ../../reco_utils/kubeflow/manifest/pvc-loader.yaml

pod "pvc-loader" deleted
pod/pvc-loader created


In [24]:
# Upload data files
!kubectl cp {train_pickle_path} pvc-loader:/data/
!kubectl cp {val_pickle_path} pvc-loader:/data/
!kubectl cp {test_pickle_path} pvc-loader:/data/

In [51]:
# Verify
!kubectl exec pvc-loader -- bash -c "ls /data/"

d5997a7396c4c1db
fada5d956cce34c3
movielens_100k_test.pkl
movielens_100k_train.pkl
movielens_100k_val.pkl
sb3123f2df3038dc


After uploading the data, we don't need the pod anymore, so remove it.

In [26]:
!kubectl delete pod pvc-loader 

pod "pvc-loader" deleted


#### 2.2 Training scripts

We prepare a training script [reco_utils/kubeflow/svd_training.py](../../reco_utils/kubeflow/svd_training.py) for the hyperparameter tuning, which will log our target metrics such as [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation) and/or [NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) to Katib so that we can track the metrics and optimize the primary metric.

We use the Docker image containing our Recommender repo as well as the training script. For more details, see [reco_utils/kubeflow/docker/Dockerfile](../../reco_utils/kubeflow/docker/Dockerfile).

#### 2.3 Parameters

We define a search space for the hyperparameters. All the parameter values will be passed into our training script.

In [67]:
EXP_NAME = "movielens-" + MOVIELENS_DATA_SIZE + "-svd"
PRIMARY_METRIC = 'precision_at_k'
PRIMARY_METRIC_GOAL = Goal.MAXIMIZE
IDEAL_METRIC_VALUE = 1.0
RATING_METRICS = ['rmse']
RANKING_METRICS = ['precision_at_k', 'ndcg_at_k']  

REMOVE_SEEN = True
K = 10
RANDOM_STATE = 0
VERBOSE = True
NUM_EPOCHS = 30
BIASED = True

MAX_TOTAL_RUNS = 8  # TODO 100 Number of runs (training-and-evaluation) to search for the best hyperparameters. 
MAX_CONCURRENT_RUNS = 8

# PVC mount path
STORAGE_MOUNT_PATH = "/data"

script_params = {
    '--datastore': STORAGE_MOUNT_PATH,
    '--train-datapath': TRAIN_FILE_NAME,
    '--validation-datapath': VAL_FILE_NAME,
    '--output-dir': "outputs",
    '--surprise-reader': "ml-100k",
    '--rating-metrics': RATING_METRICS,
    '--ranking-metrics': RANKING_METRICS,
    '--usercol': USERCOL,
    '--itemcol': ITEMCOL,
    '--k': K,
    '--random-state': RANDOM_STATE,
    '--epochs': NUM_EPOCHS,
}

if BIASED:
    script_params['--biased'] = ''
if VERBOSE:
    script_params['--verbose'] = ''
if REMOVE_SEEN:
    script_params['--remove-seen'] = ''

# hyperparameters search space
# We do not set 'lr_all' and 'reg_all' because they will be overwritten by the other lr_ and reg_ parameters
hyperparams = {
    '--n-factors': choice([10, 50, 100, 150, 200]),
    '--init-mean': uniform(-0.5, 0.5),
    '--init-std-dev': uniform(0.01, 0.2),
    '--lr-bu': uniform(1e-6, 0.1), 
    '--lr-bi': uniform(1e-6, 0.1), 
    '--lr-pu': uniform(1e-6, 0.1), 
    '--lr-qi': uniform(1e-6, 0.1), 
    '--reg-bu': uniform(1e-6, 1),
    '--reg-bi': uniform(1e-6, 1), 
    '--reg-pu': uniform(1e-6, 1), 
    '--reg-qi': uniform(1e-6, 1)
}

Now we create worker and study manifests.

In [74]:
worker_spec = worker_manifest(
    worker_type=WorkerType.WORKER,
    image_name='loomlike/reco',
    entry_script='/app/reco_utils/kubeflow/svd_training.py',
    params=script_params,
    is_hypertune=True,
    storage_path=STORAGE_MOUNT_PATH,
    use_gpu=False,
)

studyjob_name, studyjob_file = make_hypertune_manifest(
    study_name=EXP_NAME,
    tag="random-1",
    search_type=SearchType.RANDOM,
    total_runs=MAX_TOTAL_RUNS,
    concurrent_runs=MAX_CONCURRENT_RUNS,
    primary_metric=PRIMARY_METRIC,
    goal=PRIMARY_METRIC_GOAL,
    ideal_metric_value=IDEAL_METRIC_VALUE,
    metrics=RATING_METRICS+RANKING_METRICS,
    hyperparams=hyperparams,
    worker_spec=worker_spec
)

StudyJob manifest has been generated.        To start, run 'kubectl create -f jobs/movielens-100k-svd-random-1.yaml'


## 3. Experiments

In [84]:
# Delete previous StudyJob of the same name if exists
!kubectl delete studyjob {studyjob_name}

# Create a StudyJob
!kubectl create -f {studyjob_file}

studyjob.kubeflow.org "movielens-100k-svd-random-1" deleted
studyjob.kubeflow.org/movielens-100k-svd-random-1 created


In [79]:
!kubectl describe studyjob {studyjob_name}

Name:         movielens-100k-svd-random-1
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  kubeflow.org/v1alpha1
Kind:         StudyJob
Metadata:
  Creation Timestamp:  2019-06-16T03:41:24Z
  Finalizers:
    clean-studyjob-data
  Generation:        1
  Resource Version:  461385
  Self Link:         /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/movielens-100k-svd-random-1
  UID:               9d4ed7c8-8fe8-11e9-83b5-32708d49e78a
Spec:
  Metricsnames:
    rmse
    precision_at_k
    ndcg_at_k
  Objectivevaluename:  precision_at_k
  Optimizationgoal:    1
  Optimizationtype:    maximize
  Owner:               crd
  Parameterconfigs:
    Feasible:
      List:
        10
        50
        100
        150
        200
    Name:           --n-factors
    Parametertype:  categorical
    Feasible:
      Max:          0.5
      Min:          -0.5
    Name:           --init-mean
    Parametertype: 

In [47]:
!kubectl get pod

NAME                                                       READY   STATUS        RESTARTS   AGE
ad37cfbb9d1307a9-1560645420-fncjh                          0/1     Completed     0          18m
ad37cfbb9d1307a9-1560645480-gt6fm                          0/1     Error         0          16m
adf1589fe7e7f510-1560646380-4j7ps                          0/1     Error         0          2m2s
ambassador-7b8477f667-96mmt                                1/1     Running       0          2d5h
ambassador-7b8477f667-kbchr                                1/1     Running       0          2d5h
ambassador-7b8477f667-mswcr                                1/1     Running       0          2d5h
argo-ui-9cbd45fdf-sgm6k                                    1/1     Running       0          2d5h
b4148ba65f097977-1560646380-bksm7                          0/1     Error         0          2m2s
b43384de369facca-1560646380-blmpz                          0/1     Error         0          2m2s
c78adcbe3fc9f484-1560645360-8nslb

In [49]:
!kubectl describe job ad37cfbb9d1307a9-1560645420

Name:           ad37cfbb9d1307a9-1560645420
Namespace:      kubeflow
Selector:       controller-uid=db30dd3f-8fce-11e9-9ec7-46b2371e9153
Labels:         controller-uid=db30dd3f-8fce-11e9-9ec7-46b2371e9153
                job-name=ad37cfbb9d1307a9-1560645420
Annotations:    <none>
Parallelism:    1
Completions:    1
Start Time:     Sat, 15 Jun 2019 20:37:01 -0400
Completed At:   Sat, 15 Jun 2019 20:38:51 -0400
Duration:       110s
Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
Pod Template:
  Labels:           controller-uid=db30dd3f-8fce-11e9-9ec7-46b2371e9153
                    job-name=ad37cfbb9d1307a9-1560645420
  Service Account:  metrics-collector
  Containers:
   ad37cfbb9d1307a9:
    Image:      gcr.io/kubeflow-images-public/katib/metrics-collector:v0.1.2-alpha-156-g4ab3dbd
    Port:       <none>
    Host Port:  <none>
    Args:
      ./metricscollector
      -s
      sb3123f2df3038dc
      -t
      g3cb546f948e8801
      -w
      ad37cfbb9d1307a

## 4. Results

In [14]:
# Get best run and printout metrics

Now evaluate the metrics on the test data. To do this, get the SVD model that was saved as model.dump in the training script.

In [17]:
# Load model and test

In [18]:
svd = surprise.dump.load('aml_model/model.dump')[1]

In [19]:
test_results = {}
predictions = compute_rating_predictions(svd, test, usercol=USERCOL, itemcol=ITEMCOL)
for metric in RATING_METRICS:
    test_results[metric] = eval(metric)(test, predictions)

all_predictions = compute_ranking_predictions(svd, train, usercol=USERCOL, itemcol=ITEMCOL, recommend_seen=RECOMMEND_SEEN)
for metric in RANKING_METRICS:
    test_results[metric] = eval(metric)(test, all_predictions, col_prediction='prediction', k=K)

print(test_results)

{'rmse': 1.0331492610799313, 'precision_at_k': 0.09968017057569298, 'ndcg_at_k': 0.1160964958978592}


## 5. Concluding Remarks

We showed how to tune **all** the hyperparameters accepted by Surprise SVD simultaneously, by utilizing Kubeflow on AKS.

TODO add insights

#### Cleanup

To uninstall Kubeflow,
```
cd ${KF_APP}
# If you want to delete all the resources, including storage.
kfctl delete all --delete_storage
# If you want to preserve storage, which contains metadata and information
# from mlpipeline.
kfctl delete all
```

To remove AKS cluster,

TODO