<i>Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.</i>
<br><br>
# SVD Hyperparameter Tuning with Kubeflow

In this notebook, we show how to tune the hyperparameters of a matrix factorization algorithm, SVD (Singular Value Decomposition) from the Surprise library, by utilizing [**Kubeflow**](https://www.kubeflow.org/) in the context of movie recommendations. Kubeflow is a machine learning toolkit for [Kubernetes](https://kubernetes.io/) which makes deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable.

We present the overall process of deploying Kubeflow on [AKS (Azure Kubernetes Service)](https://azure.microsoft.com/en-us/services/kubernetes-service/) and use [**Katib**](https://www.kubeflow.org/docs/components/hyperparameter/) (a Kubeflow component dedicated to hyperparameter tuning tasks) to run hyperparameter tuning experiments by demonstrating some key steps while avoiding too much detail. 

For more details about the **SVD** algorithm:
* [Surprise SVD deep-dive notebook](../02_model/surprise_svd_deep_dive.ipynb)
* [Original paper](http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)
* [Surprise homepage](https://surprise.readthedocs.io/en/stable/)
  
Regarding **Kubeflow**, please refer to:
* [Azure Kubeflow labs github repo](https://github.com/Azure/kubeflow-labs)
* [Kubeflow official doc: Getting started on Kubernetes](https://www.kubeflow.org/docs/started/getting-started-k8s/)
* [Hyperparameter tuning a Tensorflow model on Kubeflow with GPU cluster](https://github.com/loomlike/hyperparameter-tuning-on-kubernetes)

## 0. Prerequisites

* Docker (if you want to create your own docker image) - To install, see [docker site](https://docs.docker.com/install/).
* Azure CLI - The easiest way is to use [Azure DSVM](https://azure.microsoft.com/en-us/services/virtual-machines/data-science-virtual-machines/).
  - You need the Azure CLI version 2.0.64 or later installed and configured. Run `az --version` to find the version. If you need to install or upgrade, see [Install Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-apt?view=azure-cli-latest#update).

In [10]:
%load_ext autoreload
%autoreload 2

In [11]:
import os
import subprocess
import sys
sys.path.append("../../")
import time
from tempfile import TemporaryDirectory

import surprise
import papermill as pm
import pandas as pd

from reco_utils.common.constants import SEED
from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import rmse, precision_at_k, ndcg_at_k
from reco_utils.recommender.surprise.surprise_utils import compute_rating_predictions, compute_ranking_predictions
from reco_utils.kubeflow.utils import (
    choice,
    uniform,
    make_hypertune_manifest,
    make_worker_spec,
    get_study_metrics,
    get_study_result,
)
from reco_utils.kubeflow.manifest import (
    Goal,
    SearchType,
    WorkerType,
)

print("System version: {}".format(sys.version))
print("Surprise version: {}".format(surprise.__version__))

System version: 3.6.8 |Anaconda, Inc.| (default, Feb 11 2019, 15:03:47) [MSC v.1915 64 bit (AMD64)]
Surprise version: 1.0.6


## 1. Setup
During the setup, we create AKS cluster and install Kubeflow on it.

#### 1.1 AKS setup
To create AKS and cluster, first make sure you signed in to use Azure CLI with a correct subscription.

In [None]:
!az login

In [None]:
!az account show

To change the subscription, run `az account set --subscription <YOUR-SUBSCRIPTION-NAME-OR-ID>`.

Next, **set** desired *resource group name* and *AKS name* as well as the *region* you want to create the resources at to the following cell.

In [1]:
RG_NAME = "junminaks-cpu"   # YOUR-RESOURCE-GROUP-NAME
AKS_NAME = "junminaks-cpu"  # RESOURCE-NAME
LOCATION = "eastus"  # RESOURCE-REGION. To get all the available region, run 'az account list-locations' and see 'name' key

Then, run the following commands to create the resources. This example will create **eight** [Standard_D2_v2](https://docs.microsoft.com/en-us/azure/virtual-machines/linux/sizes-general#dv2-series) CPU VM nodes for the cluster.

In [None]:
# Create resource group
!az group create --name {RG_NAME} --location {LOCATION}

# Create AKS cluster
!az aks create \
    --resource-group {RG_NAME} \
    --name {AKS_NAME} \
    --node-count 8 \
    --node-vm-size Standard_D2_v2 \
    --enable-addons monitoring \
    --generate-ssh-keys

Creating an AKS cluster may take few minutes. If the creation is successful, you'll see something like:
```
{- Finished ..ion done[############################################]  100.0000%
  "aadProfile": null,
    "addonProfiles": {
      "omsagent": {
        "config": {
  ...
```

Now, install [Kubernetes CLI](https://docs.microsoft.com/en-us/azure/aks/tutorial-kubernetes-deploy-cluster#install-the-kubernetes-cli) `kubectl` for running commands against the cluster. If you already installed kubectl, skip this cell.

In [None]:
!az aks install-cli

Connect the CLI to your cluster by runing:

In [2]:
!az aks get-credentials --resource-group {RG_NAME} --name {AKS_NAME}

# If you already have the credential and want to change the current context, run:
# !kubectl config use-context {AKS_NAME}

Switched to context "junminaks-cpu".


If you have an error, check if you have read/write permissions on the kubernetes config file. In a linux machine, the file will be at `~/.kube/config`.

To verify the connection of CLI to your cluster, run:

In [3]:
!kubectl get nodes

NAME                       STATUS    ROLES     AGE       VERSION
aks-nodepool1-30917087-0   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-1   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-2   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-3   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-4   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-5   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-6   Ready     agent     1d        v1.12.8
aks-nodepool1-30917087-7   Ready     agent     1d        v1.12.8


If the connection is successful, the nodes information will be printed out like:
```
NAME                       STATUS    ROLES     AGE       VERSION
aks-nodepool1-17965807-0   Ready     agent     11m       v1.12.8
...
```

#### 1.2 Kubeflow setup
Kubeflow makes use of *[ksonnet](https://www.kubeflow.org/docs/components/ksonnet/)* to help manage deployments.

First, setup environment variables and download the ksonnet file by running the following scripts.
Please note that here we use **ubuntu** version of *ksonnet* and Kubeflow deployment tool, *kfctl* (**not kubectl**).

If you use a different OS, please refer each application's official document.

In [9]:
os.environ["OS_TYPE"] = "linux"  # Use "darwin" for darwin or MacOS
os.environ["KS_VER"] = "0.13.1"
os.environ["KS_PKG"] = "ks_{0}_{1}_amd64".format(os.environ["KS_VER"], os.environ["OS_TYPE"])
os.environ["PATH"] = "{0}:{1}/bin/{2}".format(
    os.environ["PATH"],
    os.environ["HOME"],
    os.environ["KS_PKG"]
)

In [10]:
%%bash

wget -O /tmp/${KS_PKG}.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v${KS_VER}/${KS_PKG}.tar.gz -q
mkdir -p ${HOME}/bin
tar -xvf /tmp/${KS_PKG}.tar.gz -C ${HOME}/bin

x ks_0.13.1_darwin_amd64/CHANGELOG.md
x ks_0.13.1_darwin_amd64/CODE-OF-CONDUCT.md
x ks_0.13.1_darwin_amd64/CONTRIBUTING.md
x ks_0.13.1_darwin_amd64/LICENSE
x ks_0.13.1_darwin_amd64/README.md
x ks_0.13.1_darwin_amd64/ks


Next, download, extract Kubeflow and deploy it. For more details about this process, please see [installation instruction of Kubeflow on Azure](https://www.kubeflow.org/docs/azure/deploy/install-kubeflow/).

In [11]:
os.environ["KFCTL_VER"] = "0.5.1"
os.environ["KFCTL_PKG"] = "kfctl_v{}_{}".format(os.environ["KFCTL_VER"], os.environ["OS_TYPE"])
os.environ["PATH"] = "{0}:{1}/bin".format(
    os.environ["PATH"],
    os.environ["HOME"]
)
os.environ['KFAPP'] = "kfapp"

In [12]:
%%bash

wget -O /tmp/${KFCTL_PKG}.tar.gz https://github.com/kubeflow/kubeflow/releases/download/v${KFCTL_VER}/${KFCTL_PKG}.tar.gz -q
tar -xvf /tmp/${KFCTL_PKG}.tar.gz -C ${HOME}/bin

x ./kfctl


In [13]:
%%bash

kfctl init ${KFAPP}
cd ${KFAPP}
kfctl generate k8s
kfctl apply k8s

time="2019-06-16T21:46:15-04:00" level=info msg="deploying kubeflow application" filename="cmd/apply.go:35"


To verify the deployment, check kubeflow pods as follows:

In [None]:
!kubectl -n kubeflow get pods

We change the namespace to be "kubeflow" so that we don't need to use `-n kubeflow` argument for every *kubectl* command in this example.

In [15]:
!kubectl config set-context {AKS_NAME} --namespace=kubeflow

Context "junminaks-cpu" modified.


#### 1.3 Persistent volumn setup
One last thing we should do before moving to the next step is to create a persistent volumn to store our dataset. A PersistentVolumeClaim (PVC) is a request for storage by a user. For details, see [persistent volumes with Azure files](https://docs.microsoft.com/en-us/azure/aks/azure-files-dynamic-pv). Here, we create **100G** size storage, which is defined in *[reco_utils/kubeflow/manifest/azure-file-pvc.yaml](../../reco_utils/kubeflow/manifest/azure-file-pvc.yaml)*.

In [34]:
!kubectl apply -f ../../reco_utils/kubeflow/manifest/azure-file-sc.yaml
!kubectl apply -f ../../reco_utils/kubeflow/manifest/azure-pvc-roles.yaml
!kubectl apply -f ../../reco_utils/kubeflow/manifest/azure-file-pvc.yaml

storageclass.storage.k8s.io "azurefile" configured
clusterrole.rbac.authorization.k8s.io "system:azure-cloud-provider" configured
clusterrolebinding.rbac.authorization.k8s.io "system:azure-cloud-provider" configured
persistentvolumeclaim "azurefile" unchanged


<a id='check-pvc'></a>
To verify the deployment, run:

In [35]:
!kubectl get pvc azurefile

NAME        STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
azurefile   Bound     pvc-52a5a833-912b-11e9-a0f8-7a177b70b80b   100Gi      RWX            azurefile      5h


## 2. Experiment Preparation
#### 2.1 Dataset
1. Download data and split into training, validation and testing sets
2. Upload the training and validation sets to our PVC. To do that,
  1. Attach a pod to the PVC
  2. Copy the datasets onto the pod
  3. Delete the pod

In [8]:
# Select MovieLens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

TRAIN_FILE_NAME = "movielens_" + MOVIELENS_DATA_SIZE + "_train.pkl"
VAL_FILE_NAME = "movielens_" + MOVIELENS_DATA_SIZE + "_val.pkl"
TEST_FILE_NAME = "movielens_" + MOVIELENS_DATA_SIZE + "_test.pkl"

USERCOL = 'userID'
ITEMCOL = 'itemID'

In [25]:
data = movielens.load_pandas_df(
    size=MOVIELENS_DATA_SIZE,
    header=[USERCOL, ITEMCOL, "rating"]
)

data.head()

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.81k/4.81k [00:00<00:00, 5.57kKB/s]


Unnamed: 0,userID,itemID,rating
0,196,242,3.0
1,186,302,3.0
2,22,377,1.0
3,244,51,2.0
4,166,346,1.0


In [26]:
train, validation, test = python_random_split(data, [0.7, 0.15, 0.15], seed=SEED)

In [42]:
tmpdir = TemporaryDirectory()

train_pickle_path = os.path.join(tmpdir.name, TRAIN_FILE_NAME)
train.to_pickle(train_pickle_path)

val_pickle_path = os.path.join(tmpdir.name, VAL_FILE_NAME)
validation.to_pickle(val_pickle_path)

test_pickle_path = os.path.join(tmpdir.name, TEST_FILE_NAME)
test.to_pickle(test_pickle_path)

Before move forward, make sure your PVC has been deployed. Check the deployment status by re-running the command we executed earlier: `!kubectl get pvc azurefile` or go to the [earlier cell](#check-pvc) we checked the PVC status and re-run it.

Once the PVC  we create a pod by using [reco_utils/kubeflow/manifest/pvc-loader.yaml](../../reco_utils/kubeflow/manifest/pvc-loader.yaml) to upload the datasets into `/data` folder.

In [5]:
!kubectl delete pod pvc-loader
!kubectl apply -f ../../reco_utils/kubeflow/manifest/pvc-loader.yaml

Error from server (NotFound): pods "pvc-loader" not found


pod "pvc-loader" created


In [43]:
# Upload data files
!kubectl cp {train_pickle_path} pvc-loader:/data/
!kubectl cp {val_pickle_path} pvc-loader:/data/
!kubectl cp {test_pickle_path} pvc-loader:/data/

error: \Users\jumin\AppData\Local\Temp\tmpmgygrxxq\movielens_100k_train.pkl no such file or directory
error: \Users\jumin\AppData\Local\Temp\tmpmgygrxxq\movielens_100k_val.pkl no such file or directory
error: \Users\jumin\AppData\Local\Temp\tmpmgygrxxq\movielens_100k_test.pkl no such file or directory


In [6]:
# Verify
!kubectl exec pvc-loader -- bash -c "ls /data/"

movielens-100k-svd-random-1-af8821edb8b1d144
movielens-100k-svd-random-1-b017795ae0918bd0
movielens-100k-svd-random-1-bd3dc277f8fce935
movielens-100k-svd-random-1-e6536d1be72a6c24
movielens-100k-svd-random-1-k336a298aaa10d61
movielens-100k-svd-random-1-kec9fe5b8c084659
movielens-100k-svd-random-1-o468ed2f86057a3e
movielens-100k-svd-random-1-o991921dc222ad75
movielens-100k-svd-random-1-obffdfa5b8c0a4c5
movielens-100k-svd-random-1-t2d6e2cdee09776c
movielens-100k-svd-random-1-te7858132bdcbdfe
movielens-100k-svd-random-1-z4dba7cddabf6502
movielens_100k_test.pkl
movielens_100k_train.pkl
movielens_100k_val.pkl


#### 2.2 Training scripts

We prepare a training script [reco_utils/kubeflow/svd_training.py](../../reco_utils/kubeflow/svd_training.py) for the hyperparameter tuning, which will log our target metrics such as [RMSE](https://en.wikipedia.org/wiki/Root-mean-square_deviation) and/or [NDCG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) to *Katib* so that we can track the metrics and optimize the primary metric. At the end, the script also saves the trained model to the output folder so that we can download and validate the model on test set later.

We use the Docker image containing our Recommender repo as well as the training script. For more details, see [reco_utils/kubeflow/docker/Dockerfile](../../reco_utils/kubeflow/docker/Dockerfile).

#### 2.3 Parameters

We define a search space for the hyperparameters. All the parameter values will be passed into our training script.

In [12]:
EXP_NAME = "movielens-" + MOVIELENS_DATA_SIZE + "-svd"
PRIMARY_METRIC = 'precision_at_k'
PRIMARY_METRIC_GOAL = Goal.MAXIMIZE
IDEAL_METRIC_VALUE = 1.0
RATING_METRICS = ['rmse']
RANKING_METRICS = ['precision_at_k', 'ndcg_at_k']  

REMOVE_SEEN = True
K = 10
RANDOM_STATE = 0
VERBOSE = True
NUM_EPOCHS = 30
BIASED = True

MAX_TOTAL_RUNS = 100  # Number of runs (training-and-evaluation) to search for the best hyperparameters. 
MAX_CONCURRENT_RUNS = 8

# PVC mount path
STORAGE_MOUNT_PATH = "/data"

script_params = {
    '--datastore': STORAGE_MOUNT_PATH,
    '--train-datapath': TRAIN_FILE_NAME,
    '--validation-datapath': VAL_FILE_NAME,
    '--surprise-reader': "ml-100k",
    '--rating-metrics': RATING_METRICS,
    '--ranking-metrics': RANKING_METRICS,
    '--usercol': USERCOL,
    '--itemcol': ITEMCOL,
    '--k': K,
    '--random-state': RANDOM_STATE,
    '--epochs': NUM_EPOCHS,
}

if BIASED:
    script_params['--biased'] = ''
if VERBOSE:
    script_params['--verbose'] = ''
if REMOVE_SEEN:
    script_params['--remove-seen'] = ''

# hyperparameters search space
# We do not set 'lr_all' and 'reg_all' because they will be overwritten by the other lr_ and reg_ parameters
hyperparams = {
    '--n-factors': choice([10, 50, 100, 150, 200]),
    '--init-mean': uniform(-0.5, 0.5),
    '--init-std-dev': uniform(0.01, 0.2),
    '--lr-bu': uniform(1e-6, 0.1), 
    '--lr-bi': uniform(1e-6, 0.1), 
    '--lr-pu': uniform(1e-6, 0.1), 
    '--lr-qi': uniform(1e-6, 0.1), 
    '--reg-bu': uniform(1e-6, 1),
    '--reg-bi': uniform(1e-6, 1), 
    '--reg-pu': uniform(1e-6, 1), 
    '--reg-qi': uniform(1e-6, 1)
}

Now we create worker and study manifests.

In [13]:
# Change this to repeat the experiment without over-writing the previous StudyJob deployment.
TAG = "random-1"
# Change this to select different search algorithm
SEARCH_TYPE = SearchType.RANDOM

worker_spec = make_worker_spec(
    name=EXP_NAME,
    tag=TAG,
    worker_type=WorkerType.WORKER,
    image_name='loomlike/reco',
    entry_script='/app/reco_utils/kubeflow/svd_training.py',
    params=script_params,
    is_hypertune=True,
    storage_path=STORAGE_MOUNT_PATH,
    use_gpu=False,
)

studyjob_name, studyjob_file = make_hypertune_manifest(
    search_type=SEARCH_TYPE,
    total_runs=MAX_TOTAL_RUNS,
    concurrent_runs=MAX_CONCURRENT_RUNS,
    primary_metric=PRIMARY_METRIC,
    goal=PRIMARY_METRIC_GOAL,
    ideal_metric_value=IDEAL_METRIC_VALUE,
    metrics=RATING_METRICS+RANKING_METRICS,
    hyperparams=hyperparams,
    worker_spec=worker_spec
)

StudyJob manifest has been generated. To start, run 'kubectl create -f jobs\movielens-100k-svd-random-1.yaml'


## 3. Experiments
Now, we deploy the studyjob and monitor the status by using *kubectl*.

In [22]:
# Delete previous StudyJob of the same name if exists
!kubectl delete studyjob {studyjob_name}

# Create a StudyJob
!kubectl create -f {studyjob_file}

studyjob.kubeflow.org "movielens-100k-svd-random-1" deleted
studyjob.kubeflow.org "movielens-100k-svd-random-1" created


In [None]:
!kubectl describe studyjob {studyjob_name}

To check more details about each **job** (trial) and its **pod**, you can use **kubectl** commands like:
```
# To list all the jobs
kubectl get job 

# To check a specific job
kubectl describe job <your-job-id>

# To check a specific pod
kubectl describe pod <your-pod-id>
```

## 4. Results

#### 4.1 Dashboard
To access Kubeflow dashboard, we use port-tunneling to access *ambassador* service.

In [46]:
subprocess.Popen("kubectl port-forward svc/ambassador 8080:80", shell=True)

<subprocess.Popen at 0x231a32cb518>

Then, open [localhost:8080](http://localhost:8080) from a browser, go to **Katib Dashboard** tab and select the StudyJob you want to see from the list.

You also can create a new job from the dashboard.

TODO: Screen-shot

#### 4.2 Result query

You can also query the results by using REST API or gRPC. You can find gRPC API from [Katib official github repo](https://github.com/kubeflow/katib/blob/master/pkg/api/v1alpha1/api.proto).

Here, we use our helper functions to query the results.

In [60]:
study_result = get_study_result(
    studyjob_name=studyjob_name,
    write_result=True,
    verbose=False,
)

Name:         movielens-100k-svd-random-1
Namespace:    kubeflow
Labels:       controller-tools.k8s.io=1.0
Annotations:  <none>
API Version:  kubeflow.org/v1alpha1
Kind:         StudyJob
Metadata:
  Creation Timestamp:  2019-06-17T02:42:03Z
  Finalizers:
    clean-studyjob-data
  Generation:        1
  Resource Version:  17595
  Self Link:         /apis/kubeflow.org/v1alpha1/namespaces/kubeflow/studyjobs/movielens-100k-svd-random-1
  UID:               7cf75fc6-90a9-11e9-8304-9a611a16de8f
Spec:
  Metricsnames:
    rmse
    precision_at_k
    ndcg_at_k
  Objectivevaluename:  precision_at_k
  Optimizationgoal:    1
  Optimizationtype:    maximize
  Owner:               crd
  Parameterconfigs:
    Feasible:
      List:
        10
        50
        100
        150
        200
    Name:           --n-factors
    Parametertype:  categorical
    Feasible:
      Max:          0.5
      Min:          -0.5
    Name:           --init-mean
    Parametertype:  double
    Feasible:
      Max:      

In [73]:
study_result['Best Trial Id']
# model_id = get_best_model_id(study_result)
# model_id
# get_study_metrics(study_result['Studyid'], [], [''])

'd67868ca8f5816d2'

#### 4.3 Test
We got the best-performing hyperparameter. Now, we evaluate the metrics on the test data. To do that, download the best model we stored while training.

In [None]:
# Load model and test

In [None]:
svd = surprise.dump.load('aml_model/model.dump')[1]

In [None]:
test_results = {}
predictions = compute_rating_predictions(svd, test, usercol=USERCOL, itemcol=ITEMCOL)
for metric in RATING_METRICS:
    test_results[metric] = eval(metric)(test, predictions)

all_predictions = compute_ranking_predictions(svd, train, usercol=USERCOL, itemcol=ITEMCOL, recommend_seen=RECOMMEND_SEEN)
for metric in RANKING_METRICS:
    test_results[metric] = eval(metric)(test, all_predictions, col_prediction='prediction', k=K)

print(test_results)

## 5. Concluding Remarks

We showed how to tune **all** the hyperparameters accepted by Surprise SVD simultaneously, by utilizing Kubeflow on AKS.

TODO add insights

#### Cleanup

To uninstall Kubeflow,
```
cd ${KF_APP}
# If you want to delete all the resources, including storage.
kfctl delete all --delete_storage
# If you want to preserve storage, which contains metadata and information
# from mlpipeline.
kfctl delete all
```

To remove AKS cluster,

TODO