<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# Train SAR Single Node on MovieLens with Azure Machine Learning (Python, CPU)

This notebook provides an exmaple of how to train SAR on remote compute resources. Details and discussions of SAR can be found in [SAR Python CPU Movielens](https://github.com/Microsoft/Recommenders/blob/master/notebooks/00_quick_start/sar_movielens.ipynb) notebook. 

This notebook requires an AzureML workspace to be set up. Please follow [this notebook](https://github.com/Azure/MachineLearningNotebooks/blob/master/configuration.ipynb) on how to create an AzureML workspace.

# 0 Setup environment
### Get workspace and create experiment
The workspace was created in configuration notebook and can be loaded using `Workspace.from_config()`.

The DSVM created in configuration notebook is called `cpucluster`.

In [5]:
# set the environment path to find Recommenders
import sys
sys.path.append("../../")

import azureml
from azureml.core import Workspace, Run, Experiment

# load workspace configuration from the config.json file in the current folder.
ws = Workspace.from_config()
# create experiment
exp = Experiment(workspace=ws, name='movielens-sar')
# attach amlcompute
compute_target = ws.compute_targets["cpucluster"]

Found the config file in: /data/home/testuser/notebooks/Recommenders/notebooks/00_quick_start/aml_config/config.json


In [6]:
# top k items to recommend
TOP_K = 10

# Select Movielens data size: 100k, 1m, 10m, or 20m
MOVIELENS_DATA_SIZE = '100k'

### Download dataset and upload to data store
Now make the data accessible remotely by uploading that data from your local machine into Azure so it can be accessed for remote training. The datastore is a convenient construct associated with your workspace for you to upload/download data, and interact with it from your remote compute targets. It is backed by Azure blob storage account.

The data files are uploaded into a directory named `data` at the root of the datastore.

In [7]:
from reco_utils.dataset import movielens
import os

# download dataset
os.makedirs('./data', exist_ok = True)

datapath, item_datapath = movielens.download_datafile(
    size=MOVIELENS_DATA_SIZE,
    local_cache_path='./data/ml.zip'
)
# print (datapath, item_datapath)

# upload to data store
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)
ds.upload(src_dir='./data', target_path='movielens', overwrite=True, show_progress=True)

# clean up
movielens._clean_up(datapath)
movielens._clean_up(item_datapath)

/data/home/testuser/notebooks/Recommenders/notebooks/00_quick_start/data
AzureBlob setupwsstoragewzlfyzlr azureml-blobstore-427c91f1-92ba-45a5-b01e-4e923a230fcc
Uploading ./data/ml.zip
Uploading ./data/u.data
Uploading ./data/u.item
Uploaded ./data/u.item, 1 files out of an estimated total of 3
Uploaded ./data/u.data, 2 files out of an estimated total of 3
Uploaded ./data/ml.zip, 3 files out of an estimated total of 3


# 1 Train on remote cluster
### Create a directory
Create a directory to deliver the necessary code from your computer to the remote resource.

In [8]:
script_folder = './movielens-sar'
os.makedirs(script_folder, exist_ok=True)

### Create a training script
To submit the job to the cluster, first create a training script. Run the following code to create the training script called `train.py` in the directory you just created. This training adds a regularization rate to the training algorithm, so produces a slightly different model than the local version.

In [9]:
%%writefile $script_folder/train.py

import argparse
import os
import numpy as np
import pandas as pd
import papermill as pm
import itertools
import logging
import time

from azureml.core import Run
from sklearn.externals import joblib

from reco_utils.dataset import movielens
from reco_utils.dataset.python_splitters import python_random_split
from reco_utils.evaluation.python_evaluation import map_at_k, ndcg_at_k, precision_at_k, recall_at_k
from reco_utils.recommender.sar.sar_singlenode import SARSingleNode

# get hold of the current run
run = Run.get_context()

# let user feed in 2 parameters, the location of the data files (from datastore), and the regularization rate of the logistic regression model
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder mounting point')
parser.add_argument('--top-k', type=int, dest='top_k', default=10, help='top k items to recommend')
parser.add_argument('--data-size', type=str, dest='data_size', default=10, help='Movielens data size: 100k, 1m, 10m, or 20m')
args = parser.parse_args()

data_folder = os.path.join(args.data_folder, 'movielens')
print('Data folder:', data_folder)

# load data into pandas data frame
data = movielens.load_pandas_df_from_ds(
    size=args.data_size,
    header=['UserId','MovieId','Rating','Timestamp'],
    ds_path=os.path.join(data_folder, 'u.data'))

# Convert the float precision to 32-bit in order to reduce memory consumption 
data.loc[:, 'Rating'] = data['Rating'].astype(np.float32)

data.head()

train, test = python_random_split(data)

# instantiate the SAR algorithm and set the index
header = {
    "col_user": "UserId",
    "col_item": "MovieId",
    "col_rating": "Rating",
    "col_timestamp": "Timestamp",
}

logging.basicConfig(level=logging.DEBUG, 
                    format='%(asctime)s %(levelname)-8s %(message)s')

model = SARSingleNode(
    remove_seen=True, similarity_type="jaccard", 
    time_decay_coefficient=30, time_now=None, timedecay_formula=True, **header
)

# train the SAR model
start_time = time.time()

model.fit(train)

train_time = time.time() - start_time
run.log(name="Training time", value="Took {} seconds for training.".format(train_time))

start_time = time.time()

top_k = model.recommend_k_items(test)

test_time = time.time() - start_time
run.log(name="Prediction time", value="Took {} seconds for prediction.".format(test_time))

# TODO: remove this call when the model returns same type as input
top_k['UserId'] = pd.to_numeric(top_k['UserId'])
top_k['MovieId'] = pd.to_numeric(top_k['MovieId'])

# evaluate
eval_map = map_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                    col_rating="Rating", col_prediction="prediction", 
                    relevancy_method="top_k", k=args.top_k)
eval_ndcg = ndcg_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                      col_rating="Rating", col_prediction="prediction", 
                      relevancy_method="top_k", k=args.top_k)
eval_precision = precision_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                                col_rating="Rating", col_prediction="prediction", 
                                relevancy_method="top_k", k=args.top_k)
eval_recall = recall_at_k(test, top_k, col_user="UserId", col_item="MovieId", 
                          col_rating="Rating", col_prediction="prediction", 
                          relevancy_method="top_k", k=args.top_k)

run.log(name="Model", value=model.model_str)
run.log(name="Top K", value=args.top_k)
run.log(name="MAP", value=eval_map)
run.log(name="NDCG", value=eval_ndcg)
run.log(name="Precision@K", value=eval_precision)
run.log(name="Recall@K", value=eval_recall)

pm.record("map", eval_map)
pm.record("ndcg", eval_ndcg)
pm.record("precision", eval_precision)
pm.record("recall", eval_recall)
pm.record("train_time", train_time)
pm.record("test_time", test_time)

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/movielens_sar_model.pkl')

Overwriting ./movielens-sar/train.py


In [10]:
import shutil
shutil.rmtree('./movielens-sar/reco_utils/', ignore_errors=True)
shutil.copytree('../../reco_utils/', './movielens-sar/reco_utils/')

'./movielens-sar/reco_utils/'

### Create an estimator
An estimator object is used to submit the run.  Create your estimator by running the following code to define:

* The name of the estimator object, `est`
* The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution. 
* The compute target.  In this case you will use the AmlCompute you created
* The training script name, train.py
* Parameters required from the training script 
* Python packages needed for training

In this tutorial, this target is AmlCompute. All files in the script folder are uploaded into the cluster nodes for execution. The data_folder is set to use the datastore (`ds.as_mount()`).

In [11]:
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ds.as_mount(),
    '--top-k': TOP_K,
    '--data-size': MOVIELENS_DATA_SIZE
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                conda_packages=['pandas'],
                pip_packages=['papermill', 'sklearn'])

### Submit the job to the cluster
Run the experiment by submitting the estimator object.

In [12]:
run = exp.submit(config=est)
run

Experiment,Id,Type,Status,Details Page,Docs Page
movielens-sar,movielens-sar_1551122288_2b59e79a,azureml.scriptrun,Queued,Link to Azure Portal,Link to Documentation


# 3 Monitor a remote run

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [13]:
from azureml.widgets import RunDetails
RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

Once the run is complete, you can see files associated with that run.

In [15]:
print(run.get_file_names())

['azureml-logs/55_batchai_execution.txt', 'azureml-logs/60_control_log.txt', 'azureml-logs/80_driver_log.txt', 'azureml-logs/azureml.log', 'outputs/movielens_sar_model.pkl']
