## Overview

In this notebook, we look at how the movie recommendation model that was built on DSX can be exported to BigInsights.
We have a lot of flexibility on BigInsights for using the model, e.g.

- from Oozie to make batch predictions
- from Spark Streaming to make realtime predictions

We will use ssh and scp python libraries to move data and code between BigInsights and DSX

### Setup SSH

Read cluster credentials

In [23]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

Install required ssh and scp libraries

In [24]:
!pip install --user --upgrade --force --quiet git+https://github.com/snowch/nb_utils
    
# note to install a specific commit sha
# !pip install --user --upgrade --force --quiet git+https://github.com/snowch/nb_utils@commit_sha

print version (commit sha) of installed nb_utils for traceability purposes

In [44]:
import requests
requests.get('https://api.github.com/repos/snowch/nb_utils/git/refs/heads/master').json()['object']['sha']

u'04d1ae56e6093a2ac52818661bd8679df2a36c9b'

load the ssh utilities

In [26]:
from ssh_utils import ssh_utils
ssh = ssh_utils.SshUtil(hostname, username, password)

Let's verify ssh is working by listing the root folder contents of hdfs

In [27]:
ssh.cmd('hdfs dfs -ls /')

Found 9 items
drwxrwxr-x   - ams      hdfs             0 2016-11-02 03:03 /amshbase
drwxrwxrwx   - yarn     hadoop           0 2016-11-02 00:33 /app-logs
drwxr-xr-x   - hdfs     hdfs             0 2016-07-05 07:07 /apps
drwxr-xr-x   - hdfs     hdfs             0 2016-07-05 07:07 /iop
drwxr-xr-x   - mapred   hdfs             0 2016-07-05 07:05 /mapred
drwxrwxrwx   - mapred   hadoop           0 2016-07-05 07:05 /mr-history
drwx------   - demouser biusers          0 2016-11-02 00:30 /securedir
drwxrwxrwx   - hdfs     hdfs             0 2016-11-02 00:29 /tmp
drwxr-xr-x   - hdfs     hdfs             0 2016-11-02 00:29 /user


The next command we are running on DSX.  We create a tar archive containing our model.

In [28]:
!rm -f recommender_model.tgz
!tar czf recommender_model.tgz recommender_model/

On BigInsights, use delete any models that were copied across to BigInsights on previous runs on the notebook.

In [29]:
ssh.cmd('rm -rf ./recommender_model.tgz ./recommender_model')

Copy over our new model

In [30]:
ssh.put('recommender_model.tgz')

Verify that it was copied ok

In [31]:
ssh.cmd('ls -l ./recommender_model.tgz')

-rw-r--r--. 1 demouser biusers 2129960 Nov  2 15:28 ./recommender_model.tgz


Unzip the model archive

In [32]:
ssh.cmd('tar xzf ./recommender_model.tgz')

Verify the unzipped model folders

In [33]:
ssh.cmd('find ./recommender_model')

./recommender_model
./recommender_model/data
./recommender_model/data/user
./recommender_model/data/user/.part-r-00000-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet.crc
./recommender_model/data/user/part-r-00001-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet
./recommender_model/data/user/._SUCCESS.crc
./recommender_model/data/user/part-r-00000-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet
./recommender_model/data/user/_SUCCESS
./recommender_model/data/user/_common_metadata
./recommender_model/data/user/.part-r-00001-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet.crc
./recommender_model/data/user/._common_metadata.crc
./recommender_model/data/user/_metadata
./recommender_model/data/user/._metadata.crc
./recommender_model/data/product
./recommender_model/data/product/part-r-00000-a9529fe2-fc9e-4b18-95c7-6300fd989442.gz.parquet
./recommender_model/data/product/.part-r-00000-a9529fe2-fc9e-4b18-95c7-6300fd989442.gz.parquet.crc
./recommender_model/data/product/._SUCCESS.crc
./recommend

Remove any models that were copied to BigInsights HDFS on previous runs on the notebook. And then copy the model from the BigInsights local file system to HDFS.

In [34]:
model_path = 'hdfs:///user/{0}/recommender_model'.format(username)

In [35]:
print(model_path)

hdfs:///user/demouser/recommender_model


In [36]:
ssh.cmd('hdfs dfs -rm -r -skipTrash {0}'.format(model_path)) # it's ok if this fails
ssh.cmd('hdfs dfs -copyFromLocal ./recommender_model {0}'.format(model_path))

Deleted hdfs:///user/demouser/recommender_model


Verify the model exists in HDFS

In [37]:
ssh.cmd('hdfs dfs -ls {0}'.format(model_path))

Found 2 items
drwxr-xr-x   - demouser hdfs          0 2016-11-02 15:29 hdfs:///user/demouser/recommender_model/data
drwxr-xr-x   - demouser hdfs          0 2016-11-02 15:29 hdfs:///user/demouser/recommender_model/metadata


Copy a scala spark class to the cluster for doing the predictions<br/>
See https://github.com/snowch/demo_2710/blob/master/scala_predictor/MovieRating.scala

In [38]:
ssh.cmd('rm -f movie-rating.jar')
ssh.cmd('wget -q -O movie-rating.jar https://github.com/snowch/demo_2710/blob/master/scala_predictor/movie-rating_2.10-1.0.jar?raw=true')
ssh.cmd('ls -l movie-rating.jar')

-rw-r--r--. 1 demouser biusers 2472 Nov  2 15:29 movie-rating.jar


The user_id and movie_id we want predictions for

In [39]:
user_id = 0
movie_id = 500

Execute the spark class

In [40]:
output_path = 'hdfs:///user/{0}/rating'.format(username)
print(output_path)

hdfs:///user/demouser/rating


In [41]:
# ensure the output folder is clean
ssh.cmd('hdfs dfs -rm -f -r -skipTrash {0}'.format(output_path))
ssh.cmd('spark-submit --class "MovieRating" --master yarn-cluster ./movie-rating.jar {0} {1} {2} {3}'.format(model_path, user_id, movie_id, output_path))

Deleted hdfs:///user/demouser/rating
16/11/02 15:29:13 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
16/11/02 15:29:13 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
16/11/02 15:29:14 INFO TimelineClientImpl: Timeline service address: http://bi-hadoop-prod-4194.bi.services.us-south.bluemix.net:8188/ws/v1/timeline/
16/11/02 15:29:15 INFO RMProxy: Connecting to ResourceManager at bi-hadoop-prod-4194.bi.services.us-south.bluemix.net/172.16.237.1:8050
16/11/02 15:29:16 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/11/02 15:29:16 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluste

Verify we have some ratings

In [42]:
ssh.cmd('hdfs dfs -ls {0}'.format(output_path))

Found 2 items
-rw-r--r--   3 demouser hdfs          0 2016-11-02 15:29 hdfs:///user/demouser/rating/_SUCCESS
-rw-r--r--   3 demouser hdfs         24 2016-11-02 15:29 hdfs:///user/demouser/rating/part-00000


Let's check the predicted rating

In [43]:
ssh.cmd('hdfs dfs -cat {0}/*'.format(output_path))

0,500,5.518930176128497
