## Overview

This is a work-in-progress ...

### Setup SSH

Read cluster credentials

In [1]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

Install required ssh and scp libraries

In [2]:
!pip install --user --upgrade --force --quiet git+https://github.com/snowch/nb_utils

print version of installed nb_utils for traceability purposes

In [3]:
import requests, json
requests.get('https://api.github.com/repos/snowch/nb_utils/git/refs/heads/master').json()

{u'object': {u'sha': u'04d1ae56e6093a2ac52818661bd8679df2a36c9b',
  u'type': u'commit',
  u'url': u'https://api.github.com/repos/snowch/nb_utils/git/commits/04d1ae56e6093a2ac52818661bd8679df2a36c9b'},
 u'ref': u'refs/heads/master',
 u'url': u'https://api.github.com/repos/snowch/nb_utils/git/refs/heads/master'}

Setup a utility method to make scp commands easier

In [4]:
from ssh_utils import ssh_utils
ssh = ssh_utils.SshUtil(hostname, username, password)

Let's verify ssh is working by listing the root folder contents of hdfs

In [5]:
ssh.cmd('hdfs dfs -ls /')

Found 9 items
drwxrwxr-x   - ams      hdfs             0 2016-11-02 03:03 /amshbase
drwxrwxrwx   - yarn     hadoop           0 2016-11-02 00:33 /app-logs
drwxr-xr-x   - hdfs     hdfs             0 2016-07-05 07:07 /apps
drwxr-xr-x   - hdfs     hdfs             0 2016-07-05 07:07 /iop
drwxr-xr-x   - mapred   hdfs             0 2016-07-05 07:05 /mapred
drwxrwxrwx   - mapred   hadoop           0 2016-07-05 07:05 /mr-history
drwx------   - demouser biusers          0 2016-11-02 00:30 /securedir
drwxrwxrwx   - hdfs     hdfs             0 2016-11-02 00:29 /tmp
drwxr-xr-x   - hdfs     hdfs             0 2016-11-02 00:29 /user


The next command we are running on DSX.  We create a tar archive containing our model.

In [6]:
!rm -f recommender_model.tgz
!tar czf recommender_model.tgz recommender_model/

On BigInsights, use delete any models that were copied across to BigInsights on previous runs on the notebook.

In [7]:
ssh.cmd('rm -rf ./recommender_model.tgz ./recommender_model')

Copy over our new model

In [8]:
ssh.put('recommender_model.tgz')

Verify that it was copied ok

In [9]:
ssh.cmd('ls -l ./recommender_model.tgz')

-rw-r--r--. 1 demouser biusers 2129960 Nov  2 15:56 ./recommender_model.tgz


Unzip the model archive

In [10]:
ssh.cmd('tar xzf ./recommender_model.tgz')

Verify the unzipped model folders

In [11]:
ssh.cmd('find ./recommender_model')

./recommender_model
./recommender_model/data
./recommender_model/data/user
./recommender_model/data/user/.part-r-00000-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet.crc
./recommender_model/data/user/part-r-00001-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet
./recommender_model/data/user/._SUCCESS.crc
./recommender_model/data/user/part-r-00000-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet
./recommender_model/data/user/_SUCCESS
./recommender_model/data/user/_common_metadata
./recommender_model/data/user/.part-r-00001-70815abd-bcd7-4945-9bc5-b0796c698570.gz.parquet.crc
./recommender_model/data/user/._common_metadata.crc
./recommender_model/data/user/_metadata
./recommender_model/data/user/._metadata.crc
./recommender_model/data/product
./recommender_model/data/product/part-r-00000-a9529fe2-fc9e-4b18-95c7-6300fd989442.gz.parquet
./recommender_model/data/product/.part-r-00000-a9529fe2-fc9e-4b18-95c7-6300fd989442.gz.parquet.crc
./recommender_model/data/product/._SUCCESS.crc
./recommend

Remove any models that were copied to BigInsights HDFS on previous runs on the notebook. And then copy the model from the BigInsights local file system to HDFS.

In [12]:
model_path = 'hdfs:///user/{0}/recommender_model'.format(username)

In [51]:
print(model_path)

hdfs:///user/demouser/recommender_model


In [52]:
ssh.cmd('hdfs dfs -rm -r -skipTrash {0}'.format(model_path)) # it's ok if this fails
ssh.cmd('hdfs dfs -copyFromLocal ./recommender_model {0}'.format(model_path))

Deleted hdfs:///user/demouser/recommender_model


Verify the model exists in HDFS

In [15]:
ssh.cmd('hdfs dfs -ls {0}'.format(model_path))

Found 2 items
drwxr-xr-x   - demouser hdfs          0 2016-11-02 15:56 hdfs:///user/demouser/recommender_model/data
drwxr-xr-x   - demouser hdfs          0 2016-11-02 15:56 hdfs:///user/demouser/recommender_model/metadata


Copy a scala spark class to the cluster for doing the predictions<br/>
See https://github.com/snowch/demo_2710/blob/master/scala_streaming_predictor/src/main/scala/MovieRating.scala

In [26]:
ssh.cmd('rm -f movie-rating.jar')

# note we are now using the scala_streaming_predictor project
ssh.cmd('wget -q -O movie-rating.jar https://github.com/snowch/demo_2710/blob/master/scala_streaming_predictor/movie-rating_2.10-1.0.jar?raw=true')
ssh.cmd('ls -l movie-rating.jar')

-rw-r--r--. 1 demouser biusers 9432739 Nov  2 16:06 movie-rating.jar


The user_id and movie_id we want predictions for

In [72]:
ssh.put("messagehub.properties")

ssh.cmd("""
    # spark requires properties to be prefixed with 'spark.'
    sed -i -e 's/^/spark./' messagehub.properties
    
    # spark requires property name value pairs to be separated by a space
    sed -i -e 's/=/ /' messagehub.properties
    
    # add the model path to the properties
    echo "\nspark.model_path {0}" >> messagehub.properties
    
    # rename the properties file inline with spark conventions
    mv messagehub.properties messagehub.conf
    
    # add the cluster's spark settings to the configuration
    cat /usr/iop/current/spark-client/conf/spark-defaults.conf >> messagehub.conf
    
""".format(model_path)
)

# uncomment to debug
# ssh.cmd("cat messagehub.conf")

First check we don't have any existing spark submit jobs running

In [73]:
ssh.cmd('yarn application -list')

Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):0
                Application-Id	    Application-Name	    Application-Type	      User	     Queue	             State	       Final-State	       Progress	                       Tracking-URL
16/11/02 16:34:01 INFO impl.TimelineClientImpl: Timeline service address: http://bi-hadoop-prod-4194.bi.services.us-south.bluemix.net:8188/ws/v1/timeline/
16/11/02 16:34:01 INFO client.RMProxy: Connecting to ResourceManager at bi-hadoop-prod-4194.bi.services.us-south.bluemix.net/172.16.237.1:8050


If we have some existing yarn jobs for 'Movie Ratings', kill them

In [74]:
# ssh.cmd('yarn application -kill application_1478046523919_0017')

Execute the spark class

In [75]:
ssh.cmd('spark-submit --class "MovieRating" --master yarn-cluster --properties-file messagehub.conf ./movie-rating.jar > /dev/null 2>&1 &')

# get the currently running yarn applications
ssh.cmd('sleep 5 && yarn application -list | grep "^application_"')

application_1478046523919_0026	         MovieRating	               SPARK	  demouser	   default	          ACCEPTED	         UNDEFINED	             0%	                                N/A
16/11/02 16:34:11 INFO impl.TimelineClientImpl: Timeline service address: http://bi-hadoop-prod-4194.bi.services.us-south.bluemix.net:8188/ws/v1/timeline/
16/11/02 16:34:12 INFO client.RMProxy: Connecting to ResourceManager at bi-hadoop-prod-4194.bi.services.us-south.bluemix.net/172.16.237.1:8050


In [76]:
ssh.cmd('yarn application -list')
ssh.cmd('yarn logs -applicationId application_1478046523919_0026')

Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id	    Application-Name	    Application-Type	      User	     Queue	             State	       Final-State	       Progress	                       Tracking-URL
application_1478046523919_0026	         MovieRating	               SPARK	  demouser	   default	           RUNNING	         UNDEFINED	            10%	          http://172.16.237.2:35227
16/11/02 16:34:29 INFO impl.TimelineClientImpl: Timeline service address: http://bi-hadoop-prod-4194.bi.services.us-south.bluemix.net:8188/ws/v1/timeline/
16/11/02 16:34:29 INFO client.RMProxy: Connecting to ResourceManager at bi-hadoop-prod-4194.bi.services.us-south.bluemix.net/172.16.237.1:8050
/app-logs/demouser/logs/application_1478046523919_0026 does not have any log files.
16/11/02 16:34:33 INFO impl.TimelineClientImpl: Timeline service address: http://bi-hadoop-prod-4194.bi.services.us-south.bluemix.net:8188/ws/v1/timel

In [78]:
# some commands for debugging

# ssh.cmd('yarn application -list -appStates ALL')

# get the logs if there is an issue
# ssh.cmd('yarn logs -applicationId application_1478046523919_0025')

# kill the application
# ssh.cmd('yarn application -kill applicationId')

After running the code above and seeing your application running (Application-Name = MovieRating),

 0. ensure the notebook kernel is stopped for step 09
 1. open another window with the notebook for step 08 
 2. some messages to MessageHub: **STEP 08 (A) - Produce Prediction Requests**
 3. consume the responses: **STEP 08 (B) - Consume Prediction Responses** 