## Overview

In this notebook, we look at how the movie recommendation model that was built on DSX can be exported to BigInsights.
We have a lot of flexibility on BigInsights for using the model, e.g.

- from Oozie to make batch predictions
- from Spark Streaming to make realtime predictions

We will use ssh and scp python libraries to move data and code between BigInsights and DSX

### Setup SSH

Read cluster credentials

In [43]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

Install required ssh and scp libraries

In [44]:
!pip install --user --quiet paramiko
!pip install --user --quiet scp

The ssh library will not work by default - we need to patch it

In [45]:
def patch_crypto_be_discovery():
    # Monkey patches cryptography's backend detection.
    from cryptography.hazmat import backends

    try:
        from cryptography.hazmat.backends.commoncrypto.backend import backend as be_cc
    except ImportError:
        be_cc = None

    try:
        from cryptography.hazmat.backends.openssl.backend import backend as be_ossl
    except ImportError:
        be_ossl = None

    backends._available_backends_list = [ be for be in (be_cc, be_ossl) if be is not None ]
patch_crypto_be_discovery()

Setup a utility method to make running ssh commands easier

In [46]:
import paramiko
s = paramiko.SSHClient()
s.load_system_host_keys()
s.set_missing_host_key_policy(paramiko.AutoAddPolicy())

def ssh_cmd(command):
    s.connect(hostname, 22, username, password)
    # kinit will fail on Basic clusters, but that can be ignored
    s.exec_command('kinit -k -t {0}.keytab {0}@IBM.COM'.format(username))
    (stdin, stdout, stderr) = s.exec_command(command)
    for line in stdout.readlines():
        print line.rstrip()
    for line in stderr.readlines():
        print line.rstrip()
    s.close()

Setup a utility method to make scp commands easier

In [47]:
from scp import SCPClient
    
def scp_put(filenames):
    s.connect(hostname, 22, username, password)
    # kinit will fail on Basic clusters, but that can be ignored
    s.exec_command('kinit -k -t {0}.keytab {0}@IBM.COM'.format(username))
    with SCPClient(s.get_transport()) as scp:
        scp.put(filenames)
    scp.close()
        
def scp_get(filenames):
    s.connect(hostname, port, username, password)
    with SCPClient(s.get_transport()) as scp:
        scp.get(filenames)

Let's verify ssh is working by listing the root folder contents of hdfs

In [48]:
ssh_cmd('hdfs dfs -ls /')

Found 9 items
drwxrwxr-x   - ams      hdfs             0 2016-10-15 15:27 /amshbase
drwxrwxrwx   - yarn     hadoop           0 2016-10-16 19:48 /app-logs
drwxr-xr-x   - hdfs     hdfs             0 2016-07-05 07:07 /apps
drwxr-xr-x   - hdfs     hdfs             0 2016-07-05 07:07 /iop
drwxr-xr-x   - mapred   hdfs             0 2016-07-05 07:05 /mapred
drwxrwxrwx   - mapred   hadoop           0 2016-07-05 07:05 /mr-history
drwx------   - demouser biusers          0 2016-10-15 12:50 /securedir
drwxrwxrwx   - hdfs     hdfs             0 2016-10-15 12:49 /tmp
drwxr-xr-x   - hdfs     hdfs             0 2016-10-15 12:50 /user


The next command we are running on DSX.  We create a tar archive containing our model.

In [93]:
!rm -f recommender_model.tgz
!tar czf recommender_model.tgz recommender_model/

On BigInsights, use delete any models that were copied across to BigInsights on previous runs on the notebook.

In [108]:
ssh_cmd('rm -rf ./recommender_model.tgz ./recommender_model')

Copy over our new model

In [109]:
scp_put('recommender_model.tgz')

Verify that it was copied ok

In [110]:
ssh_cmd('ls -l ./recommender_model.tgz')

-rw-r--r--. 1 demouser biusers 2126955 Oct 17 21:52 ./recommender_model.tgz


Unzip the model archive

In [111]:
ssh_cmd('tar xzf ./recommender_model.tgz')

Verify the unzipped model folders

In [112]:
ssh_cmd('find ./recommender_model')

./recommender_model
./recommender_model/data
./recommender_model/data/user
./recommender_model/data/user/._SUCCESS.crc
./recommender_model/data/user/.part-r-00000-22606c73-f35a-4143-993e-dbf47b7c0ce8.gz.parquet.crc
./recommender_model/data/user/.part-r-00001-22606c73-f35a-4143-993e-dbf47b7c0ce8.gz.parquet.crc
./recommender_model/data/user/_SUCCESS
./recommender_model/data/user/_common_metadata
./recommender_model/data/user/._common_metadata.crc
./recommender_model/data/user/part-r-00000-22606c73-f35a-4143-993e-dbf47b7c0ce8.gz.parquet
./recommender_model/data/user/part-r-00001-22606c73-f35a-4143-993e-dbf47b7c0ce8.gz.parquet
./recommender_model/data/user/_metadata
./recommender_model/data/user/._metadata.crc
./recommender_model/data/product
./recommender_model/data/product/.part-r-00000-8485d197-34ac-4732-8989-78658d33b29d.gz.parquet.crc
./recommender_model/data/product/._SUCCESS.crc
./recommender_model/data/product/_SUCCESS
./recommender_model/data/product/part-r-00000-8485d197-34ac-473

Remove any models that were copied to BigInsights HDFS on previous runs on the notebook. And then copy the model from the BigInsights local file system to HDFS.

In [113]:
model_path = 'hdfs:///user/{0}/recommender_model'.format(username)
output_path = 'hdfs:///user/{0}/rating'.format(username)

In [114]:
ssh_cmd('hdfs dfs -rm -r -skipTrash {0}'.format(model_path))
ssh_cmd('hdfs dfs -copyFromLocal ./recommender_model {0}'.format(model_path))

Deleted hdfs:///user/demouser/recommender_model


Verify the model exists in HDFS

In [115]:
ssh_cmd('hdfs dfs -ls {0}'.format(model_path))

Found 2 items
drwxr-xr-x   - demouser hdfs          0 2016-10-17 21:52 hdfs:///user/demouser/recommender_model/data
drwxr-xr-x   - demouser hdfs          0 2016-10-17 21:52 hdfs:///user/demouser/recommender_model/metadata


Copy a scala spark class to the cluster for doing the predictions<br/>
See https://github.com/snowch/demo_2710/blob/master/scala_predictor/MovieRating.scala

In [116]:
ssh_cmd('rm -f movie-rating.jar')
ssh_cmd('wget -q -O movie-rating.jar https://github.com/snowch/demo_2710/blob/master/scala_predictor/movie-rating_2.10-1.0.jar?raw=true')
ssh_cmd('ls -l movie-rating.jar')

-rw-r--r--. 1 demouser biusers 2200 Oct 17 21:52 movie-rating.jar


The user_id and movie_id we want predictions for

In [117]:
user_id = 0
movie_id = 500

Execute the spark class

In [118]:
# ensure the output folder is clean
ssh_cmd('hdfs dfs -rm -f -r -skipTrash {0}'.format(output_path))
ssh_cmd('spark-submit --class "MovieRating" --master yarn-cluster ./movie-rating.jar {0} {1} {2} {3}'.format(model_path, user_id, movie_id, output_path))

Deleted hdfs:///user/demouser/rating
16/10/17 21:52:24 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
16/10/17 21:52:24 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
16/10/17 21:52:25 INFO TimelineClientImpl: Timeline service address: http://bi-hadoop-prod-4261.bi.services.us-south.bluemix.net:8188/ws/v1/timeline/
16/10/17 21:52:26 INFO RMProxy: Connecting to ResourceManager at bi-hadoop-prod-4261.bi.services.us-south.bluemix.net/172.16.115.1:8050
16/10/17 21:52:26 INFO Client: Requesting a new application from cluster with 1 NodeManagers
16/10/17 21:52:27 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluste

Verify we have some ratings

In [119]:
ssh_cmd('hdfs dfs -ls {0}'.format(output_path))

Found 3 items
-rw-r--r--   3 demouser hdfs          0 2016-10-17 21:52 hdfs:///user/demouser/rating/_SUCCESS
-rw-r--r--   3 demouser hdfs          0 2016-10-17 21:52 hdfs:///user/demouser/rating/part-00000
-rw-r--r--   3 demouser hdfs         32 2016-10-17 21:52 hdfs:///user/demouser/rating/part-00001


Let's check the predicted rating

In [120]:
ssh_cmd('hdfs dfs -cat {0}/*'.format(output_path))

Rating(0,500,5.519096770617339)
