## Overview


In this notebook, the cluster is loaded with the movielens ml-1m dataset.

### Read the cluster connection information

First lets get our previously saved credentials for the cluster

In [58]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

### Setup ssh library for running commands on the cluster

We can now setup a python ssh library

In [59]:
!pip install --user --quiet paramiko

The default installation of paramiko fails, so we need to patch it

In [60]:
def patch_crypto_be_discovery():
    # Monkey patches cryptography's backend detection.
    from cryptography.hazmat import backends

    try:
        from cryptography.hazmat.backends.commoncrypto.backend import backend as be_cc
    except ImportError:
        be_cc = None

    try:
        from cryptography.hazmat.backends.openssl.backend import backend as be_ossl
    except ImportError:
        be_ossl = None

    backends._available_backends_list = [ be for be in (be_cc, be_ossl) if be is not None ]

patch_crypto_be_discovery()

Finally, we can setup a convenience method for executing ssh commands on the cluster

In [61]:
import paramiko
s = paramiko.SSHClient()
s.load_system_host_keys()
s.set_missing_host_key_policy(paramiko.AutoAddPolicy())

def ssh_cmd(command):
    s.connect(hostname, 22, username, password)
    # kinit will fail on Basic clusters, but that can be ignored
    s.exec_command('kinit -k -t {0}.keytab {0}@IBM.COM'.format(username))
    (stdin, stdout, stderr) = s.exec_command(command)
    for line in stdout.readlines():
        print line.rstrip()
    for line in stderr.readlines():
        print line.rstrip()
    s.close()

### Retrieve the ml-1m dataset

In [70]:
# make sure we don't have any data hanging around from previous runs
ssh_cmd('rm -rf ml-1m ml-1m.zip movies.dat users.dat ratings.dat')

# retrieve the data to BigInsights local filesystem
ssh_cmd('wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip')

# unzip the data
ssh_cmd('unzip ml-1m.zip')

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat
  inflating: ml-1m/ratings.dat
  inflating: ml-1m/README
  inflating: ml-1m/users.dat


### Upload the data to WebHDFS

In [71]:
# make sure we don't have any data hanging around from previous runs
ssh_cmd('hdfs dfs -rm -f ./ratings.dat')

Copy the data from the BigInsights local file system to HDDS and verify that it was copied

In [72]:
ssh_cmd('hdfs dfs -put ./ml-1m/ratings.dat ./ratings.dat')
ssh_cmd('hdfs dfs -ls ./ratings.dat')

-rw-r--r--   3 demouser hdfs   24594131 2016-10-17 09:20 ratings.dat


Finally, let's remove the data we downloaded to the local filesystem.

In [73]:
ssh_cmd('rm -rf ml-1m ml-1m.zip')