## Overview


In this notebook, the cluster is loaded with the movielens ml-1m dataset.

### Read the cluster connection information

First lets get our previously saved credentials for the cluster

In [13]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

### Setup ssh library for running commands on the cluster

We can now setup a python ssh library

In [14]:
!pip install --user --upgrade --force --quiet git+https://github.com/snowch/nb_utils

print version of installed nb_utils for traceability purposes

In [15]:
import requests, json
requests.get('https://api.github.com/repos/snowch/nb_utils/git/refs/heads/master').json()

{u'object': {u'sha': u'04d1ae56e6093a2ac52818661bd8679df2a36c9b',
  u'type': u'commit',
  u'url': u'https://api.github.com/repos/snowch/nb_utils/git/commits/04d1ae56e6093a2ac52818661bd8679df2a36c9b'},
 u'ref': u'refs/heads/master',
 u'url': u'https://api.github.com/repos/snowch/nb_utils/git/refs/heads/master'}

load the ssh utilities

In [16]:
from ssh_utils import ssh_utils
ssh = ssh_utils.SshUtil(hostname, username, password)

### Retrieve the ml-1m dataset

In [17]:
# make sure we don't have any data hanging around from previous runs
ssh.cmd('rm -rf ml-1m ml-1m.zip movies.dat users.dat ratings.dat')

# retrieve the data to BigInsights local filesystem
ssh.cmd('wget --quiet http://files.grouplens.org/datasets/movielens/ml-1m.zip')

# unzip the data
ssh.cmd('unzip ml-1m.zip')

Archive:  ml-1m.zip
   creating: ml-1m/
  inflating: ml-1m/movies.dat
  inflating: ml-1m/ratings.dat
  inflating: ml-1m/README
  inflating: ml-1m/users.dat


### Upload the data to WebHDFS

In [18]:
# make sure we don't have any data hanging around from previous runs
ssh.cmd('''
    hdfs dfs -rm -f -skipTrash ./ratings.dat
    hdfs dfs -rm -r -f -skipTrash ./rating
    hdfs dfs -rm -r -f -skipTrash ./recommender_model
''')

Deleted ratings.dat


Copy the data from the BigInsights local file system to HDDS and verify that it was copied

In [19]:
ssh.cmd('''
    hdfs dfs -put ./ml-1m/ratings.dat ./ratings.dat
    hdfs dfs -ls ./
''')

Found 2 items
drwxr-xr-x   - demouser hdfs          0 2016-11-02 10:37 .sparkStaging
-rw-r--r--   3 demouser hdfs   24594131 2016-11-02 15:08 ratings.dat


Finally, let's remove the data we downloaded to the local filesystem.

In [20]:
ssh.cmd('rm -rf ml-1m ml-1m.zip')