## Overview

This notebook retrieves movie rating data from BigInsights over webhdfs.  The data was copied to the cluster in the previous step.

### Read the cluster connection information

We retrieve the cluster hostname, username and password that were saved to DSX in Step 1.

In [12]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

The next cell setups up a python object we can use to interact with our cluster. If you are using this notebook with an 'Enterprise' cluster, you will need to uncomment the line as shown.

In [13]:
from pywebhdfs.webhdfs import PyWebHdfsClient
hdfs = PyWebHdfsClient( 
    base_uri_pattern="https://{0}:8443/gateway/default/webhdfs/v1".format(hostname),
    request_extra_opts={
        'auth': (username, password),
        # 'verify': False, # uncomment this for Enterprise clusters
    }
)

### Load the movie rating data

We can now load the rating data from webHDFS and save it onto DSX local file storage - note that the DSX storage space is limited per user

In [14]:
ratings_path = '//user/{0}/ratings.dat'.format(username)
ratings_data = hdfs.read_file(ratings_path)
with open('ratings.dat', 'w') as f:
    f.write(ratings_data)

Let's visually inspect the data to get a 'feel' for it:

In [15]:
!head -3 ratings.dat
!echo
!tail -3 ratings.dat

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

6040::562::5::956704746
6040::1096::4::956715648
6040::1097::4::956715569


Note the format of the dataset: <br/>
- No header record<br/>
- The fields are - `UserID::MovieID::Rating::Timestamp`