## Overview

This notebook retrieves movie rating data from BigInsights over webhdfs.  The data was copied to the cluster in the previous step.

### Read the cluster connection information

We retrieve the cluster hostname, username and password that were saved to DSX in Step 1.

In [16]:
with open('credentials', 'r') as f:
    (hostname, username, password) = f.readline().split(',')

The next cell setups up a python object we can use to interact with our cluster. If you are using this notebook with an 'Enterprise' cluster, you will need to uncomment the line as shown.

In [17]:
!pip install --user --quiet pywebhdfs
from pywebhdfs.webhdfs import PyWebHdfsClient
hdfs = PyWebHdfsClient( 
    base_uri_pattern="https://{0}:8443/gateway/default/webhdfs/v1".format(hostname),
    request_extra_opts={
        'auth': (username, password),
        # 'verify': False, # uncomment this for Enterprise clusters
    }
)

### Load the movie rating data

We can now load the rating data from webHDFS and save it onto DSX local file storage.

First set the path to the file in HDFS

In [3]:
ratings_path = '//user/{0}/ratings.dat'.format(username)
print(ratings_path)

//user/demouser/ratings.dat


Now retrieve the file contents into a variable - see NOTE 1 for a discussion on this approach

In [19]:
ratings_data = hdfs.read_file(ratings_path)

Save the data to a file on DSX

In [19]:
with open('ratings.dat', 'w') as f:
    f.write(ratings_data)

Let's visually inspect the data to get a 'feel' for it:

In [20]:
!head -3 ratings.dat
!echo
!tail -3 ratings.dat

1::1193::5::978300760
1::661::3::978302109
1::914::3::978301968

6040::562::5::956704746
6040::1096::4::956715648
6040::1097::4::956715569


Note the format of the dataset: <br/>
- No header record<br/>
- The fields are - `UserID::MovieID::Rating::Timestamp`

---
## NOTE 1

The approach used in this notebook is a tactical solution:

- The DSX storage space is limited to a few GB per user.
- Lab services are open sourcing a spark webHDFS connect which will read webHDFS data directly into a spark dataframe.<br>See https://issues.apache.org/jira/browse/BAHIR-67 for more information.
- Future offerings of DSX and BigInsights will have tighter integration of DSX and BigInsights spark.<br>See https://datascix.uservoice.com/forums/387207-general/suggestions/16274593-integrate-with-biginsights.
- We are reading all of the data into memory in the notebook which will not scale.<br>There is a pull request on the pywebhdfs library to fix this: https://github.com/pywebhdfs/pywebhdfs/pull/46.
- A similar approach could be coded by hand using python's requests library:

```
chunk_size = 200000000 # Read in 200 Mb chunks
url = "https://{0}:8443/gateway/default/webhdfs/v1/{1}?op=OPEN".format(host, webhdfs_filepath)
r = requests.get(url, auth=(username, password), verify=True, allow_redirects=True, stream=True)

chunk_num = 1
with open(local_filepath, 'wb') as f:
    for chunk in r.iter_content(chunk_size):
        if chunk: # filter out keep-alive new chunks
           print('{0} writing chunk {1}'.format(datetime.datetime.now(), chunk_num))
           f.write(chunk)
           chunk_num = chunk_num + 1
```