<img style="float: left"  src="images/hdfs.png">
<img style="float: right" src="images/surfsara.png">

<hr style="clear: both" />

# HDFS introduction

Below are number of exercises in Python and some in Shell. Press Shift-Enter to execute the code. You can use code completion by using tab.

In this notebook we will start with some HDFS basics:

1. Basic file operations using [snakebite](https://github.com/spotify/snakebite)
2. File operations using the hdfs shell application
3. Reading from a file on HDFS

## HDFS at SURFsara

The Hadoop cluster at SURFsara is configured as HDFS cluster. It currently consists of 198 machines each offering part of their local disk space to the distributed file system. The configured capacity is 2.26 PB. Taking default replication into account there is room for 753 TB of data. The system hosts around 28 million files and directories for various users. 

To accomodate many users the Hadoop cluster at SURFsara is secured by [Kerberos](https://en.wikipedia.org/wiki/Kerberos_(protocol)). In order to make use of any cluster services we will first need to authenticate. The notebook environment is preconfigured with credentials that we only need to initialize. Execute the next cell to do this. The exclamation mark in the cell instructs Jupyter to execute that line as a shell command instead of Python code. 

In [None]:
! kinit.sh
! echo "My user name on the cluster is:" `klist | egrep -o 'hadws[0-9]+'`
temp = !klist | egrep -o 'hadws[0-9]+'
cluster_user = temp[0]

If all went well you should see some output listing your configured user (note that it should match the first part of the notebook URL). This username has been assigned to the `cluster_user` variable accessible from this notebook.

## Basic HDFS operations

Snakebite is a python library that provides a pure python HDFS client. The client uses protobuf for communicating with the NameNode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from DataNodes. Note that writing data is currently not supported. Writing can be achieved by using either the HDFS Java API or hdfs shell commands. The former is beyond the scope of this tutorial the latter will be used in an exercise in this notebook. 

First initialize a client for HDFS. For more detail concerning the snakebite API please see the [snakebite documentation](http://snakebite.readthedocs.org/en/latest/)


In [None]:
from snakebite.client import AutoConfigClient
client = AutoConfigClient()

The client object exposes various methods. Let's start with a listing of directory contents using the `ls` function. The function takes as argument a path on HDFS. We will list a public data directory: `/data/public/hadws`

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
for i in client.ls(["/data/public/hadws"]):
    pp.pprint(i)

snakebite returns the result of the listing as an array of dicts. Note that many properties that are available on regular file systems, such as size, path, owner and permissions are present on HDFS as well. Something that is not very common though is the `block_replication` factor. This factor denotes how many times the file is present on the cluster  file system. List the path and the replication factor:

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
for i in client.ls(["/data/public/hadws"]):
    print(i["path"] + " " +  str(i["block_replication"]))

Let's proceed to read a file. Both snakebite and the HDFS command line support a text operation where data is read from HDFS and converted to text (note that this is not always possible for all data formats). Use the text method to print alice.txt

In [None]:
text = client.text(["/data/public/hadws/alice.txt"])
for i in text:
    print i

Using the hdfs command line program for this is very similar:

In [None]:
! hdfs dfs -text /data/public/hadws/alice.txt

In [None]:
import json
for i in client.ls(["/data/public/hadws"]):
    print(i, indent=2))

In [None]:
client.put("/etc/passwd", "foobar")

In [None]:
!hdfs fsck -files -blocks -locations -racks