<img style="float: left"  src="images/hdfs.png">
<img style="float: right" src="images/surfsara.png">

<hr style="clear: both" />

# HDFS introduction

Below are number of exercises in Python and some in Shell. Press Shift-Enter to execute the code. You can use code completion by using tab.

In this notebook we will start with some HDFS basics.

## HDFS at SURFsara

The Hadoop cluster at SURFsara is configured as HDFS cluster. It currently consists of 198 machines each offering part of their local disk space to the distributed file system. The configured capacity is 2.26 PB. Taking default replication into account there is room for 753 TB of data. The system hosts around 28 million files and directories for various users.

To accomodate many users the Hadoop cluster at SURFsara is secured by [Kerberos](https://en.wikipedia.org/wiki/Kerberos_(protocol). In order to make use of any cluster services we will first need to authenticate. The notebook environment is preconfigured with credentials that we only need to initialize. Execute the next cell to do this. The exclamation mark in the cell instructs Jupyter to execute that line as a shell command instead of Python code.

In [None]:
import os
! kinit.sh
cluster_user = !klist | egrep -o 'hadws[0-9]+'
cluster_user = str(cluster_user[0])
print "My user name on the cluster is %s" % cluster_user
os.environ['CLUSTER_USER'] = cluster_user

If all went well you should see some output listing your configured user (note that it should match the first part of the notebook URL). This username has been assigned to the `cluster_user` variable accessible from this notebook and to the `$CLUSTER_USER` shell environment variable.

## Basic HDFS operations

Snakebite is a Python library that provides a pure Python HDFS client. The client uses protobuf for communicating with the Namenode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from Datanodes. Note that writing data is currently not supported. Writing can be achieved by using either the HDFS Java API or hdfs shell commands. The former is beyond the scope of this tutorial the latter will be used in exercises in this notebook.

First initialize a client for HDFS. For more detail concerning the snakebite API please see the [snakebite documentation](http://snakebite.readthedocs.org/en/latest/)


In [None]:
from snakebite.client import AutoConfigClient
client = AutoConfigClient()

The client object exposes various methods. Let's start with a listing of directory contents using the `ls` function. The function takes as argument a path on HDFS. We will list a public data directory: `/data/public/hadws`

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
for i in client.ls(["/data/public/hadws"]):
    pp.pprint(i)

snakebite returns the result of the listing as an array of dicts. Note that many properties that are available on regular file systems, such as size, path, owner and permissions are present on HDFS as well. Something that is not very common though is the `block_replication` factor. This factor denotes how many times the file is present on the cluster  file system. List the path and the replication factor:

In [None]:
for i in client.ls(["/data/public/hadws"]):
    print(i["path"] + " " +  str(i["block_replication"]))

Let's proceed to read a file. Both snakebite and the HDFS command line support a text operation where data is read from HDFS and converted to text (note that this is not always possible for all data formats). Use the text method to print alice.txt.

In [None]:
text = client.text(["/data/public/hadws/alice.txt"])
for i in text:
    print i

Using the `hdfs` command-line program for this is very similar:

In [None]:
! hdfs dfs -text /data/public/hadws/alice.txt

The `/data/public/hadws/`directory contains the data files that will be used in subsequent notebooks. As an exercise you are required to copy them to the `/user/$CLUSTER_USER` directory. Snakebite does not offer any copy method so we will do this using the [`hdfs dfs`](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#cp) shell commands. Note that the reference manual uses `hadoop fs` this is a synonym for `hdfs dfs`. Please use the latter to copy all data files to the directory of your `$CLUSTER_USER`.

In [None]:
! hdfs dfs -ls /user/"$CLUSTER_USER"
! hdfs dfs -cp /data/public/hadws/* /user/"$CLUSTER_USER"
! hdfs dfs -ls /user/"$CLUSTER_USER"

Next make a subdirectory in `/user/$CLUSTER_USER` named tmp. Copy alice.txt to this subdirectory and rename it to wonderland.txt and recursively list the  `/user/$CLUSTER_USER` directory.

In [None]:
! hdfs dfs -mkdir /user/"$CLUSTER_USER"/tmp
! hdfs dfs -cp /user/"$CLUSTER_USER"/alice.txt /user/"$CLUSTER_USER"/tmp/wonderland.txt
! hdfs dfs -ls -R /user/$CLUSTER_USER

Remember that HDFS stores your files in blocks that are replicated across the cluster. There is a command that can show you information about the physical location of those blocks. Execute the next cell to see where the blocks for the `/user/$CLUSTER_USER/tmp/wonderland.txt` are located.

In [None]:
!hdfs fsck /user/"$CLUSTER_USER"/tmp/wonderland.txt -files -blocks -locations -racks

The output should show a list of the IP addresses and rack names of the machines the block is located on (e.g. /S43/145.100.41.61:1004). Next increase the replication factor to 10 of the `/user/$CLUSTER_USER/tmp/wonderland.txt` file by using the `hdfs dfs -setrep` command:

In [None]:
! hdfs dfs -setrep 10 /user/"$CLUSTER_USER"/tmp/wonderland.txt

Think about what happened when you increased the replication before using `hdfs fsck` again to check the results. What effects will increasing the replication have for processing the data and fault tolerance?

In [None]:
!hdfs fsck /user/"$CLUSTER_USER"/tmp/wonderland.txt -files -blocks -locations -racks

Was the output what you expected? Chances are that the system did not manage to create all replica's yet and you will see a message about under replicated blocks. HDFS block operations are not executed with high priority, this minimizes peaks in overall network usage.

Finally, download the `/user/$CLUSTER_USER/tmp/wonderland.txt` to the machine hosting the notebook, list the current directory of the notebook environment and delete the `/user/$CLUSTER_USER/tmp/wonderland.txt` file and `/user/$CLUSTER_USER/tmp/` directory from HDFS. List your HDFS home at the end.

In [None]:
! hdfs dfs -get /user/"$CLUSTER_USER"/tmp/wonderland.txt .
! ls -lah
! hdfs dfs -rm -R /user/"$CLUSTER_USER"/tmp
! hdfs dfs -ls