<img style="float: left"  src="images/hdfs.png">
<img style="float: right" src="images/surfsara.png">

<hr style="clear: both" />

# HDFS introduction

Below are number of exercises in Python and some in Shell. Press Shift-Enter to execute the code. You can use code completion by using tab.

In this notebook we will start with some HDFS basics.

## HDFS at SURFsara

The Hadoop cluster at SURFsara is configured as HDFS cluster. It currently consists of 198 machines each offering part of their local disk space to the distributed file system. The configured capacity is 2.26 PB. Taking default replication into account there is room for 753 TB of data. The system hosts around 28 million files and directories for various users. 

To accomodate many users the Hadoop cluster at SURFsara is secured by [Kerberos](https://en.wikipedia.org/wiki/Kerberos_(protocol)). In order to make use of any cluster services we will first need to authenticate. The notebook environment is preconfigured with credentials that we only need to initialize. Execute the next cell to do this. The exclamation mark in the cell instructs Jupyter to execute that line as a shell command instead of Python code. 

In [62]:
! kinit.sh
! echo "My user name on the cluster is:" `klist | egrep -o 'hadws[0-9]+'`
! export CLUSTER_USER=`klist | egrep -o 'hadws[0-9]+'`
temp = !klist | egrep -o 'hadws[0-9]+'
% env CLUSTER_USER=temp

My user name on the cluster is: hadws29
env: CLUSTER_USER=temp


If all went well you should see some output listing your configured user (note that it should match the first part of the notebook URL). This username has been assigned to the `cluster_user` variable accessible from this notebook.

## Basic HDFS operations

Snakebite is a python library that provides a pure python HDFS client. The client uses protobuf for communicating with the NameNode and comes in the form of a library and a command line interface. Currently, the snakebite client supports most actions that involve the Namenode and reading data from DataNodes. Note that writing data is currently not supported. Writing can be achieved by using either the HDFS Java API or hdfs shell commands. The former is beyond the scope of this tutorial the latter will be used in exercises in this notebook. 

First initialize a client for HDFS. For more detail concerning the snakebite API please see the [snakebite documentation](http://snakebite.readthedocs.org/en/latest/)


In [None]:
from snakebite.client import AutoConfigClient
client = AutoConfigClient()

The client object exposes various methods. Let's start with a listing of directory contents using the `ls` function. The function takes as argument a path on HDFS. We will list a public data directory: `/data/public/hadws`

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
for i in client.ls(["/data/public/hadws"]):
    pp.pprint(i)

snakebite returns the result of the listing as an array of dicts. Note that many properties that are available on regular file systems, such as size, path, owner and permissions are present on HDFS as well. Something that is not very common though is the `block_replication` factor. This factor denotes how many times the file is present on the cluster  file system. List the path and the replication factor:

In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=2)
for i in client.ls(["/data/public/hadws"]):
    print(i["path"] + " " +  str(i["block_replication"]))

Let's proceed to read a file. Both snakebite and the HDFS command line support a text operation where data is read from HDFS and converted to text (note that this is not always possible for all data formats). Use the text method to print alice.txt

In [None]:
text = client.text(["/data/public/hadws/alice.txt"])
for i in text:
    print i

Using the hdfs command line program for this is very similar:

In [None]:
! hdfs dfs -text /data/public/hadws/alice.txt

The `/data/public/hadws/`directory contains the data files that will be used in subsequent notebooks. As an exercise you are required to copy them to the `/user` directory. Snakebite does not offer any copy method so we will do this using the [`hdfs dfs`](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#cp) shell commands. Note that the reference manual uses `hadoop fs` this is a synonym for `hdfs dfs` - please use the latter to copy all data files to the directory of your `cluster_user`. 

In [61]:
print cluster_user
! hdfs dfs -ls /user/"$CLUSTER_USER"

hadws29
Found 205 items
drwx------   - TUD-DS01  hdfs             0 2015-07-09 11:14 /user/TUD-DS01
drwx------   - TUD-DS02  hdfs             0 2015-09-11 20:43 /user/TUD-DS02
drwx------   - TUD-DS03  hdfs             0 2015-08-28 12:26 /user/TUD-DS03
drwx------   - TUD-DS04  hdfs             0 2015-07-31 15:03 /user/TUD-DS04
drwx------   - TUD-DS05  hdfs             0 2015-07-20 02:06 /user/TUD-DS05
drwx------   - TUD-DS06  hdfs             0 2015-09-25 09:48 /user/TUD-DS06
drwx------   - TUD-DS07  hdfs             0 2015-06-17 15:24 /user/TUD-DS07
drwx------   - TUD-DS08  hdfs             0 2015-06-17 15:25 /user/TUD-DS08
drwx------   - TUD-DS09  hdfs             0 2015-06-17 15:25 /user/TUD-DS09
drwx------   - TUD-DS10  hdfs             0 2015-06-17 15:29 /user/TUD-DS10
drwx------   - TUD-DS11  hdfs             0 2015-06-17 15:29 /user/TUD-DS11
drwx------   - TUD-DS12  hdfs             0 2015-06-17 15:30 /user/TUD-DS12
drwx------   - TUD-DS13  hdfs             0 2015-06-17 15:30 /us

<hr style="clear: both" />

In [None]:
import json
for i in client.ls(["/data/public/hadws"]):
    print(i, indent=2))

In [None]:
client.put("/etc/passwd", "foobar")

In [None]:
!hdfs fsck -files -blocks -locations -racks

In [None]:
Copy alice naar /user/hadws