In [None]:
%load_ext dockermagic

# HDFS

## HDFS - Web Interface

- Master node
    - NameNode: http://localhost:9870
    - Secondary NameNode: http://localhost:9868
- Worker node
    - hadoop1
        - DataNode: http://localhost:9864
    - hadoop2
        - DataNode: http://localhost:9865
    - hadoop3
        - DataNode: http://localhost:9866

## HDFS - CLI

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh
hdfs help

## Filesystem Basic Commands

- https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/FileSystemShell.html

Download books from Gutenberg project (http://www.gutenberg.org/)

- Moby Dick; Or, The Whale by Herman Melville
- Pride and Prejudice by Jane Austen
- Dracula by Bram Stoker

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh
mkdir /opt/datasets
cd /opt/datasets

wget -qc http://www.gutenberg.org/files/2701/2701-0.txt -O mobydick.txt
wget -qc http://www.gutenberg.org/files/1342/1342-0.txt -O prideandprejudice.txt
wget -qc http://www.gutenberg.org/cache/epub/345/pg345.txt -O dracula.txt

ls /opt/datasets

### Create gutenberg folder in HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -mkdir /user/hadoop/gutenberg

### Copy books to HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -put * /user/hadoop/gutenberg
# hdfs dfs -copyFromLocal * /user/hadoop/gutenberg

### List files in HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -ls /user/hadoop/gutenberg

### Show first/last KB of file

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

# hdfs dfs -head /user/hadoop/gutenberg/mobydick.txt
hdfs dfs -tail /user/hadoop/gutenberg/prideandprejudice.txt

### Show whole file - CAREFUL

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -cat /user/hadoop/gutenberg/dracula.txt

### Append file contents to a file in HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -appendToFile mobydick.txt prideandprejudice.txt dracula.txt /user/hadoop/allbooks.txt

### Copy allbooks.txt (in HDFS) to gutenberg directory (in HDFS)


In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -cp allbooks.txt /user/hadoop/gutenberg
hdfs dfs -ls -h -R

### Copy allbooks.txt to local filesystem

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -get allbooks.txt .
# hdfs dfs -copyToLocal /user/hadoop/allbooks.txt .
ls -l allbooks.txt
rm allbooks.txt

### Remove file in HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -rm allbooks.txt
# hdfs dfs -rm /user/hadoop/allbooks.txt


### Move file in HDFS (also used for renaming)

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -mv gutenberg/allbooks.txt gutenberg/books.txt

### Print statistics on folder

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

printf "name\ttype\tsize\treps\n"
hdfs dfs -stat "%n %F %b %r" /user/hadoop/gutenberg/*

### Get several files from HDFS and merge to a single local file

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -getmerge /user/hadoop/gutenberg mergebooks.txt
ls -l mergebooks.txt
rm mergebooks.txt

### Remove directory and files (-R recursive)

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/datasets

hdfs dfs -rm -R /user/hadoop/gutenberg

## Utilization in a MapReduce job

### Copy files to HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh
cd /opt/datasets

hdfs dfs -mkdir /user/hadoop/gutenberg
hdfs dfs -put mobydick.txt prideandprejudice.txt dracula.txt /user/hadoop/gutenberg

### Run MapReduce application specifying HDFS folders for input and output files

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh
cd /opt/hadoop/share/hadoop/mapreduce

# run wordcount application
hadoop jar ./hadoop-mapreduce-examples-$HADOOP_VERSION.jar wordcount \
/user/hadoop/gutenberg /user/hadoop/gutenberg-output

### Show output files

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfs -ls /user/hadoop/gutenberg-output
hdfs dfs -head /user/hadoop/gutenberg-output/part-r-00000

### Copy HDFS files to local filesystem

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /tmp

hdfs dfs -get /user/hadoop/gutenberg-output/part-r-00000 gutenberg-output.txt
head /tmp/gutenberg-output.txt

### Remove output folder on HDFS

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfs -rm -R /user/hadoop/gutenberg-output

### Running MapReduce with 2 reduce tasks (-Dmapreduce.job.reduces=2)

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /opt/hadoop/share/hadoop/mapreduce

# run wordcount application with 2 reducers
hadoop jar ./hadoop-mapreduce-examples-$HADOOP_VERSION.jar wordcount \
-Dmapreduce.job.reduces=2 \
/user/hadoop/gutenberg /user/hadoop/gutenberg-output

### List output folder contents

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfs -ls /user/hadoop/gutenberg-output

### Copy HDFS file to local filesystem

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cd /tmp

hdfs dfs -getmerge /user/hadoop/gutenberg-output gutenberg-output.txt
head /tmp/gutenberg-output.txt

# remove output folder
hdfs dfs -rm -R /user/hadoop/gutenberg-output

## Advanced Commands

- https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html

### Verify HDFS cluster status

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

# print topology
hdfs dfsadmin -printTopology

printf "\n%40s\n\n" |tr " " "="

hdfs dfsadmin -report

### Replication factor

#### List folder block location

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs fsck /user/hadoop/gutenberg -files -blocks -locations

#### Change replication factor of all files in directory to 3

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfs -setrep 3 /user/hadoop/gutenberg

#### List folder block location

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs fsck /user/hadoop/gutenberg -files -blocks -locations

#### Change replication factor back to 2

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfs -setrep 2 /user/hadoop/gutenberg

#### List folder block location

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs fsck /user/hadoop/gutenberg -files -blocks -locations

### Decomission nodes

- dfs.hosts.exclude in hdfs-site.xml

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

# Decomissioning hadoop1
cat > /opt/hadoop/etc/hadoop/dfs.exclude << EOF
hadoop1
EOF

hdfs dfsadmin -refreshNodes

- **Namenode:** http://localhost:9870

#### Report HDFS Status

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfsadmin -report

#### Recomission all nodes

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

cat > /opt/hadoop/etc/hadoop/dfs.exclude << EOF
EOF

hdfs dfsadmin -refreshNodes

#### Report HDFS status

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfsadmin -report

### Handling datanode failures

- timeouts defined in hdfs-site.xml 
    - dfs.namenode.heartbeat.recheck-interval = 10000 (10 seconds)
    - dfs.heartbeat.interval = 3 seconds
- timeout = 2 x recheck-interval + 10 x heartbeat.interval
    - timeout = 50 seconds

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

# get dfs.namenode.heartbeat.recheck-interval
hdfs getconf -confKey dfs.namenode.heartbeat.recheck-interval

# get dfs.heartbeat.interval
hdfs getconf -confKey dfs.heartbeat.interval

#### Simulate node fault

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

ssh hadoop1 'kill -9 $(cat /tmp/hadoop-hadoop-datanode.pid)'

- **Namenode:** http://localhost:9870

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfsadmin -report

#### Restart nodemanager

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

ssh hadoop1 /opt/hadoop/bin/hdfs --daemon start datanode

#### Refresh nodes

In [None]:
%%dockerexec hadoop

source /opt/envvars.sh

hdfs dfsadmin -refreshNodes
hdfs dfsadmin -report