# Hadoop Installation and Testing

> **WARNING: Be careful, the execution of this notebook can compromise your virtual machines. Do not execute any cell twice: please start from the very first cell if you have problems.**

## 0. Prerequisites

Load commands and files from `commands.py`.

In [3]:
%reload_ext autoreload
%autoreload 1
%aimport commands

Functions we will use

In [4]:
def run_ssh(host, *command):
    """
    :param host: ip address
    :type f: string
    :param command: command to execute
    :type command: string

    Execute with SSH the commands on host as the 'hadoop' user, displaying information
    """
    print('===== \x1b[31m' + 'Started on ' + host + '\x1b[0m =====')
    for cmd in command:
        print(cmd)
        !ssh hadoop@{host} {cmd}
    print('===== \x1b[31m' + 'Completed on ' + host + '\x1b[0m =====')

**a.** Populate the following dictionary with the IP addresses (as keys) and the hostnames (as values) of all your virtual machines.

In [5]:
VMS = {'172.16.0.225': 'hadoop-namenode',
       '172.16.0.221': 'hadoop-datanode-2', 
       '172.16.0.224': 'hadoop-datanode-3'}

**b.** Populate the following variable with the IP address of the virtual machine with the namenode role.

In [6]:
NAMENODE_IP = '172.16.0.225'

**c.** Populate the following variable with the IP address of the remaining virtual machines.

In [7]:
REMAINING_IPS = [
    '172.16.0.221',
    '172.16.0.224'
]

## 1. Download and install Hadoop
---
**a.** Download [hadoop-3.1.3.tar.gz](https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz) in your home folder on your virtual machines.

In [8]:
for host in VMS:
    run_ssh(host, commands.WGET_CMD)

===== [31mStarted on 172.16.0.225[0m =====
wget --progress=bar:force -c -O /home/hadoop/hadoop.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
--2020-04-16 22:14:35--  https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
Resolving archive.apache.org (archive.apache.org)... 163.172.17.199
Connecting to archive.apache.org (archive.apache.org)|163.172.17.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 338075860 (322M) [application/x-gzip]
Saving to: ‘/home/hadoop/hadoop.tar.gz’


2020-04-16 22:14:43 (41.6 MB/s) - ‘/home/hadoop/hadoop.tar.gz’ saved [338075860/338075860]

===== [31mCompleted on 172.16.0.225[0m =====
===== [31mStarted on 172.16.0.221[0m =====
wget --progress=bar:force -c -O /home/hadoop/hadoop.tar.gz https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar.gz
--2020-04-16 22:14:44--  https://archive.apache.org/dist/hadoop/common/hadoop-3.1.3/hadoop-3.1.3.tar

**b.** Decompress the Hadoop package you can use the following command:

In [9]:
for host in VMS:
    run_ssh(host, commands.TAR_CMD, commands.RM_CMD)

===== [31mStarted on 172.16.0.225[0m =====
tar -xvf hadoop.tar.gz --directory=/opt/hadoop --exclude=hadoop-3.1.0/share/doc --strip 1 > /dev/null
rm /home/hadoop/hadoop.tar.gz
===== [31mCompleted on 172.16.0.225[0m =====
===== [31mStarted on 172.16.0.221[0m =====
tar -xvf hadoop.tar.gz --directory=/opt/hadoop --exclude=hadoop-3.1.0/share/doc --strip 1 > /dev/null
rm /home/hadoop/hadoop.tar.gz
===== [31mCompleted on 172.16.0.221[0m =====
===== [31mStarted on 172.16.0.224[0m =====
tar -xvf hadoop.tar.gz --directory=/opt/hadoop --exclude=hadoop-3.1.0/share/doc --strip 1 > /dev/null
rm /home/hadoop/hadoop.tar.gz
===== [31mCompleted on 172.16.0.224[0m =====


**c.** There are environment settings that will be used by Hadoop. The following commands append the correct environment variables to your `/home/hadoop/.bashrc` files:

In [10]:
for host in VMS:
    !ssh hadoop@{host} 'printf "%s\n" {commands.get_bashrc()} >> ~/.bashrc'

**d.** The following commands check Hadoop installation (you should see no errors):

In [11]:
for host in VMS:
    run_ssh(host, 'hadoop version')

===== [31mStarted on 172.16.0.225[0m =====
hadoop version
Hadoop 3.1.3
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
Compiled by ztang on 2019-09-12T02:47Z
Compiled with protoc 2.5.0
From source with checksum ec785077c385118ac91aadde5ec9799
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.1.3.jar
===== [31mCompleted on 172.16.0.225[0m =====
===== [31mStarted on 172.16.0.221[0m =====
hadoop version
Hadoop 3.1.3
Source code repository https://gitbox.apache.org/repos/asf/hadoop.git -r ba631c436b806728f8ec2f54ab1e289526c90579
Compiled by ztang on 2019-09-12T02:47Z
Compiled with protoc 2.5.0
From source with checksum ec785077c385118ac91aadde5ec9799
This command was run using /opt/hadoop/share/hadoop/common/hadoop-common-3.1.3.jar
===== [31mCompleted on 172.16.0.221[0m =====
===== [31mStarted on 172.16.0.224[0m =====
hadoop version
Hadoop 3.1.3
Source code repository https://gitbox.apache.

# 2. Configure the namenode

**a.** Update the `core-site.xml` file located at `/opt/hadoop/etc/hadoop/` to define the name node URI on this machine.

In [12]:
!ssh hadoop@{NAMENODE_IP} 'printf "%s\n" {commands.get_namenode_core_site(VMS[NAMENODE_IP])} > /opt/hadoop/etc/hadoop/core-site.xml'

**b.** Update the `hdfs-site.xml` file located at `/opt/hadoop/etc/hadoop/` to define the path on the local filesystem where the name node stores the namespace and transactions logs persistently and to configure the HDFS subsystem.

In [13]:
!ssh hadoop@{NAMENODE_IP} 'printf "%s\n" {commands.get_namenode_hdfs_site()} > /opt/hadoop/etc/hadoop/hdfs-site.xml'

**c.** Update the `yarn-site.xml` file located at `/opt/hadoop/etc/hadoop` to configure the YARN subsystem.

In [14]:
!ssh hadoop@{NAMENODE_IP} 'printf "%s\n" {commands.get_namenode_yarn_site()} > /opt/hadoop/etc/hadoop/yarn-site.xml'

**d.** Update the `mapred-site.xml` file located at `/opt/hadoop/etc/hadoop` to configure the MAPREDUCE subsystem.

In [15]:
!ssh hadoop@{NAMENODE_IP} 'printf "%s\n" {commands.get_namenode_mapred_site(VMS[NAMENODE_IP])} > /opt/hadoop/etc/hadoop/mapred-site.xml'

**e.** Update the `workers` file located in `/opt/hadoop/etc/hadoop` to define the MAPREDUCE workers.

In [16]:
!ssh hadoop@{NAMENODE_IP} 'printf "%s\n" {commands.get_workers(VMS)} > /opt/hadoop/etc/hadoop/workers'

## 3. Configure the datanodes

**a.** Update the `core-site.xml` file located at `/opt/hadoop/etc/hadoop/` to define the name node URI on thie other datanodes.

In [17]:
for host in REMAINING_IPS:
    !ssh hadoop@{host} 'printf "%s\n" {commands.get_datanode_core_site(VMS[NAMENODE_IP])} > /opt/hadoop/etc/hadoop/core-site.xml'

**b.** Update the `hdfs-site.xml` file located at `/opt/hadoop/etc/hadoop/` to configure the HDFS subsystem.

In [18]:
for host in REMAINING_IPS:
    !ssh hadoop@{host} 'printf "%s\n" {commands.get_datanode_hdfs_site()} > /opt/hadoop/etc/hadoop/hdfs-site.xml'

**c.** Update the `yarn-site.xml` file located at `/opt/hadoop/etc/hadoop` to configure the YARN subsystem.

In [19]:
for host in REMAINING_IPS:
    !ssh hadoop@{host} 'printf "%s\n" {commands.get_datanode_yarn_site()} > /opt/hadoop/etc/hadoop/yarn-site.xml'

**d.** Update the `mapred-site.xml` file located at `/opt/hadoop/etc/hadoop` to configure the MAPREDUCE subsystem.

In [20]:
for host in REMAINING_IPS:
    !ssh hadoop@{host} 'printf "%s\n" {commands.get_datanode_mapred_site()} > /opt/hadoop/etc/hadoop/mdapred-site.xml'

## 4. Start, test and stop Hadoop

Delete the contents of the local HDFS file system.
Note: **This causes the loss of all information stored in HDFS**.

In [21]:
for host in VMS:
    run_ssh(host, 'rm -rf /opt/hdfs/namenode/*')
    run_ssh(host, 'rm -rf /opt/hdfs/datanode/*')
    
host = NAMENODE_IP
!ssh hadoop@{host} 'stop-dfs.sh'

Stopping namenodes on [hadoop-namenode]
Stopping datanodes
Stopping secondary namenodes [hadoop-namenode]


Format the HDFS filesystem at the namenode.

In [22]:
host = NAMENODE_IP
!ssh hadoop@{host} 'hdfs namenode -format -force'

2020-04-16 22:16:54,966 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop-namenode/172.16.0.225
STARTUP_MSG:   args = [-format, -force]
STARTUP_MSG:   version = 3.1.3


STARTUP_MSG:   classpath = /opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.19.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/jackson-annotations-2.7.8.jar:/opt/hadoop/share/hadoop/common/lib/netty-3.10.5.Final.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/common/lib/jetty-security-9.3.24.v20180605.jar:/opt/hadoop/share/hadoop/common/lib/re2j-1.1.jar:/opt/hadoop/share/hadoop/common/lib/error_prone_annotations-2.2.0.jar:/opt/hadoop/share/hadoop/common/lib/jackson-databind-2.7.8.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-9.3.24.v20180605.jar:/opt/hadoop/share/hadoop/common/lib/metrics-core-3.2.4.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration2-2.1.1.jar:/opt/hadoop/share/hadoop/common/lib/woodstox-core-5.0.3.jar:/opt/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:/opt/hadoop/share/hadoo

2020-04-16 22:16:55,240 INFO namenode.NameNode: createNameNode [-format, -force]
Formatting using clusterid: CID-88963bcc-895b-498d-9d9e-8bfde3833e23
2020-04-16 22:16:57,066 INFO namenode.FSEditLog: Edit logging is async:true
2020-04-16 22:16:57,117 INFO namenode.FSNamesystem: KeyProvider: null
2020-04-16 22:16:57,119 INFO namenode.FSNamesystem: fsLock is fair: true
2020-04-16 22:16:57,125 INFO namenode.FSNamesystem: Detailed lock hold time metrics enabled: false
2020-04-16 22:16:57,130 INFO namenode.FSNamesystem: fsOwner             = hadoop (auth:SIMPLE)
2020-04-16 22:16:57,131 INFO namenode.FSNamesystem: supergroup          = supergroup
2020-04-16 22:16:57,131 INFO namenode.FSNamesystem: isPermissionEnabled = false
2020-04-16 22:16:57,131 INFO namenode.FSNamesystem: HA Enabled: false
2020-04-16 22:16:57,223 INFO common.Util: dfs.datanode.fileio.profiling.sampling.percentage set to 0. Disabling file IO profiling
2020-04-16 22:16:57,251 INFO blockmanagement.DatanodeManager: dfs.block.

Creating the HDFS home folder for the `hadoop` user.

In [23]:
!ssh hadoop@{host} 'start-dfs.sh'
!ssh hadoop@{host} 'hadoop fs -mkdir -p /user/hadoop'
!ssh hadoop@{host} 'stop-dfs.sh'

Starting namenodes on [hadoop-namenode]
Starting datanodes
Starting secondary namenodes [hadoop-namenode]
Stopping namenodes on [hadoop-namenode]
Stopping datanodes
Stopping secondary namenodes [hadoop-namenode]


Starting HDFS & YARN.

In [24]:
host = NAMENODE_IP
!ssh hadoop@{host} 'start-dfs.sh'
!ssh hadoop@{host} 'start-yarn.sh'

Starting namenodes on [hadoop-namenode]
Starting datanodes
Starting secondary namenodes [hadoop-namenode]
Starting resourcemanager
Starting nodemanagers


Checking all daemons up and running. 

In [25]:
for host in VMS:
    run_ssh(host, 'jps')

===== [31mStarted on 172.16.0.225[0m =====
jps
219747 SecondaryNameNode
219489 DataNode
220262 NodeManager
3000 HistoryServer
220059 ResourceManager
220699 Jps
219263 NameNode
===== [31mCompleted on 172.16.0.225[0m =====
===== [31mStarted on 172.16.0.221[0m =====
jps
154352 Jps
153968 DataNode
154154 NodeManager
===== [31mCompleted on 172.16.0.221[0m =====
===== [31mStarted on 172.16.0.224[0m =====
jps
197748 Jps
197364 DataNode
197550 NodeManager
===== [31mCompleted on 172.16.0.224[0m =====


Run an example provided by Hadoop.
* Wait a minute before running, the daemons should perform some initialization steps
* Ignore initial errors `No such file or directory`
* Ignore logger message by logger `sasl.SaslDataTransferClient`

In [None]:
host = NAMENODE_IP
!ssh hadoop@{host} 'hadoop fs -rm -r input output'
!ssh hadoop@{host} 'hadoop fs -put /opt/hadoop/etc/hadoop/ input'
!ssh hadoop@{host} "hadoop jar /opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.3.jar grep /user/hadoop/input/*.xml /user/hadoop/output 'dfs[a-z.]+'"
!ssh hadoop@{host} 'hadoop fs -cat output/part-r-00000'

rm: `input': No such file or directory
rm: `output': No such file or directory


Stop HDFS & YARN.

In [None]:
host = NAMENODE_IP
!ssh hadoop@{host} 'stop-dfs.sh'
!ssh hadoop@{host} 'stop-yarn.sh'