In this example I will be setting up an Apache Hadoop Multi-node Cluster on AWS and processing a set of input text files using the Hadoop Streaming API and Python Mapper/Reducers. The simple application will perform the same task we saw in my Spark App but instead use the Hadoop framework to output the alphabet letter counts.
The setup involves a main node and two data nodes and use Hadoop v2.10.0 available from one of the mirros at https://hadoop.apache.org/releases.html
Note that, as an extra precaution, I have ultimately decided that this applications cluster should be configured to use the Private DNS Names. In addition, I have set up SSH tunnels on my /32 network mask for viewing the Web applications instead of exposing the public DNS names and ports (the security group generated dynamically with Ansible is discussed here)
For this example, I will click on Launch Instance and select an Amazon Linux t2.micro 64-bit (x86) configuration (Free tier eligible). For now I will not perform additional configurations of the instances, add storage, or attach any security group(s) but I will add a tag (under the Add Tags section) with a key Name for each of the three instances and values HadoopMainNode, HadoopDataNode1, and HadoopDataNode2 respectively. As I launch each instance I will be using a pre-existing key pair (i.e., certificate).
Once the instances are fully up and running you should be able to see them on the dashboard as shown in the screenshot below along with their Public DNS and IP (not shown). You should also be able to log on to any of these instances as ec2-user from an SSH terminal using your private key or, since this feature is automatically enabled on Amazon Linux instances, using the EC2 Browser-based SSH connection.
- Java 1.8 JDK (java-1.8.0-openjdk.x86_64)
- Git (2.23.0) (optional)
-
mkdir /var/applications
-
wget hadoop (v.2.10.0): wget http://apache-mirror.8birdsvideo.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz or other mirror (from the https://hadoop.apache.org/releases page) and run tar xvf hadoop-2.10.0.tar.gz
-
mv hadoop-2.10.0 hadoop
-
Edit /var/applications/hadoop/etc/hadoop/hadoop-env.sh: Replace export JAVA_HOME=${JAVA_HOME} with export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
-
Add to the /var/applications/hadoop/etc/hadoop/core-site.xml configuration as shown below:
<configuration>
<property>
<name>fs.default.name</name>
<value><HadoopMainNodeHost>:9000</value>
</property>
</configuration>
-
mkdir -p /usr/local/hadoop/hdfs/data && chown -R hadoop.hadoop /usr/local/hadoop/hdfs/data
-
Create the hadoop user and run:
chown hadoop.hadoop /usr/local/hadoop/hdfs/data
chown -R hadoop.hadoop /var/applications
- Set up passwordless ssh for each node:
sudo su hadoop
cd ~
mkdir .ssh
chmod 700 .ssh && cd .ssh
ssh-keygen (run just on main node and just hit enter with each prompt)
-
Copy the public key generated to the ~/.ssh/authorized_keys file (and ensure access permissions are set to 644!)
-
Edit the ~/.ssh/config file (ensuring access permissions are set to 644 when done) to include the following:
Host MainNode
HostName <HadoopMainNodeHost>
User hadoop
IdentityFile ~/.ssh/id_rsa
Host DataNode1
HostName <HadoopDataNode1Host>
User hadoop
IdentityFile ~/.ssh/id_rsa
Host DataNode2
HostName <HadoopDataNode2Host>
User hadoop
IdentityFile ~/.ssh/id_rsa
- Edit the /var/applications/hadoop/etc/hadoop/hdfs-site.xml file on the main node by adding the following configuration:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.MainNode.name.dir</name>
<value>file:///usr/local/hadoop/hdfs/data</value>
</property>
</configuration>
- Copy the /var/applications/hadoop/etc/hadoop/mapred-site.xml.template to mapred-site.xml in the same directory and edit to include:
<configuration>
<property>
<name>mapreduce.jobtracker.address</name>
<value><nnode><HadoopMainNodeHost>:54311</value>
</property>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
- Edit the /var/applications/hadoop/etc/hadoop/yarn-site.xml to include the following:
<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value><HadoopMainNodeHost></value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value><HadoopMainNodeHost>:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value><HadoopMainNodeHost>:8031</value>
</property>
</configuration>
- Create a /var/applications/hadoop/etc/hadoop/masters file and add the following:
<HadoopMainNodeHost>
Edit the /var/applications/hadoop/etc/hadoop/slaves
<HadoopDataNode1Host>
<HadoopDataNode2Host>
cd /var/applications
./hadoop/bin/hdfs namenode -format
cd $HADOOP_HOME/etc/hadoop
for i in `cat slaves`; do
scp *xml $i:$HADOOP_HOME/etc/hadoop;
done
./hadoop/sbin/start-dfs.sh
./hadoop/sbin/start-yarn.sh
You can also view the data nodes in the DFS Health UI (http://:50070/dfshealth.html#tab-datanode) as shown in the example screenshot below:
Note: If you stop/start any of the instances and have not set up a domain name you will need to update the public DNS in the following locations:
- Main Node:
~/.ssh/config (references to main and data nodes updated)
/var/applications/hadoop/etc/hadoop/core-site.xml
/var/applications/hadoop/etc/hadoop/mapred-site.xml
/var/applications/hadoop/etc/hadoop/yarn-site.xml
/var/applications/hadoop/etc/hadoop/masters
/var/applications/hadoop/etc/hadoop/slaves
The Ansible AWS Instance documentation contains instructions using Ansible (i.e., using the hadoop_conf.yml and hadoop_daemons.yml playbooks) for the automated deployment of the configuration and the DFS/YARN restart. Using these playbooks will automate the process if the configuration (i.e., Public DNS/IP or other parameters) changes.