hadoop-app

In this example I will be setting up an Apache Hadoop Multi-node Cluster on AWS and processing a set of input text files using the Hadoop Streaming API and Python Mapper/Reducers. The simple application will perform the same task we saw in my Spark App but instead use the Hadoop framework to output the alphabet letter counts.

The setup involves a main node and two data nodes and use Hadoop v2.10.0 available from one of the mirros at https://hadoop.apache.org/releases.html

Note that, as an extra precaution, I have ultimately decided that this applications cluster should be configured to use the Private DNS Names. In addition, I have set up SSH tunnels on my /32 network mask for viewing the Web applications instead of exposing the public DNS names and ports (the security group generated dynamically with Ansible is discussed here)

Launch the Instances

For this example, I will click on Launch Instance and select an Amazon Linux t2.micro 64-bit (x86) configuration (Free tier eligible). For now I will not perform additional configurations of the instances, add storage, or attach any security group(s) but I will add a tag (under the Add Tags section) with a key Name for each of the three instances and values HadoopMainNode, HadoopDataNode1, and HadoopDataNode2 respectively. As I launch each instance I will be using a pre-existing key pair (i.e., certificate).

Once the instances are fully up and running you should be able to see them on the dashboard as shown in the screenshot below along with their Public DNS and IP (not shown). You should also be able to log on to any of these instances as ec2-user from an SSH terminal using your private key or, since this feature is automatically enabled on Amazon Linux instances, using the EC2 Browser-based SSH connection.

Additional Setup for each Node (as root user)

Install Dependencies

Java 1.8 JDK (java-1.8.0-openjdk.x86_64)
Git (2.23.0) (optional)

Hadoop Setup

Install the Application and Configure JAVA_HOME

mkdir /var/applications
wget hadoop (v.2.10.0): wget http://apache-mirror.8birdsvideo.com/hadoop/common/hadoop-2.10.0/hadoop-2.10.0.tar.gz or other mirror (from the https://hadoop.apache.org/releases page) and run tar xvf hadoop-2.10.0.tar.gz
mv hadoop-2.10.0 hadoop
Edit /var/applications/hadoop/etc/hadoop/hadoop-env.sh: Replace export JAVA_HOME=${JAVA_HOME} with export JAVA_HOME=/usr/lib/jvm/jre-1.8.0-openjdk
Add to the /var/applications/hadoop/etc/hadoop/core-site.xml configuration as shown below:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value><HadoopMainNodeHost>:9000</value>
  </property>
</configuration>

mkdir -p /usr/local/hadoop/hdfs/data && chown -R hadoop.hadoop /usr/local/hadoop/hdfs/data
Create the hadoop user and run:

chown hadoop.hadoop /usr/local/hadoop/hdfs/data
chown -R hadoop.hadoop /var/applications

Set up passwordless ssh for each node:

sudo su hadoop
cd ~
mkdir .ssh
chmod 700 .ssh && cd .ssh
ssh-keygen (run just on main node and just hit enter with each prompt)

Copy the public key generated to the ~/.ssh/authorized_keys file (and ensure access permissions are set to 644!)
Edit the ~/.ssh/config file (ensuring access permissions are set to 644 when done) to include the following:

Host MainNode
    HostName <HadoopMainNodeHost>
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host DataNode1
    HostName <HadoopDataNode1Host>
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host DataNode2
    HostName <HadoopDataNode2Host>
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Setup the HDFS Properties on the Main Node

Edit the /var/applications/hadoop/etc/hadoop/hdfs-site.xml file on the main node by adding the following configuration:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.MainNode.name.dir</name>
    <value>file:///usr/local/hadoop/hdfs/data</value>
  </property>
</configuration>

Setup the MapReduce Properties on the Main Node

Copy the /var/applications/hadoop/etc/hadoop/mapred-site.xml.template to mapred-site.xml in the same directory and edit to include:

<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value><nnode><HadoopMainNodeHost>:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Setup the YARN Properties on the Main Node

Edit the /var/applications/hadoop/etc/hadoop/yarn-site.xml to include the following:

<configuration>
  <!-- Site specific YARN configuration properties -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value><HadoopMainNodeHost></value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
	<value><HadoopMainNodeHost>:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
	<value><HadoopMainNodeHost>:8031</value>
  </property>
</configuration>

Setup Master and Slaves on the Main Node

Create a /var/applications/hadoop/etc/hadoop/masters file and add the following:

<HadoopMainNodeHost>

Edit the /var/applications/hadoop/etc/hadoop/slaves

<HadoopDataNode1Host>
<HadoopDataNode2Host>

Start the Hadoop Cluster

Format the HDFS File System on the Main Node

cd /var/applications
./hadoop/bin/hdfs namenode -format

Copy the XML Configuration Files to the Mirror Slaves Directories

cd $HADOOP_HOME/etc/hadoop
for i in `cat slaves`; do
    scp *xml $i:$HADOOP_HOME/etc/hadoop;
done

Start the Hadoop Cluster

./hadoop/sbin/start-dfs.sh
./hadoop/sbin/start-yarn.sh

Verify that Nodes are Visible/Active in the UI (screenshot below)

You can also view the data nodes in the DFS Health UI (http://:50070/dfshealth.html#tab-datanode) as shown in the example screenshot below:

Stop and Re-starting AWS Instances

Note: If you stop/start any of the instances and have not set up a domain name you will need to update the public DNS in the following locations:

Main Node:

~/.ssh/config (references to main and data nodes updated)
/var/applications/hadoop/etc/hadoop/core-site.xml
/var/applications/hadoop/etc/hadoop/mapred-site.xml
/var/applications/hadoop/etc/hadoop/yarn-site.xml
/var/applications/hadoop/etc/hadoop/masters
/var/applications/hadoop/etc/hadoop/slaves

Automatic Deployment of the Configuration and Daemon Restart using Ansible

The Ansible AWS Instance documentation contains instructions using Ansible (i.e., using the hadoop_conf.yml and hadoop_daemons.yml playbooks) for the automated deployment of the configuration and the DFS/YARN restart. Using these playbooks will automate the process if the configuration (i.e., Public DNS/IP or other parameters) changes.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hadoop-app

Launch the Instances

Additional Setup for each Node (as root user)

Install Dependencies

Hadoop Setup

Install the Application and Configure JAVA_HOME

Setup the HDFS Properties on the Main Node

Setup the MapReduce Properties on the Main Node

Setup the YARN Properties on the Main Node

Setup Master and Slaves on the Main Node

Start the Hadoop Cluster

Format the HDFS File System on the Main Node

Copy the XML Configuration Files to the Mirror Slaves Directories

Start the Hadoop Cluster

Verify that Nodes are Visible/Active in the UI (screenshot below)

Stop and Re-starting AWS Instances

Automatic Deployment of the Configuration and Daemon Restart using Ansible

About

Releases

Packages

spineo/hadoop-app

Folders and files

Latest commit

History

Repository files navigation

hadoop-app

Launch the Instances

Additional Setup for each Node (as root user)

Install Dependencies

Hadoop Setup

Install the Application and Configure JAVA_HOME

Setup the HDFS Properties on the Main Node

Setup the MapReduce Properties on the Main Node

Setup the YARN Properties on the Main Node

Setup Master and Slaves on the Main Node

Start the Hadoop Cluster

Format the HDFS File System on the Main Node

Copy the XML Configuration Files to the Mirror Slaves Directories

Start the Hadoop Cluster

Verify that Nodes are Visible/Active in the UI (screenshot below)

Stop and Re-starting AWS Instances

Automatic Deployment of the Configuration and Daemon Restart using Ansible

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages