Skip to content

spineo/hadoop-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 

Repository files navigation

hadoop-app

In this example I will be setting up an Apache Hadoop Multi-node Cluster on AWS and processing a set of input text files using the Hadoop Streaming API and Python Mapper/Reducers. The simple application will perform the same task we saw in my Spark App but instead use the Hadoop framework to output the alphabet letter counts.

The setup involves a main node and two data nodes and use Hadoop v2.10.0 available from one of the mirros at https://hadoop.apache.org/releases.html

Note that, as an extra precaution, I have ultimately decided that this applications cluster should be configured to use the Private DNS Names. In addition, I have set up SSH tunnels on my /32 network mask for viewing the Web applications instead of exposing the public DNS names and ports (the security group generated dynamically with Ansible is discussed here)

Launch the Instances

For this example, I will click on Launch Instance and select an Amazon Linux t2.micro 64-bit (x86) configuration (Free tier eligible). For now I will not perform additional configurations of the instances, add storage, or attach any security group(s) but I will add a tag (under the Add Tags section) with a key Name for each of the three instances and values HadoopMainNode, HadoopDataNode1, and HadoopDataNode2 respectively. As I launch each instance I will be using a pre-existing key pair (i.e., certificate).

Once the instances are fully up and running you should be able to see them on the dashboard as shown in the screenshot below along with their Public DNS and IP (not shown). You should also be able to log on to any of these instances as ec2-user from an SSH terminal using your private key or, since this feature is automatically enabled on Amazon Linux instances, using the EC2 Browser-based SSH connection.

Running Instances

Additional Setup for each Node (as root user)

Install Dependencies

  • Java 1.8 JDK (java-1.8.0-openjdk.x86_64)
  • Git (2.23.0) (optional)

Hadoop Setup

Install the Application and Configure JAVA_HOME

<configuration>
  <property>
    <name>fs.default.name</name>
    <value><HadoopMainNodeHost>:9000</value>
  </property>
</configuration>
  • mkdir -p /usr/local/hadoop/hdfs/data && chown -R hadoop.hadoop /usr/local/hadoop/hdfs/data

  • Create the hadoop user and run:

chown hadoop.hadoop /usr/local/hadoop/hdfs/data
chown -R hadoop.hadoop /var/applications
  • Set up passwordless ssh for each node:
sudo su hadoop
cd ~
mkdir .ssh
chmod 700 .ssh && cd .ssh
ssh-keygen (run just on main node and just hit enter with each prompt)
  • Copy the public key generated to the ~/.ssh/authorized_keys file (and ensure access permissions are set to 644!)

  • Edit the ~/.ssh/config file (ensuring access permissions are set to 644 when done) to include the following:

Host MainNode
    HostName <HadoopMainNodeHost>
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host DataNode1
    HostName <HadoopDataNode1Host>
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Host DataNode2
    HostName <HadoopDataNode2Host>
    User hadoop
    IdentityFile ~/.ssh/id_rsa

Setup the HDFS Properties on the Main Node

  • Edit the /var/applications/hadoop/etc/hadoop/hdfs-site.xml file on the main node by adding the following configuration:
<configuration>
  <property>
    <name>dfs.replication</name>
    <value>3</value>
  </property>
  <property>
    <name>dfs.MainNode.name.dir</name>
    <value>file:///usr/local/hadoop/hdfs/data</value>
  </property>
</configuration>

Setup the MapReduce Properties on the Main Node

  • Copy the /var/applications/hadoop/etc/hadoop/mapred-site.xml.template to mapred-site.xml in the same directory and edit to include:
<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value><nnode><HadoopMainNodeHost>:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

Setup the YARN Properties on the Main Node

  • Edit the /var/applications/hadoop/etc/hadoop/yarn-site.xml to include the following:
<configuration>
  <!-- Site specific YARN configuration properties -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value><HadoopMainNodeHost></value>
  </property>
  <property>
    <name>yarn.resourcemanager.scheduler.address</name>
	<value><HadoopMainNodeHost>:8030</value>
  </property>
  <property>
    <name>yarn.resourcemanager.resource-tracker.address</name>
	<value><HadoopMainNodeHost>:8031</value>
  </property>
</configuration>

Setup Master and Slaves on the Main Node

  • Create a /var/applications/hadoop/etc/hadoop/masters file and add the following:
<HadoopMainNodeHost>

Edit the /var/applications/hadoop/etc/hadoop/slaves

<HadoopDataNode1Host>
<HadoopDataNode2Host>

Start the Hadoop Cluster

Format the HDFS File System on the Main Node

cd /var/applications
./hadoop/bin/hdfs namenode -format

Copy the XML Configuration Files to the Mirror Slaves Directories

cd $HADOOP_HOME/etc/hadoop
for i in `cat slaves`; do
    scp *xml $i:$HADOOP_HOME/etc/hadoop;
done

Start the Hadoop Cluster

./hadoop/sbin/start-dfs.sh
./hadoop/sbin/start-yarn.sh

Verify that Nodes are Visible/Active in the UI (screenshot below)

Hadoop Cluster

You can also view the data nodes in the DFS Health UI (http://:50070/dfshealth.html#tab-datanode) as shown in the example screenshot below:

DFS Health UI

Stop and Re-starting AWS Instances

Note: If you stop/start any of the instances and have not set up a domain name you will need to update the public DNS in the following locations:

  • Main Node:
~/.ssh/config (references to main and data nodes updated)
/var/applications/hadoop/etc/hadoop/core-site.xml
/var/applications/hadoop/etc/hadoop/mapred-site.xml
/var/applications/hadoop/etc/hadoop/yarn-site.xml
/var/applications/hadoop/etc/hadoop/masters
/var/applications/hadoop/etc/hadoop/slaves

Automatic Deployment of the Configuration and Daemon Restart using Ansible

The Ansible AWS Instance documentation contains instructions using Ansible (i.e., using the hadoop_conf.yml and hadoop_daemons.yml playbooks) for the automated deployment of the configuration and the DFS/YARN restart. Using these playbooks will automate the process if the configuration (i.e., Public DNS/IP or other parameters) changes.