Skip to content

Building Hadoop from source

Shawfeng Dong edited this page Nov 2, 2016 · 2 revisions

In this article we describe how to build Apache Hadoop 2.5.2 from source on Hyades.

Table of Contents

Download Hadoop source

$ cd /scratch
$ wget http://apache.spinellicreations.com/hadoop/common/hadoop-2.5.2/hadoop-2.5.2-src.tar.gz
$ tar xvfz hadoop-2.5.2-src.tar.gz
$ cd hadoop-2.5.2-src

According to BUILDING.txt, the requirements for building Hadoop 2.5.2 are:

  • Unix System
  • JDK 1.6+
  • Maven 3.0 or later
  • Findbugs 1.3.9 (if running findbugs)
  • ProtocolBuffer 2.5.0
  • CMake 2.6 or newer (if compiling native code)
  • Zlib devel (if compiling native code)
  • openssl devel ( if compiling native hadoop-pipes )
  • Internet connection for first build (to fetch all Maven and Hadoop dependencies)
Let's first install all missing requirements.

Maven

Apache Maven is a software project management and comprehension tool. Based on the concept of a project object model (POM), Maven can manage a Java development project's build, reporting and documentation from a central piece of information.

Download Maven 3.2.3 from one of the mirrors:

$ cd /scratch/
$ wget http://mirror.metrocast.net/apache/maven/maven-3/3.2.3/binaries/apache-maven-3.2.3-bin.tar.gz

Unpack the tar ball:

$ tar xvfz apache-maven-3.2.3-bin.tar.gz -C /pfs/sw/java

Create a module file (/pfs/sw/modulefiles/maven/3.2.3) that sets the following environment variables:

M2_HOME=/pfs/sw/java/apache-maven-3.2.3
PATH=$M2_HOME/bin:$PATH

Load the module:

$ module load maven

Protocol Buffers

As of December 1st, 2014, the latest release of Protocol Buffers is 2.6.0. However, Hadoop 2.5.2 requires exactly Protocol Buffers 2.5.0!

Download Protocol Buffers 2.5.0:

$ cd /scratch/
$ wget https://protobuf.googlecode.com/files/protobuf-2.5.0.tar.bz2

Build and install Protocol Buffers 2.5.0:

$ module load python
$ tar xvfj protobuf-2.5.0.tar.bz
$ cd protobuf-2.5.0
$ ./configure --prefix=/pfs/sw/serial/gcc/protobuf-2.5.0
$ make
$ make check
$ make install

Create a module file (/pfs/sw/modulefiles/protobuf/2.5.0) that sets the following environment variables:

PATH=/pfs/sw/serial/gcc/protobuf-2.5.0/bin:$PATH
PKG_CONFIG_PATH=/pfs/sw/serial/gcc/protobuf-2.5.0/lib/pkgconfig:$PKG_CONFIG_PATH

Load the module:

$ module load protobuf

CMake

CMake 2.8 is available in the CentOS 6 repositories:

$ yum install cmake

Build Hadoop

$ cd /scratch/hadoop-2.5.2-src

Create binary distribution with native libraries:

$ mvn package -Pdist,native -DskipTests=true -Dtar
NOTE there is a typo in the Hadoop documentation. The option to skip tests should be -DskipTests, not -Dskiptests!

The resulting distribution is stored in /scratch/hadoop-2.5.2-src/hadoop-dist/target/.

Fix native libraries of the official Hadoop release

In order to improve performance, Hadoop tries to load native implementations of certain components[3]. These components are available in dynamically-linked native libraries, located in the lib/native directory. Although the native libraries provided by the official Hadoop 2.5.2 release are 64-bit, they are linked with GLIBC_2.14. The glibc on RHEL/CentOS 6, however, is version 2.12. Thus the stock native libraries can't be loaded and we'll get the following warning:

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Let's fix it by overwriting the native libraries in the official Hadoop 2.5.2 release with our just built ones:

# cp /scratch/hadoop-2.5.2-src/hadoop-dist/target/hadoop-2.5.2/lib/native/* /pfs/sw/bigdata/hadoop-2.5.2/lib/native/
Clone this wiki locally