VELaSSCo (http://www.velassco.eu) EC funded project final platform code. The goal of the project was to combine big data tools and technologies with scientific visualization algorithms to provide new visual analysis methods for large-scale simulations.
Setting up VELaSSCo platform
This tutorial guides you through the installation process of Hadoop and HBase on a multi node cluster. Furthermore we explain how the setup the Injector and the VELaSSCo Platform to make everything run. In our example we use:
- 4 Machines - 1 Master and 3 Slaves
- Scientific Linux 7.3 as operating system
- OpenJDK 1.8.0_121
- Hadoop 2.7.3 - http://mirror.yannic-bonenberger.com/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
- HBase 1.2.4 - http://mirror.serversupportforum.de/apache/hbase/1.2.4/hbase-1.2.4-bin.tar.gz
- Thrift 0.9.3
Create a seperate user on every machine you want to use, e.g. "velassco". Note: don't forget to add the user to the /etc/sudoers file
su useradd velassco passwd velassco
Afterwards edit the
/etc/hosts file on every machine so that we can work with aliases instead of IP addresses and the machines know about each other. Therefor you have to find out the IPs of your machines and edit the hosts file accordingly.
sudo vim /etc/hosts
Although most tutorials say that you only have to add the master and itself to the slave machines host files, I ran into some problems without adding all the others too.
# hosts file x.x.x.1 master x.x.x.2 slave1 x.x.x.3 slave2 x.x.x.4 slave3
Afterwards we have to create a ssh key for the velassco user on your master machine and enable passwordless SSH login to the slaves. Note: don't add a password during key creation - just press enter to use the defaults
su - hadoop ssh-keygen -t rsa ssh-copy-id hadoop@slave1 ssh-copy-id hadoop@slave2 ssh-copy-id hadoop@slave3
Java and .bashrc
If you don't have Java installed please do this on every machine. The version in the prerequisites worked for me - for a compatibility list please refer to https://wiki.apache.org/hadoop/HadoopJavaVersions
In my example I used the default java location /usr/bin/java. To be able for Hadoop/HBase to use java you have to specify a envirenment variable in your .bashrc file. In order to minimize the editing of this file please copy the whole block as we are going to need it later as we set up Hadoop/HBase.
cd vim .bashrc
# .bashrc file # java export JAVA_HOME=/usr # Hadoop export HADOOP_HOME=/usr/local/hadoop export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin export HADOOP_INSTALL=$HADOOP_HOME # HBase export HBASE_HOME=/usr/local/hbase export PATH=$PATH:$HBASE_HOME/bin source .bashrc
NOTE: Please turn all firewall services off (e.g. firewalld, iptables) or you will run into some problems
First of all we have to create the storage locations for your name and datanodes on you filesystem. In this example I use "/home/hadoop/hadoopinfra" as my location for the data but I think you should be able to use any directory on the machine. Create the following folder structure on every machine:
cd mkdir hadoopinfra mkdir hadoopinfra/hdfs mkdir hadoopinfra/hdfs/namenode mkdir hadoopinfra/hdfs/datanode sudo chown -R hadoop hadoopinfra/
The easiest way to install Hadoop (and HBase) is to download and configure it on your master machine and after the configuration duplicating the folder to all the other slave machines.
# Downloading, unpacking and setting folder rights cd /usr/local wget http://mirror.yannic-bonenberger.com/apache/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz tar -xzf hadoop-2.7.3.tar.gz sudo mv hadoop-2.7.3 hadoop sudo chown -R hadoop hadoop/
To configure Hadoop to work in the fully-distributed mode you have to adjust several files in the
First search for and adjust the JAVA_HOME environment variable in
Afterward please copy the block from here to each file between the ... tag:
# core-site.xml <property> <name>fs.defaultFS</name> <value>hdfs://master/</value> <description>NameNode URI</description> </property>
# yarn-site.xml <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> <description>The hostname of the ResourceManager</description> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> <description>shuffle service for MapReduce</description> </property>
# hdfs-site.xml <property> <name>dfs.datanode.data.dir</name> <value>/home/hadoop/hadoopinfra/hdfs/datanode</value> <description>DataNode directory for storing data chunks.</description> </property> <property> <name>dfs.namenode.name.dir</name> <value>/home/hadoop/hadoopinfra/hdfs/namenode</value> <description>NameNode directory for namespace and transaction logs storage.</description> </property> <property> <name>dfs.replication</name> <value>3</value> <description>Number of replication for each chunk.</description> </property>
With the next one you have to first rename the template with:
cp mapred-site.xml.template mapred-site.xml
Then add the following as before
# mapred-site.xml <property> <name>mapreduce.framework.name</name> <value>yarn</value> <description>Execution framework.</description> </property>
Also please edit the
slaves file and add all the machines to it. As our master is also a datanode we have to add it there as well.
# slaves master slave1 slave2 slave3
After this has been done we need to copy the hadoop folder to each slave machine. Note that as you don't have permission to write into
/usr/local you have to move and chown the folder on your slave at the desired location.
# On your master machine cd /usr/local sudo scp -rp hadoop/* hadoop@slave1:~/hadoop sudo scp -rp hadoop/* hadoop@slave2:~/hadoop sudo scp -rp hadoop/* hadoop@slave3:~/hadoop
# On your slave machines cd sudo mv hadoop/ /usr/local cd /usr/local/ sudo chown -R hadoop hadoop/
Verifying Hadoop Installation
Now we have to format the namenode by using the command:
hdfs namenode -format
Sometimes this won't work because of a lack of permissions - then you can try the following command. If you do sudo you have to reown the directory.
sudo /usr/local/hadoop/bin/hdfs namenode -format sudo chown -R velassco /home/velassco/hadoopinfra
Afterwards the filesystem should be formatted and you should be able to start hadoop and yarn with the following command:
start-dfs.sh && start-yarn.sh
Then you should be able to access hadoop in the webbrowser via http://localhost:8088/
You can test if the installation works correctly by running an example mapreduce computation by navigation and running:
cd /usr/local/hadoop/share/hadoop/mapreduce yarn jar hadoop-mapreduce-examples-2.7.3.jar pi 10 100
If everything works correctly this should compute a rather bad approximation of pi.
The HBase installation should be easy if you set up Hadoop correctly. Just download and setup the HBase directory in
/usr/local as you did before with Hadoop. Don't forget to chown the directory.
Afterwards navigate to the
/usr/local/hbase/conf directory and edit the specified files as follows:
Here you have to uncomment and adjust the paths to your preference.
# hbase-env.sh export JAVA_HOME=/usr export HBASE_PID_DIR=/var/hadoop/pids export HBASE_MANAGES_ZK=true export HBASE_HEAPSIZE=4000 export HBASE_OFFHEAPSIZE=8G
Then you should create and own the directory on every machine for the pids with
sudo mkdir /var/hadoop sudo mkdir /var/hadoop/pids sudo chown -R velassco /var/hadoop
Continue to edit the following files on your master.
# hbase-site.xml on master <property> <name>hbase.master</name> <value>master:60000</value> </property> <property> <name>hbase.rootdir</name> <value>hdfs://master/hbase</value> </property> <property> <name>hbase.zookeeper.property.dataDir</name> <value>/home/velassco/zookeper</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>master</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.hregion.max.filesize</name> <value>16777216</value> </property> <property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property>
# regionservers master slave1 slave2 slave3
Finally you have to transfer the hbase folder to your other machines and adjust the
/usr/local/hbase/conf so that it only contains the following entries:
# hbase-site.xml on slaves <property> <name>hbase.rootdir</name> <value>hdfs://master/hbase</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>master</value> </property> <property> <name>hbase.cluster.distributed</name> <value>true</value> </property> <property> <name>hbase.hregion.max.filesize</name> <value>16777216</value> </property> <property> <name>zookeeper.session.timeout</name> <value>1200000</value> </property> <property> <name>hbase.zookeeper.property.tickTime</name> <value>6000</value> </property>
Then you have to run:
To access the hbase shell type
you should get somethig like this if you type
status as the shell is invoked
hbase(main):001:0> status 1 active master, 0 backup masters, 4 servers, 0 dead, 0.7500 average load hbase(main):002:0>
Some of the VELaSSCo Vqueries requires certain amount of data to be passed between the MapReduce nodes and between the VELaSSCo's Query Manager and the HBase thrift server. To avoid crashes of the HBase thrift daemon java VM memory limit shall be raised.
To adjust this limit edit the file
/usr/local/hbase/conf and do the following changes:
# The maximum amount of heap to use, in MB. Default is 1000. # it was raised to 10000 ??? # export HBASE_HEAPSIZE=10000 export HBASE_HEAPSIZE=4000 # Uncomment below if you intend to use off heap cache. # For example, to allocate 8G of # offheap, set the value to "8G". export HBASE_OFFHEAPSIZE=8G
and restart the HBase services again.
Another interesting parameter to adjust for a correct distribution of the data among the different HBase region servers is the splitting size.
To adjust this limit edit the file
/usr/local/hbase/conf and change the value of the property
For adding, converting and injecting data into the platform please refere to this readme here
Building the Velassco Platform
To build the Velassco Platform you have to simply run the following commands
cd path/to/VELASSCO/modules make
Afterwards you need to create a link to the relevant lib folder with the following commands:
cd path/to/VELASSCO/modules/Platform/AnalyticsYarnJobs ln -s lib_eddie lib
You also have to copy and rename the necessary JAR Files in the AnalyticsYarnJobs directory so that they have your machines name extension (in our example master) as follows:
# for everyfile cp GetBoundaryOfAMesh_pez001.jar GetBoundaryOfAMesh_master.jar cp GetBoundingBoxOfAModel_pez001.jar GetBoundingBoxOfAModel_master.jar cp GetDiscrete2ContinuumOfAModel_pez001.jar GetDiscrete2ContinuumOfAModel_master.jar cp GetListOfVerticesFromMesh_pez001.jar GetListOfVerticesFromMesh_master.jar cp GetMissingIDsOfVerticesWithoutResults_pez001.jar GetMissingIDsOfVerticesWithoutResults_master.jar cp GetSimplifiedMesh_Average_pez001.jar GetSimplifiedMesh_Average_master.jar cp GetSimplifiedMeshWithResult_Average_pez001.jar GetSimplifiedMeshWithResult_Average_master.jar
To run the whole Platform you have to have:
- Hadoop (
start-dfs.sh && start-yarn.sh) and HBase (
- The Thrift server started (
hbase thrift start)
- The Data should be injected (an example data is i
Then you need to start the Query Manager as follows:
cd path/to/VELASSCO/modules/Platform/ ./QueryManager.exe -port 26267
Now that the query manager is running, you are able to use CIMNE's GiD visualization client (http://www.gidhome.com/download/), and its VELaSSCo plugin, to test the query manager. You can find a document in the following link which describes how to connect to a query manager and use its functionalities:
In addition, more infromation about using VELaSSCo platform, and webinar videos can be found here:
|Data Injection for the Open Source version||ATOS||Dual (AGPL v3/commercial). Free for research and VELaSSCo partners|
|OS Query Manager||INRIA CIMNE||MIT|
|FEM analytics||CIMNE||MIT. Covering the majority of the VQueries for FEM|
|DEM analytics||UEDIN||Based on Discrete to Continuum (D2C) software, proprietary to UEDIN. Limited set of VQueries.|
|LRB-spline analytics||SINTEF||Dual (AGPL v3/commercial)|
|Kratos + GiD for FSI (Kratos VELaSSCo version)||CIMNE||Kratos OS (BSD), plugin MIT|
|GID commercial / GiD VELaSSCo plugin||CIMNE||MIT|
|iFX VELaSSCo plugin||Fraunhofer||Commercial (selling license or version )|
|LRB-splines visualization plugin||SINTEF||Dual (AGPL v3/commercial)|
|Data Ingestion for DEM||JOTNE||MIT (DEM version of the Data Injector for compatibility with EDM. It is open sourced to foster the creation of new or improved injections to EDM)|
|EDM (EXPRESS Data Manager™) big data extensions||JOTNE||Commercial|
|EDM VELaSSCo queries||JOTNE||MIT (Commercially interesting to foster the creation of VELaSSCo queries for EDM in the future by the EDM community)|
|Standard extension to support DEM simulations ISO10303-209||UEDIN JOTNE||ISO standard|