# Spark on Yarn Installation

### _sample hostname ip_

node0 10.11.0.2

node1 10.11.0.3

node2 10.11.0.4

_if not explicitly specified, the following operations are at node0_

### _SSH Configuration_

yum install ssh -y

ssh-keygen -t rsa -P ""

echo "10.11.0.2 node0
10.11.0.3 node1
10.11.0.4 node2" >> /etc/hosts;

ssh-copy-id -i ~/.ssh/id_rsa.pub root@node0

ssh-copy-id -i ~/.ssh/id_rsa.pub root@node1

ssh-copy-id -i ~/.ssh/id_rsa.pub root@node2

_if no password set for root user:_

_copy id_rsa.pub of root@node0 and paste to the corresponding file under .ssh/ in node1 and node1 manually_

### _Install JDK_

wget 'https://repo.huaweicloud.com/java/jdk/8u202-b08/jdk-8u202-linux-x64.rpm' 

yum install jdk-8u202-linux-x64.rpm -y 

echo "export JAVA_HOME=/usr/java/jdk1.8.0_202-amd64/PATH=\\$PATH:\$JAVA_HOME/bin" > /etc/profile.d/jdk.sh

source /etc/profile.d/jdk.sh

scp /etc/profile.d/jdk.sh node1:/etc/profile.d/jdk.sh

scp /etc/profile.d/jdk.sh node2:/etc/profile.d/jdk.sh


_if machines are in ARM architecture:_

wget https://repo.huaweicloud.com/java/jdk/8u202-b08/jdk-8u202-linux-arm64-vfp-hflt.tar.gz

tar -xvf jdk-8u202-linux-arm64-vfp-hflt.tar.gz

mv jdk1.8.0_202 /usr/java

echo "export JAVA_HOME=/usr/java/jdk1.8.0_202/PATH=\\$PATH:\$JAVA_HOME/bin" > /etc/profile.d/jdk.sh

source /etc/profile.d/jdk.sh

scp /etc/profile.d/jdk.sh node1:/etc/profile.d/jdk.sh

scp /etc/profile.d/jdk.sh node2:/etc/profile.d/jdk.sh

### Download Hadoop and Spark

mkdir -p /data/big-data-app/res

cd /data/big-data-app/res

wget https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-3.1.3/spark-3.1.3-bin-hadoop2.7.tgz

wget https://archive.apache.org/dist/hadoop/common/hadoop-2.7.7/hadoop-2.7.7.tar.gz

tar -xvf spark-3.1.3-bin-hadoop2.7.tgz

mv spark-3.1.3-bin-hadoop2.7 ../spark3

tar -xvf hadoop-2.7.7.tar.gz

mv hadoop-2.7.7.tar.gz ../hadoop2


### _HDFS_

yum install -y ssh rsync

echo "export HADOOP_HOME=/data/big-data-app/hadoop2
export HADOOP_INSTALL=\\$HADOOP_HOME
export HADOOP_MAPRED_HOME=\\$HADOOP_HOME
export HADOOP_HDFS_HOME=\\$HADOOP_HOME
export HADOOP_COMMON_HOME=\\$HADOOP_HOME
export HADOOP_CONF_DIR=\\$HADOOP_HOME/etc/hadoop
export YARN_HOME=\\$HADOOP_HOME
export YARN_CONF_DIR=\\$HADOOP_HOME/etc/hadoop
export PATH=\\$PATH:\\$HADOOP_HOME/sbin:\\$HADOOP_HOME/bin" > /etc/profile.d/hadoop.sh

source /etc/profile.d/hadoop.sh

scp /etc/profile.d/hadoop.sh node1:/etc/profile.d/hadoop.sh

scp /etc/profile.d/hadoop.sh node2:/etc/profile.d/hadoop.sh

echo "SPARK_HOME=/data/big-data-app/spark3
PATH=\$SPARK_HOME/bin:\\$PATH" > /etc/profile.d/spark.sh

source /etc/profile.d/spark.sh

scp /etc/profile.d/spark.sh node1:/etc/profile.d/spark.sh

scp /etc/profile.d/spark.sh node2:/etc/profile.d/spark.sh

### Hadoop Configuration

cd $HADOOP_CONF_DIR


vi capacity-scheduler.xml

_replace_
```xml
<!-- DefaultResourceCalculator only uses Memory -->
<property>
    <name>yarn.scheduler.capacity.resource-calculator</name>
    <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>
</property>
```

vi core-site.xml

_replace_

```xml
<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://node0:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/data/hadoop</value>
    </property>
</configuration>
```

vi hdfs-site.xml

_replace_

```xml
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.http-address</name>
        <value>node0:50090</value>
    </property>
    <property>
        <name>dfs.namenode.secondary.https-address</name>
        <value>node0:50091</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
</configuration>
```

vi mapred-site.xml

_replace_

```xml
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
```

_on every node_

cp \\$SPARK_HOME/yarn/spark-3.1.3-yarn-shuffle.jar \\$HADOOP_HOME/share/hadoop/yarn/lib/

vi yarn-site.xml

_replace_

```xml
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>node0</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>32768</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>16384</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle,spark_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
        <value>org.apache.spark.network.yarn.YarnShuffleService</value>
    </property>
    <property>
        <name>spark.shuffle.service.port</name>
        <value>7337</value>
    </property>
</configuration>
```

vi slaves

_replace_

```xml
node1
node2
```

### _Spark Configuration_

echo "node1
node2" > $SPARK_HOME/conf/workers

echo "spark.eventLog.enabled true
spark.eventLog.compress true
spark.eventLog.dir hdfs:///logs
spark.yarn.historyServer.address node0:18080
" >> $SPARK_HOME/conf/spark-default.conf

export SPARK_HISTORY_OPTS="-Dspark.history.ui.port=18080 \
-Dspark.history.retainedApplications=3 \
-Dspark.history.fs.logDirectory=hdfs://node0:9000/logs"

cd $SPARK_HOME/sbin

sh start-history-server.sh

#Distribute resource

rsync -arv /data/big-data-app node1:/data/big-data-app

rsync -arv /data/big-data-app node2:/data/big-data-app

### _Format namenode_

cd $HADOOP_HOME/bin
./hdfs namenode -format

### _Start service_

cd $HADOOP_HOME/bin
start-all.sh

_on every node_

jps

_expected instances_

```xml
node0:
NameNode 
SecondaryNode 
ResourceManager

node1:
DataNode 
NodeManager

node2:
DataNode 
NodeManager
```
_URLs_

```xml
HDFS URL http://node0:50070
YARN URL http://node0:8088
```

### _Common Problem_

_NodeManagers fail to initialize (caused by Hadoop can't find spark-3.1.3-yarn-shuffle.jar):_

cp \\${SPARK_HOME}/yarn/spark-3.1.3-yarn-shuffle.jar ${HADOOP_HOME}/share/hadoop/yarn/lib/

### _Spark Submit_

_example 1_

bin/spark-submit examples/src/main/python/pi.py

_example 2_

./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn --deploy-mode cluster examples/jars/spark-examples*.jar