## Configure Spark 3

Let us go ahead and configure Spark 3 on our single node Hadoop and Spark Cluster. We need to ensure that Spark can run using YARN mode.

* Update **/opt/spark3/conf/spark-env.sh** with below environment variables.

```shell
export HADOOP_HOME="/opt/hadoop"
export HADOOP_CONF_DIR="/opt/hadoop/etc/hadoop"
```

* Update **/opt/spark3/conf/spark-defaults.conf** with below properties.

```shell
spark.driver.extraJavaOptions     -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled  true
spark.master                      yarn
spark.eventLog.enabled            true
spark.eventLog.dir                hdfs:///spark3-logs
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs:///spark3-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080
spark.yarn.historyServer.address  localhost:18080
spark.yarn.jars                   hdfs:///spark3-jars/*.jar
```

* Update **/opt/hive/conf/hive-site.xml** with below setting.

```shell
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
```

* We also need to create directories for logs and jars in HDFS. Also, Spark jars should be copied to HDFS folder provided as part of **spark.yarn.jars**.

```shell
hdfs dfs -mkdir /spark3-jars
hdfs dfs -mkdir /spark3-logs

hdfs dfs -put /opt/spark3/jars/* /spark3-jars
```

* By default we will not be able to access Hive Metastore tables and databases using Spark. We need to perform below steps to integrate Spark with Hive Metastore.
  * Create soft link for **hive-site.xml** in Spark conf folder.
  * We also need to install latest **Postgres JDBC** jar in Spark jars folder.

```shell
sudo ln -s /opt/hive/conf/hive-site.xml /opt/spark3/conf/
sudo wget https://jdbc.postgresql.org/download/postgresql-42.2.19.jar \
    -O /opt/spark3/jars/postgresql-42.2.19.jar
```

******************************************************************************************************************

**my config**

* Update **$SPARK_HOME/conf/spark-env.sh** with below environment variables.

```shell
export HADOOP_HOME="~/.sdkman/candidates/hadoop/current"
export HADOOP_CONF_DIR="~/.sdkman/candidates/hadoop/current/etc/hadoop"

```

* Update **$SPARK_HOME/conf/spark-defaults.conf** with below properties.

```shell
spark.driver.extraJavaOptions     -Dderby.system.home=/tmp/derby/
spark.sql.repl.eagerEval.enabled  true
spark.master                      yarn
spark.eventLog.enabled            true
spark.eventLog.dir                hdfs://0.0.0.0:9000/spark3-logs
spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs://0.0.0.0:9000/spark3-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080
spark.yarn.historyServer.address  localhost:18080
spark.yarn.jars                   hdfs://0.0.0.0:9000/spark3-jars/*.jar


```

* Update **/opt/hive/conf/hive-site.xml** with below setting.

```shell
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
  </property>
```

**practice my machine**

In [3]:
!pwd


/home/nghiaht7/data-engineer/data-engineering-essentials/06_setup_bigdata_ecosystem/02_setup_hive_and_spark


In [3]:
!source /home/nghiaht7/data-engineer/data-engineering-essentials/start-all.sh

Starting namenodes on [0.0.0.0]
Starting datanodes
Starting secondary namenodes [justdoit]
Starting resourcemanager
Starting nodemanagers
cluster_util_db
372182 NodeManager
371585 DataNode
371440 NameNode
372355 Jps
371820 SecondaryNameNode
355896 RunJar
372027 ResourceManager
CONTAINER ID   IMAGE      COMMAND                  CREATED             STATUS             PORTS                                       NAMES
d64410874ee6   postgres   "docker-entrypoint.s…"   About an hour ago   Up About an hour   0.0.0.0:6432->5432/tcp, :::6432->5432/tcp   cluster_util_db


In [1]:
!hdfs dfs -ls -R / 

drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:10 /public
drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:10 /public/retail_db
drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:10 /public/retail_db/categories
-rw-r--r--   1 nghiaht7 supergroup       1029 2021-08-24 17:10 /public/retail_db/categories/part-00000
drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:10 /public/retail_db/customers
-rw-r--r--   1 nghiaht7 supergroup     953719 2021-08-24 17:10 /public/retail_db/customers/part-00000
drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:10 /public/retail_db/order_items
-rw-r--r--   1 nghiaht7 supergroup    5408880 2021-08-24 17:10 /public/retail_db/order_items/part-00000
drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:10 /public/retail_db/orders
-rw-r--r--   1 nghiaht7 supergroup    2999944 2021-08-24 17:10 /public/retail_db/orders/part-00000
-rw-r--r--   1 nghiaht7 supergroup         60 2021-08-24 17:10 /public/retai

In [2]:
!hdfs dfs -mkdir /spark3-jars

In [4]:
!hdfs dfs -mkdir /spark3-logs

In [6]:
!hdfs dfs -ls -R / |grep spark

drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:30 /spark3-jars
drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:30 /spark3-logs


In [7]:
!hdfs dfs -put $SPARK_HOME/jars/* /spark3-jars

In [8]:
!hdfs dfs -ls -R / | grep jar

drwxr-xr-x   - nghiaht7 supergroup          0 2021-08-24 17:32 /spark3-jars
-rw-r--r--   1 nghiaht7 supergroup     136363 2021-08-24 17:31 /spark3-jars/HikariCP-2.5.1.jar
-rw-r--r--   1 nghiaht7 supergroup     232470 2021-08-24 17:32 /spark3-jars/JLargeArrays-1.5.jar
-rw-r--r--   1 nghiaht7 supergroup    1175798 2021-08-24 17:32 /spark3-jars/JTransforms-3.1.jar
-rw-r--r--   1 nghiaht7 supergroup     386529 2021-08-24 17:32 /spark3-jars/RoaringBitmap-0.9.0.jar
-rw-r--r--   1 nghiaht7 supergroup     236660 2021-08-24 17:32 /spark3-jars/ST4-4.0.4.jar
-rw-r--r--   1 nghiaht7 supergroup      30035 2021-08-24 17:31 /spark3-jars/accessors-smart-1.2.jar
-rw-r--r--   1 nghiaht7 supergroup      69409 2021-08-24 17:31 /spark3-jars/activation-1.1.1.jar
-rw-r--r--   1 nghiaht7 supergroup     134044 2021-08-24 17:31 /spark3-jars/aircompressor-0.10.jar
-rw-r--r--   1 nghiaht7 supergroup    1168113 2021-08-24 17:31 /spark3-jars/algebra_2.12-2.0.0-M2.jar
-rw-r--r--   1 nghiaht7 supergroup     167761 20

-rw-r--r--   1 nghiaht7 supergroup     200223 2021-08-24 17:31 /spark3-jars/hk2-api-2.6.1.jar
-rw-r--r--   1 nghiaht7 supergroup     203358 2021-08-24 17:31 /spark3-jars/hk2-locator-2.6.1.jar
-rw-r--r--   1 nghiaht7 supergroup     131590 2021-08-24 17:31 /spark3-jars/hk2-utils-2.6.1.jar
-rw-r--r--   1 nghiaht7 supergroup    1502280 2021-08-24 17:31 /spark3-jars/htrace-core4-4.1.0-incubating.jar
-rw-r--r--   1 nghiaht7 supergroup     767140 2021-08-24 17:31 /spark3-jars/httpclient-4.5.6.jar
-rw-r--r--   1 nghiaht7 supergroup     328347 2021-08-24 17:31 /spark3-jars/httpcore-4.4.12.jar
-rw-r--r--   1 nghiaht7 supergroup      27156 2021-08-24 17:31 /spark3-jars/istack-commons-runtime-3.0.8.jar
-rw-r--r--   1 nghiaht7 supergroup    1282424 2021-08-24 17:31 /spark3-jars/ivy-2.4.0.jar
-rw-r--r--   1 nghiaht7 supergroup      67889 2021-08-24 17:31 /spark3-jars/jackson-annotations-2.10.0.jar
-rw-r--r--   1 nghiaht7 supergroup     348635 2021-08-24 17:31 /spark3-jars/jackson-core-2.10.0.jar
-rw

-rw-r--r--  10 nghiaht7 supergroup   40623961 2021-08-24 17:30 /tmp/hadoop-yarn/staging/nghiaht7/.staging/job_1629797692104_0002/job.jar


In [9]:
!hdfs dfs -ls -R / | grep hive

-rw-r--r--   1 nghiaht7 supergroup     183472 2021-08-24 17:31 /spark3-jars/hive-beeline-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup      43387 2021-08-24 17:31 /spark3-jars/hive-cli-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup     436980 2021-08-24 17:31 /spark3-jars/hive-common-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup   10839104 2021-08-24 17:31 /spark3-jars/hive-exec-2.3.7-core.jar
-rw-r--r--   1 nghiaht7 supergroup     116311 2021-08-24 17:31 /spark3-jars/hive-jdbc-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup     326419 2021-08-24 17:31 /spark3-jars/hive-llap-common-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup    8194428 2021-08-24 17:31 /spark3-jars/hive-metastore-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup     916206 2021-08-24 17:31 /spark3-jars/hive-serde-2.3.7.jar
-rw-r--r--   1 nghiaht7 supergroup    1679364 2021-08-24 17:31 /spark3-jars/hive-service-rpc-3.1.2.jar
-rw-r--r--   1 nghiaht7 supergroup      54116 2021-08-24 17:31 /spark3-jars/hive-shims-0.23-2.3.7.jar
-rw-r