# Spark JDBC to Databases

- [Overview](#spark-jdbc-overview)
- [Setup](#spark-jdbc-setup)
  - [Define Environment Variables](#spark-jdbc-define-envir-vars)
  - [Initiate a Spark JDBC Session](#spark-jdbc-init-session)
    - [Load Driver Packages Dynamically](#spark-jdbc-init-dynamic-pkg-load)
    - [Load Driver Packages Locally](#spark-jdbc-init-local-pkg-load)
- [Connect to Databases Using Spark JDBC](#spark-jdbc-connect-to-dbs)
 - [Connect to a MySQL Database](#spark-jdbc-to-mysql)
   - [Connecting to a Public MySQL Instance](#spark-jdbc-to-mysql-public)
   - [Connecting to a Test or Temporary MySQL Instance](#spark-jdbc-to-mysql-test-or-temp)
 - [Connect to a PostgreSQL Database](#spark-jdbc-to-postgresql)
 - [Connect to an Oracle Database](#spark-jdbc-to-oracle)
 - [Connect to an MS SQL Server Database](#spark-jdbc-to-ms-sql-server)
 - [Connect to a Redshift Database](#spark-jdbc-to-redshift)
- [Cleanup](#spark-jdbc-cleanup)
  - [Delete Data](#spark-jdbc-delete-data)
  - [Release Spark Resources](#spark-jdbc-release-spark-resources)

<a id="spark-jdbc-overview"></a>
## Overview

Spark SQL includes a data source that can read data from other databases using Java database connectivity (**JDBC**).
The results are returned as a Spark DataFrame that can easily be processed in Spark SQL or joined with other data sources.
For more information, see the [Spark documentation](https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#jdbc-to-other-databases).

<a id="spark-jdbc-setup"></a>
## Setup

<a id="spark-jdbc-define-envir-vars"></a>
### Define Environment Variables

Begin by initializing some environment variables.

> **Note:** You need to edit the following code to assign valid values to the database variables (`DB_XXX`).

In [1]:
import os

# Read Iguazio Data Science Platform ("the platform") environment variables into local variables
V3IO_USER = os.getenv('V3IO_USERNAME')
V3IO_HOME = os.getenv('V3IO_HOME')
V3IO_HOME_URL = os.getenv('V3IO_HOME_URL')

# Define database environment variables
# TODO: Edit the variable definitions to assign valid values for your environment.
%env DB_HOST = ""        # Database host as a fully qualified name (FQN)
%env DB_PORT = ""        # Database port number
%env DB_DRIVER = ""      # Database driver [mysql/postgresql|oracle:thin|sqlserver]
%env DB_Name = ""        # Database|schema name
%env DB_TABLE = ""       # Table name
%env DB_USER = ""        # Database username
%env DB_PASSWORD = ""    # Database user password

os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"

env: DB_HOST=""        # Database host's fully qualified name
env: DB_PORT=""        # Port num of the database
env: DB_DRIVER=""      # Database Driver [postgresql|mysql|oracle:thin|sqlserver]
env: DB_Name=""        # Database|Schema Name
env: DB_TABLE=""       # Table Name
env: DB_USER=""        # Database User Name
env: DB_PASSWORD=""    # Database User's Password


<a id="spark-jdbc-init-session"></a>
### Initiate a Spark JDBC Session

You can select between two methods for initiating a Spark session with JDBC drivers ("Spark JDBC session"):

- [Load Driver Packages Dynamically](#spark-jdbc-init-dynamic-pkg-load) (preferred)
- [Load Driver Packages Locally](#spark-jdbc-init-local-pkg-load)

<a id="spark-jdbc-init-dynamic-pkg-load"></a>
#### Load Driver Packages Dynamically

The preferred method for initiating a Spark JDBC session is to load the required JDBC driver packages dynamically from https://spark-packages.org/ by doing the following:

1. Set the `PYSPARK_SUBMIT_ARGS` environment variable to `"--packages <group>:<name>:<version> pyspark-shell"`.
2. Initiate a new spark session.

The following example demonstrates how to initiate a Spark session that uses version 5.1.39 of the **mysql-connector-java** MySQL JDBC database driver (`mysql:mysql-connector-java:5.1.39`).

In [None]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# Configure the Spark JDBC driver package
# TODO: Replace `mysql:mysql-connector-java:5.1.39` with the required driver-pacakge information.
os.environ["PYSPARK_SUBMIT_ARGS"] = "--packages mysql:mysql-connector-java:5.1.39 pyspark-shell"

# Initiate a new Spark session; you can change the application name
spark = SparkSession.builder.appName("Spark JDBC tutorial").getOrCreate()

<a id="spark-jdbc-init-local-pkg-load"></a>
#### Load Driver Packages Locally

You can also load the Spark JDBC driver package from the local file system of your Iguazio Data Science Platform ("the platform").
It's recommended that you use this method only if you don't have internet connection ("dark-site installations") or if there's no official Spark package for your database.
The platform comes pre-deployed with MySQL, PostgreSQL, Oracle, Redshift, and MS SQL Server JDBC driver packages, which are found in the **/spark/3rd_party** directory (**$SPARK_HOME/3rd_party**).
You can also copy additional driver packages or different versions of the pre-deployed drivers to the platform &mdash; for example, from the **Data** dashboard page.

To load a JDBC driver package locally, you need to set the `spark.driver.extraClassPath` and `spark.executor.extraClassPath` Spark configuration properties to the path to a Spark JDBC driver package in the platform's file system.
You can do this using either of the following alternative methods:

- Preconfigure the path to the driver package &mdash;

  1. In your Spark-configuration file &mdash; **$SPARK_HOME/conf/spark-defaults.conf** &mdash; set the `extraClassPath` configuration properties to the path to the relevant driver package:
    ```python
    spark.driver.extraClassPath = "<path to a JDBC driver package>"
    spark.executor.extraClassPath = "<path to a JDBC driver package>"
    ```
  2. Initiate a new spark session.

- Configure the path to the driver package as part of the initiation of a new Spark session:
  ```python
  spark = SparkSession.builder. \
    appName("<app name>"). \
    config("spark.driver.extraClassPath", "<path to a JDBC driver package>"). \
    config("spark.executor.extraClassPath", "<path to a JDBC driver package>"). \
    getOrCreate()
  ```

The following example demonstrates how to initiate a Spark session that uses the pre-deployed version 8.0.13 of the **mysql-connector-java** MySQL JDBC database driver (**/spark/3rd_party/mysql-connector-java-8.0.13.jar**)

In [None]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession

# METHOD I
# Edit your Spark configuration file ($SPARK_HOME/conf/spark-defaults.conf), set the `spark.driver.extraClassPath` and
# `spark.executor.extraClassPath` properties to the local file-system path to a pre-deployed Spark JDBC driver package.
# Replace "/spark/3rd_party/mysql-connector-java-8.0.13.jar" with the relevant path.
#     spark.driver.extraClassPath = "/spark/3rd_party/mysql-connector-java-8.0.13.jar"
#     spark.executor.extraClassPath = "/spark/3rd_party/mysql-connector-java-8.0.13.jar"
#
# Then, initiate a new Spark session; you can change the application name.
# spark = SparkSession.builder.appName("Spark JDBC tutorial").getOrCreate()

# METHOD II
# Initiate a new Spark Session; you can change the application name.
# Set the same `extraClassPath` configuration properties as in Method #1 as part of the initiation command.
# Replace "/spark/3rd_party/mysql-connector-java-8.0.13.jar" with the relevant path.
local file-system path to a pre-deployed Spark JDBC driver package
spark = SparkSession.builder. \
    appName("Spark JDBC tutorial"). \
    config("spark.driver.extraClassPath", "/spark/3rd_party/mysql-connector-java-8.0.13.jar"). \
    config("spark.executor.extraClassPath", "/spark/3rd_party/mysql-connector-java-8.0.13.jar"). \
    getOrCreate()

In [3]:
import pprint

# Verify your configuration: run the following code to list the current Spark configurations, and check the output to verify that the
# `spark.driver.extraClassPath` and `spark.executor.extraClassPath` properties are set to the correct local driver-pacakge path.
conf = spark.sparkContext._conf.getAll()

pprint.pprint(conf)

[('spark.sql.catalogImplementation', 'in-memory'),
 ('spark.driver.extraLibraryPath', '/hadoop/etc/hadoop'),
 ('spark.app.id', 'app-20190704070308-0001'),
 ('spark.executor.memory', '2G'),
 ('spark.executor.id', 'driver'),
 ('spark.jars',
  'file:///spark/v3io-libs/v3io-hcfs_2.11.jar,file:///spark/v3io-libs/v3io-spark2-object-dataframe_2.11.jar,file:///spark/v3io-libs/v3io-spark2-streaming_2.11.jar,file:///igz/.ivy2/jars/mysql_mysql-connector-java-5.1.39.jar'),
 ('spark.cores.max', '4'),
 ('spark.executorEnv.V3IO_ACCESS_KEY', 'bb79fffa-7582-4fd2-9347-a350335801fc'),
 ('spark.driver.extraClassPath',
  '/spark/3rd_party/mysql-connector-java-8.0.13.jar'),
 ('spark.executor.extraJavaOptions', '"-Dsun.zip.disableMemoryMapping=true"'),
 ('spark.driver.port', '33751'),
 ('spark.driver.host', '10.233.92.91'),
 ('spark.executor.extraLibraryPath', '/hadoop/etc/hadoop'),
 ('spark.submit.pyFiles',
  '/igz/.ivy2/jars/mysql_mysql-connector-java-5.1.39.jar'),
 ('spark.app.name', 'Spark JDBC tutorial'

<a id="spark-jdbc-connect-to-dbs"></a>
## Connect to Databases Using Spark JDBC

<a id="spark-jdbc-to-mysql"></a>
### Connect to a MySQL Database

- [Connecting to a Public MySQL Instance](#spark-jdbc-to-mysql-public)
- [Connecting to a Test or Temporary MySQL Instance](#spark-jdbc-to-mysql-test-or-temp)

<a id="spark-jdbc-to-mysql-public"></a>
#### Connect to a Public MySQL Instance

In [4]:
#Loading data from a JDBC source
dfMySQL = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://mysql-rfam-public.ebi.ac.uk:4497/Rfam") \
    .option("dbtable", "Rfam.family") \
    .option("user", "rfamro") \
    .option("password", "") \
    .option("driver", "com.mysql.jdbc.Driver") \
    .load()

dfMySQL.show()

+--------+-------------+---------+--------------------+--------------------+--------------------+----------------+--------------+------------+--------------------+--------------------+------------------+--------------------+--------------------+--------+--------+--------------+----------+--------------------+--------------------+-----------------+--------------------+---------------+--------+------------+---------+------------+--------------+----+----+---------------+-------+----------+-------------------+-------------------+
|rfam_acc|      rfam_id|auto_wiki|         description|              author|         seed_source|gathering_cutoff|trusted_cutoff|noise_cutoff|             comment|         previous_id|           cmbuild|         cmcalibrate|            cmsearch|num_seed|num_full|num_genome_seq|num_refseq|                type|    structure_source|number_of_species|number_3d_structures|num_pseudonokts|tax_seed|ecmli_lambda| ecmli_mu|ecmli_cal_db|ecmli_cal_hits|maxl|clen|match_pair_n

<a id="spark-jdbc-to-mysql-test-or-temp"></a>
#### Connect to a Test or Temporary MySQL Instance

> **Note:** The following code won't work if the MySQL instance has been shut down.

In [None]:
dfMySQL = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:mysql://172.31.33.215:3306/db1") \
    .option("dbtable", "db1.fruit") \
    .option("user", "root") \
    .option("password", "my-secret-pw") \
    .option("driver", "com.mysql.jdbc.Driver") \
    .load()

dfMySQL.show()

<a id="spark-jdbc-to-postgresql"></a>
### Connect to a PostgreSQL Database

In [None]:
# Load data from a JDBC source
dfPS = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .load()

dfPS2 = spark.read \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename",
    properties={"user": "username", "password": "password"})

# Specify DataFrame column data types on read
dfPS3 = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("customSchema", "id DECIMAL(38, 0), name STRING") \
    .load()

# Save data to a JDBC source
dfPS.write \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("dbtable", "schema.tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .save()

dfPS2.write \
    properties={"user": "username", "password": "password"})

# Specify create table column data types on write
dfPS.write \
    .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)") \
    .jdbc("jdbc:postgresql:dbserver", "schema.tablename", properties={"user": "username", "password": "password"})

<a id="spark-jdbc-to-oracle"></a>
### Connect to an Oracle Database

In [None]:
# Read a table from Oracle (table: hr.emp)
dfORA = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \
    .option("dbtable", "hr.emp") \
    .option("user", "db_user_name") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()

dfORA.printSchema()

dfORA.show()

# Read a query from Oracle
query = "(select empno,ename,dname from emp, dept where emp.deptno = dept.deptno) emp"

dfORA1 = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:oracle:thin:username/password@//hostname:portnumber/SID") \
    .option("dbtable", query) \
    .option("user", "db_user_name") \
    .option("password", "password") \
    .option("driver", "oracle.jdbc.driver.OracleDriver") \
    .load()

dfORA1.printSchema()

dfORA1.show()

<a id="spark-jdbc-to-ms-sql-server"></a>
### Connect to an MS SQL Server Database

In [None]:
# Read a table from MS SQL Server
dfMS = spark.read \
    .format("jdbc") \
    .options(url="jdbc:sqlserver:username/password@//hostname:portnumber/DB") \
    .option("dbtable", "db_table_name") \
    .option("user", "db_user_name") \
    .option("password", "password") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver" ) \
    .load()

dfMS.printSchema()

dfMS.show()

<a id="spark-jdbc-to-redshift"></a>
### Connect to a Redshift Database

In [None]:
# Read data from a table
dfRS = spark.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
    .option("dbtable", "my_table") \
    .option("tempdir", "s3n://path/for/temp/data") \
    .load()

# Read data from a query
dfRS = spark.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
    .option("query", "select x, count(*) my_table group by x") \
    .option("tempdir", "s3n://path/for/temp/data") \
    .load()

# Write data back to a table
dfRS.write \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable", "my_table_copy") \
  .option("tempdir", "s3n://path/for/temp/data") \
  .mode("error") \
  .save()

# Use IAM role-based authentication
dfRS.write \
  .format("com.databricks.spark.redshift") \
  .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
  .option("dbtable", "my_table_copy") \
  .option("tempdir", "s3n://path/for/temp/data") \
  .option("aws_iam_role", "arn:aws:iam::123456789000:role/redshift_iam_role") \
  .mode("error") \
  .save()

<a id="spark-jdbc-cleanup"></a>
## Cleanup

Prior to exiting, release disk space, computation, and memory resources consumed by the active session:

- [Delete Data](#spark-jdbc-delete-data)
- [Release Spark Resources](#spark-jdbc-release-spark-resources)

<a id="spark-jdbc-delete-data"></a>
### Delete Data

You can optionally delete any of the directories or files that you created.
See the instructions in the [Creating and Deleting Container Directories](https://www.iguazio.com/docs/tutorials/latest-release/getting-started/containers/#create-delete-container-dirs) tutorial.
For example, the following code uses a local file-system command to delete a **&lt;running user&gt;/examples/spark-jdbc** directory in the "users" container.
Edit the path, as needed, then remove the comment mark (`#`) and run the code.

In [None]:
# !rm -rf /User/examples/spark-jdbc/

<a id="spark-jdbc-release-spark-resources"></a>
### Release Spark Resources

When you're done, run the following command to stop your Spark session and release its computation and memory resources:

In [None]:
spark.stop()