d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Connecting to JDBC

Apache Spark&trade; and Databricks&reg; allow you to connect to a number of data stores using JDBC.
## In this lesson you:
* Read data from a JDBC connection 
* Parallelize your read operation to leverage distributed computation

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/i07uvaoqgh?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/i07uvaoqgh?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
## Java Database Connectivity

Java Database Connectivity (JDBC) is an application programming interface (API) that defines database connections in Java environments.  Spark is written in Scala, which runs on the Java Virtual Machine (JVM).  This makes JDBC the preferred method for connecting to data whenever possible. Hadoop, Hive, and MySQL all run on Java and easily interface with Spark clusters.

Databases are advanced technologies that benefit from decades of research and development. To leverage the inherent efficiencies of database engines, Spark uses an optimization called predicate pushdown.  **Predicate pushdown uses the database itself to handle certain parts of a query (the predicates).**  In mathematics and functional programming, a predicate is anything that returns a Boolean.  In SQL terms, this often refers to the `WHERE` clause.  Since the database is filtering data before it arrives on the Spark cluster, there's less data transfer across the network and fewer records for Spark to process.  Spark's Catalyst Optimizer includes predicate pushdown communicated through the JDBC API, making JDBC an ideal data source for Spark workloads.

In the road map for ETL, this is the **Extract and Validate** step:

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-1.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

### Recalling the Design Pattern

Recall the design pattern for connecting to data from the previous lesson:  
<br>
1. Define the connection point.
2. Define connection parameters such as access credentials.
3. Add necessary options. 

After adhering to this, read data using `spark.read.options(<option key>, <option value>).<connection_type>(<endpoint>)`.  The JDBC connection uses this same formula with added complexity over what was covered in the lesson.

<iframe  
src="//fast.wistia.net/embed/iframe/2clbjyxese?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/2clbjyxese?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
Run the cell below to confirm you are using the right driver.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Each notebook has a default language that appears in upper corner of the screen next to the notebook name, and you can easily switch between languages in a notebook. To change languages, start your cell with `%python`, `%scala`, `%sql`, or `%r`.

In [10]:
%scala
// run this regardless of language type
Class.forName("org.postgresql.Driver")

// read in the postgres DB command, its normal 


Define your database connection criteria. In this case, you need the hostname, port, and database name. 

Access the database `training` via port `5432` of a Postgres server sitting at the endpoint `server1.databricks.training`.

Combine the connection criteria into a URL.

In [12]:

jdbcHostname = "server1.databricks.training"
jdbcPort = 5432
jdbcDatabase = "training"
jdbcUrl = f"jdbc:postgresql://{jdbcHostname}:{jdbcPort}/{jdbcDatabase}"
  

Create a connection properties object with the username and password for the database.

In [14]:
connectionProps = {
  "user": "readonly",
  "password": "readonly"
}

Read from the database by passing the URL, table name, and connection properties into `spark.read.jdbc()`.

In [16]:
tableName = "training.people_1m"

peopleDF = spark.read.jdbc(url=jdbcUrl, table=tableName, properties=connectionProps)

display(peopleDF)


id,firstName,middleName,lastName,gender,birthDate,ssn,salary
21592,Christi,Rubye,Baynom,F,1953-11-06T00:00:00.000+0000,957-64-2902,56055
21593,Jani,Breann,O'Gormley,F,1981-05-25T00:00:00.000+0000,966-80-3908,87217
21594,Adriene,Olinda,Twine,F,1989-12-13T00:00:00.000+0000,974-80-7229,77349
21595,Lorene,Corazon,Klosges,F,1974-12-13T00:00:00.000+0000,903-50-4421,67685
21596,Charlene,Elba,Hold,F,1960-07-07T00:00:00.000+0000,928-55-2975,64066
21597,Marvella,Janice,Cathrall,F,1956-10-20T00:00:00.000+0000,955-41-2380,94479
21598,Luella,Mabelle,Devennie,F,1979-07-15T00:00:00.000+0000,963-85-6723,114841
21599,Maribel,Joannie,Justice,F,1960-05-09T00:00:00.000+0000,970-51-1506,69513
21600,Lucinda,Rachel,Ollerhead,F,1983-06-19T00:00:00.000+0000,983-48-5816,73936
21601,Simonne,Katie,Squires,F,1996-11-01T00:00:00.000+0000,990-84-3794,64578


## Exercise 1: Parallelizing JDBC Connections

The command above was executed as a serial read through a single connection to the database. This works well for small data sets; at scale, parallel reads are necessary for optimal performance.

See the [Managing Parallelism](https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#managing-parallelism) section of the Databricks documentation.

-sandbox
### Step 1: Find the Range of Values in the Data

Parallel JDBC reads entail assigning a range of values for a given partition to read from. The first step of this divide-and-conquer approach is to find bounds of the data.

Calculate the range of values in the `id` column of `peopleDF`. Save the minimum to `dfMin` and the maximum to `dfMax`.  **This should be the number itself rather than a DataFrame that contains the number.**  Use `.first()` to get a Scala or Python object.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the `min()` and `max()` functions in Python `pyspark.sql.functions` or Scala `org.apache.spark.sql.functions`.

In [19]:

# TODO
from pyspark.sql.functions import mean, min, max

dfMin = peopleDF.select("id").rdd.min()[0]
dfMax = peopleDF.select("id").rdd.max()[0]


# I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example:
# df = sqlCtx.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"])
# df.show()
# Which creates:
# +---+---+
# |  A|  B|
# +---+---+
# |1.0|4.0|
# |2.0|5.0|
# |3.0|6.0|
# +---+---+
# My goal is to find the largest value in column A (by inspection, this is 3.0). Using PySpark, here are four approaches I can think of:
# # Method 1: Use describe()
# float(df.describe("A").filter("summary = 'max'").select("A").collect()[0].asDict()['A'])

# # Method 2: Use SQL
# df.registerTempTable("df_table")
# sqlContext.sql("SELECT MAX(A) as maxval FROM df_table").collect()[0].asDict()['maxval']

# # Method 3: Use groupby()
# df.groupby().max('A').collect()[0].asDict()['max(A)']

# # Method 4: Convert to RDD
# df.select("A").rdd.max()[0]



In [20]:
# TEST - Run this cell to test your solution

dbTest("ET1-P-04-01-01", 1, dfMin)
dbTest("ET1-P-04-01-02", 1000000, dfMax)

print("Tests passed!")

-sandbox
### Step 2: Define the Connection Parameters.

<a href="https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#manage-parallelism" target="_blank">Referencing the documentation,</a> define the connection parameters for this read.

Use 8 partitions.

Assign the results to `peopleDFParallel`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Setting the column for your parallel read introduces unexpected behavior due to a bug in Spark. To make sure Spark uses the capitalization of your column, use `'"id"'` for your column. <a href="https://github.com/apache/spark/pull/20370#issuecomment-359958843" target="_blank"> Monitor the issue here.</a>

In [22]:

# ANSWER
peopleDFParallel = spark.read.jdbc(
  url=jdbcUrl,                    # the JDBC URL
  table="training.people_1m",     # the name of the table
  column="id",                    # the name of a column of an integral type that will be used for partitioning.
  lowerBound=dfMin,               # the minimum value of columnName used to decide partition stride.
  upperBound=dfMax,               # the maximum value of columnName used to decide partition stride
  numPartitions=8,               # the number of partitions/connections
  properties=connectionProps      # the connection properties
)

display(peopleDFParallel)


id,firstName,middleName,lastName,gender,birthDate,ssn,salary
1,Lydia,Ula,Rubinowicz,F,1997-02-02T00:00:00.000+0000,927-54-8759,70110
2,Diamond,Carletta,Melesk,F,1984-10-21T00:00:00.000+0000,939-18-5247,74024
3,Yen,Julienne,Recher,F,1988-11-24T00:00:00.000+0000,929-26-8667,83619
4,Mallie,Albertina,Icom,F,1997-03-17T00:00:00.000+0000,921-87-2459,84369
5,Neda,Adele,Sansam,F,1997-05-25T00:00:00.000+0000,948-60-9586,63300
6,Brittaney,Marisela,Ingerfield,F,1966-06-20T00:00:00.000+0000,921-43-9011,84172
7,Annetta,Jenny,Ghiroldi,F,1987-01-22T00:00:00.000+0000,997-84-2238,79905
8,Jinny,Ethel,Tunno,F,1970-04-23T00:00:00.000+0000,909-25-2848,113081
9,Sherise,Lorita,McArte,F,1956-02-27T00:00:00.000+0000,914-17-3474,107826
10,Hilaria,Samira,Dana,F,1993-08-08T00:00:00.000+0000,940-79-2466,72104


In [23]:
# TEST - Run this cell to test your solution
dbTest("ET1-P-04-02-01", 8, peopleDFParallel.rdd.getNumPartitions())

print("Tests passed!")

### Step 3: Compare the Serial and Parallel Reads

Compare the two reads with the `%timeit` function.

Display the number of partitions in each DataFrame by running the following:

In [26]:
print("Partitions:", peopleDF.rdd.getNumPartitions())
print("Partitions:", peopleDFParallel.rdd.getNumPartitions())

Invoke `%timeit` followed by calling a `.describe()`, which computes summary statistics, on both `peopleDF` and `peopleDFParallel`.

In [28]:

%timeit peopleDF.describe()
# i think:  20.8 s ± 410 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



%timeit peopleDFParallel.describe()
#  21.7 s ± 1.53 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


What is the difference between serial and parallel reads?  Note that your results vary drastically depending on the cluster and number of partitions you use

## Review

**Question:** What is JDBC?  
**Answer:** JDBC stands for Java Database Connectivity, and is a Java API for connecting to databases such as MySQL, Hive, and other data stores.

**Question:** How does Spark read from a JDBC connection by default?  
**Answer:** With a serial read.  With additional specifications, Spark conducts a faster, parallel read.  Parallel reads take full advantage of Spark's distributed architecture.

**Question:** What is the general design pattern for connecting to your data?  
**Answer:** The general design patter is as follows:
0. Define the connection point
0. Define connection parameters such as access credentials
0. Add necessary options such as for headers or parallelization

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [32]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Applying Schemas to JSON Data]($./ETL1 05 - Applying Schemas to JSON Data ).

## Additional Topics & Resources

**Q:** My tool can't connect via JDBC.  Can I connect via <a href="https://en.wikipedia.org/wiki/Open_Database_Connectivity" target="_blank">ODBC instead</a>?  
**A:** Yes.  The best practice is generally to use JDBC connections wherever possible since Spark runs on the JVM.  In cases where JDBC is either not supported or is less performant, use the Simba ODBC driver instead.  See <a href="https://docs.databricks.com/user-guide/clusters/jdbc-odbc.html" target="_blank">the Databricks documentation on connecting BI tools</a> for more details.


**Q:** How can I connect my Spark cluster to Azure Cosmos DB?  
**A:** Microsoft has developed an Azure Cosmos DB Spark Connector. Start with the <a href="https://docs.azuredatabricks.net/spark/latest/data-sources/azure/cosmosdb-connector.html#azure-cosmos-db" target="_blank">Databricks Azure Cosmos DB</a> documentation.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>