d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Database Writes

Apache Spark&trade; and Databricks&reg; allow you to write to a number of target databases in parallel, storing the transformed data from from your ETL job.

## In this lesson you:
* Write to a target database in serial and in parallel
* Repartition DataFrames to optimize table inserts
* Coalesce DataFrames to minimize data shuffling

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.
* Concept (optional): <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a>

<iframe  
src="//fast.wistia.net/embed/iframe/ziolp0desw?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/ziolp0desw?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Database Writes in Spark

Writing to a database in Spark differs from other tools largely due to its distributed nature. There are a number of variables that can be tweaked to optimize performance, largely relating to how data is organized on the cluster. Partitions are the first step in understanding performant database connections.

**A partition is a portion of your total data set,** which is divided into many of these portions so Spark can distribute your work across a cluster. 

The other concept needed to understand Spark's computation is a slot (also known as a core). **A slot/core is a resource available for the execution of computation in parallel.** In brief, a partition refers to the distribution of data while a slot refers to the distribution of computation.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-2/partitions-and-cores.png" style="height: 400px; margin: 20px"/></div>

As a general rule of thumb, the number of partitions should be a multiple of the number of cores. For instance, with 5 partitions and 8 slots, 3 of the slots will be underutilized. With 9 partitions and 8 slots, a job will take twice as long as it waits for the extra partition to finish.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> For a refresher on connecting to databases with JDBC, see Lesson 4 of <a href="https://academy.databricks.com/collections/frontpage/products/etl-part-1-data-extraction" target="_blank">ETL Part 1 course from Databricks Academy</a>.

### Managing Partitions

In the context of JDBC database writes, **the number of partitions determine the number of connections used to push data through the JDBC API.** There are two ways to control this parallelism:  

| Function | Transformation Type | Use | Evenly distributes data across partitions? |
| :----------------|:----------------|:----------------|:----------------| 
| `.coalesce(n)`   | narrow (does not shuffle data) | reduce the number of partitions | no |
| `.repartition(n)`| wide (includes a shuffle operation) | increase the number of partitions | yes |

Run the cell below to mount the data.

In [0]:
%run "./Includes/Classroom-Setup"

Start by importing a DataFrame of Wikipedia pageviews.

In [0]:
wikiDF = (spark.read
  .parquet("/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet")
)
display(wikiDF)

timestamp,site,requests
2015-03-22T14:13:34,mobile,1425
2015-03-22T14:23:18,desktop,2534
2015-03-22T14:36:47,desktop,2444
2015-03-22T14:38:39,mobile,1488
2015-03-22T14:57:11,mobile,1519
2015-03-22T15:03:18,mobile,1559
2015-03-22T15:16:47,mobile,1510
2015-03-22T15:45:03,desktop,2673
2015-03-22T15:58:32,desktop,2463
2015-03-22T16:06:11,desktop,2525


View the number of partitions by changing the DataFrame into an RDD and using the `.getNumPartitions()` method.  
Since the Parquet file was saved with 5 partitions, those partitions are retained when you import the data.

In [0]:
partitions = wikiDF.rdd.getNumPartitions()
print("Partitions: {0:,}".format( partitions ))

To increase the number of partitions to 16, use `.repartition()`.

In [0]:
repartitionedWikiDF = wikiDF.repartition(16)
print("Partitions: {0:,}".format( repartitionedWikiDF.rdd.getNumPartitions() ))

To reduce the number of partitions, use `.coalesce()`.

In [0]:
coalescedWikiDF = repartitionedWikiDF.coalesce(2)
print("Partitions: {0:,}".format( coalescedWikiDF.rdd.getNumPartitions() ))

-sandbox
### Configure Default Partitions

Spark uses a default value of 200 partitions, which comes from real-world experience by Spark engineers. This is an adjustable configuration setting. Run the following cell to see this value.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Get and set any number of different configuration settings in this manner. <a href="https://spark.apache.org/docs/latest/configuration.html" target="_blank">See the Spark documents</a> for details.

In [0]:
spark.conf.get("spark.sql.shuffle.partitions")

Adjust the number of partitions with the following cell.  **This changes the number of partitions after a shuffle operation.**

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", "8")

Now check to see how this changes an operation involving a data shuffle, such as an `.orderBy()`.  Recall that coalesced `coalescedWikiDF` to 2 partitions.

In [0]:
orderByPartitions = coalescedWikiDF.orderBy("requests").rdd.getNumPartitions()
print("Partitions: {0:,}".format( orderByPartitions ))

The `.orderBy()` triggered the repartition of the DataFrame into 8 partitions.  Now reset the default value.

In [0]:
spark.conf.set("spark.sql.shuffle.partitions", "200")

### Parallel Database Writes

Database writes are the inverse of what was covered in Lesson 4 of ETL Part 1.  In that lesson you defined the number of partitions in the call to the database.  

**By contrast and when writing to a database, the number of active connections to the database is determined by the number of partitions of the DataFrame.**

Examine this by writing `wikiDF` to the `/tmp` directory.  Recall that `wikiDF` has 5 partitions.

In [0]:
wikiDF.write.mode("OVERWRITE").parquet(userhome+"/wiki.parquet")

Examine the number of partitions in `wiki.parquet`.

In [0]:
for i in dbutils.fs.ls(userhome+"/wiki.parquet"):
  print(i)

This file has 5 parts, meaning Spark wrote the data through 5 different connections to this directory in the file system.

### Examining in the Spark UI

Click the arrow next to `Spark Jobs` under the following code cell in order to see a breakdown of the job you triggered. Click the next arrow to see a breakdown of the stages.

In [0]:
wikiDF.repartition(12).write.mode("OVERWRITE").parquet(userhome+"/wiki.parquet")

5 stages were initially triggered, one for each partition of our data.  When you repartitioned the DataFrame to 12 partitions, 12 stages were needed, one to write each partition of the data.  Run the following and observe how the repartitioning changes the number of stages.

In [0]:
wikiDF.repartition(10).write.mode("OVERWRITE").parquet(userhome+"/wiki.parquet")

For more details, click `View` to examine the more details about the operation in the Spark UI.

### A Note on Upserts

Upserts insert a record into a database if it doesn't already exist, and updates the existing record if it does.  **Upserts are not supported in core Spark** due to the transactional nature of upserting and the immutable nature of Spark. You can only append or overwrite.  Databricks offers a data management system called Databricks Delta that does allow for upserts and other transactional functionality. [See the Databricks Delta docs for more information.](https://docs.databricks.com/delta/delta-batch.html#upserts-merge-into)

## Exercise 1: Changing Partitions

Change the number of partitions to prepare the optimal database write.

### Step 1: Import Helper Functions and Data

A function is defined for you to print out the number of records in each DataFrame.  Run the following cell to define that function.

In [0]:
def printRecordsPerPartition(df):
  '''
  Utility method to count & print the number of records in each partition
  '''
  print("Per-Partition Counts:")
  
  def countInPartition(iterator): 
    yield __builtin__.sum(1 for _ in iterator)
    
  results = (df.rdd                   # Convert to an RDD
    .mapPartitions(countInPartition)  # For each partition, count
    .collect()                        # Return the counts to the driver
  )

  for result in results: 
    print("* " + str(result))

Import the data to sitting in `/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet` to `wikiDF`.

In [0]:
# TODO
wikiDF = ( spark.read.parquet("/mnt/training/wikipedia/pageviews/pageviews_by_second.parquet")
           )

display(wikiDF)

timestamp,site,requests
2015-03-22T14:13:34,mobile,1425
2015-03-22T14:23:18,desktop,2534
2015-03-22T14:36:47,desktop,2444
2015-03-22T14:38:39,mobile,1488
2015-03-22T14:57:11,mobile,1519
2015-03-22T15:03:18,mobile,1559
2015-03-22T15:16:47,mobile,1510
2015-03-22T15:45:03,desktop,2673
2015-03-22T15:58:32,desktop,2463
2015-03-22T16:06:11,desktop,2525


In [0]:
# TEST - Run this cell to test your solution
dbTest("ET2-P-06-01-01", 7200000, wikiDF.count())
dbTest("ET2-P-06-01-02", ['timestamp', 'site', 'requests'], wikiDF.columns)

print("Tests passed!")

Print the count of records by partition using `printRecordsPerPartition()`.

In [0]:
printRecordsPerPartition(wikiDF)

### Step 2: Repartition the Data

Define three new DataFrames:

* `wikiDF1Partition`: `wikiDF` with 1 partition
* `wikiDF16Partition`: `wikiDF` with 16 partitions
* `wikiDF128Partition`: `wikiDF` with 128 partitions

In [0]:
# TODO
wikiDF1Partition =  wikiDF.repartition(1)
wikiDF16Partition =  wikiDF.repartition(16)
wikiDF128Partition =  wikiDF.repartition(128)

In [0]:
# TEST - Run this cell to test your solution

dbTest("ET2-P-06-02-01", 1, wikiDF1Partition.rdd.getNumPartitions())
dbTest("ET2-P-06-02-02", 16, wikiDF16Partition.rdd.getNumPartitions())
dbTest("ET2-P-06-02-03", 128, wikiDF128Partition.rdd.getNumPartitions())

print("Tests passed!")

### Step 3: Examine the Distribution of Records

Use `printRecordsPerPartition()` to examine the distribution of records across the partitions.

In [0]:
# TODO
printRecordsPerPartition(wikiDF1Partition)
printRecordsPerPartition(wikiDF16Partition)
printRecordsPerPartition(wikiDF128Partition)

### Step 4: Coalesce `wikiDF16Partition` and Examine the Results

Coalesce `wikiDF16Partition` to `10` partitions, saving the result to `wikiDF16PartitionCoalesced`.

In [0]:
# TODO
wikiDF16PartitionCoalesced = repartitionedWikiDF.coalesce(10)

In [0]:
# TEST - Run this cell to test your solution

dbTest("ET2-P-06-03-01", 10, wikiDF16PartitionCoalesced.rdd.getNumPartitions())

print("Tests passed!")

Examine the new distribution of data using `printRecordsPerPartition`.  Is the distribution uniform?  Why or why not?

In [0]:
printRecordsPerPartition(wikiDF16PartitionCoalesced)

In [0]:
#Distribution is not Uniform. There are 10 partitions and 0.88 cores

## Review
**Question:** How do you determine the number of connections to the database you write to?  
**Answer:** Spark makes one connection for each partition in your data. Increasing the number of partitions increases the database connections.

**Question:** How do you increase and decrease the number of partitions in your data?  
**Answer:** `.repartitions(n)` increases the number of partitions in your data. It can also decrease the number of partitions, but since this is a wide operation it should be used sparingly.  `.coalesce(n)` decreases the number of partitions in your data. If you use `.coalesce(n)` with a number greater than the current partitions, this DataFrame method will have no effect.

**Question:** How can you change the default number of partitions?  
**Answer:** Changing the configuration parameter `spark.sql.shuffle.partitions` will alter the default number of partitions.

## Next Steps

Start the next lesson, [Table Management]($./07-Table-Management ).

## Additional Topics & Resources

**Q:** Where can I find more information on reading the Spark UI?  
**A:** Check out the Databricks blog on <a href="https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html" target="_blank">Understanding your Spark Application through Visualization</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>