d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

-sandbox
<img src="https://files.training.databricks.com/images/Apache-Spark-Logo_TM_200px.png" style="float: left: margin: 20px"/>

# Loading Data and Productionalizing

Apache Spark&trade; and Databricks&reg; allow you to productionalize code by scheduling notebooks for regular execution.
## In this lesson you:
* Load data using the Apache Parquet format
* Automate a pipeline using the Databricks `Jobs` functionality

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Chrome

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [4]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/b38tovvtgm?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/b38tovvtgm?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
## Introductory Productionalizing

Incorporating notebooks into production workflows will be covered in detail in later courses. This lesson focuses on two aspects of productionalizing: Parquet as a best practice for loading data from ETL jobs and scheduling jobs.

In the road map for ETL, this is the **Load and Automate** step:

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-Process-4.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

-sandbox
## Writing Parquet

BLOB stores like S3 and the Azure Blob are the data storage option of choice on Databricks, and Parquet is the storage format of choice.  [Apache Parquet](https://parquet.apache.org/documentation/latest/) is a highly efficient, column-oriented data format that shows massive performance increases over other options such as CSV. For instance, Parquet compresses data repeated in a given column and preserves the schema from a write.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> When writing data to DBFS, the best practice is to use Parquet.

<iframe  
src="//fast.wistia.net/embed/iframe/i7u61oyvcu?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/i7u61oyvcu?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

Import Chicago crime data.

In [10]:
crimeDF = (spark.read
  .option("delimiter", "\t")
  .option("header", True)
  .option("timestampFormat", "mm/dd/yyyy hh:mm:ss a")
  .option("inferSchema", True)
  .csv("/mnt/training/Chicago-Crimes-2018.csv")
)
display(crimeDF)

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
23811,JB141441,2018-01-05T01:10:00.000+0000,118XX S INDIANA AVE,0110,HOMICIDE,FIRST DEGREE MURDER,VACANT LOT,False,False,532,5,9,53,01A,1179707.0,1826280.0,2018,2018-01-12T15:49:14.000+0000,41.678585145,-87.617837834,"(41.678585145, -87.617837834)"
11228589,JB148990,2018-01-23T09:00:00.000+0000,072XX S VERNON AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,323,3,6,69,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228563,JB148931,2018-01-31T10:12:00.000+0000,040XX N KEYSTONE AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,1722,17,39,16,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228555,JB148885,2018-01-01T14:00:00.000+0000,017XX W CONGRESS PKWY,0820,THEFT,$500 AND UNDER,HOSPITAL BUILDING/GROUNDS,False,False,1231,12,2,28,06,,,2018,2018-01-12T15:49:14.000+0000,,,
11228430,JB148675,2018-01-27T21:00:00.000+0000,061XX S EBERHART AVE,0560,ASSAULT,SIMPLE,RESIDENCE,False,True,313,3,20,42,08A,,,2018,2018-01-12T15:49:14.000+0000,,,
11228401,JB148683,2018-01-02T12:00:00.000+0000,038XX N SAWYER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1733,17,33,16,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228347,JB148599,2018-01-28T19:00:00.000+0000,008XX E 45TH ST,0620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE,False,False,221,2,4,39,05,,,2018,2018-01-12T15:49:14.000+0000,,,
11228291,JB148591,2018-01-10T16:45:00.000+0000,010XX E 53RD ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,233,2,4,41,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228287,JB148482,2018-01-03T15:45:00.000+0000,0000X W C1 ST,0810,THEFT,OVER $500,AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA,False,False,1651,16,41,76,06,,,2018,2018-01-12T15:49:14.000+0000,,,
11228268,JB148558,2018-01-04T16:00:00.000+0000,044XX S MICHIGAN AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,APARTMENT,False,True,215,2,3,38,26,,,2018,2018-01-12T15:49:14.000+0000,,,


Rename the columns in `CrimeDF` so there are no spaces or invalid characters. This is required by Spark and is a best practice.  Use camel case.

In [12]:
cols = crimeDF.columns
titleCols = [''.join(j for j in i.title() if not j.isspace()) for i in cols]
camelCols = [column[0].lower()+column[1:] for column in titleCols]

crimeRenamedColsDF = crimeDF.toDF(*camelCols)
display(crimeRenamedColsDF)

id,caseNumber,date,block,iucr,primaryType,description,locationDescription,arrest,domestic,beat,district,ward,communityArea,fbiCode,xCoordinate,yCoordinate,year,updatedOn,latitude,longitude,location
23811,JB141441,2018-01-05T01:10:00.000+0000,118XX S INDIANA AVE,0110,HOMICIDE,FIRST DEGREE MURDER,VACANT LOT,False,False,532,5,9,53,01A,1179707.0,1826280.0,2018,2018-01-12T15:49:14.000+0000,41.678585145,-87.617837834,"(41.678585145, -87.617837834)"
11228589,JB148990,2018-01-23T09:00:00.000+0000,072XX S VERNON AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,323,3,6,69,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228563,JB148931,2018-01-31T10:12:00.000+0000,040XX N KEYSTONE AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,APARTMENT,False,False,1722,17,39,16,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228555,JB148885,2018-01-01T14:00:00.000+0000,017XX W CONGRESS PKWY,0820,THEFT,$500 AND UNDER,HOSPITAL BUILDING/GROUNDS,False,False,1231,12,2,28,06,,,2018,2018-01-12T15:49:14.000+0000,,,
11228430,JB148675,2018-01-27T21:00:00.000+0000,061XX S EBERHART AVE,0560,ASSAULT,SIMPLE,RESIDENCE,False,True,313,3,20,42,08A,,,2018,2018-01-12T15:49:14.000+0000,,,
11228401,JB148683,2018-01-02T12:00:00.000+0000,038XX N SAWYER AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,1733,17,33,16,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228347,JB148599,2018-01-28T19:00:00.000+0000,008XX E 45TH ST,0620,BURGLARY,UNLAWFUL ENTRY,RESIDENCE,False,False,221,2,4,39,05,,,2018,2018-01-12T15:49:14.000+0000,,,
11228291,JB148591,2018-01-10T16:45:00.000+0000,010XX E 53RD ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,233,2,4,41,11,,,2018,2018-01-12T15:49:14.000+0000,,,
11228287,JB148482,2018-01-03T15:45:00.000+0000,0000X W C1 ST,0810,THEFT,OVER $500,AIRPORT TERMINAL LOWER LEVEL - NON-SECURE AREA,False,False,1651,16,41,76,06,,,2018,2018-01-12T15:49:14.000+0000,,,
11228268,JB148558,2018-01-04T16:00:00.000+0000,044XX S MICHIGAN AVE,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,APARTMENT,False,True,215,2,3,38,26,,,2018,2018-01-12T15:49:14.000+0000,,,


-sandbox
Write to Parquet by calling the following method on a DataFrame: `.write.parquet("mnt/<destination>.parquet")`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Specify the write mode (for example, `overwrite` or `append`) using `.mode()`.

[See the documentation for additional specifications.](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=parquet#pyspark.sql.DataFrameWriter.parquet)

In [14]:
print(workingDir)

In [15]:
targetPath = f"{workingDir}/crime.parquet"
crimeRenamedColsDF.write.mode("overwrite").parquet(targetPath)
# could have been append as well


In [16]:
%fs ls dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet


path,name,size
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/_SUCCESS,_SUCCESS,0
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/_committed_2896337623577134971,_committed_2896337623577134971,222
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/_started_2896337623577134971,_started_2896337623577134971,0
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/part-00000-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-51-1-c000.snappy.parquet,part-00000-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-51-1-c000.snappy.parquet,1214058
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/part-00001-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-52-1-c000.snappy.parquet,part-00001-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-52-1-c000.snappy.parquet,321174


In [17]:
# two files because i have two cores ? 

-sandbox
Review how this command writes the Parquet file. An advantage of Parquet is that, unlike a CSV file which is normally a single file, Parquet is distributed so each partition of data in the cluster writes to its own "part". Notice the different log data included in this directory.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Write other file formats in this same way (for example, `.write.csv("mnt/<destination>.csv")`)

In [20]:
targetPath

In [21]:
display(dbutils.fs.ls(targetPath))

path,name,size
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/_SUCCESS,_SUCCESS,0
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/_committed_2896337623577134971,_committed_2896337623577134971,222
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/_started_2896337623577134971,_started_2896337623577134971,0
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/part-00000-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-51-1-c000.snappy.parquet,part-00000-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-51-1-c000.snappy.parquet,1214058
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crime.parquet/part-00001-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-52-1-c000.snappy.parquet,part-00001-tid-2896337623577134971-a0214ffa-ee76-447f-bbf6-bcca5d7a8e42-52-1-c000.snappy.parquet,321174


-sandbox
Use the `repartition` DataFrame method to repartition the data to limit the number of separate parts.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> What appears to the user as a single DataFrame is actually data distributed across a cluster.  Each cluster holds _partitions_, or parts, of the data.  By repartitioning, we define how many different parts of our data to have.

In [23]:
repartitionedPath = f"{workingDir}/crimeRepartitioned.parquet"
crimeRenamedColsDF.repartition(1).write.mode("overwrite").parquet(repartitionedPath)

Now look at how many parts are in the new folder. You have one part for each partition. Since you repartitioned the DataFrame with a value of `1`, now all the data is in `part-00000`.

In [25]:
display(dbutils.fs.ls(repartitionedPath))

path,name,size
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crimeRepartitioned.parquet/_SUCCESS,_SUCCESS,0
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crimeRepartitioned.parquet/_committed_9089915443373710447,_committed_9089915443373710447,123
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crimeRepartitioned.parquet/_started_9089915443373710447,_started_9089915443373710447,0
dbfs:/user/tbresee@mail.smu.edu/etl_part_1/etl1_07___loading_data_and_productionalizing_psp/crimeRepartitioned.parquet/part-00000-tid-9089915443373710447-852a2ab6-373c-47d6-b072-a5a3cfb6c419-37-1-c000.snappy.parquet,part-00000-tid-9089915443373710447-852a2ab6-373c-47d6-b072-a5a3cfb6c419-37-1-c000.snappy.parquet,1487398


In [26]:
#  BOTTOM LINE:
# USING THE REPARTITION COMMAND WILL OUTPUT A SINGLE FILE...

-sandbox
### Automate by Scheduling a Job

Scheduling a job allows you to perform a batch process at a regular interval. Schedule email updates for successful completion and error logs.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Since jobs are not available in the Community Edition version of Databricks, you are unable to follow along in Community Edition.

-sandbox

1. Click **Jobs** in the left-hand panel of the screen.
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Jobs.png" style="height: 200px" style="margin-bottom: 20px; height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
2. Click **Create Job**.
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Jobs2.png" style="height: 200px" style="margin-bottom: 20px; height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
3. Perform the following:
 - Name the job
 - Choose the notebook the job will execute
 - Specify the cluster
 - Choose a daily job
 - Send yourself an email alert
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Jobs3.png" style="height: 200px" style="margin-bottom: 20px; height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to turn off the job so it does not execute indefinitely.

-sandbox
## Exercise 1 (Optional): Productionalizing a Job

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Community Edition users are not able to complete this exercise.

-sandbox
### Step 1: Run All

Click **Run All** to confirm the notebook runs.  If there are any errors, fix them.

<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Jobs4.png" style="height: 200px" style="margin-bottom: 20px; height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>

### Step 2: Schedule a Job

Schedule this notebook to run using the steps above.

## Review

**Question:** What is the recommended storage format to use with Spark?
**Answer:** Apache Parquet is a highly optimized solution for data storage and is the recommended option for storage where possible.  In addition to offering benefits like compression, it's distributed, so a given partition of data writes to its own file, enabling parallel reads and writes. Formats like CSV are prone to corruption since a single missing comma could corrupt the data. Also, the data cannot be parallelized.

**Question:** How do you schedule a regularly occurring task in Databricks?
**Answer:** The Jobs tab of a Databricks notebook or the new [Jobs API](https://docs.databricks.com/api/latest/jobs.html) allows for job automation.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [34]:
%run "./Includes/Classroom-Cleanup"

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> All done!</h2>

Thank you for your participation!

## Additional Topics & Resources

**Q:** Where can I get more information on scheduling jobs on Databricks?
**A:** Check out the Databricks documentation on <a href="https://docs.databricks.com/user-guide/jobs.html" target="_blank">Scheduling Jobs on Databricks</a>

**Q:** How can I schedule complex jobs, such as those involving dependencies between jobs?
**A:** There are two options for complex jobs.  The easiest solution is <a href="https://docs.databricks.com/user-guide/notebooks/notebook-workflows.html" target="_blank">Notebook Workflows</a>, which involves using one notebook that triggers the execution of other notebooks. For more complexity, <a href="https://databricks.com/blog/2017/07/19/integrating-apache-airflow-with-databricks.html" target="_blank">Databricks integrates with the open source workflow scheduler Apache Airflow.</a>

**Q:** How do I perform spark-submit jobs?
**A:** Spark-submit is the process for running Spark jobs in the open source implementation of Spark.  [Jobs](https://docs.databricks.com/user-guide/jobs.html) and [the jobs API](https://docs.databricks.com/api/latest/jobs.html) are a robust option offered in the Databricks environment.  You can also launch spark-submit jobs through the jobs UI as well

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>