d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# ETL Process Overview

Apache Spark&trade; and Databricks&reg; allow you to create an end-to-end _extract, transform, load (ETL)_ pipeline.
## In this lesson you:
* Create a basic end-to-end ETL pipeline
* Demonstrate the Spark approach to ETL pipelines

## Audience
* Primary Audience: Data Engineers
* Additional Audiences: Data Scientists and Data Pipeline Engineers

## Prerequisites
* Web browser: Please use a <a href="https://docs.databricks.com/user-guide/supported-browsers.html#supported-browsers" target="_blank">supported browser</a>.

<iframe  
src="//fast.wistia.net/embed/iframe/rd9d11fwe6?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/rd9d11fwe6?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### The Spark Approach

Spark offers a compute engine and connectors to virtually any data source. By leveraging easily scaled infrastructure and accessing data where it lives, Spark addresses the core needs of a big data application.

These principles comprise the Spark approach to ETL, providing a unified and scalable approach to big data pipelines: <br><br>

1. Databricks and Spark offer a **unified platform** 
 - Spark on Databricks combines ETL, stream processing, machine learning, and collaborative notebooks.
 - Data scientists, analysts, and engineers can write Spark code in Python, Scala, SQL, and R.
2. Spark's unified platform is **scalable to petabytes of data and clusters of thousands of nodes**.  
 - The same code written on smaller data sets scales to large workloads, often with only small changes.
2. Spark on Databricks decouples data storage from the compute and query engine.  
 - Spark's query engine **connects to any number of data sources** such as S3, Azure Blob Storage, Redshift, and Kafka.  
 - This **minimizes costs**; a dedicated cluster does not need to be maintained and the compute cluster is **easily updated to the latest version** of Spark.
 
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Workload_Tools_2-01.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

### A Basic ETL Job

In this lesson you use web log files from the <a href="https://www.sec.gov/dera/data/edgar-log-file-data-set.html" target="_blank">US Securities and Exchange Commission website</a> to do a basic ETL for a day of server activity. You will extract the fields of interest and load them into persistent storage.

<iframe  
src="//fast.wistia.net/embed/iframe/95uh9cxyb3?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/95uh9cxyb3?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Getting Started

Run the following cell to configure our "classroom."

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster. Click <b>Detached</b> in the upper left hand corner and then select your preferred cluster.

<img src="https://files.training.databricks.com/images/eLearning/attach-to-cluster.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup & Classroom-Cleanup<br>

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [0]:
%run "./Includes/Classroom-Setup"

The Databricks File System (DBFS) is an HDFS-like interface to bulk data stores like Amazon's S3 and Azure's Blob storage service.

Pass the path `/mnt/training/EDGAR-Log-20170329/EDGAR-Log-20170329.csv` into `spark.read.csv`to access data stored in DBFS. Use the header option to specify that the first line of the file is the header.

In [0]:
path = "/mnt/training/EDGAR-Log-20170329/EDGAR-Log-20170329.csv"

logDF = (spark
  .read
  .option("header", True)
  .csv(path)
  .sample(withReplacement=False, fraction=0.3, seed=3) # using a sample to reduce data size
)

display(logDF)

ip,date,time,zone,cik,accession,extention,code,size,idx,norefer,noagent,find,crawler,browser
101.71.41.ihh,2017-03-29,00:00:00,0.0,1437491.0,0001245105-17-000052,xslF345X03/primary_doc.xml,301.0,687.0,0.0,0.0,0.0,10.0,0.0,
104.196.240.dda,2017-03-29,00:00:00,0.0,1270985.0,0001188112-04-001037,.txt,200.0,7619.0,0.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0000905148-07-006108,-index.htm,200.0,2727.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0000905148-08-001993,-index.htm,200.0,2710.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1059376.0,0001104659-09-046963,-index.htm,200.0,2715.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1364986.0,0000914121-06-002243,-index.htm,200.0,2786.0,1.0,0.0,0.0,10.0,0.0,
107.23.85.jfd,2017-03-29,00:00:00,0.0,1364986.0,0000914121-06-002251,-index.htm,200.0,2784.0,1.0,0.0,0.0,10.0,0.0,
108.240.248.gha,2017-03-29,00:00:00,0.0,1540159.0,0001217160-12-000029,f332scottlease.htm,200.0,49578.0,0.0,0.0,0.0,10.0,0.0,
108.59.8.fef,2017-03-29,00:00:00,0.0,732834.0,0001209191-15-017349,xslF345X03/doc4.xml,301.0,673.0,0.0,0.0,0.0,10.0,0.0,
108.91.91.hbc,2017-03-29,00:00:00,0.0,1629769.0,0001209191-17-023204,.txt,301.0,675.0,0.0,0.0,0.0,10.0,0.0,


Next, review the server-side errors, which have error codes in the 500s.

In [0]:
from pyspark.sql.functions import col

serverErrorDF = (logDF
  .filter((col("code") >= 500) & (col("code") < 600))
  .select("date", "time", "extention", "code")
)

display(serverErrorDF)

date,time,extention,code
2017-03-29,00:00:12,.txt,503.0
2017-03-29,00:00:16,-index.htm,503.0
2017-03-29,00:00:24,-index.htm,503.0
2017-03-29,00:00:44,-index.htm,503.0
2017-03-29,00:01:01,-index.htm,503.0
2017-03-29,00:01:01,-index.htm,503.0
2017-03-29,00:01:02,-index.htm,503.0
2017-03-29,00:01:03,-index.htm,503.0
2017-03-29,00:01:03,-index.htm,503.0
2017-03-29,00:01:04,-index.htm,503.0


### Data Validation

One aspect of ETL jobs is to validate that the data is what you expect.  This includes:<br><br>
* Approximately the expected number of records
* The expected fields are present
* No unexpected missing values

<iframe  
src="//fast.wistia.net/embed/iframe/k3mf97q7nn?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/k3mf97q7nn?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
Take a look at the server-side errors by hour to confirm the data meets your expectations. Visualize it by selecting the bar graph icon once the table is displayed. <br><br>
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/visualization.png" style="height: 400px" style="margin-bottom: 20px; height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>

In [0]:
from pyspark.sql.functions import from_utc_timestamp, hour, col

countsDF = (serverErrorDF
  .select(hour(from_utc_timestamp(col("time"), "GMT")).alias("hour"))
  .groupBy("hour")
  .count()
  .orderBy("hour")
)

display(countsDF)

hour,count
0,2030
1,1638
2,1123
3,1093
4,1118
5,1168
6,1089
7,1054
8,1055
9,1022


The distribution of errors by hour meets the expectations.  There is an uptick in errors around midnight, possibly due to server maintenance at this time.

-sandbox
### Saving Back to DBFS

A common and highly effective design pattern in the Databricks and Spark ecosystem involves loading structured data back to DBFS as a parquet file. Learn more about [the scalable and optimized data storage format parquet here](http://parquet.apache.org/).

Save the parsed DataFrame back to DBFS as parquet using the `.write` method.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> All clusters have storage available to them in the `/tmp/` directory.  In the case of Community Edition clusters, this is a small, but helpful, amount of storage.  
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If you run out of storage, use the command `dbutils.fs.rm("/tmp/<my directory>", True)` to recursively remove all items from a directory.  Note that this is a permanent action.

In [0]:
targetPath = workingDir + "/log20170329/serverErrorDF.parquet"

(serverErrorDF
  .write
  .mode("overwrite") # overwrites a file if it already exists
  .parquet(targetPath)
)

-sandbox
### Our ETL Pipeline

Here's what the ETL pipeline you just built looks like.  In the rest of this course you will work with more complex versions of this general pattern.

| Code | Stage |
|------|-------|
| `logDF = (spark`                                                                          | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.read`                                                           | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.option("header", True)`                                         | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.csv(<source>)`                                                  | Extract |
| `)`                                                                                       | Extract |
| `serverErrorDF = (logDF`                                                                  | Transform |
| &nbsp;&nbsp;&nbsp;&nbsp;`.filter((col("code") >= 500) & (col("code") < 600))`             | Transform |
| &nbsp;&nbsp;&nbsp;&nbsp;`.select("date", "time", "extention", "code")`                    | Transform |
| `)`                                                                                       | Transform |
| `(serverErrorDF.write`                                                                 | Load |
| &nbsp;&nbsp;&nbsp;&nbsp;`.parquet(<destination>))`                                      | Load |

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is a distributed job, so it can easily scale to fit the demands of your data set.

<iframe  
src="//fast.wistia.net/embed/iframe/qu6fxg1f6a?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/qu6fxg1f6a?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

## Exercise 1: Perform an ETL Job

Write a basic ETL script that captures the 20 most active website users and load the results to DBFS.

### Step 1: Create a DataFrame of Aggregate Statistics

Create a DataFrame `ipCountDF` that uses `logDF` to create a count of each time a given IP address appears in the logs, with the counts sorted in descending order.  The result should have two columns: `ip` and `count`.

In [0]:
# TODO
from pyspark.sql.functions import col, desc

ipCountDF = (logDF
             
  .select("ip")
  .groupBy("ip")
  .count()
  .orderBy(desc("count"))
)

display(ipCountDF)	


ip,count
213.152.28.bhe,518548
158.132.91.haf,497361
117.91.6.caf,239912
132.195.122.djf,197267
117.91.2.aha,152731
173.52.208.ehd,146767
108.91.91.hbc,143232
117.91.7.hgh,133447
97.100.78.cjb,130156
217.174.255.dgd,123039


In [0]:
# TEST - Run this cell to test your solution
ip1, count1 = ipCountDF.first()
cols = set(ipCountDF.columns)

dbTest("ET1-P-02-01-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-01-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-01-03", True, 'count' in cols)
dbTest("ET1-P-02-01-03", True, 'ip' in cols)

print("Tests passed!")

-sandbox
### Step 2: Save the Results

Use your temporary folder to save the results back to DBFS as `workingDir + "/ipCount.parquet"`

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** If you run out of space, use `%fs rm -r /tmp/<my directory>` to recursively (and permanently) remove all items from a directory.

In [0]:
# TODO
writePath = workingDir + "/ipCount.parquet"

(ipCountDF
  .write
  .mode("overwrite") # overwrites a file if it already exists
  .parquet(writePath)
)

In [0]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import desc

writePath = workingDir + "/ipCount.parquet"

ipCountDF2 = (spark
  .read
  .parquet(writePath)
  .orderBy(desc("count"))
)
ip1, count1 = ipCountDF2.first()
cols = ipCountDF2.columns

dbTest("ET1-P-02-02-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-02-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-02-03", True, "count" in cols)
dbTest("ET1-P-02-02-04", True, "ip" in cols)

print("Tests passed!")

Check the load worked by using listing the files in our **`writePath`**

Parquet divides your data into a number of files.

If successful, you see a `_SUCCESS` file as well as the data split across a number of parts.

In [0]:
display(dbutils.fs.ls(writePath))

path,name,size
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/_SUCCESS,_SUCCESS,0
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/_committed_6129939551467295247,_committed_6129939551467295247,3458
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/_started_6129939551467295247,_started_6129939551467295247,0
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00000-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1360-1-c000.snappy.parquet,part-00000-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1360-1-c000.snappy.parquet,5372
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00001-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1361-1-c000.snappy.parquet,part-00001-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1361-1-c000.snappy.parquet,4896
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00002-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1362-1-c000.snappy.parquet,part-00002-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1362-1-c000.snappy.parquet,5095
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00003-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1363-1-c000.snappy.parquet,part-00003-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1363-1-c000.snappy.parquet,5029
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00004-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1364-1-c000.snappy.parquet,part-00004-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1364-1-c000.snappy.parquet,5016
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00005-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1365-1-c000.snappy.parquet,part-00005-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1365-1-c000.snappy.parquet,4776
dbfs:/user/vivek.sivalingam@rhsmith.umd.edu/etl_part_1/etl1_02___etl_process_overview_psp/ipCount.parquet/part-00006-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1366-1-c000.snappy.parquet,part-00006-tid-6129939551467295247-c8071b7b-2816-4464-8ac8-168c93e76091-1366-1-c000.snappy.parquet,4365


## Review
**Question:** What does ETL stand for and what are the stages of the process?  
**Answer:** ETL stands for `extract-transform-load`
0. *Extract* refers to ingesting data.  Spark easily connects to data in a number of different sources.
0. *Transform* refers to applying structure, parsing fields, cleaning data, and/or computing statistics.
0. *Load* refers to loading data to its final destination, usually a database or data warehouse.

**Question:** How does the Spark approach to ETL deal with devops issues such as updating a software version?  
**Answer:** By decoupling storage and compute, updating your Spark version is as easy as spinning up a new cluster.  Your old code will easily connect to S3, the Azure Blob, or other storage.  This also avoids the challenge of keeping a cluster always running, such as with Hadoop clusters.

**Question:** How does the Spark approach to data applications differ from other solutions?  
**Answer:** Spark offers a unified solution to use cases that would otherwise need individual tools. For instance, Spark combines machine learning, ETL, stream processing, and a number of other solutions all with one technology.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

Run the **`Classroom-Cleanup`** cell below to remove any artifacts created by this lesson.

In [0]:
%run "./Includes/Classroom-Cleanup"

## Next Steps

Start the next lesson, [Connecting to S3]($./03-Connecting-to-S3 ).

## Additional Topics & Resources

**Q:** Where can I get more information on building ETL pipelines?  
**A:** Check out the Spark Summit talk on <a href="https://databricks.com/session/building-robust-etl-pipelines-with-apache-spark" target="_blank">Building Robust ETL Pipelines with Apache Spark</a>

**Q:** Where can I find out more information on moving from traditional ETL pipelines towards Spark?  
**A:** Check out the Spark Summit talk <a href="https://databricks.com/session/get-rid-of-traditional-etl-move-to-spark" target="_blank">Get Rid of Traditional ETL, Move to Spark!</a>

**Q:** What are the visualization options in Databricks?  
**A:** Databricks provides a wide variety of <a href="https://docs.databricks.com/user-guide/visualizations/index.html" target="_blank">built-in visualizations</a>.  Databricks also supports a variety of 3rd party visualization libraries, including <a href="https://d3js.org/" target="_blank">d3.js</a>, <a href="https://matplotlib.org/" target="_blank">matplotlib</a>, <a href="http://ggplot.yhathq.com/" target="_blank">ggplot</a>, and <a href="https://plot.ly/" target="_blank">plotly<a/>.

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>