# ETL Process Overview

Apache Spark&trade; and Azure Databricks&reg; allow you to create an end-to-end _extract, transform, load (ETL)_ pipeline.

-sandbox
### The Spark Approach

Spark offers a compute engine and connectors to virtually any data source. By leveraging easily scaled infrastructure and accessing data where it lives, Spark addresses the core needs of a big data application.

These principles comprise the Spark approach to ETL, providing a unified and scalable approach to big data pipelines: <br><br>

1. Databricks and Spark offer a **unified platform** 
 - Spark on Databricks combines ETL, stream processing, machine learning, and collaborative notebooks.
 - Data scientists, analysts, and engineers can write Spark code in Python, Scala, SQL, and R.
2. Spark's unified platform is **scalable to petabytes of data and clusters of thousands of nodes**.  
 - The same code written on smaller data sets scales to large workloads, often with only small changes.
2. Spark on Databricks decouples data storage from the compute and query engine.  
 - Spark's query engine **connects to any number of data sources** such as S3, Azure Blob Storage, Redshift, and Kafka.  
 - This **minimizes costs**; a dedicated cluster does not need to be maintained and the compute cluster is **easily updated to the latest version** of Spark.
 
<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Workload_Tools_2-01.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

### A Basic ETL Job

In this lesson you use web log files from the <a href="https://www.sec.gov/dera/data/edgar-log-file-data-set.html" target="_blank">US Securities and Exchange Commision website</a> to do a basic ETL for a day of server activity. You will extract the fields of interest and load them into persistent storage.

-sandbox
### Getting Started

Run the following cell to configure our "classroom."

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Remember to attach your notebook to a cluster. Click <b>Detached</b> in the upper left hand corner and then select your preferred cluster.

<img src="https://files.training.databricks.com/images/eLearning/attach-to-cluster.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

Run the cell below to mount the data. Details on how this works are covered in the next lesson.

In [6]:
%run "./Includes/Classroom-Setup"

The Databricks File System (DBFS) is an HDFS-like interface to bulk data storages like Azure's Blob storage service.

Pass the path `/mnt/training/EDGAR-Log-20170329/EDGAR-Log-20170329.csv` into `spark.read.csv`to access data stored in DBFS. Use the header option to specify that the first line of the file is the header.

In [8]:
path = "/mnt/training/EDGAR-Log-20170329/EDGAR-Log-20170329.csv"

logDF = (spark
  .read
  .option("header", True)
  .csv(path)
  .sample(withReplacement=False, fraction=0.3, seed=3) # using a sample to reduce data size
)

display(logDF)

Next, review the server-side errors, which have error codes in the 500s.

In [10]:
from pyspark.sql.functions import col

serverErrorDF = (logDF
  .filter((col("code") >= 500) & (col("code") < 600))
  .select("date", "time", "extention", "code")
)

display(serverErrorDF)

### Data Validation

One aspect of ETL jobs is to validate that the data is what you expect.  This includes:<br><br>
* Approximately the expected number of records
* The expected fields are present
* No unexpected missing values

-sandbox
Take a look at the server-side errors by hour to confirm the data meets your expectations. Visualize it by selecting the bar graph icon once the table is displayed. <br><br>
<div><img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/visualization.png" style="height: 400px" style="margin-bottom: 20px; height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>

In [13]:
from pyspark.sql.functions import from_utc_timestamp, hour, col

countsDF = (serverErrorDF
  .select(hour(from_utc_timestamp(col("time"), "GMT")).alias("hour"))
  .groupBy("hour")
  .count()
  .orderBy("hour")
)

display(countsDF)

The distribution of errors by hour meets the expections.  There is an uptick in errors around midnight, possibly due to server maintenance at this time.

-sandbox
### Saving Back to DBFS

A common and highly effective design pattern in the Databricks and Spark ecosytem involves loading structured data back to DBFS as a parquet file. Learn more about [the scalable and optimized data storage format parquet here](http://parquet.apache.org/).

Save the parsed DataFrame back to DBFS as parquet using the `.write` method.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> All clusters have storage availiable to them in the `/tmp/` directory.  In the case of Community Edition clusters, this is a small, but helpful, amount of storage.  
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If you run out of storage, use the command `dbutils.fs.rm("/tmp/<my directory>", True)` to recursively remove all items from a directory.  Note that this is a permanent action.

In [16]:
(serverErrorDF
  .write
  .mode("overwrite") # overwrites a file if it already exists
  .parquet("/tmp/log20170329/serverErrorDF.parquet")
)

-sandbox
### Our ETL Pipeline

Here's what the ETL pipeline you just built looks like.  In the rest of this course you will work with more complex versions of this general pattern.

| Code | Stage |
|------|-------|
| `logDF = (spark`                                                                          | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.read`                                                           | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.option("header", True)`                                         | Extract |
| &nbsp;&nbsp;&nbsp;&nbsp;`.csv(<source>)`                                                  | Extract |
| `)`                                                                                       | Extract |
| `serverErrorDF = (logDF`                                                                  | Transform |
| &nbsp;&nbsp;&nbsp;&nbsp;`.filter((col("code") >= 500) & (col("code") < 600))`             | Transform |
| &nbsp;&nbsp;&nbsp;&nbsp;`.select("date", "time", "extention", "code")`                    | Transform |
| `)`                                                                                       | Transform |
| `(serverErrorDF.write`                                                                 | Load |
| &nbsp;&nbsp;&nbsp;&nbsp;`.parquet(<destination>))`                                      | Load |

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This is a distributed job, so it can easily scale to fit the demands of your data set.

## Exercise 1: Perform an ETL Job

Write a basic ETL script that captures the 20 most active website users and load the results to DBFS.

### Step 1: Create a DataFrame of Aggregate Statistics

Create a DataFrame `ipCountDF` that uses `logDF` to create a count of each time a given IP address appears in the logs, with the counts sorted in descending order.  The result should have two columns: `ip` and `count`.

In [20]:
# ANSWER
from pyspark.sql.functions import desc

ipCountDF = (logDF
  .select("ip")
  .groupBy("ip")
  .count()
  .orderBy(desc("count"))
)

display(ipCountDF)

In [21]:
# TEST - Run this cell to test your solution
ip1, count1 = ipCountDF.first()
cols = set(ipCountDF.columns)

dbTest("ET1-P-02-01-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-01-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-01-03", {'count', 'ip'}, cols)

print("Tests passed!")

-sandbox
### Step 2: Save the Results

Use your tempory folder to save the results back to DBFS as `ipCount.parquet`

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** If you run out of space, use `%fs rm -r /tmp/<my directory>` to recursively (and permanently) remove all items from a directory.

In [23]:
# ANSWER
(ipCountDF
  .write
  .mode("overwrite")
  .parquet("/tmp/ipCount.parquet")
)

In [24]:
# TEST - Run this cell to test your solution
from pyspark.sql.functions import desc

ipCountDF2 = (spark
  .read
  .parquet("/tmp/ipCount.parquet")
  .orderBy(desc("count"))
)
ip1, count1 = ipCountDF2.first()
cols = set(ipCountDF2.columns)

dbTest("ET1-P-02-02-01", "213.152.28.bhe", ip1)
dbTest("ET1-P-02-02-02", True, count1 > 500000 and count1 < 550000)
dbTest("ET1-P-02-02-03", {'count', 'ip'}, cols)

print("Tests passed!")

Check the load worked by using `%fs ls <path>`.  Parquet divides your data into a number of files.  If successful, you see a `_SUCCESS` file as well as the data split across a number of parts.

In [26]:
%fs ls /tmp/ipCount.parquet

## Review
**Question:** What does ETL stand for and what are the stages of the process?  
**Answer:** ETL stands for `extract-transform-load`
0. *Extract* refers to ingesting data.  Spark easily connects to data in a number of different sources.
0. *Transform* refers to applying structure, parsing fields, cleaning data, and/or computing statistics.
0. *Load* refers to loading data to its final destination, usually a database or data warehouse.

**Question:** How does the Spark approach to ETL deal with devops issues such as updating a software version?  
**Answer:** By decoupling storage and compute, updating your Spark version is as easy as spinning up a new cluster.  Your old code will easily connect to Azure Blob, or other storage.  This also avoids the challenge of keeping a cluster always running, such as with Hadoop clusters.

**Question:** How does the Spark approach to data applications differ from other solutions?  
**Answer:** Spark offers a unified solution to use cases that would otherwise need individual tools. For instance, Spark combines machine learning, ETL, stream processing, and a number of other solutions all with one technology.

## Next Steps

Start the next lesson, [Connecting to Azure Blob Storage]($./03-Connecting-to-Azure-Blob-Storage ).

## Additional Topics & Resources

**Q:** Where can I get more information on building ETL pipelines?  
**A:** Check out the Spark Summit talk on <a href="https://databricks.com/session/building-robust-etl-pipelines-with-apache-spark" target="_blank">Building Robust ETL Pipelines with Apache Spark</a>

**Q:** Where can I find out more information on moving from traditional ETL pipelines towards Spark?  
**A:** Check out the Spark Summit talk <a href="https://databricks.com/session/get-rid-of-traditional-etl-move-to-spark" target="_blank">Get Rid of Traditional ETL, Move to Spark!</a>

**Q:** What are the visualization options in Databricks?  
**A:** Databricks provides a wide variety of <a href="https://docs.azuredatabricks.net/user-guide/visualizations/index.html#id1" target="_blank">built-in visualizations</a>.  Databricks also supports a variety of 3rd party visualization libraries, including <a href="https://d3js.org/" target="_blank">d3.js</a>, <a href="https://matplotlib.org/" target="_blank">matplotlib</a>, <a href="http://ggplot.yhathq.com/" target="_blank">ggplot</a>, and <a href="https://plot.ly/" target="_blank">plotly<a/>.