d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Course Overview and Setup

Databricks&reg; provides an Apache Spark&trade; as-a-service workspace environment, making it easy to manage clusters and explore data interactively.

## Databricks Delta

Databricks&reg; Delta is a transactional storage layer designed specifically to harness the power of Apache Spark and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that stores your data as Parquet files in DBFS and maintains a transaction log that efficiently tracks changes to the table.

## Lessons
0. Course Overview and Setup
0. Create Table
0. Append Table
0. Upsert Table
0. Streaming 
0. Optimization
0. Databricks Delta Architecture
0. Time Travel

## Audience
* Primary Audience: Data Engineers
* Secondary Audience: Data Analysts, and Data Scientists

## Prerequisites
* Web browser: **Chrome**
* A cluster configured with **8 cores** and **DBR 6.2**
* Suggested Courses from <a href="https://academy.databricks.com/" target="_blank">Databricks Academy</a>:
  - ETL Part 1
  - Spark-SQL

-sandbox

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Before You Start</h2>

Before starting this course, you will need to create a cluster and attach it to this notebook.

Please configure your cluster to use Databricks Runtime version **6.2** which includes:
- Python Version 3.x
- Scala Version 2.11
- Apache Spark 2.4.4

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Do not use an ML or GPU accelerated runtimes

Step-by-step instructions for creating a cluster are included here:
- <a href="https://www.databricks.training/step-by-step/creating-clusters-on-azure" target="_blank">Azure Databricks</a>
- <a href="https://www.databricks.training/step-by-step/creating-clusters-on-aws" target="_blank">Databricks on AWS</a>
- <a href="https://www.databricks.training/step-by-step/creating-clusters-on-ce" target="_blank">Databricks Community Edition (CE)</a>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This courseware has been tested against the specific DBR listed above. Using an untested DBR may yield unexpected results and/or various errors. If the required DBR has been deprecated, please <a href="https://academy.databricks.com/" target="_blank">download an updated version of this course</a>.

-sandbox
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Setup
In general, all courses are designed to run on one of the following Databricks platforms:
* Databricks Community Edition (CE)
* Databricks (an AWS hosted service)
* Azure-Databricks (an Azure-hosted service)

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Some features are not available on the Community Edition, which limits the ability of some courses to be executed in that environment. Please see the course's prerequisites for specific information on this topic.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Additionally, private installations of Databricks (e.g., accounts provided by your employer) may have other limitations imposed, such as aggressive permissions and or language restrictions such as prohibiting the use of Scala which will further inhibit some courses from being executed in those environments.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** All courses provided by Databricks Academy rely on custom variables, functions, and settings to provide you with the best experience possible.

For each lesson to execute correctly, please make sure to run the **`Classroom-Setup`** cell at the start of each lesson (see the next cell) and the **`Classroom-Cleanup`** cell at the end of each lesson.

In [5]:
%run "./Includes/Classroom-Setup"

<iframe  
src="//fast.wistia.net/embed/iframe/q6wgvu9noh?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/q6wgvu9noh?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
# The Challenge with Data Lakes
### Or, it's not a Data Lake, it's a Data Swamp


A <b>Data Lake</b>: 
* Is a storage repository that inexpensively stores a vast amount of raw data in its native format.
* Consists of current and historical data dumps in various formats including XML, JSON, CSV, Parquet, etc.
* May contain operational relational databases with live transactional data.
* In effect, it's a dumping ground of amorphous data.

To extract meaningful information out of a Data Lake, we need to resolve problems like:
* Schema enforcement when new tables are introduced 
* Table repairs when any new data is inserted into the data lake
* Frequent refreshes of metadata 
* Bottlenecks of small file sizes for distributed computations
* Difficulty re-sorting data by an index (i.e. userID) if data is spread across many files and partitioned by i.e. eventTime

# The Solution: Databricks Delta

Databricks Delta is a unified data management system that brings reliability and performance (10-100x faster than Spark on Parquet) to cloud data lakes.  Delta's core abstraction is a Spark table with built-in reliability and performance optimizations.

You can read and write data stored in Databricks Delta using the same familiar Apache Spark SQL batch and streaming APIs you use to work with Hive tables or DBFS directories. Databricks Delta provides the following functionality:<br><br>

* <b>ACID transactions</b> - Multiple writers can simultaneously modify a data set and see consistent views.
* <b>DELETES/UPDATES/UPSERTS</b> - Writers can modify a data set without interfering with jobs reading the data set.
* <b>Automatic file management</b> - Data access speeds up by organizing data into large files that can be read efficiently.
* <b>Statistics and data skipping</b> - Reads are 10-100x faster when statistics are tracked about the data in each file, allowing Delta to avoid reading irrelevant information.

## Review Questions

**Question:** What is Databricks Delta?<br>
**Answer:** Databricks Delta is a mechanism of effectively managing the flow of data (<b>data pipeline</b>) to and from a <b>Data Lake</b>.

**Question:** What are some of the pain points of existing data pipelines?<br>
**Answer:** 
* Introduction of new tables requires schema creation 
* Whenever any new data is inserted into the data lake, table repairs are required
* Metadata must be frequently refreshed
* Small file sizes become a bottleneck for distributed computations
* If data is sorted by, say,  `eventTime`, it can be computationally expensive to sort the data by a different column, say, `userID`

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Classroom-Cleanup<br>

During the course of this lesson, files, tables, and other artifacts may have been created.

These resources create clutter, consume resources (generally in the form of storage), and may potentially incur some [minor] long-term expense.

You can remove these artifacts by running the **`Classroom-Cleanup`** cell below.

In [11]:
%run ./Includes/Classroom-Cleanup

<h2><img src="https://files.training.databricks.com/images/105/logo_spark_tiny.png"> Next Steps</h2>

Start the next lesson, [Create]($./Delta 02 - Create).

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>