# Course Overview and Setup

Azure Databricks&reg; provides an Apache Spark&trade; as-a-service workspace environment, making it easy to manage clusters and explore data interactively.

## Databricks Delta

Databricks&reg; Delta is a transactional storage layer designed specifically to harness the power of Apache Spark and Databricks DBFS. The core abstraction of Databricks Delta is an optimized Spark table that stores your data as Parquet files in DBFS and maintains a transaction log that efficiently tracks changes to the table.

** The course is composed of the following lessons:**  
1. Course Overview and Setup
2. Create Table
3. Append Table
4. Upsert Table
5. Streaming 
6. Optimization
4. Databricks Delta Architecture
5. Capstone Project

-sandbox
# The Challenge with Data Lakes
### AKA: It's not a Data Lake, it's a Data CESSPOOL

A <b>Data Lake</b>: 
* Is a storage repository that inexpensively stores a vast amount of raw data in its native format.
* Consists of current and historical data dumps in various formats including XML, JSON, CSV, Parquet, etc.
* May contain operational relational databases with live transactional data.
* In effect, it's a dumping ground of amorphous data.

To extract meaningful information out of a Data Lake, we need to resolve problems like:
* Schema enforcement when new tables are introduced 
* Table repairs when any new data is inserted into the data lake
* Frequent refreshes of metadata 
* Bottlenecks of small file sizes for distributed computations
* Difficulty re-sorting data by an index (i.e. userID) if data is spread across many files and partitioned by i.e. eventTime

# The Solution: Databricks Delta

Databricks Delta is a Spark table with built-in reliability and performance optimizations.

You can read and write data stored in Databricks Delta using the same familiar Apache Spark SQL batch and streaming APIs you use to work with Hive tables or DBFS directories. Databricks Delta provides the following functionality:<br><br>

* <b>ACID transactions</b> - Multiple writers can simultaneously modify a data set and see consistent views.
* <b>DELETES/UPDATES/UPSERTS</b> - Writers can modify a data set without interfering with jobs reading the data set.
* <b>Automatic file management</b> - Data access speeds up by organizing data into large files that can be read efficiently.
* <b>Statistics and data skipping</b> - Reads are 10-100x faster when statistics are tracked about the data in each file, allowing Delta to avoid reading irrelevant information.

# Up and Running with Databricks

Before we continue with Databricks Delta, a little digression on setting up your Databricks account. 

You may wish to skip this section if you already have Databricks up and running.

Create a notebook and Spark cluster.

-sandbox
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** This step requires you to navigate Databricks while doing this lesson.  We recommend you <a href="" target="_blank">open a second browser window</a> when navigating Databricks to view these instructions in one window while navigating in the other.

### Step 1
Databricks notebooks are backed by clusters, or networked computers that work together to process your data. Create a Spark cluster (*if you already have a running cluster, skip to **Step 2** *):
1. In your new window, click the **Clusters** button in the sidebar.
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/create-cluster-4.png" style="height: 200px"/></div><br/>
2. Click the **Create Cluster** button.
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/create-cluster-5.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div><br/>
3. Name your cluster. Use your name or initials to easily differentiate your cluster from your coworkers.
4. Select the cluster type. We recommend the latest runtime and Scala **2.11**.
5. Specify your cluster configuration.
  * For clusters created on a **Community Edition** shard the default values are sufficient for the remaining fields.
  * For all other environments, refer to your company's policy on creating and using clusters.</br></br>
6. Right click on **Cluster** button on left side and open a new tab. Click the **Create Cluster** button.
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/create-cluster-2.png" style="height: 300px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Check with your local system administrator to see if there is a recommended default cluster at your company to use for the rest of the class. This could save you some money!

-sandbox
### Step 2
Create a new notebook in your home folder:
1. Click the **Home** button in the sidebar.
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/home.png" style="height: 200px"/></div><br/>
2. Right-click on your home folder.
3. Select **Create**.
4. Select **Notebook**.
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/create-notebook-1.png" style="height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div><br/>
5. Name your notebook `First Notebook`.<br/>
6. Set the language to **Python**.<br/>
7. Select the cluster to which to attach this notebook.  
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If a cluster is not currently running, this option will not exist.
8. Click **Create**.
<div>
  <div style="float:left"><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/create-notebook-2b.png" style="width:400px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>
  <div style="float:left">&nbsp;&nbsp;&nbsp;or&nbsp;&nbsp;&nbsp;</div>
  <div style="float:left"><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/create-notebook-2.png" style="width:400px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>
  <div style="clear:both"></div>
</div>

-sandbox
### Step 3

Now that you have a notebook, use it to run code.
1. In the first cell of your notebook, type `1 + 1`. 
2. Run the cell, click the run icon and select **Run Cell**.
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/run-notebook-1.png" style="width:600px; margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You can also run a cell by typing **Ctrl-Enter**.

In [8]:
1 + 1

-sandbox

### Attach and Run

If your notebook was not previously attached to a cluster you might receive the following prompt: 
<div><img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/run-notebook-2.png" style="margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/></div>

If you click **Attach and Run**, first make sure you attach to the correct cluster.

If it is not the correct cluster, click **Cancel** and follow the steps in the the next cell, **Attach & Detach**.

-sandbox
### Attach & Detach

If your notebook is detached you can attach it to another cluster:  
<img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/attach-to-cluster.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>
<br/>
<br/>
<br/>
If your notebook is attached to a cluster you can:
* Detach your notebook from the cluster
* Restart the cluster
* Attach to another cluster
* Open the Spark UI
* View the Driver's log files

<img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/detach-from-cluster.png" style="margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

## Summary
* To create a notebook click the down arrow on a folder and select **Create Notebook**.
* To import notebooks click the down arrow on a folder and select **Import**.
* To attach to a spark cluster select **Attached/Detached**, directly below the notebook title.
* Create clusters using the **Clusters** button on the left sidebar.

## Review Questions

**Question:** What is Databricks Delta?<br>
**Answer:** Databricks Delta is a mechanism of effectively managing the flow of data (<b>data pipeline</b>) to and from a <b>Data Lake</b>.

**Question:** What are some of the pain points of existing data pipelines?<br>
**Answer:** 
* Introduction of new tables requires schema creation 
* Whenever any new data is inserted into the data lake, table repairs are required
* Metadata must be frequently refreshed
* Small file sizes become a bottleneck for distributed computations
* If data is sorted by a particular index (i.e. eventTime), it is very difficult to re-sort the data by a different index (i.e. userID)

**Question:** How do you create a notebook?  
**Answer:** Sign into Azure Databricks, select the **Home** icon from the sidebar, right-click your home-folder, select **Create**, and then **Notebook**. In the **Create Notebook** dialog, specify the name of your notebook and the default programming language.

**Question:** How do you create a cluster?  
**Answer:** Select the **Clusters** icon on the sidebar, click the **Create Cluster** button, specify the specific settings for your cluster and then click **Create Cluster**.

**Question:** How do you attach a notebook to a cluster?  
**Answer:** If you run a command while detached, you may be prompted to connect to a cluster. To connect to a specific cluster, open the cluster menu by clicking the **Attached/Detached** menu item and then selecting your desired cluster.

-sandbox
## Next Steps

This course is available in Python and Scala.  Start the next lesson, **02-Create**.
1. Click the **Home** icon in the left sidebar
2. Select your home folder
3. Select the folder **Delta-Version #**
4. Open the notebook **02-Create** in either the Python or Scala folder


<img src="https://s3-us-west-2.amazonaws.com/curriculum-release/images/eLearning/Delta/course-import-2.png" style="margin-bottom: 5px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; width: auto; height: auto; max-height: 350px"/>

## Additional Topics & Resources
**Q:** Where can I find documentation on Databricks Delta?  
**A:** See <a href="https://docs.azuredatabricks.net/delta/index.html" target="_blank">Databricks Delta Guide</a>.

**Q:** Are there additional docs I can reference to find my way around Azure Databricks?  
**A:** See <a href="https://docs.azuredatabricks.net/getting-started/index.html" target="_blank">Getting Started with Databricks</a>.

**Q:** Where can I learn more about the cluster configuration options?  
**A:** See <a href="https://docs.azuredatabricks.net/user-guide/clusters/index.html" target="_blank">Spark Clusters on Databricks</a>.

**Q:** Can I import formats other than .dbc files?  
**A:** Yes, see <a href="https://docs.azuredatabricks.net/user-guide/notebooks/notebook-manage.html#import-a-notebook" target="_blank">Importing notebooks</a>.