# Course Overview and Setup
## ETL Part 1: Data Extraction

In this course data engineers access data where it lives and then apply data extraction best practices, including schemas, corrupt record handling, and parallelized code. By the end of this course, you will extract data from multiple sources, use schema inference and apply user-defined schemas, and navigate Databricks and Spark documents to source solutions.

-sandbox
### ETL with Azure Databricks and Spark

The **extract, transform, load (ETL)** process takes data from one or more sources, transforms it, normally by adding structure, and then loads it into a target database. 

A common ETL job takes log files from a web server, parses out pertinent fields so it can be readily queried, and then loads it into a database.

ETL may seem simple: applying structure to data so it’s in a desired form. However, the complexity of ETL is in the details. Data Engineers building ETL pipelines must understand and apply the following concepts:<br><br>

* Optimizing data formats and connections
* Determining the ideal schema
* Handling corrupt records
* Automating workloads

This course addresses these concepts.

<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/ETL-overview.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

Stay tuned for upcoming courses which will cover:<br><br>

* Complex and performant data transformations
* Schema changes over time  
* Recovery from job failures
* Avoiding duplicate records

## Exercise 1

Create a notebook and Spark cluster.

-sandbox
### Step 1
Databricks notebooks are backed by clusters, or networked computers, that process data. Create a Spark cluster (*if you already have a running cluster, skip to **Step 3** *):
1. Select the **Clusters** icon in the sidebar.
<div><img src="https://files.training.databricks.com/images/eLearning/create-cluster-4.png" style="height: 200px; margin: 20px"/></div>
2. Click the **Create Cluster** button.
<div><img src="https://files.training.databricks.com/images/eLearning/create-cluster-5.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
3. Name your cluster. Use your name or initials to easily differentiate your cluster from your coworkers.
4. Select the cluster type. We recommend the latest Databricks runtime (**3.3**, **3.4**, etc.) and Scala **2.11**.
5. Specify your cluster configuration.
  * For clusters created on a **Community Edition** shard the default values are sufficient for the remaining fields.
  * For all other shards, please refer to your company's policy on private clusters.</br></br>
6. Click the **Create Cluster** button.
<div><img src="https://files.training.databricks.com/images/eLearning/create-cluster-2.png" style="height: 300px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>


<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Check with your local system administrator to see if there is a recommended default cluster at your company to use for the rest of the class. This could save you  money!

-sandbox
### Step 2

Create a new notebook in your home folder:
1. Select the **Home** icon in the sidebar.
<div><img src="https://files.training.databricks.com/images/eLearning/home.png" style="height: 200px; margin: 20px"/></div>
2. Right-click your home folder.
3. Select **Create**.
4. Select **Notebook**.
<div><img src="https://files.training.databricks.com/images/eLearning/create-notebook-1.png" style="height: 150px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
5. Name your notebook `My Notebook`.<br/>
6. Set the language to **Python**.<br/>
7. Select the cluster to attach this Notebook.  
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> If a cluster is not currently running, this option will not exist.
8. Click **Create**.
<div>
  <div style="float:left"><img src="https://files.training.databricks.com/images/eLearning/create-notebook-2b.png" style="width:400px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
  <div style="float:left; margin-left: e3m; margin-right: 3em">or</div>
  <div style="float:left"><img src="https://files.training.databricks.com/images/eLearning/create-notebook-2.png" style="width:400px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
  <div style="clear:both"></div>
</div>

-sandbox
### Step 3

Now  you have a notebook, use it to run code.
1. In the first cell of your notebook, type `1 + 1`. 
2. Run the cell: Click the **Run** icon and then select **Run Cell**.
<div><img src="https://files.training.databricks.com/images/eLearning/run-notebook-1.png" style="width:600px; margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> **Ctrl-Enter** also runs a cell.

In [7]:
1 + 1

-sandbox

### Attach and Run

If your notebook was not previously attached to a cluster you might receive the following prompt: 
<div><img src="https://files.training.databricks.com/images/eLearning/run-notebook-2.png" style="margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/></div>

If you click **Attach and Run**, first make sure that you are attaching to the correct cluster.

If it is not the correct cluster, click **Cancel** instead see the next cell, **Attach & Detach**.

-sandbox
### Attach & Detach

If your notebook is detached you can attach it to another cluster:  
<img src="https://files.training.databricks.com/images/eLearning/attach-to-cluster.png" style="border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; margin: 20px"/>

If your notebook is attached to a cluster you can:
* Detach your notebook from the cluster.
* Restart the cluster.
* Attach to another cluster.
* Open the Spark UI.
* View the Driver's log files.

<img src="https://files.training.databricks.com/images/eLearning/detach-from-cluster.png" style="margin-bottom:1em; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa"/>

## Summary
* Click the down arrow on a folder and select the **Create Notebook** option to create notebooks.
* Click the down arrow on a folder and select the **Import** option to import notebooks.
* Select the **Attached/Detached** option directly below the notebook title to attach to a spark cluster 
* Create clusters using the Clusters button on the left sidebar.

## Review

**Question:** How do you create a Notebook?  
**Answer:** Sign into Azure Databricks, select the **Home** icon from the sidebar, right-click your home-folder, select **Create**, and then **Notebook**. In the **Create Notebook** dialog, specify the name of your notebook and the default programming language.

**Question:** How do you create a cluster?  
**Answer:** Select the **Clusters** icon on the sidebar, click the **Create Cluster** button, specify the specific settings for your cluster and then click **Create Cluster**.

**Question:** How do you attach a notebook to a cluster?  
**Answer:** If you run a command while detached, you may be prompted to connect to a cluster. To connect to a specific cluster, open the cluster menu by clicking the **Attached/Detached** menu item and then selecting your desired cluster.

-sandbox
## Next Steps

This course is available in Python and Scala.  Start the next lesson, **ETL Process Overview**.
1. Click the **Home** icon in the left sidebar.
2. Select your home folder.
3. Select the folder **ETL-Part-1**
4. Open the notebook **02-ETL-Process-Overview** in either the Python or Scala folder


<img src="https://files.training.databricks.com/images/eLearning/ETL-Part-1/Course-Import2-Azure.png" style="margin-bottom: 5px; border: 1px solid #aaa; border-radius: 10px 10px 10px 10px; box-shadow: 5px 5px 5px #aaa; width: auto; height: auto; max-height: 383px"/>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The Python and Scala content is identical except for the language used.

## Additional Topics & Resources
**Q:** Are there additional docs I can reference to find my way around Azure Databricks?  
**A:** See <a href="https://docs.azuredatabricks.net/getting-started/index.html" target="_blank">Getting Started Guides</a>.

**Q:** Where can I learn more about the cluster configuration options?  
**A:** See <a href="https://docs.azuredatabricks.net/user-guide/clusters/index.html#id1" target="_blank">Spark Clusters on Azure Databricks</a>.

**Q:** Can I import formats other than .dbc files?  
**A:** Yes, see <a href="https://docs.azuredatabricks.net/user-guide/notebooks/notebook-manage.html#notebook-external-formats" target="_blank">Importing Notebooks</a>.