
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>




# Basic Transformations
This lesson will show you various ways to bring data into the Databricks Data Intelligence Platform. This data may be in different formats or may exist in various locations. We will talk about the intricacies of these situations. 

## Learning Objectives
By the end of this lesson, you should be able to:
- Use Spark SQL to configure options for extracting data from external sources
- Use Spark SQL DDL to define schemas and tables
- Differentiate between managed and external tables in Spark SQL 
- Explain how managed and external tables impact storage location and management




## Run Setup

The setup script will create the data and declare necessary values for the rest of this notebook to execute.

In [0]:
%run ./Includes/Classroom-Setup-02.2

## Cloning Delta Lake Tables
Delta Lake has two options for efficiently copying Delta Lake tables.

**`DEEP CLONE`** fully copies data and metadata from a source table to a target. This copy occurs incrementally, so executing this command again can sync changes from the source to the target location.

In [0]:
CREATE OR REPLACE TABLE historical_sales_clone
DEEP CLONE historical_sales_bronze;

Because all the data files must be copied over, this can take quite a while for large datasets.

If you wish to create a copy of a table quickly to test out applying changes without the risk of modifying the current table, **`SHALLOW CLONE`** can be a good option. Shallow clones just copy the Delta transaction logs, meaning that the data doesn't move.

In [0]:
CREATE OR REPLACE TABLE historical_sales_shallow_clone
SHALLOW CLONE historical_sales_bronze;

In either case, data modifications applied to the cloned version of the table will be tracked and stored separately from the source. Cloning is a great way to set up tables for testing SQL code while still in development.

## Complete Overwrites

We can use overwrites to atomically replace all of the data in a table. There are multiple benefits to overwriting tables instead of deleting and recreating tables:
- Overwriting a table is much faster because it doesn’t need to list the directory recursively or delete any files.
- The old version of the table still exists; can easily retrieve the old data using Time Travel.
- It’s an atomic operation. Concurrent queries can still read the table while you are deleting the table.
- Due to ACID transaction guarantees, if overwriting the table fails, the table will be in its previous state.

Spark SQL provides two easy methods to accomplish complete overwrites.

Some students may have noticed previous lesson on CTAS statements actually used CRAS statements (to avoid potential errors if a cell was run multiple times).

**`CREATE OR REPLACE TABLE`** (CRAS) statements fully replace the contents of a table each time they execute.

Note: This table was created using a CRAS statement in the `Classroom-Setup` script.

In [0]:
CREATE OR REPLACE TABLE events AS
  SELECT * FROM parquet.`${da.paths.datasets}/ecommerce/raw/events-historical`;

DESCRIBE HISTORY events;

Reviewing the table history shows a previous version of this table was replaced. The version 0 CRAS statement was when the `Classroom-Setup` script was run. The version 1 CRAS statement was run in the previous cell.

## INSERT OVERWRITE

**`INSERT OVERWRITE`** provides a nearly identical outcome as above: data in the target table will be replaced by data from the query. 

**`INSERT OVERWRITE`**:

- Can only overwrite an existing table, not create a new one like our CRAS statement
- Can overwrite only with new records that match the current table schema -- and thus can be a "safer" technique for overwriting an existing table without disrupting downstream consumers
- Can overwrite individual partitions

In [0]:
INSERT OVERWRITE events
  SELECT * FROM parquet.`${da.paths.datasets}/ecommerce/raw/events-historical`;
DESCRIBE HISTORY events;

The table history records the operation as a WRITE.

A primary difference between using CRAS and using `INSERT OVERWRITE` has to do with how Delta Lake enforces schema on write.

Whereas a CRAS statement will allow us to completely redefine the contents of our target table, **`INSERT OVERWRITE`** will fail if we try to change our schema (unless we provide optional settings). 

Uncomment and run the cell below to generate an expected error message.

In [0]:
-- INSERT OVERWRITE events
-- SELECT *, current_timestamp() FROM parquet.`${da.paths.datasets}/ecommerce/raw/sales-historical`

## Merge Updates

You can upsert data from a source table, view, or DataFrame into a target Delta table using the **`MERGE`** SQL operation. Delta Lake supports inserts, updates and deletes in **`MERGE`**, and supports extended syntax beyond the SQL standards to facilitate advanced use cases.

<strong><code>
MERGE INTO target a<br/>
USING source b<br/>
ON {merge_condition}<br/>
WHEN MATCHED THEN {matched_action}<br/>
WHEN NOT MATCHED THEN {not_matched_action}<br/>
</code></strong>

We will use the **`MERGE`** operation to update historic users data with updated emails and new users.

In [0]:
CREATE OR REPLACE TEMP VIEW users_update AS 
SELECT *, current_timestamp() AS updated 
FROM parquet.`${da.paths.datasets}/ecommerce/raw/users-30m`

The main benefits of **`MERGE`**:
* updates, inserts, and deletes are completed as a single transaction
* multiple conditionals can be added in addition to matching fields
* provides extensive options for implementing custom logic

Below, we'll only update records if the current row has a **`NULL`** email and the new row does not. 

All unmatched records from the new batch will be inserted.

In [0]:
MERGE INTO users a
USING users_update b
ON a.user_id = b.user_id
WHEN MATCHED AND a.email IS NULL AND b.email IS NOT NULL THEN
  UPDATE SET email = b.email, updated = b.updated
WHEN NOT MATCHED THEN 
  INSERT (user_id, email, updated)
  VALUES (b.user_id, b.email, b.updated)

Note that we explicitly specify the behavior of this function for both the **`MATCHED`** and **`NOT MATCHED`** conditions; the example demonstrated here is just an example of logic that can be applied, rather than indicative of all **`MERGE`** behavior.

## Insert-Only Merge for Deduplication

A common ETL use case is to collect logs or other every-appending datasets into a Delta table through a series of append operations. 

Many source systems can generate duplicate records. With merge, you can avoid inserting the duplicate records by performing an insert-only merge.

This optimized command uses the same **`MERGE`** syntax but only provided a **`WHEN NOT MATCHED`** clause.

Below, we use this to confirm that records with the same **`user_id`** and **`event_timestamp`** aren't already in the **`events`** table.

In [0]:
MERGE INTO events a
USING events_update b
ON a.user_id = b.user_id AND a.event_timestamp = b.event_timestamp
WHEN NOT MATCHED AND b.traffic_source = 'email' THEN 
  INSERT *


 
## Filtering and Renaming Columns from Existing Tables

Simple transformations like changing column names or omitting columns from target tables can be easily accomplished during table creation.

The following statement creates a new table containing a subset of columns from the **`sales_copy_into`** table. 

Here, we'll presume that we're intentionally leaving out information that potentially identifies the user or that provides itemized purchase details. We'll also rename our fields with the assumption that a downstream system has different naming conventions than our source data.

In [0]:
CREATE OR REPLACE TABLE purchases AS
SELECT order_id AS id, transaction_timestamp, purchase_revenue_in_usd AS price
FROM historical_sales_bronze;

SELECT * FROM purchases LIMIT 10;


 
## Declare Schema with Generated Columns

Note in the cell above that the `transactions_timestamp` column appears to be some variant of a Unix timestamp, which may not be the most useful for our analysts to derive insights. This is a situation where generated columns would be beneficial.

Generated columns are a special type of column whose values are automatically generated based on a user-specified function over other columns in the Delta table. We first divide the timestamp that is currently in microseconds by 1e6 (1 million). We then use `CAST` to cast the result to a [TIMESTAMP](https://docs.databricks.com/en/sql/language-manual/data-types/timestamp-type.html). Then, we `CAST` to [DATE](https://docs.databricks.com/en/sql/language-manual/data-types/date-type.html).

The code below demonstrates creating a new table while:
1. Specifying column names and types
1. Adding a <a href="https://docs.databricks.com/en/delta/generated-columns.html" target="_blank">generated column</a> to calculate the date
1. Providing a descriptive column comment for the generated column

Note that, at this point, the table contains no data. When we add data to the table that does not already contain a date value, the `date` column will be computed.

In [0]:
CREATE OR REPLACE TABLE purchase_dates (
  id STRING, 
  transaction_timestamp STRING, 
  price STRING,
  date DATE GENERATED ALWAYS AS (
    cast(cast(transaction_timestamp/1e6 AS TIMESTAMP) AS DATE))
    COMMENT "generated based on `transaction_timestamp` column");

SELECT * FROM purchase_dates

Let's add some data to the table.

The cell below uses a `MERGE INTO` command. We will see this command in action in the next lesson. For now, just note that our generated column, `date`, has properly computed the date, based on the `transactions_timestamp` column.

As with any Delta Lake source, the query automatically reads the most recent snapshot of the table for any query; you never need to run **`REFRESH TABLE`**.

Lastly, note that if a field that would otherwise be generated is included in an insert to a table, this insert will fail if the value provided does not exactly match the value that would be derived by the logic used to define the generated column.

**NOTE**: The cell below configures a setting to allow for generating columns when using a Delta Lake **`MERGE INTO`** statement: **`SET spark.databricks.delta.schema.autoMerge.enabled=true`**

In [0]:
SET spark.databricks.delta.schema.autoMerge.enabled=true; 

MERGE INTO purchase_dates a
USING purchases b
ON a.id = b.id
WHEN NOT MATCHED THEN
  INSERT *;

SELECT * FROM purchase_dates;


## Add a Table Constraint

The error message above refers to a **`CHECK constraint`**. Generated columns are a special implementation of check constraints.

Because Delta Lake enforces schema on write, Databricks can support standard SQL constraint management clauses to ensure the quality and integrity of data added to a table.

Databricks currently support two types of constraints:
* <a href="https://docs.databricks.com/delta/delta-constraints.html#not-null-constraint" target="_blank">**`NOT NULL`** constraints</a>
* <a href="https://docs.databricks.com/delta/delta-constraints.html#check-constraint" target="_blank">**`CHECK`** constraints</a>

In both cases, you must ensure that no data violating the constraint is already in the table prior to defining the constraint. Once a constraint has been added to a table, data violating the constraint will result in write failure.

Below, we'll add a **`CHECK`** constraint to the **`date`** column of our table. Note that **`CHECK`** constraints look like standard **`WHERE`** clauses you might use to filter a dataset.

In [0]:
ALTER TABLE purchase_dates ADD CONSTRAINT valid_date CHECK (date > '2020-01-01');

Table constraints are shown in the **`TBLPROPERTIES`** field.

In [0]:
DESCRIBE EXTENDED purchase_dates

The metadata fields added to the table provide useful information to understand when records were inserted and from where. This can be especially helpful if troubleshooting problems in the source data becomes necessary.

All of the comments and properties for a given table can be reviewed using **`DESCRIBE TABLE EXTENDED`**.

**NOTE**: Delta Lake automatically adds several table properties on table creation.


 
Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()


&copy; 2024 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the 
<a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use">Terms of Use</a> | 
<a href="https://help.databricks.com/">Support</a>