# Getting started with Delta Lake

[Delta Lake](https://delta.io/) is an open storage format used to save your data in your Lakehouse. Delta provides an abstraction layer on top of files. It's the storage foundation of your Lakehouse.

In [0]:
%run ./_resources/00-setup $reset_all_data=false

In [0]:
print(f"Our user dataset is stored under our Volume={folder}/user_parquet")

## Creating our first Delta Lake table:

In [0]:
%sql
CREATE TABLE IF NOT EXISTS user_delta (
  id BIGINT,
  creation_date TIMESTAMP,
  firstname STRING,
  lastname STRING,
  email STRING,
  address STRING,
  gender INT,
  age_group INT
);

-- Load data in the newly created table:
COPY INTO user_delta FROM '/Volumes/delta_learning/dev/delta_lake_raw_data/user_parquet'
FILEFORMAT = parquet;


In [0]:
%sql
SELECT * FROM user_delta;

In [0]:
data_parquet = spark.read.parquet("/Volumes/delta_learning/dev/delta_lake_raw_data/user_parquet")

data_parquet.write.saveAsTable("p_user_delta")

In [0]:
display(spark.read.table("p_user_delta"))

## Upgrading an existing Parquet or Iceberg table to Delta Lake:

In [0]:
%sql
-- Only for Parquet tables:
CONVERT TO DELTA database_name.table_name;

-- If the table is partitioned:
CONVERT TO DELTA parquet.`s3://my-bucket/path/to/table`
  PARTITIONED BY (date DATE);

-- Uses Iceberg manifest for metadata:
CONVERT TO DELTA iceberg.`s3://my-bucket/path/to/table`;

## Unified Batch and Streaming operations:

In [0]:
# Read the insertion of data:
(spark.readStream
      .option("igonreDeletes", "true")
      .option("ignoreChanges", "true")
      .table("user_delta")
      .createOrReplaceTempView("v_user_delta_stream"))

In [0]:
df = spark.sql("""
                SELECT gender, ROUND(AVG(age_group), 2)
                FROM v_user_delta_stream
                GROUP BY gender
               """)

display(df, get_chkp_folder(folder))

**Wait** until the stream is up and running before executing the code below:

In [0]:
%sql
INSERT INTO user_delta (id, creation_date, firstname, lastname, email, address, gender, age_group) 
VALUES (99999, now(), 'Quentin', 'Ambard', 'quentin.ambard@databricks.com', 'FR', '2', 3) 

## Full DML Support:

In [0]:
%sql
UPDATE user_delta SET age_group = 4 WHERE id = 99999

In [0]:
%sql
DELETE FROM user_delta WHERE id = 99999

In [0]:
%sql
CREATE TABLE IF NOT EXISTS user_updates 
  (id BIGINT, creation_date TIMESTAMP, 
   firstname STRING, lastname STRING, 
   email STRING, address STRING, 
   gender INT, age_group INT);

DELETE FROM user_updates;

INSERT INTO user_updates VALUES (1,     now(), 'Marco',   'polo',   'marco@polo.com',    'US', 2, 3); 
INSERT INTO user_updates VALUES (2,     now(), 'John',    'Doe',    'john@doe.com',      'US', 2, 3);
INSERT INTO user_updates VALUES (99999, now(), 'Quentin', 'Ambard', 'qa@databricks.com', 'FR', 2, 3);

SELECT * FROM user_updates;

In [0]:
%sql
MERGE INTO user_delta AS t
USING user_updates AS s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET *
WHEN NOT MATCHED THEN
  INSERT *;

SELECT * FROM user_delta WHERE id IN (1 ,2, 99999);

## Enforce Data Quality with constraint:

Delta Lake support constraints. You can add any expression to force your table having a given field respecting this constraint.

_Note: This is enforcing quality at the table level. Delta Live Tables offer much more advance quality rules and expectations in data Pipelines._

In [0]:
%sql
ALTER TABLE user_delta ADD CONSTRAINT id_not_null CHECK (id IS NOT NULL);

In [0]:
%sql
-- This command will fail as we insert a user with a null id::
INSERT INTO user_delta (id, creation_date, firstname, lastname, email, address, gender, age_group) 
VALUES (null, now(), 'Quentin', 'Ambard', 'quentin.ambard@databricks.com', 'FR', '2', 3);

## Travel back in Time:

In [0]:
%sql
DESCRIBE HISTORY user_delta

In [0]:
%sql
--Time Travel via Version Number or Timestamp:
SELECT * FROM user_delta WHERE id IN (1 ,2, 99999);

In [0]:
%sql
SELECT * FROM user_delta 
VERSION AS OF 2
WHERE id IN (1 ,2, 99999);

In [0]:
%sql
-- Restore a Previous Version:
RESTORE TABLE user_delta TO VERSION AS OF 2;

In [0]:
%sql
-- Delete all modification older than 200 hours:
VACUUM user_delta RETAIN 200 HOURS;

## CLONE Delta Tables:
You can create a copy of an existing Delta table at a specific version using the `clone` command. This is very useful to:
- Get data from a PROD environment to a STAGING one.
- Archive a specific version for regulatory reason.

There are two types of clones:
* A **deep clone** is a clone that copies the source table data to the clone target in addition to the metadata of the existing table. 
* A **shallow clone** is a clone that does not copy the data files to the clone target. The table metadata is equivalent to the source. These clones are cheaper to create.

Any changes made to either deep or shallow clones affect only the clones themselves and not the source table.

*Note: Shallow clone are pointers to the main table. Running a VACUUM may delete the underlying files and break the shallow clone!*

In [0]:
%sql
-- Shallow clone (zero copy):
CREATE TABLE IF NOT EXISTS user_delta_clone
  SHALLOW CLONE user_delta
  VERSION AS OF 2;

SELECT * FROM user_delta_clone;

In [0]:
%sql
-- Deep clone (copy data):
CREATE TABLE IF NOT EXISTS user_delta_clone_deep
  DEEP CLONE user_delta;

SELECT * FROM user_delta_clone_deep;

## Generated columns:

In [0]:
%sql
CREATE TABLE IF NOT EXISTS user_delta_generated_id (
  id BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 10000 INCREMENT BY 1),
  firstname STRING, 
  lastname STRING, 
  email STRING, 
  address STRING
);

INSERT INTO user_delta_generated_id (firstname, lastname, email, address)
  SELECT firstname, lastname, email, address FROM user_delta;

SELECT * from user_delta_generated_id;

In [0]:
DBDemos.stop_all_streams()