# <img src="https://files.training.databricks.com/images/DeltaLake-logo.png" width=80px> Open Source Delta Lake

[Delta Lake](https://delta.io/) is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads.

<img src="https://www.evernote.com/l/AAF4VIILJtFNZLuvZjGGhZTr2H6Z0wh6rOYB/image.png" width=900px>

In [0]:
#Configure Source data and Delta Lake Path

sourcePath = "/databricks-datasets/learning-spark-v2/loans/loan-risks.snappy.parquet"
deltaPath = "/tmp/loans_delta"

#read the parquet file and save it as delta lake table

spark.read.format("parquet").load(sourcePath).write.format("delta").save(deltaPath)


output: _delta_log directory and parquet file(.snappy.parquet)

In [0]:
%fs ls "/tmp/loans_delta" 

**_delta_log** folder contains set of files: optimization, .crc(overall statistics) and json(logs)

In [0]:
%fs ls "/tmp/loans_delta/_delta_log/"  

In [0]:
%fs head /tmp/loans_delta/_delta_log/00000000000000000000.crc

Creating a view using delta lake file

In [0]:
spark.read.format("delta").load(deltaPath).createOrReplaceTempView("loans_delta")

spark.sql("SELECT * FROM loans_delta LIMIT 5").show()

Spark SQL queries can run directly on a directory of data
using `delta.<<deltaPath>>`

In [0]:
spark.sql("SELECT * FROM delta.deltaPath LIMIT 5") # Error: Path must be absolute

In [0]:
spark.sql("SELECT * FROM delta.`deltaPath` LIMIT 5") #Error: Path must be absolute

In [0]:
spark.sql("SELECT * FROM delta.`/tmp/loans_delta` LIMIT 5").show() #shows first 5 rows

In [0]:
spark.sql("SELECT count(*) FROM delta.`/tmp/loans_delta`").show() #shows no. of records

In [0]:
display(spark.sql("SELECT * FROM delta.`{}` LIMIT 5".format(deltaPath))) #using absolute path
#nice table format

In [0]:
%sql
SELECT * FROM delta.`/tmp/loans_delta` LIMIT 5

### CREATE A Table Using Delta Lake

Create a table called `loans_data_delta` using `DELTA` out of the above data.

**The notation is:**
> `CREATE TABLE <table-name>` <br>
  `USING DELTA` <br>
  `LOCATION <path-do-data> ` <br>
  
Tables created with a specified `LOCATION` are considered unmanaged by the metastore. Unlike a managed table, where no path is specified, an unmanaged table’s files are not deleted when you `DROP` the table. However, changes to either the registered table or the files will be reflected in both locations.

**Best Practice**
> Managed tables require that the data for your table be stored in DBFS. Unmanaged tables only store metadata in DBFS. 

**Note**
> Since Delta Lake stores schema (and partition) info in the `_delta_log` directory, we do not have to specify partition columns!

In [0]:
spark.sql("""
     DROP TABLE IF EXISTS loans_data_delta
""")
spark.sql("""
  CREATE TABLE loans_data_delta
  USING DELTA
  LOCATION '{}'
""".format(deltaPath))

In [0]:
%sql
select count(*) from loans_data_delta

### Metadata

Since we already have data backing `loans_data_delta` in place,
the table in the Hive metastore automatically inherits the schema, partitioning,
and table properties of the existing data.

Note that we only store table name, path, database info in the Hive metastore,
the actual schema is stored in the `_delta_log` directory as shown below.

In [0]:
%fs ls /tmp/loans_delta/_delta_log/ 

Metadata is displayed through DESCRIBE DETAIL `<tableName>`.

As long as we have some data in place already for a Delta Lake table, we can infer schema.

In [0]:
%sql
DESCRIBE DETAIL loans_data_delta

### Key Takeaways

> Saving to Delta Lake is as easy as saving to Parquet, but creates an additional log file.

> Using Delta Lake to create tables is straightforward and you do not need to specify schemas.