# Overview
As we saw in [Intro To The Data Lakehouse Architecture Notebook](../Intro%20To%20Data%20Lakehouse.ipynb), the Databricks Delta Lake is an open source implimentation of the thrid generation archiecture for a data store. It seeks to provide the best of both data warehouses and data lakes. 

Delta Lake is implimented as a "storage layer" that sits on top of a Data Lake (generation II). Recall that a Data Lake is essentially a storage system for files (ie. a file system). Some popular datastores use an API rather than a POSIX filesystem mount point, but this is beyond the scope. The important think here is that Delta Lake sits on top of a Data Lake.

There are a number of popular Data Lake providers that are [supported](https://docs.delta.io/latest/delta-storage.html) including: 

- Amazon S3
- Microsoft Azure storage
- HDFS
- Google Cloud Storage
- Oracle Cloud Infrastructure
- IBM Cloud Object Storage

But, one can also use a traditional POSIX file system, so we can basically choose any path mounted to our local file system. In my case I am using Ceph as my infinitely scalable file system but that is another discussion.

For this notebook we will use a local filesystem path (a local directory) which points to a directory in this repository as our Data Lake. (More on this later).

## Features
According to the [documentation](https://docs.delta.io/latest/delta-intro.html), Delta Lake offers the following features:

- ACID transactions on Spark: Serializable isolation levels ensure that readers never see inconsistent data.
- Scalable metadata handling: Leverages Spark distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease.
- Streaming and batch unification: A table in Delta Lake is a batch table as well as a streaming source and sink. Streaming data ingest, batch historic backfill, interactive queries all just work out of the box.
- Schema enforcement: Automatically handles schema variations to prevent insertion of bad records during ingestion.
- Time travel: Data versioning enables rollbacks, full historical audit trails, and reproducible machine learning experiments.
- Upserts and deletes: Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts, and so on.

In this notebook we will put hands on keyboard to understand these features.

# 1. Installing Software
The Delta Lake is tightly integrated with Apache Spark. Having a look at the [quick start guide](https://docs.delta.io/latest/quick-start.html#set-up-apache-spark-with-delta-lake) we see that Apache Spark (and pyspark if using python) is the main interface for interacting with Delta Lake.



Taking a deeper look at the [github page](https://github.com/delta-io/delta) we see that:
> Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.

In this section there will be a lot of mentioning spark so if you are not familiar with Apache Spark, I would reccomend reviewing [introductory material](../../../Machine%20Learning/Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md) in this repository

## 1.1 Install Apache Spark And Pyspark
As mentioned in the [introductory material](../../../Machine%20Learning/Big%20Data%20And%20Big%20Compute/Apache%20Spark/README.md) we are running on Spark 3.1.1. Consult this material for information regarding the installation of Spark or pyspark.

## 2.1. Install Delta Lake Packages
The documentation was a bit sparse on installing the Delta Lake software. The first thing to decide is which version. According to the [documentation](https://docs.delta.io/latest/releases.html#compatibility-with-apache-spark) we have the following compatability matrix.


<table border="1" class="docutils">
<colgroup>
<col width="41%">
<col width="59%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Delta Lake version</th>
<th class="head">Apache Spark version</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td> 1.1.x</td>
<td> 3.2.x</td>
</tr>
<tr class="row-odd"><td> 1.0.x</td>
<td> 3.1.x</td>
</tr>
<tr class="row-even"><td> 0.7.x and 0.8.x</td>
<td> 3.0.x</td>
</tr>
<tr class="row-odd"><td> Below 0.7.0</td>
<td> 2.4.2 - 2.4.<em>&lt;latest&gt;</em></td>
</tr>
</tbody>
</table>

As we have been using Spark 3.1.1 we will be installing Delta Lake 1.0.x.

Delta Lake exits as a set of jar's that extend and stack on top of the Apache Spark stack.

### 2.1.1. Install delta-spark Python Library
This PyPi package contains the Python APIs for using Delta Lake with Apache Spark. This package however does not include the related Scala jar files that are the core of the code base (recall Spark is written in Java/Scala). The jars related by Delta Lake will be fetched at runtime after adding specific configurations to the Spark Driver.


For more information see the [pypi index](https://pypi.org/project/delta-spark/).

In [3]:
! pip install delta-spark==1.0.1

Collecting delta-spark==1.0.1
  Downloading delta_spark-1.0.1-py3-none-any.whl (17 kB)
Collecting importlib-metadata>=3.10.0
  Downloading importlib_metadata-4.11.3-py3-none-any.whl (18 kB)
Collecting zipp>=0.5
  Downloading zipp-3.7.0-py3-none-any.whl (5.3 kB)
Installing collected packages: zipp, importlib-metadata, delta-spark
Successfully installed delta-spark-1.0.1 importlib-metadata-4.11.3 zipp-3.7.0
You should consider upgrading via the '/usr/local/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

# 2. Runnning Spark with Delta Lake integration
