### Important Note!

This notebook ran using with a 12.2 LTS Runtime version!

There are no concerns regarding cluster size (memory and cores).

The purpose of this notebook is just to show the version of the SQL commands for Python, **always use the SQL version as a reference**, as it was the one used during the Databricks SQL course.

## Databricks Notebook Magic Commands

| Command | Result |
|--------|--------|
| `%lsmagic` | Lists all magic commands. |
| `%md` | Cell as a Markdown. |
| `%sql` | Use SQL. |
| `%python` | Use Python. |
| `%scala` | Use Scala. |
| `%r` | Use R. |
| `%run` | Executes a Python file or another notebook. |
| `%who` | Shows all variables. |
| `%env` | Allows insert environment variables. |
| `%fs` | Enables use File System commands from DBFD Utils. |
| `%sh` | Executes Shell commands on cluster. |
| `%matplotlib` | Matplotlib backend resources. |
| `%config` | Can set configurations to notebook. |
| `%pip` | Install Python packages. |
| `%load` | Loads the contents of a file into a cell. |
| `%reload` | Reloads contents. |
| `%jobs` | Lists running jobs. |

In [0]:
%sql
SELECT "HELLO, WORLD! (using SQL)"

"HELLO, WORLD! (using SQL)"
"HELLO, WORLD! (using SQL)"


In [0]:
print("HELLO, WORLD! (using Python)")

HELLO, WORLD! (using Python)


In [0]:
%scala
println("HELLO, WORLD! (using Scala)")

In [0]:
%r
print("HELLO, WORLD! (using r)")

[1] "HELLO, WORLD! (using r)"

In [0]:
%sh
echo "HELLO, WORLD!"

HELLO, WORLD!


In [0]:
A = 30

In [0]:
B = A

In [0]:
C = B / 3

## What is Apache Spark?

Apache Spark is a processing engine that is able to analyze data using SQL, Python, Scala, R and Java. Also having frameworks to enable machine learning, graph processing or streamming:

![Spark Engines](https://raw.githubusercontent.com/owshq-plumbers/trn-cc-databricks-sql-zero-to-zero/main/images/spark-engine.png)
<br/>
<br/>

* The DataFrames API works due to an abstraction above RDDs, however gaining something about 5-20x over traditional RDDs with its Catalyst Optimizer.

![Spark Unified Engine](https://raw.githubusercontent.com/owshq-plumbers/trn-cc-databricks-sql-zero-to-zero/main/images/unified-engine.png)


## Spark Cluster Architecture: Drivers, Executors, Slots & Tasks
![Spark Physical Cluster, slots](https://raw.githubusercontent.com/owshq-plumbers/trn-cc-databricks-sql-zero-to-zero/main/images/spark-driver.png)

When you create a cluster, you can choose between **single-node** or **multi-node**, where you have to include your driver machine type and also workers machine type.


## Spark Jobs, Lazy Evaluation, Transformations & Actions
![Spark Jobs, Lazy Evaluation, Transformations & Actions](https://raw.githubusercontent.com/owshq-plumbers/trn-cc-databricks-sql-zero-to-zero/main/images/spark_bookclub.png)

* Every Apache Spark execution is classified as a Spark Job, which is later divided in stages that will agroup tasks (smallest work execution unit).
* Because Apache Spark is appended to the Lazy Evaluation model, every process can be classified in **action** OR **transformation**.
* Every Spark Job is generated by an action command, while stages will be organized due to shuffling and optimization query roles.
* Shuffle? Yes, Spark transformations can also be divided in: **narrow transformations** and **wide transformations (shuffle)**.



## Some examples of Spark Action

| Command | Result |
|--------|--------|
| `collect()` | Returns an array that contains all of Rows in this Dataset. |
| `count()` | Returns the number of rows in the Dataset. |
| `first()` | Returns the first row. |
| `foreach(f)` | Applies a function f to all rows. |
| `head()` | Returns the first row. |
| `show(..)` | Displays the top 20 rows of Dataset in a tabular form. |
| `take(n)` | Returns the first n rows in the Dataset. |


## Some examples of Spark Transformation

| Command | Type | Result |
|--------|--------|-------------|
| `filter()` | Narrow | Returns a new DataFrame after applying filter function on source dataset. |
| `distinct()` | Wide | Returns a new DataFrame containing the distinct rows in this DataFrame. |
| `join()` | Wide | Joins with another DataFrame, using the given join expression. |
| `union()` | Narrow | Combines two DataFrames and returns the new DataFrame. |
| `repartition()` | Wide | Returns a new DataFrame (hash partitioned) partitioned by the given partitioning expressions. |
| `groupBy()` | Wide | Groups the DataFrame using the specified columns, so we can run aggregation on them. |

## What is Delta Lake?

![Delta Lake](https://raw.githubusercontent.com/owshq-plumbers/trn-cc-databricks-sql-zero-to-zero/main/images/delta-lake.png)

* Delta Lake is an open-source storage framework that enables building a Lakehouse architecture with compute engines, acting as a **native table format on Databricks**, using Parquet as the physical format to append all data.
* Delta tables are built upon the **ACID guarantees** provided by the open source Delta Lake protocol. ACID stands for atomicity, consistency, isolation, and durability.
* The Delta Lake can implement the ACID properties due to **transaction log controls** saved during **every commit** on the data. During a transaction, data files are written to the file directory backing the table. When the transaction completes, a new entry is committed to the transaction log that includes the paths to all files written during the transaction. Each commit increments the table version and makes new data files visible to read operations.

### Creating a Delta Table

In [0]:
df = spark.read.load("/databricks-datasets/learning-spark-v2/people/people-10m.delta")
table_name = "people_10millions"
df.write.saveAsTable(table_name)

In [0]:
dbutils.fs.ls('dbfs:/databricks-datasets/learning-spark-v2/people/')

[FileInfo(path='dbfs:/databricks-datasets/learning-spark-v2/people/.DS_Store', name='.DS_Store', size=6148, modificationTime=1602178938000),
 FileInfo(path='dbfs:/databricks-datasets/learning-spark-v2/people/README.md', name='README.md', size=215, modificationTime=1587072727000),
 FileInfo(path='dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta/', name='people-10m.delta/', size=0, modificationTime=1694513790235),
 FileInfo(path='dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.parquet/', name='people-10m.parquet/', size=0, modificationTime=1694513790235),
 FileInfo(path='dbfs:/databricks-datasets/learning-spark-v2/people/people-with-header-10m.csv.bzip/', name='people-with-header-10m.csv.bzip/', size=0, modificationTime=1694513790235),
 FileInfo(path='dbfs:/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt', name='people-with-header-10m.txt', size=608145966, modificationTime=1587072733000),
 FileInfo(path='dbfs:/databricks-datasets/l

### Reading a Delta Table

In [0]:
df_titanic_censored = spark.read.table("titanic.clean.titanic_censored")
display(df_titanic_censored.limit(5))

name,sex,age,embarked,survived
"Allen, Miss. Elisabeth Walton",female,29,Southampton,1
"Allison, Mr. Hudson Joshua Creighton",male,30,Southampton,0
"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25,Southampton,0
"Anderson, Mr. Harry",male,48,Southampton,1
"Andrews, Miss. Kornelia Theodosia",female,63,Southampton,1


In [0]:
df_people = spark.read.table("default.people_10millions")
display(df_people.take(5))

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
5016568,Enrique,Emmett,Carvil,H,1960-01-17T05:00:00.000+0000,909-88-7612,62241
5016569,Jordan,Alva,Penk,H,1979-05-02T04:00:00.000+0000,936-39-5888,74778
5016570,Leo,Merlin,Conkay,H,1981-05-21T04:00:00.000+0000,919-71-2948,82321
5016571,Bernard,Wiley,Thackham,H,1983-10-12T04:00:00.000+0000,976-38-5505,72797
5016572,Devin,Loyd,Gipp,H,1990-01-20T05:00:00.000+0000,934-44-9546,99538


In [0]:
%fs
ls 'dbfs:/databricks-datasets/learning-spark-v2/people/'

path,name,size,modificationTime
dbfs:/databricks-datasets/learning-spark-v2/people/.DS_Store,.DS_Store,6148,1602178938000
dbfs:/databricks-datasets/learning-spark-v2/people/README.md,README.md,215,1587072727000
dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta/,people-10m.delta/,0,1694513960298
dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.parquet/,people-10m.parquet/,0,1694513960298
dbfs:/databricks-datasets/learning-spark-v2/people/people-with-header-10m.csv.bzip/,people-with-header-10m.csv.bzip/,0,1694513960298
dbfs:/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt,people-with-header-10m.txt,608145966,1587072733000
dbfs:/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.gz,people-with-header-10m.txt.gz,273242367,1587072734000
dbfs:/databricks-datasets/learning-spark-v2/people/people-with-header-10m.txt.snappy/,people-with-header-10m.txt.snappy/,0,1694513960298


In [0]:
display(df_people.count())

count(1)
10000000


### Getting Details about Delta Table (propertys, size, location, etc)

In [0]:
display(df_titanic_censored.describe)

In [0]:
df_titanic_clean = spark.read.table("titanic.clean.titanic_clean")
df_titanic_clean.describe()
df_titanic_clean.summary()

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,06985207-25ff-43c6-8655-784ce8a1242d,titanic.clean.titanic_clean,Clean Titanic table,abfss://unitycatalog@mdwdatabricksuc.dfs.core.windows.net/f2c6382d-e8a1-48cb-a313-a64aaadf3277/tables/28346e45-cccd-48cf-84f8-bd30533a6855,2023-09-02T14:37:18.259+0000,2023-09-02T14:38:17.000+0000,List(),,1,28702,Map(),1,2,"List(appendOnly, invariants)",Map()


In [0]:
display(df_people.describe())

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,d68baa71-9dbe-46a2-82c7-ebbd00513e1a,hive_metastore.default.people_10millions,,dbfs:/user/hive/warehouse/people_10millions,2023-09-12T01:18:02.999+0000,2023-09-12T01:18:17.000+0000,List(),,4,236570588,Map(),1,2,"List(appendOnly, invariants)",Map()


In [0]:
df_people.createOrReplaceTemporaryView("people_10millions")
display(spark.sql("DESCRIBE TABLE EXTENDED people_10millions"))

col_name,data_type,comment
id,int,
firstName,string,
middleName,string,
lastName,string,
gender,string,
birthDate,timestamp,
ssn,string,
salary,int,
,,
# Delta Statistics Columns,,


### Inspecting the Delta table history (creation, upserts, deletions, etc)

In [0]:
display(df_people.history)

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
6,2023-09-12T01:18:17.000+0000,2657276046068378,vitojon@outlook.com,CREATE OR REPLACE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(639040945428698),0911-185656-lx56e10s,5.0,WriteSerializable,False,"Map(numFiles -> 4, numOutputBytes -> 236570588, numOutputRows -> 10000000)",,Databricks-Runtime/13.3.x-scala2.12
5,2023-09-12T01:14:20.000+0000,2657276046068378,vitojon@outlook.com,VACUUM END,Map(status -> COMPLETED),,List(639040945428698),0911-185656-lx56e10s,4.0,SnapshotIsolation,True,"Map(numDeletedFiles -> 0, numVacuumedDirectories -> 1)",,Databricks-Runtime/13.3.x-scala2.12
4,2023-09-12T01:14:15.000+0000,2657276046068378,vitojon@outlook.com,VACUUM START,"Map(defaultRetentionMillis -> 604800000, retentionCheckEnabled -> true)",,List(639040945428698),0911-185656-lx56e10s,3.0,SnapshotIsolation,True,"Map(numFilesToDelete -> 0, sizeOfDataToDelete -> 0)",,Databricks-Runtime/13.3.x-scala2.12
3,2023-09-12T01:03:50.000+0000,2657276046068378,vitojon@outlook.com,DELETE,"Map(predicate -> [""(birthDate#3733 >= 2000-01-01 00:00:00)""])",,List(639040945428698),0911-185656-lx56e10s,2.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 236570588, numCopiedRows -> 9983242, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 16482, numDeletedRows -> 16758, scanTimeMs -> 1358, numAddedFiles -> 4, numAddedBytes -> 236178357, rewriteTimeMs -> 15123)",,Databricks-Runtime/13.3.x-scala2.12
2,2023-09-12T01:02:45.000+0000,2657276046068378,vitojon@outlook.com,UPDATE,"Map(predicate -> [""(gender#2916 = F)""])",,List(639040945428698),0911-185656-lx56e10s,1.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 236570589, numCopiedRows -> 4812698, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 16701, scanTimeMs -> 1691, numAddedFiles -> 4, numUpdatedRows -> 5187302, numAddedBytes -> 236570588, rewriteTimeMs -> 15010)",,Databricks-Runtime/13.3.x-scala2.12
1,2023-09-12T01:02:26.000+0000,2657276046068378,vitojon@outlook.com,UPDATE,"Map(predicate -> [""(gender#2230 = M)""])",,List(639040945428698),0911-185656-lx56e10s,0.0,WriteSerializable,False,"Map(numRemovedFiles -> 3, numRemovedBytes -> 177167561, numCopiedRows -> 2670016, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 14614, scanTimeMs -> 1745, numAddedFiles -> 3, numUpdatedRows -> 4812698, numAddedBytes -> 177167561, rewriteTimeMs -> 12834)",,Databricks-Runtime/13.3.x-scala2.12
0,2023-09-12T00:54:34.000+0000,2657276046068378,vitojon@outlook.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(639040945428698),0911-185656-lx56e10s,,WriteSerializable,True,"Map(numFiles -> 4, numOutputBytes -> 236570589, numOutputRows -> 10000000)",,Databricks-Runtime/13.3.x-scala2.12


## Changing Delta Table

#### Inserting new data

In [0]:
df_people.update(contition = "gender = 'M'", 
                 set { "gender": "'H'"} )
df_people.update(contition = "gender = 'F'", 
                 set { "gender": "'M'"} )

num_affected_rows
5187302


#### Removing rows/values

In [0]:
df_people.delete("birthDate >= '2000-01-01'")

num_affected_rows
16758


In [0]:
display(df_people.count())

count(1)
9983242


### Inspecting the Delta table history again (creation, upserts, deletions, etc)

In [0]:
display(df_people.history())

version,timestamp,userId,userName,operation,operationParameters,job,notebook,clusterId,readVersion,isolationLevel,isBlindAppend,operationMetrics,userMetadata,engineInfo
9,2023-09-12T10:28:01.000+0000,2657276046068378,vitojon@outlook.com,DELETE,"Map(predicate -> [""(birthDate#3498 >= 2000-01-01 00:00:00)""])",,List(639040945428698),0912-101048-c963s7e0,8.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 236570369, numCopiedRows -> 9983242, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 13955, numDeletedRows -> 16758, scanTimeMs -> 922, numAddedFiles -> 4, numAddedBytes -> 236178138, rewriteTimeMs -> 13032)",,Databricks-Runtime/13.3.x-scala2.12
8,2023-09-12T10:22:53.000+0000,2657276046068378,vitojon@outlook.com,UPDATE,"Map(predicate -> [""(gender#2545 = F)""])",,List(639040945428698),0912-101048-c963s7e0,7.0,WriteSerializable,False,"Map(numRemovedFiles -> 0, numRemovedBytes -> 0, numCopiedRows -> 0, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 465, scanTimeMs -> 464, numAddedFiles -> 0, numUpdatedRows -> 0, numAddedBytes -> 0, rewriteTimeMs -> 1)",,Databricks-Runtime/13.3.x-scala2.12
7,2023-09-12T10:22:50.000+0000,2657276046068378,vitojon@outlook.com,UPDATE,"Map(predicate -> [""(gender#1621 = M)""])",,List(639040945428698),0912-101048-c963s7e0,6.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 236570588, numCopiedRows -> 4812698, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 16781, scanTimeMs -> 1760, numAddedFiles -> 4, numUpdatedRows -> 5187302, numAddedBytes -> 236570369, rewriteTimeMs -> 15011)",,Databricks-Runtime/13.3.x-scala2.12
6,2023-09-12T01:18:17.000+0000,2657276046068378,vitojon@outlook.com,CREATE OR REPLACE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(639040945428698),0911-185656-lx56e10s,5.0,WriteSerializable,False,"Map(numFiles -> 4, numOutputBytes -> 236570588, numOutputRows -> 10000000)",,Databricks-Runtime/13.3.x-scala2.12
5,2023-09-12T01:14:20.000+0000,2657276046068378,vitojon@outlook.com,VACUUM END,Map(status -> COMPLETED),,List(639040945428698),0911-185656-lx56e10s,4.0,SnapshotIsolation,True,"Map(numDeletedFiles -> 0, numVacuumedDirectories -> 1)",,Databricks-Runtime/13.3.x-scala2.12
4,2023-09-12T01:14:15.000+0000,2657276046068378,vitojon@outlook.com,VACUUM START,"Map(defaultRetentionMillis -> 604800000, retentionCheckEnabled -> true)",,List(639040945428698),0911-185656-lx56e10s,3.0,SnapshotIsolation,True,"Map(numFilesToDelete -> 0, sizeOfDataToDelete -> 0)",,Databricks-Runtime/13.3.x-scala2.12
3,2023-09-12T01:03:50.000+0000,2657276046068378,vitojon@outlook.com,DELETE,"Map(predicate -> [""(birthDate#3733 >= 2000-01-01 00:00:00)""])",,List(639040945428698),0911-185656-lx56e10s,2.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 236570588, numCopiedRows -> 9983242, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 16482, numDeletedRows -> 16758, scanTimeMs -> 1358, numAddedFiles -> 4, numAddedBytes -> 236178357, rewriteTimeMs -> 15123)",,Databricks-Runtime/13.3.x-scala2.12
2,2023-09-12T01:02:45.000+0000,2657276046068378,vitojon@outlook.com,UPDATE,"Map(predicate -> [""(gender#2916 = F)""])",,List(639040945428698),0911-185656-lx56e10s,1.0,WriteSerializable,False,"Map(numRemovedFiles -> 4, numRemovedBytes -> 236570589, numCopiedRows -> 4812698, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 16701, scanTimeMs -> 1691, numAddedFiles -> 4, numUpdatedRows -> 5187302, numAddedBytes -> 236570588, rewriteTimeMs -> 15010)",,Databricks-Runtime/13.3.x-scala2.12
1,2023-09-12T01:02:26.000+0000,2657276046068378,vitojon@outlook.com,UPDATE,"Map(predicate -> [""(gender#2230 = M)""])",,List(639040945428698),0911-185656-lx56e10s,0.0,WriteSerializable,False,"Map(numRemovedFiles -> 3, numRemovedBytes -> 177167561, numCopiedRows -> 2670016, numDeletionVectorsAdded -> 0, numDeletionVectorsRemoved -> 0, numAddedChangeFiles -> 0, executionTimeMs -> 14614, scanTimeMs -> 1745, numAddedFiles -> 3, numUpdatedRows -> 4812698, numAddedBytes -> 177167561, rewriteTimeMs -> 12834)",,Databricks-Runtime/13.3.x-scala2.12
0,2023-09-12T00:54:34.000+0000,2657276046068378,vitojon@outlook.com,CREATE TABLE AS SELECT,"Map(partitionBy -> [], description -> null, isManaged -> true, properties -> {}, statsOnLoad -> false)",,List(639040945428698),0911-185656-lx56e10s,,WriteSerializable,True,"Map(numFiles -> 4, numOutputBytes -> 236570589, numOutputRows -> 10000000)",,Databricks-Runtime/13.3.x-scala2.12


### Rescuing through Time Travel to recover old version of Delta Table

In [0]:
spark.sql("CREATE OR REPLACE TABLE people_10millions AS SELECT * FROM people_10millions VERSION AS OF 2")

num_affected_rows,num_inserted_rows


In [0]:
df_people.describe()

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,cc79eec3-ea3e-4068-821b-c300d5d133d3,hive_metastore.default.people_10millions,,dbfs:/user/hive/warehouse/people_10millions,2023-09-12T10:32:42.057+0000,2023-09-12T10:32:54.000+0000,List(),,4,236570588,Map(),1,2,"List(appendOnly, invariants)",Map()


In [0]:
display(df_people.describe())

format,id,name,description,location,createdAt,lastModified,partitionColumns,clusteringColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion,tableFeatures,statistics
delta,cc79eec3-ea3e-4068-821b-c300d5d133d3,hive_metastore.default.people_10millions,,dbfs:/user/hive/warehouse/people_10millions,2023-09-12T10:32:42.057+0000,2023-09-12T10:32:54.000+0000,List(),,4,236570588,Map(),1,2,"List(appendOnly, invariants)",Map()


In [0]:
dbutils.fs.ls('dbfs:/user/hive/warehouse/people_10millions/_delta_log')

[FileInfo(path='dbfs:/user/hive/warehouse/people_10millions/_delta_log/00000000000000000000.crc', name='00000000000000000000.crc', size=6072, modificationTime=1694480080000),
 FileInfo(path='dbfs:/user/hive/warehouse/people_10millions/_delta_log/00000000000000000000.json', name='00000000000000000000.json', size=5122, modificationTime=1694480074000),
 FileInfo(path='dbfs:/user/hive/warehouse/people_10millions/_delta_log/00000000000000000001.00000000000000000006.compacted.json', name='00000000000000000001.00000000000000000006.compacted.json', size=10002, modificationTime=1694481499000),
 FileInfo(path='dbfs:/user/hive/warehouse/people_10millions/_delta_log/00000000000000000001.crc', name='00000000000000000001.crc', size=6072, modificationTime=1694480548000),
 FileInfo(path='dbfs:/user/hive/warehouse/people_10millions/_delta_log/00000000000000000001.json', name='00000000000000000001.json', size=4696, modificationTime=1694480546000),
 FileInfo(path='dbfs:/user/hive/warehouse/people_10milli

In [0]:
display(spark.read.json(`/user/hive/warehouse/people_10millions/_delta_log/00000000000000000003.json`))

add,commitInfo,remove
,"List(0911-185656-lx56e10s, Databricks-Runtime/13.3.x-scala2.12, false, WriteSerializable, List(639040945428698), DELETE, List(16482, 236178357, 0, 4, 9983242, 16758, 0, 0, 236570588, 4, 15123, 1358), List([""(birthDate#3733 >= 2000-01-01 00:00:00)""]), 2, List(false, false, false), 1694480630866, 355c9cdc-eea5-45ee-970a-24dca4f60201, 2657276046068378, vitojon@outlook.com)",
,,"List(true, 1694480630858, true, part-00001-842d31c6-df6a-47dc-ac30-e601115e4b4b-c000.snappy.parquet, 59655845, List(1694480072000001, 1694480072000001, 1694480072000001, 268435456))"
,,"List(true, 1694480630858, true, part-00003-9344270e-70b7-4712-a5ea-b1495be6ab43-c000.snappy.parquet, 59403027, List(1694480072000003, 1694480072000003, 1694480072000003, 268435456))"
,,"List(true, 1694480630858, true, part-00000-0d03c716-971a-4eed-b0ed-0c41d86d9b7e-c000.snappy.parquet, 59319885, List(1694480072000000, 1694480072000000, 1694480072000000, 268435456))"
,,"List(true, 1694480630858, true, part-00002-a65ece48-0cdc-49b6-aa36-065b209f156b-c000.snappy.parquet, 58191831, List(1694480072000002, 1694480072000002, 1694480072000002, 268435456))"
"List(true, 1694480629000, part-00000-5d3da8b5-0278-4524-9db6-66bcd5714620-c000.snappy.parquet, 59220845, {""numRecords"":2509072,""minValues"":{""id"":1267751,""firstName"":""Aaron"",""middleName"":""Aaron"",""lastName"":""A'Barrow"",""gender"":""H"",""birthDate"":""1951-12-31T05:00:00.000Z"",""ssn"":""666-10-1005"",""salary"":-24221},""maxValues"":{""id"":6280265,""firstName"":""Zulma"",""middleName"":""Zulma"",""lastName"":""Zywicki"",""gender"":""M"",""birthDate"":""1999-12-31T05:00:00.000Z"",""ssn"":""999-98-9981"",""salary"":170371},""nullCount"":{""id"":0,""firstName"":0,""middleName"":0,""lastName"":0,""gender"":0,""birthDate"":0,""ssn"":0,""salary"":0}}, List(1694480072000000, 1694480072000000, 1694480072000000, 268435456))",,
"List(true, 1694480630000, part-00001-6dc30b02-e32a-4a89-857a-e73c211ddb41-c000.snappy.parquet, 59558946, {""numRecords"":2509281,""minValues"":{""id"":3766824,""firstName"":""Aaron"",""middleName"":""Aaron"",""lastName"":""A'Barrow"",""gender"":""H"",""birthDate"":""1951-12-31T05:00:00.000Z"",""ssn"":""666-10-1008"",""salary"":-25644},""maxValues"":{""id"":7544037,""firstName"":""Zulma"",""middleName"":""Zulma"",""lastName"":""Zywicki"",""gender"":""M"",""birthDate"":""1999-12-31T05:00:00.000Z"",""ssn"":""999-98-9985"",""salary"":180841},""nullCount"":{""id"":0,""firstName"":0,""middleName"":0,""lastName"":0,""gender"":0,""birthDate"":0,""ssn"":0,""salary"":0}}, List(1694480072000001, 1694480072000001, 1694480072000001, 268435456))",,
"List(true, 1694480629000, part-00002-d75beb7f-b9ca-45d5-a39f-b5df12057754-c000.snappy.parquet, 58094265, {""numRecords"":2451834,""minValues"":{""id"":7544038,""firstName"":""Aaron"",""middleName"":""Aaron"",""lastName"":""A'Barrow"",""gender"":""H"",""birthDate"":""1951-12-31T05:00:00.000Z"",""ssn"":""666-10-1009"",""salary"":-21931},""maxValues"":{""id"":10000000,""firstName"":""Zulma"",""middleName"":""Zulma"",""lastName"":""Zywicki"",""gender"":""M"",""birthDate"":""1999-12-31T05:00:00.000Z"",""ssn"":""999-98-9989"",""salary"":170562},""nullCount"":{""id"":0,""firstName"":0,""middleName"":0,""lastName"":0,""gender"":0,""birthDate"":0,""ssn"":0,""salary"":0}}, List(1694480072000002, 1694480072000002, 1694480072000002, 268435456))",,
"List(true, 1694480630000, part-00003-d4a3cc96-e43b-4dfd-b11b-6851b84b494d-c000.snappy.parquet, 59304301, {""numRecords"":2513055,""minValues"":{""id"":1,""firstName"":""Abbey"",""middleName"":""Abbey"",""lastName"":""A'Barrow"",""gender"":""M"",""birthDate"":""1951-12-31T05:00:00.000Z"",""ssn"":""666-10-1010"",""salary"":-26884},""maxValues"":{""id"":3766823,""firstName"":""Zulma"",""middleName"":""Zulma"",""lastName"":""Zywicki"",""gender"":""M"",""birthDate"":""1999-12-31T05:00:00.000Z"",""ssn"":""999-98-9924"",""salary"":168650},""nullCount"":{""id"":0,""firstName"":0,""middleName"":0,""lastName"":0,""gender"":0,""birthDate"":0,""ssn"":0,""salary"":0}}, List(1694480072000003, 1694480072000003, 1694480072000003, 268435456))",,


### Cleaning up and getting rid of the history

In [0]:
df_people.vacuum()

path
dbfs:/user/hive/warehouse/people_10millions


In [0]:
display(df_people.count())

count(1)
9983242


In [0]:
df_people.restoreToVersion(0)
#df_people.restoreToVersion("2023-09-13")

table_size_after_restore,num_of_files_after_restore,num_removed_files,num_restored_files,removed_files_size,restored_files_size
236570589,4,4,4,236178357,236570589


In [0]:
display(df_people.count())

count(1)
10000000


In [0]:
df_titanic_censored.createOrReplaceTemporaryView("titanic_censored")
display(spark.sql("DESCRIBE TABLE EXTENDED titanic_censored"))

col_name,data_type,comment
name,string,Name of the passenger
sex,string,
age,string,
embarked,string,
survived,string,
,,
# Detailed Table Information,,
Catalog,titanic,
Database,clean,
Table,titanic_censored,


In [0]:
display(df_titanic_censored.history())

In [0]:
dbutils.fs.ls('dbfs:/databricks-datasets/')

[FileInfo(path='dbfs:/databricks-datasets/COVID/', name='COVID/', size=0, modificationTime=1694518516218),
 FileInfo(path='dbfs:/databricks-datasets/README.md', name='README.md', size=976, modificationTime=1532502324000),
 FileInfo(path='dbfs:/databricks-datasets/Rdatasets/', name='Rdatasets/', size=0, modificationTime=1694518516218),
 FileInfo(path='dbfs:/databricks-datasets/SPARK_README.md', name='SPARK_README.md', size=3359, modificationTime=1455505834000),
 FileInfo(path='dbfs:/databricks-datasets/adult/', name='adult/', size=0, modificationTime=1694518516218),
 FileInfo(path='dbfs:/databricks-datasets/airlines/', name='airlines/', size=0, modificationTime=1694518516218),
 FileInfo(path='dbfs:/databricks-datasets/amazon/', name='amazon/', size=0, modificationTime=1694518516218),
 FileInfo(path='dbfs:/databricks-datasets/asa/', name='asa/', size=0, modificationTime=1694518516218),
 FileInfo(path='dbfs:/databricks-datasets/atlas_higgs/', name='atlas_higgs/', size=0, modificationTime=