# Week 2 Lab: File Formats, Delta Tables, and Time Travel

In this lab you will explore the differences between CSV and Delta Lake (Parquet) storage.

You'll see firsthand how:
- **Columnar storage** (Parquet/Delta) dramatically reduces I/O for analytical queries
- **File size** shrinks 10–20x thanks to Parquet's compression techniques
- **Time travel** lets you query, audit, and restore previous versions of a table

---
## Prerequisites

Before running this notebook, run the `generate_sensor_data` notebook to create a CSV file at
`/FileStore/hwe-data/week2/sensor_readings/` (10 million rows).

---
## Step 1: Create Schema

We'll use a `week2` schema to hold everything we build in this lab.

In [None]:
CREATE SCHEMA IF NOT EXISTS week2

---
## Step 2: Explore the CSV Data

First, let's look at the raw CSV file using `read_files`. This is how Databricks reads
non-Delta files — it parses the CSV on every query.

Preview the first few rows. Notice Databricks must read and parse the entire CSV to return even a few rows.

In [None]:
SELECT *
FROM read_files(
  '/FileStore/hwe-data/week2/sensor_readings',
  format => 'csv',
  header => true
)
LIMIT 10

Check the schema that Databricks inferred from the CSV. Notice that everything is STRING —
CSV has no type information embedded in the file.

In [None]:
DESCRIBE SELECT *
FROM read_files(
  '/FileStore/hwe-data/week2/sensor_readings',
  format => 'csv',
  header => true
)

---
## Step 3: Create a Delta Table from CSV

Let's create a Delta table from the CSV using CREATE TABLE AS SELECT (CTAS).
This converts the CSV data into Delta format (Parquet files + transaction log).

In [None]:
CREATE OR REPLACE TABLE week2.sensor_readings AS
SELECT
  sensor_id,
  sensor_type,
  location,
  CAST(reading_timestamp AS TIMESTAMP) AS reading_timestamp,
  CAST(reading_value AS DOUBLE) AS reading_value,
  unit,
  CAST(battery_pct AS INT) AS battery_pct,
  CAST(signal_strength AS INT) AS signal_strength,
  status,
  firmware_version,
  CAST(deployed_date AS DATE) AS deployed_date,
  CAST(maintenance_flag AS BOOLEAN) AS maintenance_flag
FROM read_files(
  '/FileStore/hwe-data/week2/sensor_readings',
  format => 'csv',
  header => true
)

Now compare the Delta table's schema to the CSV. The Delta table preserved the original types
(TIMESTAMP, DOUBLE, INT, BOOLEAN, DATE) because Delta stores schema metadata alongside the data.

In [None]:
DESCRIBE week2.sensor_readings

---
## Step 4: Compare File Sizes

Delta tables store data as Parquet files, which use columnar compression:
- **Dictionary encoding** for low-cardinality strings (sensor_type has only 5 values → stored as integer codes)
- **Delta encoding** for sequential values (timestamps increase by 1 second → store only the difference)
- **Bit packing** for small integers (battery_pct 0–100 needs only 7 bits, not 32)
- **Run-length encoding** for repeated values (maintenance_flag is mostly false → store "false × N")

CSV stores everything as plain text with no compression at all.

Check the Delta table's size on disk using `DESCRIBE DETAIL`. Look at the `sizeInBytes` column.

In [None]:
DESCRIBE DETAIL week2.sensor_readings

Count the rows to confirm both formats have the same data.

In [None]:
SELECT
  (SELECT COUNT(*) FROM week2.sensor_readings) AS delta_row_count,
  (SELECT COUNT(*)
   FROM read_files(
     '/FileStore/hwe-data/week2/sensor_readings',
     format => 'csv',
     header => true
   )
  ) AS csv_row_count

---
## Step 5: Compare Query Performance

This is where the difference really shows. We'll run the same queries against CSV and Delta,
then compare how much data each query had to read.

### How to view I/O statistics

After each query runs, look at the bottom of the cell output for the execution time
(e.g., "Took 2.34 seconds"). Click on it to open the **Query Profile**.

In the Query Profile, click on the **Scan** operator (the box at the bottom of the diagram).
The right panel will show:
- **data read size** — how many bytes were read from storage
- **rows read** — how many rows were scanned

Compare these numbers between the CSV and Delta versions of each query.

### Test 1: Aggregate a single column

This query only needs the `reading_value` column. The Delta/Parquet version can skip
all 11 other columns entirely — it only reads the one it needs. The CSV version must
read and parse every column of every row.

**CSV** — Run this, then open the Query Profile and note the data read size.

In [None]:
SELECT AVG(CAST(reading_value AS DOUBLE)) AS avg_reading
FROM read_files(
  '/FileStore/hwe-data/week2/sensor_readings',
  format => 'csv',
  header => true
)

**Delta** — Run this, then compare the data read size to the CSV version above.

In [None]:
SELECT AVG(reading_value) AS avg_reading
FROM week2.sensor_readings

### Test 2: Filter and aggregate two columns

This query filters on `sensor_type` and aggregates `reading_value`. With Parquet,
Databricks reads only these two columns. With CSV, it reads everything.

**CSV**

In [None]:
SELECT
  sensor_type,
  COUNT(*) AS readings,
  ROUND(AVG(CAST(reading_value AS DOUBLE)), 2) AS avg_value
FROM read_files(
  '/FileStore/hwe-data/week2/sensor_readings',
  format => 'csv',
  header => true
)
GROUP BY sensor_type
ORDER BY sensor_type

**Delta**

In [None]:
SELECT
  sensor_type,
  COUNT(*) AS readings,
  ROUND(AVG(reading_value), 2) AS avg_value
FROM week2.sensor_readings
GROUP BY sensor_type
ORDER BY sensor_type

### Test 3: Multi-column analytical query

A more realistic query that joins several columns. Even here, Delta reads far less data
because it only reads the 4 columns referenced in the query, not all 12.

**CSV**

In [None]:
SELECT
  sensor_type,
  status,
  COUNT(*) AS readings,
  ROUND(AVG(CAST(battery_pct AS INT)), 1) AS avg_battery
FROM read_files(
  '/FileStore/hwe-data/week2/sensor_readings',
  format => 'csv',
  header => true
)
GROUP BY sensor_type, status
ORDER BY sensor_type, status

**Delta**

In [None]:
SELECT
  sensor_type,
  status,
  COUNT(*) AS readings,
  ROUND(AVG(battery_pct), 1) AS avg_battery
FROM week2.sensor_readings
GROUP BY sensor_type, status
ORDER BY sensor_type, status

### What you should see

| Metric | CSV | Delta |
|--------|-----|-------|
| Data read size (Test 1) | ~full file size | ~1/12 of Delta size |
| Data read size (Test 2) | ~full file size | ~2/12 of Delta size |
| Query time | Slower | Faster |

The CSV queries always read the entire file regardless of which columns you select.
Delta/Parquet reads only the columns your query references — this is called **column pruning**.

Combined with compression, Delta typically reads 10–50x less data than CSV for analytical queries.

---
## Step 6: Time Travel

Delta Lake maintains a transaction log that records every change to a table.
This means you can:
- **Query previous versions** of the data
- **See the history** of all operations
- **Restore** to a previous version if something goes wrong

First, let's check the current state of the table. We'll look at a specific sensor to
make the changes easy to track.

In [None]:
SELECT sensor_id, sensor_type, reading_value, status
FROM week2.sensor_readings
WHERE sensor_id = 'sensor-0001'
LIMIT 5

Check the table's version history before making changes.

In [None]:
DESCRIBE HISTORY week2.sensor_readings

Now let's make a change — update all readings for sensor-0001 to set their status to `'maintenance'`.

In [None]:
UPDATE week2.sensor_readings
SET status = 'maintenance'
WHERE sensor_id = 'sensor-0001'

Verify the update took effect.

In [None]:
SELECT sensor_id, sensor_type, reading_value, status
FROM week2.sensor_readings
WHERE sensor_id = 'sensor-0001'
LIMIT 5

Check the history again — you should see a new version from the UPDATE operation.

In [None]:
DESCRIBE HISTORY week2.sensor_readings

### Query a previous version

Use `VERSION AS OF` to read the data as it was before the update.
Version 0 is the original table as it was first written.

In [None]:
SELECT sensor_id, sensor_type, reading_value, status
FROM week2.sensor_readings VERSION AS OF 0
WHERE sensor_id = 'sensor-0001'
LIMIT 5

The original status values are still there in version 0, even though the current
version shows `'maintenance'`. Delta kept the old Parquet files and just recorded
which files belong to which version in the transaction log.

### Restore to a previous version

If the UPDATE was a mistake, we can undo it by restoring the table to version 0.

In [None]:
RESTORE TABLE week2.sensor_readings TO VERSION AS OF 0

Verify the restore worked — the status values should be back to their original values.

In [None]:
SELECT sensor_id, sensor_type, reading_value, status
FROM week2.sensor_readings
WHERE sensor_id = 'sensor-0001'
LIMIT 5

Check history one more time — you'll see the RESTORE as yet another version in the log.
Nothing is ever truly lost with Delta Lake.

In [None]:
DESCRIBE HISTORY week2.sensor_readings

---
## Step 7: Clean Up

Drop the `week2` schema and all its tables.

In [None]:
DROP SCHEMA IF EXISTS week2 CASCADE