<img src="https://www.apache.org/logos/res/spark/spark.png" alt="spark-logo" style="width: 500px;"/>

First, lets set up our tables using pySpark to do a teardown to avoid conflicts and recreate the `nyc.taxis` table to store data from the <a href="https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page">NYC Taxi and Limousine Commission Record Data</a>. We will initially load a month of trip data and do a simple query to verify the data was correctly written to the table.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Jupyter").getOrCreate()

spark

## Load One Month of NYC Taxi/Limousine Trip Data

This notebook uses the New York City Taxi and Limousine Commission Trip Record Data available on the AWS Open Data Registry. This contains data of trips taken by taxis and for-hire vehicles in New York City. This data is stored in an iceberg table called `taxis`.

To be able to rerun the notebook several times, let's drop the table and the views if they exist to start fresh.

In [None]:
%%sql
CREATE DATABASE IF NOT EXISTS nyc.taxis;

In [None]:
%%sql
DROP TABLE IF EXISTS nyc.taxis;

In [None]:
%%sql
DROP VIEW IF EXISTS nyc.long_distances;

In [None]:
%%sql
DROP VIEW IF EXISTS nyc.negative_amounts;

## Create the `nyc.taxis` table in Spark

In [None]:
%%sql
CREATE TABLE
  nyc.taxis (
    VendorID BIGINT,
    tpep_pickup_datetime TIMESTAMP,
    tpep_dropoff_datetime TIMESTAMP,
    passenger_count DOUBLE,
    trip_distance DOUBLE,
    RatecodeID DOUBLE,
    store_and_fwd_flag string,
    PULocationID BIGINT,
    DOLocationID BIGINT,
    payment_type BIGINT,
    fare_amount DOUBLE,
    extra DOUBLE,
    mta_tax DOUBLE,
    tip_amount DOUBLE,
    tolls_amount DOUBLE,
    improvement_surcharge DOUBLE,
    total_amount DOUBLE,
    congestion_surcharge DOUBLE,
    airport_fee DOUBLE
  ) USING iceberg PARTITIONED BY (days (tpep_pickup_datetime))

## Write a month of trip data to `nyc.taxis`

In [None]:
df = spark.read.parquet("/home/iceberg/data/yellow_tripdata_2022-01.parquet")
df.writeTo("nyc.taxis").append()

In [None]:
%%sql
SELECT
  *
FROM
  nyc.taxis
LIMIT
  10

<img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="iceberg-rest-logo" style="width: 500px;bottom: 0;margin: 0;"/>
<div style="font-size:60px;top: 0;margin: 0;"> {<span style="color:#287ABE;font-weight:bold;">REST</span>:<span style="color:#287ABE;font-weight:bold;">Catalog</span>} </div>

## View the Tables in the catalog 👀

Now that we've created a table, let's look at the table metadata in the REST catalog to have a quick refresher at the table metadata highlights.

In [None]:
from json import loads, dumps
from IPython.display import JSON

#call the rest api using curl and asign back to IPython SList
taxis_table_meta=!curl -s http://rest:8181/v1/namespaces/nyc/tables/taxis

#parse table metadata payload into Python dictionary and print payload
table_meta_dict=loads(taxis_table_meta.spstr)
print(dumps(table_meta_dict, indent=2)[0:600], "\n  ...", "\n}")
#print(dumps(table_meta_dict, indent=2))

In [None]:
metadata_location=table_meta_dict["metadata-location"]
metadata=table_meta_dict["metadata"]


#display summary table metadata in an interactive json view
JSON(
    {
        "metadata-location": metadata_location,
        "location": metadata["location"],
        "table-uuid": metadata["table-uuid"],
        "schemas":[{
            "schema-id": schema["schema-id"],
            "fields": ", ".join([
                "::".join([ 
                  str(field["id"]),
                  field["name"],
                  field["type"],
                  "required" if field["required"] else "nullable"
                ]) 
                for field in schema['fields']
            ]),
            "current-schema": "✳️" if schema["schema-id"] == metadata["current-schema-id"] else "false"
         } for schema in metadata["schemas"]],
        "last-sequence-number": metadata["last-sequence-number"],
        "snapshots":[ {
            "sequence-number": snapshot["sequence-number"],
            "snapshot-id": snapshot["snapshot-id"],
            "summary": str(snapshot['summary']),
            "manifest-list": snapshot['manifest-list'],
            "schema-id": snapshot['schema-id'],
            "refs": str(["{}::{}".format(name, ref["type"]) for name, ref in metadata["refs"].items() if ref["snapshot-id"]==snapshot["snapshot-id"]]),
            "current-snapshot": "✳️" if snapshot["snapshot-id"] == metadata["current-snapshot-id"] else "false"
         } for snapshot in metadata["snapshots"]],
        "snapshot-log": [str(log) for log in metadata["snapshot-log"]],
        "metadata-log": [str(log) for log in metadata["metadata-log"]],
        "statistics": str(metadata["statistics"]),
        "partition-statistics": str(metadata["partition-statistics"])
    },
    root='.table-metadata',
    expanded=True
)

## Query `nyc.taxis` from Trino

<img src="https://raw.githubusercontent.com/trinodb/presentations/main/assets/logos/cbb-isolated.svg" alt="trino-logo" style="width: 500px;"/>

To verify the shared table representation of Iceberg is also read correctly by Trino, let's run the same original query ran in spark, as well as, a query that pulls out specific fields and makes assumptions about types based on the schema we saw above. This is nothing new, just important to understand what the Iceberg table metadata offers us with interoperability.

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080/iceberg' \
--output-format='MARKDOWN' \
--execute="\
SELECT \
  * \
FROM \
  nyc.taxis \
LIMIT 10") 2> /dev/null

display(Markdown(trino_out.nlstr))

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080' \
--output-format='MARKDOWN' \
--execute="\
SELECT \
  vendorid, \
  format_datetime(tpep_pickup_datetime, 'YYYY-DDD HH:mm') AS pickup,     /* Trino timestamp type */\
  format_datetime(tpep_dropoff_datetime, 'YYYY-DDD HH:mm') AS dropoff,   /* Trino timestamp type */\
  CAST(passenger_count AS INT) AS num_riders,                            /* Trino Numeric type */\
  trip_distance AS distance, \
  PULocationID AS pickup_loc_id, \
  DOLocationID AS dropoff_loc_id, \
  payment_type AS pay_type, \
  fare_amount AS base, \
  format('%.2f%%', mta_tax) AS tax,                                      /* Trino DOUBLE type */\
  tip_amount AS tip, \
  total_amount AS total \
FROM \
  iceberg.nyc.taxis \
LIMIT 10") 2> /dev/null

display(Markdown(trino_out.nlstr))

<img src="https://www.apache.org/logos/res/spark/spark.png" alt="spark-logo" style="width: 500px;"/>

## Create a view

Let's create an Iceberg view to look at the longest distances travelled and the total amount of the trips.

In [None]:
%%sql
CREATE VIEW
  nyc.long_distances (
    vendor_id COMMENT 'Vendor ID',
    pickup_date,
    dropoff_date,
    distance COMMENT 'Trip Distance',
    total COMMENT 'Total amount'
  ) AS
SELECT
  VendorID,
  tpep_pickup_datetime,
  tpep_dropoff_datetime,
  trip_distance,
  total_amount
FROM
  nyc.taxis
ORDER BY
  trip_distance

In [None]:
%%sql
SELECT
  *
FROM
  nyc.long_distances
LIMIT
  10

## Update View to order results differently

The output isn't as helpful as imagined, so let's update the view and change the order of columns and the ordering of the results.

In [None]:
%%sql
CREATE
OR REPLACE VIEW nyc.long_distances (
  distance COMMENT 'Trip Distance',
  total COMMENT 'Total amount',
  vendor_id COMMENT 'Vendor ID',
  pickup_date,
  dropoff_date
) AS
SELECT
  trip_distance,
  total_amount,
  VendorID,
  tpep_pickup_datetime,
  tpep_dropoff_datetime
FROM
  nyc.taxis
WHERE
  trip_distance > 35
ORDER BY
  total_amount,
  trip_distance

In [None]:
%%sql
SELECT
  *
FROM
  nyc.long_distances
LIMIT 
  10

<img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="iceberg-rest-logo" style="width: 500px;bottom: 0;margin: 0;"/>
<div style="font-size:60px;top: 0;margin: 0;"> {<span style="color:#287ABE;font-weight:bold;">REST</span>:<span style="color:#287ABE;font-weight:bold;">Catalog</span>} </div>

## View the Views in the catalog 👀

Now that we've both created and replaced a view, let's look at the view metadata in the REST catalog and compare it with table metadata we saw before. It's important to notice both the differences and similarities of what data is represented.

In [None]:
from json import loads, dumps
from IPython.display import JSON

#call the rest api using curl and asign back to IPython SList
view_meta=!curl -s http://rest:8181/v1/namespaces/nyc/views/long_distances

#parse table metadata payload into Python dictionary and print payload
view_meta_dict=loads(view_meta.spstr)
print(dumps(view_meta_dict, indent=2)[0:600], "\n  ...", "\n}")
#print(dumps(view_meta_dict, indent=2))

In [None]:
#get basic view metadata
metadata_location=view_meta_dict["metadata-location"]
metadata=view_meta_dict["metadata"]

current_schema_id = next(version for version in metadata["versions"] if version["version-id"] == metadata["current-version-id"])["schema-id"]

#display summary table metadata in an interactive json view
JSON(
    {
        "metadata-location": metadata_location,
        "location": metadata["location"],
        "view_uuid": metadata["view-uuid"],
        "schemas":[{
            "schema-id": schema["schema-id"],
            "fields": ", ".join([
                "::".join([ 
                  str(field["id"]),
                  field["name"],
                  field["type"],
                  "required" if field["required"] else "nullable"
                ]) 
                for field in schema['fields']
            ]),
            "current-schema": "✳️" if schema["schema-id"] == current_schema_id else "false"
         } for schema in metadata["schemas"]],
        "versions":[ {
            "version-id": version["version-id"],
            "summary": str(version['summary']),
            "default-namespace": version['default-namespace'],
            "schema-id": version['schema-id'],
            "representations": version['representations'],
            "current-version": "✳️" if version["version-id"] == metadata["current-version-id"] else "false"
         } for version in metadata["versions"]],
        "version-log": [str(log) for log in metadata["version-log"]]
    },
    root='.view-metadata',
    expanded=True
)

<img src="https://raw.githubusercontent.com/trinodb/presentations/main/assets/logos/cbb-isolated.svg" alt="trino-logo" style="width: 500px;"/>

## Query `nyc.long_distances` view from Trino

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080/iceberg' \
--output-format='MARKDOWN' \
--execute="\
SELECT \
  * \
FROM \
  nyc.long_distances \
LIMIT 10")

print('\n'.join([line for line in trino_out.l if line.startswith('Query')]))

Although this seems like a bug, this is actually intentional based on the current discussion of a [the Trino Pull Request](https://github.com/trinodb/trino/pull/19818#discussion_r1400212612). This is a great point to discuss the meat of the differences between table and view representations.


## Fundamental difference between Iceberg Tables and Views

### Table And View similarities
The [Iceberg Table Spec](https://iceberg.apache.org/spec/) and the [Iceberg View Spec](https://iceberg.apache.org/view-spec/) can be confusing to parse modeling differences when dealing directly with views and tables in SQL. This is because views are designed to provide the same experience as a table in SQL. Let's first talk about what properties and traits are shared between representations.

* Both views and tables are associated with a schema, containing column ids, names, types, and if the column is nullable.
* They both store the **warehouse** and **namespace** to address the table/view in SQL.
* They both track schema changes over time.
* They both hold information that tells them where to find data, but what they store and how that information is used to find data is where these implementations diverge.
* Both use Iceberg's [optimistic concurrency](https://iceberg.apache.org/spec/#optimistic-concurrency) model with a catalog, updating a common [metadata location](https://iceberg.apache.org/view-spec/#metadata-location). This ensures that changes to views inherit the same gurantees provided to evolving the view definition as tables do.

### Table Spec
Iceberg table metadata contain the information needed for any compute engine to correctly and performantly retrieve the data from storage. The metadata contains an abstract internal representation of the table metadata.

* The actual location of the data files on disk and columnar ranges of each file to avoid reading files from storage that do not contain relevant information.
* The snapshots of committed data to a table over time.
* The partitioning schema over time to enable skipping the reads of irrelevant partitions of a table.
* The ordering of the data on disk.

### View Spec
Iceberg views enable storing multiple dialects for the same view, though currently there is no mechanism in Iceberg or the spec that ensures multiple views are represented. A view in Iceberg holds the view query in the SQL dialect of the defining compute engine using these abstractions:

* The [view representation](https://iceberg.apache.org/view-spec/#representations) contains the raw query stored in `sql`, the query syntax `dialect`, and the `type` of representation (currently limited to `"sql"`). It's worth noting that although not supported by any direct procedure, Iceberg views store multiple representations of the view in different dialects, which will enable different compute engines to access the most appropriate view for them and either manage the transpiling themselves, or run it natively if its a direct match.
* As opposed to being explicitely defined as with the `CREATE TABLE` statement, a view's schema is defined from the result set of the query. In some cases, the `CREATE VIEW` statement will allow you to add a schema to explicitely cast all fields to the same type.
* A [view version](https://iceberg.apache.org/view-spec/#versions) is the combination of the current schema, the current view representation, and a few more fields. Similar to snapshots, view versions have a log field that details when they've been updated.

## Table vs View Guarantees

Using schemas, both they ensure the data types adhere to the contract which leads to better data quality with this type checking. The catch is that views are tightly coupled to the compute engines' query syntax, the state of the compute engine, and typing systems to infer the correct schema. Unlike Postgres views which utilize a uniform typing system on a its own tables where it can validate type, the view behavior when reading Iceberg views with a SQL dialect outside of your own is up to the query engine based on how it implements error handling around corrupted or unsure states. We'll show below how Spark enables reading view dialects outside of Spark. Both of these approaches have their tradeoffs.

Perhaps this doesn't give you concern you as you'll get an error message in most cases. There are more nefarious silent failures that can take place where view models lack full visibility of the schema and phsyical locations of tables. This issue is not Iceberg specific, but derives from using interoperable engines to storage layers.


# Listing and describing views

## Create a `nyc.negative_amounts` view, this time in Trino
It appears that there are trips with negative total amounts. Let's display these results in a separate view

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080' \
--catalog='iceberg'\
--output-format='MARKDOWN' \
--execute="\
CREATE OR REPLACE VIEW nyc.negative_amounts AS\
SELECT\
  total_amount,\
  trip_distance,\
  VendorID,\
  tpep_pickup_datetime,\
  tpep_dropoff_datetime\
FROM\
  nyc.taxis\
WHERE\
  total_amount < 0\
ORDER BY\
  total_amount\
") 2> /dev/null

display(Markdown(trino_out.nlstr))

We should be able to query this view in Trino since Trino created this dialect. Let's first show the views in Trino. Trino does [not currently support a way to filter views versus tables](https://github.com/trinodb/trino/issues/2999#issuecomment-1127533014), so to verify which views will show, let's run a `SHOW TABLES` command, followed by a simple query of the new Trino view.

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080/iceberg' \
--output-format='MARKDOWN' \
--execute="SHOW TABLES IN nyc") 2> /dev/null

display(Markdown(trino_out.nlstr))

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080/iceberg' \
--output-format='MARKDOWN' \
--execute="\
SELECT \
  * \
FROM \
  nyc.negative_amounts \
LIMIT 10") 2> /dev/null

display(Markdown(trino_out.nlstr))

In [None]:
from IPython.display import Markdown, display

# Run using the Trino CLI since %%sql magic is from pySpark and only connects to Spark
trino_out=!(trino --server='http://trino-iceberg:8080/iceberg' \
--output-format='MARKDOWN' \
--execute="SHOW CREATE VIEW nyc.negative_amounts")

display(Markdown(trino_out.nlstr))

In [None]:
from json import loads, dumps
from IPython.display import JSON

#call the rest api using curl and asign back to IPython SList
view_meta=!curl -s http://rest:8181/v1/namespaces/nyc/views/negative_amounts

#parse table metadata payload into Python dictionary and print payload
view_meta_dict=loads(view_meta.spstr)
#print(dumps(table_meta_dict, indent=2)[0:600], "\n  ...", "\n}")
print(dumps(view_meta_dict, indent=2))

# Listing and describing views

In [None]:
%%sql
SHOW VIEWS IN nyc

In [None]:
spark.catalog.setCurrentCatalog("demo")

In [None]:
%%sql
SELECT
  *
FROM
  nyc.negative_amounts
LIMIT 10

In [None]:
%%sql
DESCRIBE nyc.long_distances

In [None]:
%%sql
DESCRIBE EXTENDED nyc.long_distances

# Displaying the CREATE statement of a view

In [None]:
%%sql
SHOW
CREATE TABLE
  nyc.long_distances

# Altering and displaying properties of a view

This will add a new property and also update the comment of the view. 
The comment will be shown when describing the view.
The end of this section will also remove a property from the view.

In [None]:
%%sql
SHOW TBLPROPERTIES nyc.long_distances

In [None]:
%%sql
ALTER VIEW nyc.long_distances
SET
  TBLPROPERTIES (
    'key1' = 'val1',
    'key2' = 'val2',
    'comment' = 'This is a view comment'
  )

In [None]:
%%sql
SHOW TBLPROPERTIES nyc.long_distances

In [None]:
%%sql
DESCRIBE EXTENDED nyc.long_distances

In [None]:
%%sql
ALTER VIEW nyc.long_distances UNSET TBLPROPERTIES ('key1')

In [None]:
%%sql
SHOW TBLPROPERTIES nyc.long_distances