# Managing Iceberg tables

In this part of the workshop we'll look at the different ways Iceberg enables you to optimize and maintain your tables.

You can learn more in the Iceberg [documentation](https://iceberg.apache.org/docs/latest/spark-procedures/#metadata-management).

### Starting Spark

Start Spark and connect to your Polaris Catalog.

In [None]:
## Update with your principal user credentials (from Polaris Catalog)

clientId="0b8097fb53c92862"
clientSecret="85c2af291ebdc578d95efd768aeac0e5"

In [None]:
## Start the Spark application and connect to our Polaris Catalog

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('iceberg_lab') \
.config('spark.sql.defaultCatalog', 'polaris') \
.config('spark.sql.catalog.polaris', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.polaris.type', 'rest') \
.config('spark.sql.catalog.polaris.header.X-Iceberg-Access-Delegation','true') \
.config('spark.sql.catalog.polaris.client.region','us-east-1') \
.config('spark.sql.catalog.polaris.uri','http://polaris-catalog:8181/api/catalog') \
.config('spark.sql.catalog.polaris.credential',clientId+':'+clientSecret) \
.config('spark.sql.catalog.polaris.warehouse','polariscatalog') \
.config('spark.sql.catalog.polaris.scope','PRINCIPAL_ROLE:ALL') \
.config('spark.sql.catalog.polaris.token-refresh-enabled', 'true') \
.getOrCreate()

### Create a table and load some data

You'll create a table and load some data.  We'll then optimize these files by compacting them.

In [None]:
import requests
import json

### https://data.cityofnewyork.us/NYC-BigApps/Citi-Bike-System-Data/vsnr-94wk

r = requests.get('https://gbfs.citibikenyc.com/gbfs/en/station_status.json')
station_status = r.json()

with open("/home/iceberg/notebooks/station_status.json", "w") as f:
    for item in station_status['data']['stations']:
        json.dump(item, f)
        f.write('\n\r')

f.close()

In [None]:
spark.sql('DROP TABLE IF EXISTS demo.stations PURGE')

df = spark.read.format("json") \
          .option("header",True) \
          .option("inferschema",True) \
          .load("/home/iceberg/notebooks/station_status.json")

df.repartition(100).write.saveAsTable('demo.stations')

In [None]:
%%sql

SELECT * FROM demo.stations limit 10

Check how many files were created.  In this example, we forced Spark to split the data into 100 files, but in the real world this will happen naturally.

%%sql

SELECT count(*) FROM polaris.demo.stations.files

### Rewrite data file, aka. compaction

Compaction is an important process that combines smalls files into few larger files

We start off by compacting our table by looking for 2 or more files with the smallest size.

In [None]:
ret = spark.sql("CALL polaris.system.rewrite_data_files(table => 'demo.stations', options => map('min-input-files','2', 'rewrite-job-order','bytes-asc'))")
ret.show()

Inspect the `files` information table again and you'll see that we only have 1 single file now

In [None]:
%%sql

SELECT count(*) FROM polaris.demo.stations.files

*** Before starting this step, drop the table and recreate it as before so we can test out other compaction scenarios. ***

In the following compaction scenario we're sorting the data during compaction. There are bin-packing and sorting using standard ordering or zorder.
- Binpacking simply arranges bits to fit more into fewer files.
- Sorting organizes rows by sort key so similar data is colocated in the same files making reads more efficient.
- Zorder is more complex ordering that comes with its own pros/cons

In [None]:
ret = spark.sql("CALL polaris.system.rewrite_data_files(table => 'demo.stations', strategy => 'sort', sort_order => 'station_id DESC NULLS LAST,legacy_id DESC NULLS LAST')")
ret.show()

Another interesting optimization is to compact only those files that meet a specific filter criteria.  This is helpful when there is large skew in the data and the low cardinality data is not often compacted because it's under the file number of byte size threshold.

In [None]:
ret = spark.sql("CALL polaris.system.rewrite_data_files(table => 'demo.stations', where => is_installed = 1)")
ret.show()

### Expiring snapshots

As you already noticed, Iceberg creates lots of snapshots to keep track of changes.  Each snapshot creates numerous manifest files that track everything about files and partitions and schemas.  Each snapshot is also maintains the full table history so you can time travel in queries. However, all of this takes up storage and cost you money.  

It's a good practice to expire old snapshots after some period of time or number of snapshots created.

First inspect your `snapshots` information table and lets see which one to expire.

In [None]:
%%sql

SELECT * FROM polaris.demo.stations.snapshots

In [None]:
ret = spark.sql("CALL polaris.system.expire_snapshots(table => 'demo.stations', snapshot_ids => ARRAY(642880844932688596))")
ret.show()

Inspect the `snapshots` table again and you'll see the old snapshot was removed