Nessie Demo
===========
This demo showcases how to use Nessie python API along with Spark3 from Iceberg

Initialize Pyspark + Nessie environment
----------------------------------------------

In [1]:
import os
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
from py4j.java_gateway import java_import

spark = SparkSession.builder \
                    .config("spark.sql.execution.pyarrow.enabled", "true") \
                    .config("spark.hadoop.fs.defaultFS", 'file://' + os.getcwd() + '/spark_warehouse') \
                    .config("spark.hadoop.nessie.url", os.getenv("NESSIE_ENDPOINT")) \
                    .config("spark.hadoop.nessie.ref", "main") \
                    .config("spark.sql.catalog.nessie", "com.dremio.nessie.iceberg.spark.NessieIcebergSparkCatalog") \
                    .getOrCreate()
sc = spark.sparkContext
jvm = sc._gateway.jvm

java_import(jvm, "com.dremio.nessie.iceberg.NessieCatalog")
java_import(jvm, "org.apache.iceberg.catalog.TableIdentifier")
java_import(jvm, "org.apache.iceberg.Schema")
java_import(jvm, "org.apache.iceberg.types.Types")
java_import(jvm, "org.apache.iceberg.PartitionSpec")

Set up nessie branches
----------------------------

- Branch `main` already exists
- Create branch `dev`
- List all branches (pipe JSON result into jq)

In [2]:
!nessie branch dev




In [3]:
!nessie --verbose branch

[33m* main  e0b41c30f0710277532f51242994e10acfdc46bf comment
[0m  dev   e0b41c30f0710277532f51242994e10acfdc46bf comment



Create tables under dev branch
-------------------------------------

Creating two tables under the `dev` branch:
- region
- nation

It is not yet possible to create table using pyspark and iceberg, so Java code
is used instead

In [4]:
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("nessie.ref", "dev")
catalog = jvm.NessieCatalog(sc._jsc.hadoopConfiguration())

# Creating region table
region_name = jvm.TableIdentifier.parse("testing.region")
region_schema = jvm.Schema([
    jvm.Types.NestedField.optional(1, "R_REGIONKEY", jvm.Types.LongType.get()),
    jvm.Types.NestedField.optional(2, "R_NAME", jvm.Types.StringType.get()),
    jvm.Types.NestedField.optional(3, "R_COMMENT", jvm.Types.StringType.get()),
])
region_spec = jvm.PartitionSpec.unpartitioned()

region_table = catalog.createTable(region_name, region_schema, region_spec)
region_df = spark.read.load("data/region.parquet")
region_df.write.option('hadoop.nessie.ref', 'dev').format("iceberg").mode("overwrite").save("testing.region")

# Creating nation table
nation_name = jvm.TableIdentifier.parse("testing.nation")
nation_schema = jvm.Schema([
    jvm.Types.NestedField.optional(1, "N_NATIONKEY", jvm.Types.LongType.get()),
    jvm.Types.NestedField.optional(2, "N_NAME", jvm.Types.StringType.get()),
    jvm.Types.NestedField.optional(3, "N_REGIONKEY", jvm.Types.LongType.get()),
    jvm.Types.NestedField.optional(4, "N_COMMENT", jvm.Types.StringType.get()),
])
nation_spec = jvm.PartitionSpec.builderFor(nation_schema).truncate("N_NAME", 2).build()
nation_table = catalog.createTable(nation_name, nation_schema, nation_spec)

nation_df = spark.read.load("data/nation.parquet")
nation_df.write.option('hadoop.nessie.ref', 'dev').format("iceberg").mode("overwrite").save("testing.nation")


Check generated tables
----------------------------

Check tables generated under the dev branch (and that the main branch does not
have any tables)

In [5]:
!nessie contents --list




In [6]:
!nessie contents --list --ref dev

UNKNOWN:
	testing.nation
	testing.region



Note that the `dev` and `main` branches point to different commits now

In [7]:
!nessie --verbose branch

[33m* main  e0b41c30f0710277532f51242994e10acfdc46bf comment
[0m  dev   b7dd56f9ea974fb62ee7d50813464d51a438f205 comment



Dev promotion
-------------

Promote dev branch promotion to main.

* main now has the same tables as dev
* main and dev point to the same commit

In [8]:
!nessie merge dev --force




In [9]:
!nessie contents --list

UNKNOWN:
	testing.nation
	testing.region



In [10]:
!nessie --verbose branch

[33m* main  b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
[0m  dev   b7dd56f9ea974fb62ee7d50813464d51a438f205 comment



Create `etl` branch
----------------------

- Create a branch `etl` out of `main`
- add data to nation
- alter the schema of region
- create table city
- query the tables in `etl`
- query the tables in `main`
- promote `etl` branch to `main`

In [11]:
!nessie branch etl main




In [12]:
Nation = Row("N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT")
new_nations = spark.createDataFrame([
    Nation(25, "SYLDAVIA", 3, "King Ottokar's Sceptre"),
    Nation(26, "SAN THEODOROS", 1, "The Picaros")])
new_nations.write.option('hadoop.nessie.ref', 'etl').format("iceberg").mode("append").save("testing.nation")

In [13]:
!nessie --verbose branch

[33m* main  b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
[0m  dev   b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
  etl   db44e81a4762fb4c479135bb397c097811cee2fd comment



In [14]:
# changing the default branch
hadoop_conf.set('nessie.ref', 'etl')

etl_catalog = jvm.NessieCatalog(hadoop_conf)
etl_catalog.loadTable(region_name).updateSchema().addColumn('R_ABBREV', jvm.Types.StringType.get()).commit()

In [15]:
!nessie --verbose branch

[33m* main  b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
[0m  dev   b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
  etl   87ebd562faf6d3a7d39c934d8ad393b62825b31c comment



In [16]:
# Creating city table
sc.getConf().set("spark.hadoop.nessie.ref", "etl")
spark.sql("CREATE TABLE nessie.testing.city (C_CITYKEY BIGINT, C_NAME STRING, N_NATIONKEY BIGINT, C_COMMNT STRING) USING iceberg PARTITIONED BY (N_NATIONKEY)")

DataFrame[]

In [17]:
!nessie --verbose branch

[33m* main  b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
[0m  dev   b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
  etl   c3065a3d139b96bb882f723908de1d59664713b1 comment



In [18]:
from pynessie import init
nessie = init()
nessie.list_keys('main').entries

[Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'nation'])),
 Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'region']))]

In [19]:
[i.name for i in nessie.list_keys('etl').entries]

[EntryName(elements=['testing', 'nation']),
 EntryName(elements=['testing', 'city']),
 EntryName(elements=['testing', 'region'])]

In [26]:
{i.name:i.hash_ for i in nessie.list_references()}

{'main': 'b7dd56f9ea974fb62ee7d50813464d51a438f205',
 'dev': 'b7dd56f9ea974fb62ee7d50813464d51a438f205',
 'etl': 'c3065a3d139b96bb882f723908de1d59664713b1'}

In [23]:
nessie.merge('main', 'etl')

NessieConflictException: Entity already exists at : 409 Client Error: Conflict for url: http://nessie:19120/api/v1/trees/branch/main/merge?expectedHash=b7dd56f9ea974fb62ee7d50813464d51a438f205

In [25]:
!nessie merge etl --force

[31mNessie Exception[0m is Entity already exists at : 409 Client Error: Conflict for url: http://nessie:19120/api/v1/trees/branch/main/merge?expectedHash=b7dd56f9ea974fb62ee7d50813464d51a438f205 with status code: 409.
[33mServer status:[0m CONFLICT
[33mServer message:[0m The branch [main] does not have the expected hash [b7dd56f9ea974fb62ee7d50813464d51a438f205].
[33mServer traceback:[0m com.dremio.nessie.error.NessieConflictException: The branch [main] does not have the expected hash [b7dd56f9ea974fb62ee7d50813464d51a438f205].
	at com.dremio.nessie.services.rest.TreeResource.mergeRefIntoBranch(TreeResource.java:207)
	at com.dremio.nessie.services.rest.TreeResource_Subclass.mergeRefIntoBranch$$superaccessor11(TreeResource_Subclass.zig:2998)
	at com.dremio.nessie.services.rest.TreeResource_Subclass$$function$$11.apply(TreeResource_Subclass$$function$$11.zig:47)
	at io.quarkus.arc.impl.AroundInvokeInvocationContext.proceed(AroundInvokeInvocationContext.java:54)
	at io.quarkus.hib

In [22]:
{i.name:i.hash_ for i in nessie.list_references()}

{'main': 'b7dd56f9ea974fb62ee7d50813464d51a438f205',
 'dev': 'b7dd56f9ea974fb62ee7d50813464d51a438f205',
 'etl': 'c3065a3d139b96bb882f723908de1d59664713b1'}

Create `experiment` branch
--------------------------------

- create `experiment` branch from `main`
- drop `nation` table
- add data to `region` table
- compare `experiment` and `main` tables

In [27]:
!nessie branch experiment main




In [28]:
# changing the default branch
hadoop_conf.set('nessie.ref', 'experiment')

catalog = jvm.NessieCatalog(hadoop_conf)
catalog.dropTable(jvm.TableIdentifier.parse("testing.nation"), False)

True

In [29]:
spark.sql("set spark.hadoop.nessie.ref=experiment")
spark.sql('INSERT INTO TABLE nessie.testing.region VALUES (5, "AUSTRALIA", "Let\'s hop there", "AUS")')
spark.sql('INSERT INTO TABLE nessie.testing.region VALUES (6, "ANTARTICA", "It\'s cold", "ANT")')

DataFrame[]

In [30]:
!nessie contents --list --ref experiment

UNKNOWN:
	testing.region



Lets take a look at the contents of the region table on the experiment branch.
Notice the use of the `nessie` catalog.

In [39]:
spark.sql("select * from nessie.testing.`region`").toPandas()

Unnamed: 0,R_REGIONKEY,R_NAME,R_COMMENT,R_ABBREV
0,0,AFRICA,lar deposits. blithely final packages cajole. ...,
1,1,AMERICA,"hs use ironic, even requests. s",
2,2,ASIA,ges. thinly even pinto beans ca,
3,3,EUROPE,ly final courts cajole furiously final excuse,
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...,
5,5,AUSTRALIA,Let's hop there,AUS
6,6,ANTARTICA,It's cold,ANT


and compare to the contents of the region table on the main branch. Notice the
use of `@main` to view data on the main branch

In [45]:
spark.sql("select * from nessie.testing.`region@etl`").toPandas()

Unnamed: 0,R_REGIONKEY,R_NAME,R_COMMENT,R_ABBREV
0,0,AFRICA,lar deposits. blithely final packages cajole. ...,
1,1,AMERICA,"hs use ironic, even requests. s",
2,2,ASIA,ges. thinly even pinto beans ca,
3,3,EUROPE,ly final courts cajole furiously final excuse,
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...,
5,5,AUSTRALIA,Let's hop there,AUS
6,6,ANTARTICA,It's cold,ANT


In [44]:
!nessie --verbose branch

  experiment  8679b0d54540fe40e0254f4c64cecc724ddd854e comment
[33m* main        b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
[0m  dev         b7dd56f9ea974fb62ee7d50813464d51a438f205 comment
  etl         0e5c4fedab0c559fa4727dd201ac7a7e503188b6 comment

