<img src="./images/logo.svg" alt="lakeFS logo" width=300/> <img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="Apache Iceberg logo" width=300/>  

## lakeFS ❤️ Apache Iceberg - an example of the Medallion Architecture by using Spark client

* [📚 lakeFS Apache Iceberg Integration Docs](https://docs.lakefs.io/integrations/iceberg.html)
* [Getting started with Iceberg in Spark](https://iceberg.apache.org/docs/nightly/spark-getting-started/)

## Prerequisites

###### This Notebook requires connecting to lakeFS Cloud or lakeFS Enterprise.
###### Register for the lakeFS Cloud: https://lakefs.cloud/register or Contact Us for a lakeFS Enterprise Key: https://lakefs.io/contact-sales/

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [None]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [None]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [None]:
repo_name = "lakefs-spark-medallion-iceberg"

### Versioning Information

In [None]:
mainBranch = "main"
devBranch = "dev"

### Iceberg Information

In [None]:
myCatalog = "my_catalog"
warehouseDir = "./tmp-spark-warehouse"

### Import libraries

In [None]:
import os
import lakefs
from assets.lakefs_demo import print_commit, print_diff

### Set environment variables

In [None]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

### Define lakeFS UI Endpoint

In [None]:
if lakefsEndPoint.startswith('http://host.docker.internal'):
    lakefsUIEndPoint = 'http://localhost:8084'
elif lakefsEndPoint.startswith('http://lakefs:8000'):
    lakefsUIEndPoint = 'http://localhost:8084'
else:
    lakefsUIEndPoint = lakefsEndPoint

#### Verify lakeFS credentials by getting lakeFS version

In [None]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

---

---

### Set up Spark

**_If you're not using the provided MinIO storage then change S3 storage endpoint (e.g. http://s3.us-east-1.amazonaws.com) and credentials to match your environment_**

In [None]:
from pyspark.sql import SparkSession

storage_endpoint = "http://minio:9000"
storage_access_key = "minioadmin"
storage_secret_key = "minioadmin"

spark = SparkSession.builder.appName("Iceberg / Jupyter") \
        .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.8.1") \
        .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
        .config("spark.hadoop.fs.s3a.endpoint", storage_endpoint) \
        .config("spark.hadoop.fs.s3a.access.key", storage_access_key) \
        .config("spark.hadoop.fs.s3a.secret.key", storage_secret_key) \
        .config("spark.hadoop.fs.s3a.path.style.access", "true") \
        .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider") \
        .config("spark.sql.catalog." + myCatalog, "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog." + myCatalog + ".type", "rest") \
        .config("spark.sql.catalog." + myCatalog + ".uri", lakefsEndPoint + "/iceberg/api") \
        .config("spark.sql.catalog." + myCatalog + ".oauth2-server-uri", lakefsEndPoint + "/iceberg/api/v1/oauth/tokens") \
        .config("spark.sql.catalog." + myCatalog + ".credential", lakefsAccessKey + ":" + lakefsSecretKey) \
        .config("spark.sql.catalog." + myCatalog + ".prefix", "lakefs") \
        .config("spark.sql.warehouse.dir", warehouseDir) \
        .config("spark.sql.catalog." + myCatalog + ".warehouse", warehouseDir) \
        .config("spark.sql.catalog." + myCatalog + ".io-impl", "org.apache.iceberg.hadoop.HadoopFileIO") \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .getOrCreate()
spark.sparkContext.setLogLevel("INFO")

spark

### Create Iceberg namespaces

In [None]:
print(f"myCatalog: {myCatalog}, repo_name: {repo_name}, mainBranch: {mainBranch}")

In [None]:
%sql CREATE NAMESPACE {myCatalog}.`{repo_name}`.{mainBranch}.bronze
%sql CREATE NAMESPACE {myCatalog}.`{repo_name}`.{mainBranch}.silver
%sql CREATE NAMESPACE {myCatalog}.`{repo_name}`.{mainBranch}.gold

### List namespaces in the main branch

In [None]:
%sql SHOW NAMESPACES IN {myCatalog}.`{repo_name}`.{mainBranch}

---

## Create Iceberg tables in the lakeFS catalog `main` branch and bronze namespace

In [None]:
icebergNamespace = 'bronze'

In [None]:
# create authors table
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.authors(id int, name string) USING iceberg

In [None]:
# create books table
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.books(id int, title string, author_id int) USING iceberg

In [None]:
# create book_sales table
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.book_sales(id int, sale_date date, book_id int, price double) USING iceberg;

### List tables in the main branch

In [None]:
%sql SHOW TABLES IN {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}

### Insert data into tables

In [None]:
# Insert data into the authors table
%sql INSERT INTO {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.authors (id, name) \
VALUES (1, "J.R.R. Tolkien"), (2, "George R.R. Martin"), \
       (3, "Agatha Christie"), (4, "Isaac Asimov"), (5, "Stephen King");

In [None]:
# Insert data into the books table
%sql INSERT INTO {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.books (id, title, author_id) \
VALUES (1, "The Lord of the Rings", 1), (2, "The Hobbit", 1), \
       (3, "A Song of Ice and Fire", 2), (4, "A Clash of Kings", 2), \
       (5, "And Then There Were None", 3), (6, "Murder on the Orient Express", 3), \
       (7, "Foundation", 4), (8, "I, Robot", 4), \
       (9, "The Shining", 5), (10, "It", 5);

In [None]:
# Insert data into the book_sales table
%sql INSERT INTO {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.book_sales (id, sale_date, book_id, price) \
VALUES (1, DATE '2024-04-12', 1, 25.50), \
       (2, DATE '2024-04-11', 2, 17.99), \
       (3, DATE '2024-04-10', 3, 12.95), \
       (4, DATE '2024-04-13', 4, 32.00), \
       (5, DATE '2024-04-12', 5, 29.99), \
       (6, DATE '2024-03-15', 1, 23.99), \
       (7, DATE '2024-02-22', 2, 19.50), \
       (8, DATE '2024-01-10', 3, 14.95), \
       (9, DATE '2023-12-05', 4, 28.00), \
       (10, DATE '2023-11-18', 5, 27.99), \
       (11, DATE '2023-10-26', 2, 18.99), \
       (12, DATE '2023-10-12', 1, 22.50), \
       (13, DATE '2024-04-09', 3, 11.95), \
       (14, DATE '2024-03-28', 4, 35.00), \
       (15, DATE '2024-04-05', 5, 31.99), \
       (16, DATE '2024-03-01', 1, 27.50), \
       (17, DATE '2024-02-14', 2, 21.99), \
       (18, DATE '2024-01-07', 3, 13.95), \
       (19, DATE '2023-12-20', 4, 29.00), \
       (20, DATE '2023-11-03', 5, 28.99); 

# Main demo starts here 🚦 👇🏻

## Read my production data from my main branch

In [None]:
%sql SELECT * FROM {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.authors LIMIT 5;

In [None]:
%sql SELECT * FROM {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.books LIMIT 5;

In [None]:
%sql SELECT * FROM {myCatalog}.`{repo_name}`.{mainBranch}.{icebergNamespace}.book_sales LIMIT 5;

## Transform the data - Create a development sandbox

In [None]:
branchDev = repo.branch(devBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{devBranch} ref:", branchDev.get_commit().id)

## Read data from my development sandbox

In [None]:
%sql SELECT * FROM {myCatalog}.`{repo_name}`.{devBranch}.{icebergNamespace}.book_sales LIMIT 5;

## Running transformation pipeline in isolation

### Remove Cancelled Sales in the silver layer

In [None]:
# create book_sales table
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.{devBranch}.silver.book_sales(id int, sale_date date, book_id int, price double) USING iceberg;

In [None]:
%sql INSERT INTO {myCatalog}.`{repo_name}`.{devBranch}.silver.book_sales \
     SELECT * FROM {myCatalog}.`{repo_name}`.{devBranch}.bronze.book_sales \
     WHERE id NOT IN (10, 15, 2, 1, 6);

In [None]:
%sql SELECT * FROM {myCatalog}.`{repo_name}`.{devBranch}.silver.book_sales

## Attach data classification, source and target in the metadata

In [None]:
dataClassification = 'transformed'
source = lakefsUIEndPoint + '/repositories/' + repo_name + '/objects?ref=' + devBranch + '&path=_lakefs_tables/iceberg/namespaces/bronze/tables/book_sales'
target = lakefsUIEndPoint + '/repositories/' + repo_name + '/objects?ref=' + devBranch + '&path=_lakefs_tables/iceberg/namespaces/silver/tables/book_sales'
kwargs={'allow_empty': True}

ref = branchDev.commit(
        message='Added transformed data in ' + repo_name + ' repository!',
        metadata={'using': 'Iceberg REST Catalog',
                 'data classification': dataClassification,
                  '::lakefs::source::url[url:ui]': source,
                  '::lakefs::target::url[url:ui]': target},
        **kwargs)
print_commit(ref.get_commit())

### Partition table in the gold layer

In [None]:
# create book_sales table
%sql CREATE TABLE IF NOT EXISTS {myCatalog}.`{repo_name}`.{devBranch}.gold.book_sales(id int, sale_date date, book_id int, price double) USING iceberg \
     PARTITIONED BY (book_id)

In [None]:
%sql INSERT INTO {myCatalog}.`{repo_name}`.{devBranch}.gold.book_sales \
     SELECT * FROM {myCatalog}.`{repo_name}`.{devBranch}.silver.book_sales

In [None]:
%sql SELECT * FROM {myCatalog}.`{repo_name}`.{devBranch}.gold.book_sales

## Attach data classification, source and target in the metadata

In [None]:
dataClassification = 'partitioned'
source = lakefsUIEndPoint + '/repositories/' + repo_name + '/objects?ref=' + devBranch + '&path=_lakefs_tables/iceberg/namespaces/silver/tables/book_sales'
target = lakefsUIEndPoint + '/repositories/' + repo_name + '/objects?ref=' + devBranch + '&path=_lakefs_tables/iceberg/namespaces/gold/tables/book_sales'
kwargs={'allow_empty': True}

ref = branchDev.commit(
        message='Partitioned data in ' + repo_name + ' repository!',
        metadata={'using': 'Iceberg REST Catalog',
                 'data classification': dataClassification,
                  '::lakefs::source::url[url:ui]': source,
                  '::lakefs::target::url[url:ui]': target},
        **kwargs)
print_commit(ref.get_commit())

### Merge Changes

In [None]:
res = branchDev.merge_into(branchMain)
print(res)

### If you merged new branch to the main branch then you can atomically rollback all changes

In [None]:
branchMain.revert(parent_number=1, reference=mainBranch)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack