<img src="./images/logo.svg" alt="lakeFS logo" width=300/> <img src="https://www.apache.org/logos/res/iceberg/iceberg.png" alt="Apache Iceberg logo" width=300/>  

## lakeFS ❤️ Apache Iceberg - an example of the integration by using Trino client

* [📚 lakeFS Apache Iceberg Integration Docs](https://docs.lakefs.io/integrations/iceberg.html)
* [Getting started with Trino's Iceberg connector](https://trino.io/docs/current/connector/iceberg.html)

## Prerequisites

* ###### Review [README](https://github.com/treeverse/lakeFS-samples/blob/main/01_standalone_examples/trino/README.md) if you didn't provision Trino container.

* ###### This Notebook requires connecting to lakeFS Cloud or lakeFS Enterprise.
    ###### Register for the lakeFS Cloud: https://lakefs.cloud/register or Contact Us for a lakeFS Enterprise Key: https://lakefs.io/contact-sales/

## Config

**_If you're not using the provided lakeFS server and MinIO storage then change these values to match your environment_**

### lakeFS endpoint and credentials

In [1]:
lakefsEndPoint = 'http://lakefs:8000' # e.g. 'https://username.aws_region_name.lakefscloud.io' 
lakefsAccessKey = 'AKIAIOSFOLKFSSAMPLES'
lakefsSecretKey = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'

### Object Storage

In [2]:
storageNamespace = 's3://example' # e.g. "s3://bucket"

---

## Setup

**(you shouldn't need to change anything in this section, just run it)**

In [3]:
repo_name = "lakefs-trino-iceberg"

### Versioning Information

In [4]:
mainBranch = "main"
devBranch = "dev"
icebergNamespace = "lakefs_demo"
myCatalog = "lakefs"

### Install and import libraries

In [None]:
!pip install trino==0.334.0

In [5]:
import os
import lakefs
import trino

### Set environment variables

In [6]:
os.environ["LAKECTL_SERVER_ENDPOINT_URL"] = lakefsEndPoint
os.environ["LAKECTL_CREDENTIALS_ACCESS_KEY_ID"] = lakefsAccessKey
os.environ["LAKECTL_CREDENTIALS_SECRET_ACCESS_KEY"] = lakefsSecretKey

#### Verify lakeFS credentials by getting lakeFS version

In [7]:
print("Verifying lakeFS credentials…")
try:
    v=lakefs.client.Client().version
except:
    print("🛑 failed to get lakeFS version")
else:
    print(f"…✅lakeFS credentials verified\n\nℹ️lakeFS version {v}")

Verifying lakeFS credentials…
🛑 failed to get lakeFS version


### Define lakeFS Repository

In [None]:
repo = lakefs.Repository(repo_name).create(storage_namespace=f"{storageNamespace}/{repo_name}", default_branch=mainBranch, exist_ok=True)
branchMain = repo.branch(mainBranch)
print(repo)

---

---

### Define Trino Cursor

**_If you're not using Trino in a separate Docker Container as part of the lakeFS Samples then change host & port to match your Trino environment_**

In [None]:
def trino_cursor(catalog, schema):
    # provide a catalog and schema name, and get a cursor to execute commands
    def get_cursor(catalog, schema):
        conn = trino.dbapi.connect(
            host='host.docker.internal',
            port='8080',
            user='lakefs_user',
            catalog=catalog,
            schema=schema,
        )
        return conn.cursor()
    return get_cursor(catalog, schema)

In [None]:
cursor = trino_cursor(myCatalog, icebergNamespace)
cursor.execute("SHOW CATALOGS").fetchall()

### Create Iceberg namespace

In [None]:
lakefs_demo_ns = f"{repo_name}.{mainBranch}.{icebergNamespace}"
cursor.execute(f'CREATE SCHEMA IF NOT EXISTS "{lakefs_demo_ns}"')

### List namespaces

In [None]:
cursor.execute(f"SHOW SCHEMAS FROM {myCatalog}").fetchall()

---

## Create Iceberg tables in the lakeFS catalog `main` branch

In [None]:
# create authors table
cursor.execute(f'CREATE TABLE IF NOT EXISTS "{repo_name}.{mainBranch}.{icebergNamespace}".authors (id INTEGER, name VARCHAR)')

In [None]:
# create books table
cursor.execute(f'CREATE TABLE IF NOT EXISTS "{repo_name}.{mainBranch}.{icebergNamespace}".books (id INTEGER, title VARCHAR, author_id INTEGER)')

In [None]:
# create book_sales table
cursor.execute(f'CREATE TABLE IF NOT EXISTS "{repo_name}.{mainBranch}.{icebergNamespace}".book_sales (id INTEGER, sale_date DATE, book_id INTEGER, price DOUBLE)')

### List tables in the main branch

In [None]:
cursor.execute(f'SHOW TABLES FROM "{repo_name}.{mainBranch}.{icebergNamespace}"').fetchall()

### Insert data into tables

In [None]:
# Insert data into the authors table
cursor.execute(f"INSERT INTO \"{repo_name}.{mainBranch}.{icebergNamespace}\".authors (id, name) \
VALUES (1, 'J.R.R. Tolkien'), (2, 'George R.R. Martin'), \
       (3, 'Agatha Christie'), (4, 'Isaac Asimov'), (5, 'Stephen King')")

In [None]:
# Insert data into the books table
cursor.execute(f"INSERT INTO \"{repo_name}.{mainBranch}.{icebergNamespace}\".books (id, title, author_id) \
VALUES (1, 'The Lord of the Rings', 1), (2, 'The Hobbit', 1), \
       (3, 'A Song of Ice and Fire', 2), (4, 'A Clash of Kings', 2), \
       (5, 'And Then There Were None', 3), (6, 'Murder on the Orient Express', 3), \
       (7, 'Foundation', 4), (8, 'I, Robot', 4), \
       (9, 'The Shining', 5), (10, 'It', 5)")

In [None]:
# Insert data into the book_sales table
cursor.execute(f"INSERT INTO \"{repo_name}.{mainBranch}.{icebergNamespace}\".book_sales (id, sale_date, book_id, price) \
VALUES (1, DATE '2024-04-12', 1, 25.50), \
       (2, DATE '2024-04-11', 2, 17.99), \
       (3, DATE '2024-04-10', 3, 12.95), \
       (4, DATE '2024-04-13', 4, 32.00), \
       (5, DATE '2024-04-12', 5, 29.99), \
       (6, DATE '2024-03-15', 1, 23.99), \
       (7, DATE '2024-02-22', 2, 19.50), \
       (8, DATE '2024-01-10', 3, 14.95), \
       (9, DATE '2023-12-05', 4, 28.00), \
       (10, DATE '2023-11-18', 5, 27.99), \
       (11, DATE '2023-10-26', 2, 18.99), \
       (12, DATE '2023-10-12', 1, 22.50), \
       (13, DATE '2024-04-09', 3, 11.95), \
       (14, DATE '2024-03-28', 4, 35.00), \
       (15, DATE '2024-04-05', 5, 31.99), \
       (16, DATE '2024-03-01', 1, 27.50), \
       (17, DATE '2024-02-14', 2, 21.99), \
       (18, DATE '2024-01-07', 3, 13.95), \
       (19, DATE '2023-12-20', 4, 29.00), \
       (20, DATE '2023-11-03', 5, 28.99)")

# Main demo starts here 🚦 👇🏻

## Read my production data from my main branch

In [None]:
cursor.execute(f'SELECT * FROM "{repo_name}.{mainBranch}.{icebergNamespace}".authors').fetchall()

In [None]:
cursor.execute(f'SELECT * FROM "{repo_name}.{mainBranch}.{icebergNamespace}".books').fetchall()

In [None]:
cursor.execute(f'SELECT * FROM "{repo_name}.{mainBranch}.{icebergNamespace}".book_sales').fetchall()

## Mess with the data - Create a development sandbox

In [None]:
branchDev = repo.branch(devBranch).create(source_reference=mainBranch, exist_ok=True)
print(f"{devBranch} ref:", branchDev.get_commit().id)

## Read data from my development sandbox

In [None]:
cursor.execute(f'SELECT * FROM "{repo_name}.{devBranch}.{icebergNamespace}".book_sales').fetchall()

In [None]:
cursor.execute(f"SELECT 'Prod', SUM(price) AS total_sales \
     FROM \"{repo_name}.{mainBranch}.{icebergNamespace}\".book_sales \
     UNION ALL \
     SELECT 'Dev', SUM(price) AS total_sales \
     FROM \"{repo_name}.{devBranch}.{icebergNamespace}\".book_sales").fetchall()

## Running pipelines in isolation

### Remove Cancelled Sales

In [None]:
cursor.execute(f"DELETE FROM \"{repo_name}.{devBranch}.{icebergNamespace}\".book_sales \
     WHERE id IN (10, 15, 2, 1, 6)")

### Who are my top selling authors?

In [None]:
cursor.execute(f"SELECT \
        au.name AS author_name, \
        ROUND(SUM(s.price), 2) AS total_sales \
     FROM \"{repo_name}.{devBranch}.{icebergNamespace}\".books b \
     LEFT JOIN \"{repo_name}.{devBranch}.{icebergNamespace}\".authors au ON b.author_id = au.id \
     LEFT JOIN \"{repo_name}.{devBranch}.{icebergNamespace}\".book_sales s ON b.id = s.book_id \
     GROUP BY au.name \
     ORDER BY total_sales DESC \
     LIMIT 3").fetchall()

### Compare dev and main

In [None]:
cursor.execute(f"SELECT \
        au.name AS author_name, \
        ROUND(SUM(s.price), 2) AS total_sales \
     FROM \"{repo_name}.{mainBranch}.{icebergNamespace}\".books b \
     LEFT JOIN \"{repo_name}.{mainBranch}.{icebergNamespace}\".authors au ON b.author_id = au.id \
     LEFT JOIN \"{repo_name}.{mainBranch}.{icebergNamespace}\".book_sales s ON b.id = s.book_id \
     GROUP BY au.name \
     ORDER BY total_sales DESC \
     LIMIT 3").fetchall()

### Merge Changes

In [None]:
res = branchDev.merge_into(branchMain)
print(res)

## More Questions?

###### Join the lakeFS Slack group - https://lakefs.io/slack