# DataHub + Iceberg + Spark Lineage Demo

This notebook demonstrates the full data platform integration:
- **Apache Spark** - Distributed data processing
- **Apache Iceberg** - Table format with time travel
- **Apache Polaris** - Iceberg REST Catalog
- **MinIO** - S3-compatible storage
- **DataHub** - Data catalog with lineage tracking

## What we'll do:
1. Create source tables in Iceberg
2. Transform data (generates lineage)
3. View the lineage in DataHub
4. Run the Polaris ingestion to sync metadata


## 1. Setup Spark Session

Using the pre-configured connector with OpenLineage enabled for DataHub lineage tracking.


In [None]:
from connector import create_spark_session

# Create Spark session with Polaris + OpenLineage config
spark = create_spark_session("datahub-lineage-demo")
print(f"Spark version: {spark.version}")
print(f"Available catalogs: {spark.catalog.listCatalogs()}")


## 2. Create Source Data

We'll create two source tables that will be joined to produce a derived table.


In [None]:
# Switch to the Polaris catalog
spark.sql("USE polaris")

# Create a demo namespace (database)
spark.sql("CREATE NAMESPACE IF NOT EXISTS demo")
spark.sql("USE demo")

print("Created namespace: polaris.demo")


In [None]:
# Create sample customers table
customers_data = [
    (1, "Alice", "alice@example.com", "Copenhagen"),
    (2, "Bob", "bob@example.com", "Aarhus"),
    (3, "Charlie", "charlie@example.com", "Odense"),
    (4, "Diana", "diana@example.com", "Aalborg"),
    (5, "Erik", "erik@example.com", "Copenhagen"),
]

customers_df = spark.createDataFrame(
    customers_data, 
    ["customer_id", "name", "email", "city"]
)

# Write as Iceberg table
customers_df.writeTo("polaris.demo.customers").createOrReplace()
print("Created table: polaris.demo.customers")
customers_df.show()


In [None]:
# Create sample orders table
from datetime import date

orders_data = [
    (101, 1, date(2024, 1, 15), 150.00, "completed"),
    (102, 2, date(2024, 1, 16), 250.00, "completed"),
    (103, 1, date(2024, 1, 17), 75.00, "completed"),
    (104, 3, date(2024, 1, 18), 300.00, "pending"),
    (105, 4, date(2024, 1, 19), 125.00, "completed"),
    (106, 5, date(2024, 1, 20), 450.00, "completed"),
    (107, 1, date(2024, 1, 21), 200.00, "cancelled"),
    (108, 2, date(2024, 1, 22), 180.00, "completed"),
]

orders_df = spark.createDataFrame(
    orders_data,
    ["order_id", "customer_id", "order_date", "amount", "status"]
)

# Write as Iceberg table
orders_df.writeTo("polaris.demo.orders").createOrReplace()
print("Created table: polaris.demo.orders")
orders_df.show()


## 3. Transform Data (Generates Lineage)

Now we'll create a derived table by joining customers and orders.
This transformation generates lineage that DataHub will track.


In [None]:
# Read source tables
customers = spark.table("polaris.demo.customers")
orders = spark.table("polaris.demo.orders")

# Join and aggregate: Customer order summary
from pyspark.sql import functions as F

customer_summary = (
    orders
    .filter(F.col("status") == "completed")
    .groupBy("customer_id")
    .agg(
        F.count("order_id").alias("total_orders"),
        F.sum("amount").alias("total_spent"),
        F.avg("amount").alias("avg_order_value"),
        F.max("order_date").alias("last_order_date")
    )
    .join(customers, "customer_id")
    .select(
        "customer_id",
        "name",
        "email",
        "city",
        "total_orders",
        "total_spent",
        "avg_order_value",
        "last_order_date"
    )
)

# Write derived table - THIS GENERATES LINEAGE!
customer_summary.writeTo("polaris.demo.customer_summary").createOrReplace()
print("Created derived table: polaris.demo.customer_summary")
print("\nLineage: customers + orders -> customer_summary")
customer_summary.show()


In [None]:
# Create another derived table: City-level analytics
city_analytics = (
    spark.table("polaris.demo.customer_summary")
    .groupBy("city")
    .agg(
        F.count("customer_id").alias("customer_count"),
        F.sum("total_spent").alias("city_revenue"),
        F.avg("avg_order_value").alias("city_avg_order")
    )
    .orderBy(F.desc("city_revenue"))
)

# Write - generates more lineage
city_analytics.writeTo("polaris.demo.city_analytics").createOrReplace()
print("Created derived table: polaris.demo.city_analytics")
print("\nLineage: customer_summary -> city_analytics")
city_analytics.show()


## 4. Verify Tables in Iceberg

Let's verify all tables exist in the Polaris catalog.


In [None]:
# List all tables in the demo namespace
print("Tables in polaris.demo:")
spark.sql("SHOW TABLES IN polaris.demo").show()

# Show table details
print("\nCustomers table schema:")
spark.sql("DESCRIBE polaris.demo.customers").show()

print("\nIceberg table history (time travel):")
spark.sql("SELECT * FROM polaris.demo.customers.history").show()


## 5. View in DataHub

Now open DataHub to see the tables and lineage:

1. **Access DataHub UI:**
   ```bash
   kubectl port-forward svc/datahub-datahub-frontend -n datahub 9002:9002
   ```
   Open: http://localhost:9002

2. **Run Polaris ingestion to sync metadata:**
   ```bash
   kubectl apply -f k8s/datahub/ingestion-configmap.yaml
   kubectl delete job datahub-ingest-polaris -n datahub --ignore-not-found
   kubectl apply -f k8s/datahub/ingestion-polaris.yaml
   ```

3. **Navigate to:**
   - **Search** for "customers", "orders", "customer_summary"
   - **Lineage tab** shows data flow
   - **Schema tab** shows columns

4. **Expected Lineage Graph:**
   ```
   customers ─────┐
                  ├──> customer_summary ──> city_analytics
   orders ────────┘
   ```


In [None]:
# Print summary
print("="*60)
print("DEMO COMPLETE!")
print("="*60)
print("\nCreated tables:")
print("  - polaris.demo.customers (source)")
print("  - polaris.demo.orders (source)")
print("  - polaris.demo.customer_summary (derived)")
print("  - polaris.demo.city_analytics (derived)")
print("\nLineage generated:")
print("  customers + orders -> customer_summary -> city_analytics")
print("\nView in DataHub:")
print("  kubectl port-forward svc/datahub-datahub-frontend -n datahub 9002:9002")
print("  http://localhost:9002")
print("\nTo sync metadata to DataHub:")
print("  kubectl apply -f k8s/datahub/ingestion-polaris.yaml")


In [None]:
# Stop Spark session
spark.stop()
