# SparkConnector Demonstration (k3d/JupyterHub + Spark-on-Kubernetes)

This demo supports **two ways** to choose environment (`sbx/dev/test/prod`):

1) **JupyterHub profile selection (recommended best practice)**
   - The profile injects `DST_ENV`, `DST_BUCKET`, `POLARIS_WAREHOUSE`.

2) **Git branch mapping (optional legacy behavior)**
   - If a git repo is present in the notebook pod, this demo can map the current
     branch to an environment and set the same env vars *before* Spark starts.

Branch → Environment mapping used here:
- `feature/*` (and unknown) → `sbx`
- `dev` / `develop` → `dev`
- `release/*` or `hotfix/*` → `test`
- `main` / `master` → `prod`

Notes:
- MinIO credentials are **user credentials** (MINIO_USER/MINIO_PASSWORD) and are
  provided at spawn time by the JupyterHub start form.
- This file is mounted into the notebook pod at `/opt/leru/getting_started`.

## 1) Verify LER-U mount

In [2]:
import os
import utilities

print("utilities.__file__ =", utilities.__file__)
print("/opt/leru exists    =", os.path.exists("/opt/leru"))

utilities.__file__ = /opt/leru/utilities/__init__.py
/opt/leru exists    = True


## 2) Optional: derive sbx/dev/test/prod from git branch

In [3]:
import subprocess
from pathlib import Path


def _git_branch(repo_dir: str | None = None) -> str | None:
    """Return current git branch name, or None if not a git repo."""
    cmd = ["git"]
    if repo_dir:
        cmd += ["-C", repo_dir]
    cmd += ["rev-parse", "--abbrev-ref", "HEAD"]
    try:
        out = subprocess.check_output(cmd, stderr=subprocess.DEVNULL).decode().strip()
        return None if out in ("", "HEAD") else out
    except Exception:
        return None


def _branch_to_env(branch: str | None) -> str:
    if not branch:
        return "sbx"
    if branch.startswith("feature/"):
        return "sbx"
    if branch in ("dev", "develop"):
        return "dev"
    if branch.startswith(("release/", "hotfix/")):
        return "test"
    if branch in ("main", "master"):
        return "prod"
    return "sbx"


# Choose where to look for git:
# - If you have this repo cloned in the pod, prefer it
# - Otherwise use current working directory
repo_candidate = "/home/jovyan/spark-k8-hub"
repo_dir = repo_candidate if Path(repo_candidate, ".git").exists() else None

branch = _git_branch(repo_dir)
env_from_git = _branch_to_env(branch)

print("git_repo_dir =", repo_dir or "(none)")
print("git_branch   =", branch)
print("env_from_git =", env_from_git)

# Toggle: set to True if you want git to override the JupyterHub profile env vars
USE_GIT_FOR_ENV = True

if USE_GIT_FOR_ENV:
    os.environ["DST_ENV"] = env_from_git
    os.environ["DST_BUCKET"] = f"s3a://{env_from_git}"
    os.environ["POLARIS_WAREHOUSE"] = env_from_git
    # optional for debugging
    if branch:
        os.environ["DST_GIT_BRANCH"] = branch

print("DST_ENV          =", os.environ.get("DST_ENV"))
print("DST_BUCKET       =", os.environ.get("DST_BUCKET"))
print("POLARIS_WAREHOUSE=", os.environ.get("POLARIS_WAREHOUSE"))

git_repo_dir = (none)
git_branch   = None
env_from_git = sbx
DST_ENV          = sbx
DST_BUCKET       = s3a://sbx
POLARIS_WAREHOUSE= sbx


## 3) Create Spark session via SparkConnector

In [4]:
from utilities.spark_connector import SparkConnector

connector = SparkConnector(size="XS", force_new=True)
spark = connector.session

print("\n--- Connector env ---")
print("env_name      =", connector.env.env_name)
print("runtime       =", connector.env.runtime)
print("spark_master  =", connector.env.spark_master)
print("bucket        =", connector.env.bucket)
print("catalog_type  =", connector.env.catalog_type)


 CONFIGURING SPARK SESSION
  User:        root
  Branch:      unknown
  Environment: sbx
  Bucket:      s3a://sbx
  Size:        XS
  Runtime:     kubernetes


25/12/16 11:16:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/16 11:16:24 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
25/12/16 11:16:24 WARN Utils: The configured local directories are not expected to be URIs; however, got suspicious values [s3a://sbx/spark-tmp/]. Please check your configured local directories.
25/12/16 11:16:29 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties



 SPARK SESSION ACTIVE
  Environment:  sbx
  Branch:       unknown
  Bucket:       s3a://sbx
  Size:         XS


--- Connector env ---
env_name      = sbx
runtime       = kubernetes
spark_master  = k8s://https://kubernetes.default.svc
bucket        = s3a://sbx
catalog_type  = in-memory


## 4) Spark sanity

In [5]:
print("count =", spark.range(1000).count())

[Stage 0:>                                                          (0 + 2) / 2]

count = 1000


                                                                                

## 5) Delta write/read to the selected bucket

In [6]:
path = f"{connector.env.bucket}/demo/connector_demonstrator_env_select/delta_table"
(
    spark.range(10)
    .withColumnRenamed("id", "n")
    .write.format("delta")
    .mode("overwrite")
    .save(path)
)

print("Wrote Delta to:", path)
print("Read back:")
spark.read.format("delta").load(path).show()

                                                                                

Wrote Delta to: s3a://sbx/demo/connector_demonstrator_env_select/delta_table
Read back:


25/12/16 11:16:38 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
[Stage 15:>                                                         (0 + 1) / 1]

+---+
|  n|
+---+
|  5|
|  6|
|  7|
|  8|
|  9|
|  0|
|  1|
|  2|
|  3|
|  4|
+---+



                                                                                

## 6) Polaris/Iceberg smoke test (uses POLARIS_WAREHOUSE / catalog)

In [7]:
try:
    # Catalog is always named "polaris" in Spark config, but warehouse/catalog name comes from env.
    spark.sql("CREATE DATABASE IF NOT EXISTS polaris.demo").show()
    spark.sql("DROP TABLE IF EXISTS polaris.demo.users")
    spark.sql(
        """
        CREATE TABLE polaris.demo.users (
            id INT,
            name STRING
        )
        USING iceberg
        """
    )
    spark.sql("INSERT INTO polaris.demo.users VALUES (1, 'Alice'), (2, 'Bob')")
    spark.sql("SELECT * FROM polaris.demo.users").show()
    print("✅ Polaris/Iceberg smoke test OK")
except Exception as e:
    print("⚠️ Polaris/Iceberg smoke test skipped/failed:")
    print(e)



++
||
++
++



                                                                                

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
+---+-----+

✅ Polaris/Iceberg smoke test OK


## 7) Cleanup

In [8]:
connector.stop()
print("Stopped Spark session")

Stopping Spark session...


25/12/16 11:16:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.


Session stopped.
Stopped Spark session
