# SparkConnector Demonstration (k3d/JupyterHub + Spark-on-Kubernetes)

This version is adapted for **this platform**:
- JupyterHub singleuser pod running inside Kubernetes
- Spark master is **Kubernetes** (`k8s://https://kubernetes.default.svc`)
- MinIO and Polaris are reached via **cluster DNS**
- LER-U utilities are mounted into the notebook pod at `/opt/leru`

What this demo shows:
- The `utilities` package is imported from `/opt/leru/utilities`
- `SparkConnector` creates a SparkSession configured for Spark-on-K8s
- A simple read/write to MinIO (S3A)
- Optional Polaris/Iceberg operations

Notes:
- We do **not** use git branch switching in this environment.
- Dynamic allocation behavior depends on Spark config and is not reliably visible via a “Spark Master UI” in k8s.

## 1) Verify LER-U mount and import path

The LER-U folder is runtime-mounted into the notebook pod:

- Container path: `/opt/leru`
- Utilities: `/opt/leru/utilities`

`PYTHONPATH` is set so `import utilities` works.

In [7]:
import os
import utilities

print("utilities.__file__ =", utilities.__file__)
print("/opt/leru exists    =", os.path.exists("/opt/leru"))
print("/opt/leru/utilities =", os.path.exists("/opt/leru/utilities"))

utilities.__file__ = /opt/leru/utilities/__init__.py
/opt/leru exists    = True
/opt/leru/utilities = True


## 2) Create a Spark session via SparkConnector

The connector auto-detects Kubernetes and uses sensible defaults.

You can override defaults via env vars (optional):

- `MINIO_USER` / `MINIO_PASSWORD` (default: `admin` / `password`)
- `MINIO_HOSTNAME` (default: `minio.minio.svc`)
- `DST_BUCKET` (default: `s3a://polaris`)
- `SPARK_EXECUTOR_IMAGE` (defaults to the notebook image tag)
- `POLARIS_URI` / `POLARIS_TOKEN_URI` (defaults match this repo)

In [8]:
from utilities.spark_connector import SparkConnector

connector = SparkConnector(size="XS", force_new=True)
spark = connector.session

print("\n--- Connector env ---")
print("env_name      =", connector.env.env_name)
print("runtime       =", connector.env.runtime)
print("spark_master  =", connector.env.spark_master)
print("bucket        =", connector.env.bucket)
print("catalog_type  =", connector.env.catalog_type)

print("\nSpark version =", spark.version)


 CONFIGURING SPARK SESSION
  User:        root
  Branch:      unknown
  Environment: k8s
  Bucket:      s3a://polaris
  Size:        XS
  Runtime:     kubernetes


25/12/16 09:21:46 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.Con

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/jhub-dev/configmaps. Message: configmaps "spark-exec-856bcd9b2667ed8f-conf-map" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=configmaps, name=spark-exec-856bcd9b2667ed8f-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=configmaps "spark-exec-856bcd9b2667ed8f-conf-map" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:518)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:535)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleCreate(OperationSupport.java:340)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:703)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.handleCreate(BaseOperation.java:92)
	at io.fabric8.kubernetes.client.dsl.internal.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:42)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:1108)
	at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.create(BaseOperation.java:92)
	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.setUpExecutorConfigMap(KubernetesClusterSchedulerBackend.scala:90)
	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:114)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:235)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:599)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/jhub-dev/configmaps. Message: configmaps "spark-exec-856bcd9b2667ed8f-conf-map" already exists. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=null, kind=configmaps, name=spark-exec-856bcd9b2667ed8f-conf-map, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=configmaps "spark-exec-856bcd9b2667ed8f-conf-map" already exists, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=AlreadyExists, status=Failure, additionalProperties={}).
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:671)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.requestFailure(OperationSupport.java:651)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.assertResponseCode(OperationSupport.java:600)
	at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.lambda$handleResponse$0(OperationSupport.java:560)
	at java.base/java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:642)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079)
	at io.fabric8.kubernetes.client.http.StandardHttpClient.lambda$completeOrCancel$10(StandardHttpClient.java:140)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079)
	at io.fabric8.kubernetes.client.http.ByteArrayBodyHandler.onBodyDone(ByteArrayBodyHandler.java:52)
	at java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
	at java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
	at java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506)
	at java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2079)
	at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:137)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	... 1 more


## 3) Quick Spark sanity check

In [None]:
print("count =", spark.range(1000).count())

## 4) Write/read a small Delta dataset to MinIO (S3A)

We write to a path under the configured bucket.

In [None]:
path = f"{connector.env.bucket}/demo/connector_demonstrator_k8s/delta_table"
(
    spark.range(10)
    .withColumnRenamed("id", "n")
    .write.format("delta")
    .mode("overwrite")
    .save(path)
)

print("Wrote Delta to:", path)
print("Read back:")
spark.read.format("delta").load(path).show()

## 5) (Optional) Polaris/Iceberg smoke test

If Polaris is deployed (it is in this repo's k3d stack), this should work.

In [9]:
try:
    spark.sql("CREATE DATABASE IF NOT EXISTS polaris.demo").show()
    spark.sql("DROP TABLE IF EXISTS polaris.demo.users")
    spark.sql(
        """
        CREATE TABLE polaris.demo.users (
            id INT,
            name STRING
        )
        USING iceberg
        """
    )
    spark.sql("INSERT INTO polaris.demo.users VALUES (1, 'Alice'), (2, 'Bob')")
    spark.sql("SELECT * FROM polaris.demo.users").show()
    print("✅ Polaris/Iceberg smoke test OK")
except Exception as e:
    print("⚠️ Polaris/Iceberg smoke test skipped/failed:")
    print(e)

⚠️ Polaris/Iceberg smoke test skipped/failed:
An error occurred while calling o141.sql.
: org.apache.spark.SparkException: [INTERNAL_ERROR] Undefined error message parameter for error class: '_LEGACY_ERROR_TEMP_1055'. Parameters: Map(database -> polaris.demo)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
	at org.apache.spark.SparkException$.internalError(SparkException.scala:96)
	at org.apache.spark.ErrorClassesJsonReader.getErrorMessage(ErrorClassesJSONReader.scala:56)
	at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:53)
	at org.apache.spark.SparkThrowableHelper$.getMessage(SparkThrowableHelper.scala:40)
	at org.apache.spark.sql.AnalysisException.<init>(AnalysisException.scala:47)
	at org.apache.spark.sql.AnalysisException.<init>(AnalysisException.scala:70)
	at org.apache.spark.sql.errors.QueryCompilationErrors$.invalidDatabaseNameError(QueryCompilationErrors.scala:875)
	at org.apache.spark.sql.catalyst.analysis.ResolveSess

## 6) Cleanup

In [10]:
connector.stop()
print("Stopped Spark session")

Stopped Spark session
