# Spark Operator Testing & Debugging

This notebook helps test and debug the Spark Operator integration with Kubeflow in the Kind cluster.

## What This Tests

1. ✅ Spark Operator deployment status
2. ✅ SparkApplication CRD availability
3. ✅ Submit test SparkApplication
4. ✅ Monitor SparkApplication lifecycle
5. ✅ Retrieve driver and executor logs
6. ✅ Verify MinIO/LakeFS connectivity from Spark
7. ✅ Troubleshoot common issues

## Setup

In [None]:
import json
import subprocess
import time
from datetime import datetime
from typing import Dict, List, Optional

# Configuration
NAMESPACE = "dfp"
CLUSTER_NAME = "dfp-kind"

print(f"Testing Spark Operator in namespace: {NAMESPACE}")
print(f"Kind cluster: {CLUSTER_NAME}")

## Utility Functions

In [None]:
def run_kubectl(args: List[str], check: bool = True) -> subprocess.CompletedProcess:
    """Run kubectl command and return result."""
    cmd = ["kubectl"] + args
    result = subprocess.run(
        cmd,
        capture_output=True,
        text=True,
        check=False,
    )
    if check and result.returncode != 0:
        print(f"❌ Command failed: {' '.join(cmd)}")
        print(f"   stderr: {result.stderr}")
        raise subprocess.CalledProcessError(result.returncode, cmd, result.stdout, result.stderr)
    return result


def print_section(title: str):
    """Print a formatted section header."""
    print("\n" + "=" * 70)
    print(f"  {title}")
    print("=" * 70 + "\n")


def print_json(data: dict):
    """Pretty print JSON data."""
    print(json.dumps(data, indent=2))


print("✓ Utility functions loaded")

## 1. Check Prerequisites

In [None]:
print_section("1. Checking Prerequisites")

# Check kubectl
try:
    result = run_kubectl(["version", "--client", "-o", "json"])
    version_info = json.loads(result.stdout)
    print(f"✓ kubectl version: {version_info['clientVersion']['gitVersion']}")
except Exception as e:
    print(f"❌ kubectl not found or not configured: {e}")

# Check cluster connection
try:
    result = run_kubectl(["cluster-info"])
    print(f"\n✓ Cluster connection established")
    print(result.stdout)
except Exception as e:
    print(f"❌ Cannot connect to cluster: {e}")

# Check namespace
try:
    result = run_kubectl(["get", "namespace", NAMESPACE])
    print(f"✓ Namespace '{NAMESPACE}' exists")
except Exception as e:
    print(f"❌ Namespace '{NAMESPACE}' not found: {e}")

## 2. Check Spark Operator Deployment

In [None]:
print_section("2. Spark Operator Deployment Status")

try:
    # Get deployment
    result = run_kubectl(["-n", NAMESPACE, "get", "deployment", "spark-operator", "-o", "json"])
    deployment = json.loads(result.stdout)
    
    # Extract key info
    image = deployment["spec"]["template"]["spec"]["containers"][0]["image"]
    replicas = deployment["spec"]["replicas"]
    available_replicas = deployment["status"].get("availableReplicas", 0)
    ready_replicas = deployment["status"].get("readyReplicas", 0)
    
    print(f"✓ Spark Operator Deployment Found")
    print(f"  Image: {image}")
    print(f"  Desired Replicas: {replicas}")
    print(f"  Ready Replicas: {ready_replicas}")
    print(f"  Available Replicas: {available_replicas}")
    
    if ready_replicas == replicas:
        print(f"\n✅ Spark Operator is READY")
    else:
        print(f"\n⚠️  Spark Operator is NOT READY")
        
    # Get pod status
    result = run_kubectl(["-n", NAMESPACE, "get", "pods", "-l", "app=spark-operator", "-o", "json"])
    pods = json.loads(result.stdout)
    
    print(f"\nPods:")
    for pod in pods["items"]:
        name = pod["metadata"]["name"]
        phase = pod["status"]["phase"]
        print(f"  • {name}: {phase}")
        
except subprocess.CalledProcessError:
    print("❌ Spark Operator deployment not found!")
    print("\nTo install:")
    print("  kubectl apply -k infra/k8s/kind/addons/spark-operator/")
    print("  # or")
    print("  task spark-operator:up")

## 3. Check SparkApplication CRD

In [None]:
print_section("3. SparkApplication Custom Resource Definition")

try:
    # Check CRD exists
    result = run_kubectl(["get", "crd", "sparkapplications.sparkoperator.k8s.io", "-o", "json"])
    crd = json.loads(result.stdout)
    
    print(f"✓ SparkApplication CRD Found")
    print(f"  Name: {crd['metadata']['name']}")
    print(f"  Group: {crd['spec']['group']}")
    print(f"  Scope: {crd['spec']['scope']}")
    
    print(f"\n  Versions:")
    for version in crd["spec"]["versions"]:
        served = "✓" if version["served"] else "✗"
        storage = "(storage)" if version["storage"] else ""
        print(f"    {served} {version['name']} {storage}")
        
    print(f"\n✅ SparkApplication CRD is available")
    
except subprocess.CalledProcessError:
    print("❌ SparkApplication CRD not found!")
    print("\nThe CRD should be installed with the Spark Operator.")

## 4. List Existing SparkApplications

In [None]:
print_section("4. Existing SparkApplications")

try:
    result = run_kubectl(["-n", NAMESPACE, "get", "sparkapplications", "-o", "json"])
    apps = json.loads(result.stdout)
    
    if not apps["items"]:
        print("No SparkApplications found in namespace.")
    else:
        print(f"Found {len(apps['items'])} SparkApplication(s):\n")
        for app in apps["items"]:
            name = app["metadata"]["name"]
            state = app.get("status", {}).get("applicationState", {}).get("state", "UNKNOWN")
            creation = app["metadata"]["creationTimestamp"]
            print(f"  • {name}")
            print(f"    State: {state}")
            print(f"    Created: {creation}")
            print()
            
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to list SparkApplications: {e.stderr}")

## 5. Create Test SparkApplication

This creates a simple test SparkApplication that runs a PySpark job.

In [None]:
print_section("5. Creating Test SparkApplication")

# Generate unique name
test_app_name = f"spark-test-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
print(f"Creating SparkApplication: {test_app_name}\n")

# Simple PySpark test job
test_job_code = '''
import sys
from pyspark.sql import SparkSession

print("="*70)
print("Starting Spark Test Job")
print("="*70)

spark = SparkSession.builder.appName("spark-test").getOrCreate()

print(f"Spark version: {spark.version}")
print(f"Python version: {sys.version}")

# Create test DataFrame
data = [("Alice", 34), ("Bob", 45), ("Charlie", 29)]
df = spark.createDataFrame(data, ["name", "age"])

print("\\nTest DataFrame:")
df.show()

print("\\nDataFrame count:", df.count())
print("\\nDataFrame schema:")
df.printSchema()

print("\\n" + "="*70)
print("✅ Spark Test Job Completed Successfully")
print("="*70)

spark.stop()
'''

# Create ConfigMap with test job
test_configmap = f"""
apiVersion: v1
kind: ConfigMap
metadata:
  name: {test_app_name}-job
  namespace: {NAMESPACE}
data:
  test_job.py: |
{test_job_code}
"""

print("Creating ConfigMap with test job...")
try:
    result = run_kubectl(["apply", "-f", "-"], check=True)
    subprocess.run(["kubectl", "apply", "-f", "-"], input=test_configmap, text=True, check=True, capture_output=True)
    print("✓ ConfigMap created\n")
except Exception as e:
    print(f"❌ Failed to create ConfigMap: {e}")

# Create SparkApplication
test_spark_app = f"""
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: {test_app_name}
  namespace: {NAMESPACE}
spec:
  type: Python
  mode: cluster
  pythonVersion: "3"
  sparkVersion: "3.5.3"
  image: "bitnami/spark:3.5.3"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "local:///opt/spark/work-dir/test_job.py"
  
  restartPolicy:
    type: Never
  
  driver:
    cores: 1
    memory: "512m"
    serviceAccount: spark-operator
    labels:
      version: "3.5.3"
    volumeMounts:
      - name: test-job
        mountPath: /opt/spark/work-dir
  
  executor:
    instances: 1
    cores: 1
    memory: "512m"
    labels:
      version: "3.5.3"
    volumeMounts:
      - name: test-job
        mountPath: /opt/spark/work-dir
  
  volumes:
    - name: test-job
      configMap:
        name: {test_app_name}-job
"""

print("Creating SparkApplication...")
try:
    subprocess.run(["kubectl", "apply", "-f", "-"], input=test_spark_app, text=True, check=True, capture_output=True)
    print(f"✓ SparkApplication '{test_app_name}' created\n")
    print(f"Monitor with:")
    print(f"  kubectl -n {NAMESPACE} get sparkapplication {test_app_name}")
    print(f"  kubectl -n {NAMESPACE} logs -f {test_app_name}-driver")
except Exception as e:
    print(f"❌ Failed to create SparkApplication: {e}")

## 6. Monitor SparkApplication Status

Run this cell multiple times to watch the status change.

In [None]:
print_section(f"6. Monitoring SparkApplication: {test_app_name}")

try:
    result = run_kubectl(["-n", NAMESPACE, "get", "sparkapplication", test_app_name, "-o", "json"])
    app = json.loads(result.stdout)
    
    # Extract status info
    state = app.get("status", {}).get("applicationState", {}).get("state", "UNKNOWN")
    submission_time = app.get("status", {}).get("submissionTime", "N/A")
    termination_time = app.get("status", {}).get("terminationTime", "N/A")
    driver_info = app.get("status", {}).get("driverInfo", {})
    
    print(f"Application: {test_app_name}")
    print(f"State: {state}")
    print(f"Submission Time: {submission_time}")
    print(f"Termination Time: {termination_time}")
    
    if driver_info:
        print(f"\nDriver Info:")
        print(f"  Pod Name: {driver_info.get('podName', 'N/A')}")
        print(f"  Web UI Service: {driver_info.get('webUIServiceName', 'N/A')}")
    
    # State-specific messages
    if state == "COMPLETED":
        print(f"\n✅ SparkApplication COMPLETED successfully!")
    elif state == "FAILED":
        print(f"\n❌ SparkApplication FAILED!")
        error_message = app.get("status", {}).get("applicationState", {}).get("errorMessage", "")
        if error_message:
            print(f"   Error: {error_message}")
    elif state in ["SUBMITTED", "RUNNING"]:
        print(f"\n⏳ SparkApplication is {state}...")
    elif state == "PENDING":
        print(f"\n⏳ SparkApplication is pending submission...")
    
    # Check pods
    print(f"\nAssociated Pods:")
    result = run_kubectl(["-n", NAMESPACE, "get", "pods", "-l", f"sparkoperator.k8s.io/app-name={test_app_name}", "-o", "json"])
    pods = json.loads(result.stdout)
    
    if not pods["items"]:
        print("  No pods created yet")
    else:
        for pod in pods["items"]:
            name = pod["metadata"]["name"]
            phase = pod["status"]["phase"]
            role = pod["metadata"]["labels"].get("spark-role", "unknown")
            print(f"  • {name} ({role}): {phase}")
            
except subprocess.CalledProcessError:
    print(f"❌ SparkApplication '{test_app_name}' not found")

## 7. Wait for Completion

This cell waits for the SparkApplication to complete (or fail).

In [None]:
print_section(f"7. Waiting for SparkApplication to Complete")

print(f"Waiting for {test_app_name}...\n")

timeout = 300  # 5 minutes
start_time = time.time()
last_state = None

while time.time() - start_time < timeout:
    try:
        result = run_kubectl(["-n", NAMESPACE, "get", "sparkapplication", test_app_name, "-o", "json"])
        app = json.loads(result.stdout)
        state = app.get("status", {}).get("applicationState", {}).get("state", "UNKNOWN")
        
        if state != last_state:
            elapsed = int(time.time() - start_time)
            print(f"[{elapsed}s] State: {state}")
            last_state = state
        
        if state in ["COMPLETED", "FAILED", "SUBMISSION_FAILED"]:
            print(f"\nFinal state reached: {state}")
            if state == "COMPLETED":
                print("✅ SparkApplication completed successfully!")
            else:
                print(f"❌ SparkApplication {state}")
            break
        
        time.sleep(5)
        
    except subprocess.CalledProcessError:
        print(f"❌ Failed to get SparkApplication status")
        break
else:
    print(f"\n⏱️ Timeout reached ({timeout}s)")
    print(f"Last state: {last_state}")

## 8. View Driver Logs

In [None]:
print_section("8. Driver Pod Logs")

try:
    # Get driver pod name
    result = run_kubectl([
        "-n", NAMESPACE,
        "get", "pods",
        "-l", f"sparkoperator.k8s.io/app-name={test_app_name},spark-role=driver",
        "-o", "jsonpath={.items[0].metadata.name}"
    ])
    driver_pod = result.stdout.strip()
    
    if not driver_pod:
        print("❌ Driver pod not found")
    else:
        print(f"Driver pod: {driver_pod}\n")
        print("Logs:")
        print("=" * 70)
        
        result = run_kubectl(["-n", NAMESPACE, "logs", driver_pod, "--tail=100"])
        print(result.stdout)
        print("=" * 70)
        
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to get logs: {e.stderr}")

## 9. View Executor Logs

In [None]:
print_section("9. Executor Pod Logs")

try:
    # Get executor pods
    result = run_kubectl([
        "-n", NAMESPACE,
        "get", "pods",
        "-l", f"sparkoperator.k8s.io/app-name={test_app_name},spark-role=executor",
        "-o", "json"
    ])
    pods = json.loads(result.stdout)
    
    if not pods["items"]:
        print("No executor pods found (may have been cleaned up)")
    else:
        for pod in pods["items"]:
            pod_name = pod["metadata"]["name"]
            print(f"\nExecutor pod: {pod_name}")
            print("=" * 70)
            
            result = run_kubectl(["-n", NAMESPACE, "logs", pod_name, "--tail=50"])
            print(result.stdout)
            print("=" * 70)
            
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to get executor logs: {e.stderr}")

## 10. Describe SparkApplication

Get detailed information including events.

In [None]:
print_section("10. SparkApplication Details")

try:
    result = run_kubectl(["-n", NAMESPACE, "describe", "sparkapplication", test_app_name])
    print(result.stdout)
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to describe SparkApplication: {e.stderr}")

## 11. Check Spark Operator Logs

View the Spark Operator controller logs to see how it processed the SparkApplication.

In [None]:
print_section("11. Spark Operator Controller Logs")

try:
    result = run_kubectl([
        "-n", NAMESPACE,
        "logs",
        "deployment/spark-operator",
        "--tail=100"
    ])
    print(result.stdout)
except subprocess.CalledProcessError as e:
    print(f"❌ Failed to get operator logs: {e.stderr}")

## 12. Troubleshooting Helpers

In [None]:
print_section("12. Troubleshooting Information")

print("Common Issues and Solutions:\n")

print("1. ImagePullBackOff:")
print("   - Check if Spark image exists")
print("   - For kind, load image: kind load docker-image bitnami/spark:3.5.3 --name dfp-kind\n")

print("2. Pods stuck in Pending:")
print("   - Check node resources: kubectl top nodes")
print("   - Check pod events: kubectl -n dfp describe pod <pod-name>\n")

print("3. Application stuck in SUBMITTED:")
print("   - Check Spark Operator logs (see cell above)")
print("   - Verify RBAC permissions")
print("   - Check if service account exists\n")

print("4. Application FAILED:")
print("   - Check driver logs (see cell above)")
print("   - Check application state error message")
print("   - Verify job code syntax\n")

print("\nUseful Commands:")
print(f"  # Watch SparkApplication")
print(f"  kubectl -n {NAMESPACE} get sparkapplication {test_app_name} -w")
print(f"\n  # Follow driver logs")
print(f"  kubectl -n {NAMESPACE} logs -f {test_app_name}-driver")
print(f"\n  # Check all pods")
print(f"  kubectl -n {NAMESPACE} get pods -l sparkoperator.k8s.io/app-name={test_app_name}")
print(f"\n  # Delete SparkApplication")
print(f"  kubectl -n {NAMESPACE} delete sparkapplication {test_app_name}")

## 13. Check Service Account and RBAC

In [None]:
print_section("13. Service Account and RBAC Check")

# Check service account
try:
    result = run_kubectl(["-n", NAMESPACE, "get", "serviceaccount", "spark-operator"])
    print("✓ ServiceAccount 'spark-operator' exists\n")
    print(result.stdout)
except subprocess.CalledProcessError:
    print("❌ ServiceAccount 'spark-operator' not found")

print("\n" + "-" * 70 + "\n")

# Check ClusterRole
try:
    result = run_kubectl(["get", "clusterrole", "spark-operator"])
    print("✓ ClusterRole 'spark-operator' exists\n")
    print(result.stdout)
except subprocess.CalledProcessError:
    print("❌ ClusterRole 'spark-operator' not found")

print("\n" + "-" * 70 + "\n")

# Check ClusterRoleBinding
try:
    result = run_kubectl(["get", "clusterrolebinding", "spark-operator"])
    print("✓ ClusterRoleBinding 'spark-operator' exists\n")
    print(result.stdout)
except subprocess.CalledProcessError:
    print("❌ ClusterRoleBinding 'spark-operator' not found")

## 14. Clean Up Test Resources

In [None]:
print_section("14. Clean Up Test Resources")

print(f"Deleting test resources for: {test_app_name}\n")

# Delete SparkApplication
try:
    result = run_kubectl(["-n", NAMESPACE, "delete", "sparkapplication", test_app_name])
    print(f"✓ Deleted SparkApplication: {test_app_name}")
except subprocess.CalledProcessError:
    print(f"⚠️  SparkApplication '{test_app_name}' not found (may already be deleted)")

# Delete ConfigMap
try:
    result = run_kubectl(["-n", NAMESPACE, "delete", "configmap", f"{test_app_name}-job"])
    print(f"✓ Deleted ConfigMap: {test_app_name}-job")
except subprocess.CalledProcessError:
    print(f"⚠️  ConfigMap '{test_app_name}-job' not found (may already be deleted)")

print(f"\n✅ Clean up complete")

## Summary

This notebook tested:
- ✅ Spark Operator deployment and health
- ✅ SparkApplication CRD availability
- ✅ Creating and submitting SparkApplications
- ✅ Monitoring application lifecycle
- ✅ Retrieving logs from driver and executors
- ✅ RBAC configuration

### Next Steps

If the test passed, you can now:
1. Run the Kronodroid pipeline with `--transform-runner spark-operator`
2. Create custom SparkApplications for your data processing needs
3. Integrate with MinIO and LakeFS for data versioning

### Useful Documentation

- Spark Operator docs: https://github.com/kubeflow/spark-operator
- SparkApplication examples: https://github.com/kubeflow/spark-operator/tree/master/examples
- Local setup: `infra/k8s/kind/addons/spark-operator/README.md`