# Spark Operator Test Notebook

Interactive notebook for testing and debugging the Kubeflow Spark Operator.

## Prerequisites

- Kind cluster running with Spark Operator deployed (`task spark-operator:up`)
- kubectl configured to access the cluster
- Python environment with subprocess support

## What This Notebook Tests

1. **Spark Operator Deployment Status** - Verify the operator is running
2. **CRD Availability** - Check SparkApplication CRD is installed
3. **RBAC Configuration** - Verify service accounts and permissions
4. **SparkApplication Submission** - Submit a test Pi calculation job
5. **Application Monitoring** - Track job progress and state changes
6. **Log Retrieval** - Get driver and executor logs
7. **Troubleshooting** - Diagnose common issues
8. **Cleanup** - Remove test resources

## 1. Setup and Configuration

In [1]:
import subprocess
import json
import time
import os
from datetime import datetime
from typing import Optional, Dict, List, Any
from dataclasses import dataclass
from IPython.display import display, HTML, clear_output

In [2]:
# Configuration
NAMESPACE = os.getenv('SPARK_NAMESPACE', 'dfp')
SPARK_IMAGE = os.getenv('SPARK_IMAGE', 'apache/spark:3.5.7-python3')
SPARK_VERSION = '3.5.7'
SERVICE_ACCOUNT = 'spark-operator'
CLUSTER_NAME = os.getenv('CLUSTER_NAME', 'dfp-kind')

print(f"Namespace: {NAMESPACE}")
print(f"Spark Image: {SPARK_IMAGE}")
print(f"Spark Version: {SPARK_VERSION}")
print(f"Service Account: {SERVICE_ACCOUNT}")
print(f"Cluster Name: {CLUSTER_NAME}")

Namespace: dfp
Spark Image: apache/spark:3.5.7-python3
Spark Version: 3.5.7
Service Account: spark-operator
Cluster Name: dfp-kind


In [3]:
def run_kubectl(args: List[str], *, check: bool = True, capture: bool = True) -> subprocess.CompletedProcess:
    """Run a kubectl command and return the result."""
    cmd = ['kubectl'] + args
    try:
        result = subprocess.run(
            cmd,
            check=check,
            capture_output=capture,
            text=True
        )
        return result
    except subprocess.CalledProcessError as e:
        print(f"Command failed: {' '.join(cmd)}")
        if e.stdout:
            print(f"stdout: {e.stdout}")
        if e.stderr:
            print(f"stderr: {e.stderr}")
        raise


def kubectl_get_json(resource: str, name: str = '', namespace: str = NAMESPACE) -> Optional[Dict]:
    """Get a Kubernetes resource as JSON."""
    try:
        args = ['-n', namespace, 'get', resource]
        if name:
            args.append(name)
        args.extend(['-o', 'json'])
        result = run_kubectl(args)
        return json.loads(result.stdout)
    except subprocess.CalledProcessError:
        return None
    except json.JSONDecodeError:
        return None


def print_status(status: str, message: str):
    """Print a status message with icon."""
    icons = {
        'success': '[OK]',
        'warning': '[WARN]',
        'error': '[ERROR]',
        'info': '[INFO]',
        'pending': '[...]'
    }
    icon = icons.get(status, '[?]')
    print(f"{icon} {message}")

## 2. Cluster Connectivity Check

In [4]:
def check_cluster_connectivity() -> bool:
    """Verify kubectl can connect to the cluster."""
    print("Checking cluster connectivity...\n")
    
    try:
        # Get current context
        result = run_kubectl(['config', 'current-context'])
        context = result.stdout.strip()
        print_status('success', f"Current context: {context}")
        
        # Check cluster info
        result = run_kubectl(['cluster-info'], check=False)
        if result.returncode == 0:
            print_status('success', "Cluster is accessible")
            # Extract control plane URL
            for line in result.stdout.split('\n'):
                if 'control plane' in line.lower() or 'master' in line.lower():
                    print(f"    {line.strip()}")
        else:
            print_status('error', "Cannot connect to cluster")
            print(f"    stderr: {result.stderr}")
            return False
        
        # Check namespace exists
        result = run_kubectl(['-n', NAMESPACE, 'get', 'namespace', NAMESPACE], check=False)
        if result.returncode == 0:
            print_status('success', f"Namespace '{NAMESPACE}' exists")
        else:
            print_status('warning', f"Namespace '{NAMESPACE}' does not exist")
            print(f"    Create it with: kubectl create namespace {NAMESPACE}")
            return False
        
        return True
        
    except FileNotFoundError:
        print_status('error', "kubectl not found in PATH")
        return False
    except Exception as e:
        print_status('error', f"Unexpected error: {e}")
        return False


cluster_ok = check_cluster_connectivity()

Checking cluster connectivity...

[OK] Current context: kind-dfp-kind
[OK] Cluster is accessible
    Kubernetes control plane is running at https://127.0.0.1:54595
[OK] Namespace 'dfp' exists


## 3. Spark Operator Deployment Status

In [5]:
def check_spark_operator_deployment() -> Dict[str, Any]:
    """Check if the Spark Operator is deployed and running."""
    print("Checking Spark Operator deployment...\n")
    
    result = {
        'installed': False,
        'ready': False,
        'image': None,
        'replicas': {'desired': 0, 'ready': 0},
        'pods': []
    }
    
    # Check deployment
    deployment = kubectl_get_json('deployment', 'spark-operator')
    
    if not deployment:
        print_status('error', "Spark Operator deployment not found")
        print(f"\n    To install Spark Operator:")
        print(f"    kubectl apply -k infra/k8s/kind/addons/spark-operator/")
        print(f"    # or: task spark-operator:up")
        return result
    
    result['installed'] = True
    
    # Get image
    containers = deployment.get('spec', {}).get('template', {}).get('spec', {}).get('containers', [])
    if containers:
        result['image'] = containers[0].get('image', 'unknown')
        print_status('success', f"Spark Operator image: {result['image']}")
    
    # Check replicas
    status = deployment.get('status', {})
    result['replicas']['desired'] = status.get('replicas', 0)
    result['replicas']['ready'] = status.get('readyReplicas', 0)
    
    if result['replicas']['ready'] >= 1:
        result['ready'] = True
        print_status('success', f"Replicas: {result['replicas']['ready']}/{result['replicas']['desired']} ready")
    else:
        print_status('warning', f"Replicas: {result['replicas']['ready']}/{result['replicas']['desired']} ready")
    
    # Get pod details
    pods = kubectl_get_json('pods', '', NAMESPACE)
    if pods:
        for pod in pods.get('items', []):
            name = pod.get('metadata', {}).get('name', '')
            if 'spark-operator' in name:
                phase = pod.get('status', {}).get('phase', 'Unknown')
                result['pods'].append({'name': name, 'phase': phase})
                status_icon = 'success' if phase == 'Running' else 'warning'
                print_status(status_icon, f"Pod: {name} ({phase})")
    
    # Check Spark version compatibility
    if result['image']:
        if '3.5' in result['image']:
            print_status('success', "Operator supports Spark 3.5.x")
        else:
            print_status('warning', "Operator version may not match Spark 3.5.x")
    
    return result


operator_status = check_spark_operator_deployment()

Checking Spark Operator deployment...

[OK] Spark Operator image: ghcr.io/kubeflow/spark-operator:v1beta2-1.4.3-3.5.0
[OK] Replicas: 1/1 ready
[OK] Pod: spark-operator-6bf7947df-7tkbg (Running)
[OK] Operator supports Spark 3.5.x


In [6]:
def get_spark_operator_logs(tail: int = 50) -> str:
    """Get recent logs from the Spark Operator."""
    print(f"Spark Operator logs (last {tail} lines):\n")
    print("=" * 80)
    
    try:
        result = run_kubectl(['-n', NAMESPACE, 'logs', 'deployment/spark-operator', f'--tail={tail}'])
        print(result.stdout)
        return result.stdout
    except subprocess.CalledProcessError as e:
        print(f"Failed to get logs: {e.stderr}")
        return ""
    finally:
        print("=" * 80)


# Uncomment to view operator logs
# get_spark_operator_logs()

## 4. CRD and RBAC Verification

In [7]:
def check_spark_crds() -> Dict[str, bool]:
    """Check if SparkApplication CRDs are installed."""
    print("Checking Spark Operator CRDs...\n")
    
    crds = {
        'sparkapplications.sparkoperator.k8s.io': False,
        'scheduledsparkapplications.sparkoperator.k8s.io': False
    }
    
    try:
        result = run_kubectl(['get', 'crd', '-o', 'json'])
        crd_list = json.loads(result.stdout)
        
        for item in crd_list.get('items', []):
            name = item.get('metadata', {}).get('name', '')
            if name in crds:
                crds[name] = True
                print_status('success', f"CRD installed: {name}")
        
        for crd, installed in crds.items():
            if not installed:
                print_status('error', f"CRD missing: {crd}")
        
        return crds
        
    except Exception as e:
        print_status('error', f"Failed to check CRDs: {e}")
        return crds


crd_status = check_spark_crds()

Checking Spark Operator CRDs...

[OK] CRD installed: scheduledsparkapplications.sparkoperator.k8s.io
[OK] CRD installed: sparkapplications.sparkoperator.k8s.io


In [8]:
def check_rbac_configuration() -> Dict[str, Any]:
    """Check RBAC configuration for Spark Operator."""
    print("Checking RBAC configuration...\n")
    
    result = {
        'service_account': False,
        'cluster_role': False,
        'cluster_role_binding': False
    }
    
    # Check ServiceAccount
    try:
        sa = kubectl_get_json('serviceaccount', SERVICE_ACCOUNT)
        if sa:
            result['service_account'] = True
            print_status('success', f"ServiceAccount '{SERVICE_ACCOUNT}' exists")
        else:
            print_status('error', f"ServiceAccount '{SERVICE_ACCOUNT}' not found")
    except Exception as e:
        print_status('error', f"Failed to check ServiceAccount: {e}")
    
    # Check ClusterRole
    try:
        cr_result = run_kubectl(['get', 'clusterrole', 'spark-operator', '-o', 'json'], check=False)
        if cr_result.returncode == 0:
            result['cluster_role'] = True
            print_status('success', "ClusterRole 'spark-operator' exists")
        else:
            print_status('warning', "ClusterRole 'spark-operator' not found (may use different name)")
    except Exception as e:
        print_status('error', f"Failed to check ClusterRole: {e}")
    
    # Check ClusterRoleBinding
    try:
        crb_result = run_kubectl(['get', 'clusterrolebinding', 'spark-operator', '-o', 'json'], check=False)
        if crb_result.returncode == 0:
            result['cluster_role_binding'] = True
            print_status('success', "ClusterRoleBinding 'spark-operator' exists")
        else:
            print_status('warning', "ClusterRoleBinding 'spark-operator' not found (may use different name)")
    except Exception as e:
        print_status('error', f"Failed to check ClusterRoleBinding: {e}")
    
    return result


rbac_status = check_rbac_configuration()

Checking RBAC configuration...

[OK] ServiceAccount 'spark-operator' exists
[OK] ClusterRole 'spark-operator' exists
[OK] ClusterRoleBinding 'spark-operator' exists


## 5. Submit Test SparkApplication

In [9]:
def generate_test_spark_application(app_name: str) -> str:
    """Generate a simple test SparkApplication YAML (Pi calculation)."""
    return f"""apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: {app_name}
  namespace: {NAMESPACE}
spec:
  type: Scala
  mode: cluster
  sparkVersion: "{SPARK_VERSION}"
  image: "{SPARK_IMAGE}"
  imagePullPolicy: IfNotPresent
  mainClass: org.apache.spark.examples.SparkPi
  mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-{SPARK_VERSION}.jar"
  arguments:
    - "100"
  sparkConf:
    "spark.jars.ivy": "/tmp/.ivy2"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    memory: "512m"
    serviceAccount: {SERVICE_ACCOUNT}
    env:
      - name: HOME
        value: "/tmp"
  executor:
    instances: 1
    cores: 1
    memory: "512m"
    serviceAccount: {SERVICE_ACCOUNT}
    env:
      - name: HOME
        value: "/tmp"
"""


def generate_python_test_spark_application(app_name: str) -> str:
    """Generate a Python test SparkApplication YAML."""
    return f"""apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: {app_name}
  namespace: {NAMESPACE}
spec:
  type: Python
  mode: cluster
  pythonVersion: "3"
  sparkVersion: "{SPARK_VERSION}"
  image: "{SPARK_IMAGE}"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  arguments:
    - "100"
  sparkConf:
    "spark.jars.ivy": "/tmp/.ivy2"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    memory: "512m"
    serviceAccount: {SERVICE_ACCOUNT}
    env:
      - name: HOME
        value: "/tmp"
  executor:
    instances: 1
    cores: 1
    memory: "512m"
    serviceAccount: {SERVICE_ACCOUNT}
    env:
      - name: HOME
        value: "/tmp"
"""


# Generate a unique application name
TEST_APP_NAME = f"spark-test-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
print(f"Test application name: {TEST_APP_NAME}")

Test application name: spark-test-20260112-125105


In [10]:
def submit_spark_application(app_name: str, use_python: bool = True) -> bool:
    """Submit a test SparkApplication to the cluster."""
    print(f"Submitting SparkApplication: {app_name}\n")
    
    # Generate YAML
    if use_python:
        yaml_content = generate_python_test_spark_application(app_name)
        print_status('info', "Using Python Pi example")
    else:
        yaml_content = generate_test_spark_application(app_name)
        print_status('info', "Using Scala SparkPi example")
    
    print(f"\nSparkApplication YAML:\n")
    print("-" * 60)
    print(yaml_content)
    print("-" * 60)
    
    try:
        # Apply the YAML
        result = subprocess.run(
            ['kubectl', 'apply', '-f', '-'],
            input=yaml_content,
            text=True,
            check=True,
            capture_output=True
        )
        print_status('success', f"SparkApplication created: {app_name}")
        print(f"    {result.stdout.strip()}")
        return True
        
    except subprocess.CalledProcessError as e:
        print_status('error', f"Failed to create SparkApplication")
        print(f"    stderr: {e.stderr}")
        return False


# Submit the test application (Python by default)
# Set use_python=False to use Scala SparkPi instead
submitted = submit_spark_application(TEST_APP_NAME, use_python=True)

Submitting SparkApplication: spark-test-20260112-125105

[INFO] Using Python Pi example

SparkApplication YAML:

------------------------------------------------------------
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: spark-test-20260112-125105
  namespace: dfp
spec:
  type: Python
  mode: cluster
  pythonVersion: "3"
  sparkVersion: "3.5.7"
  image: "apache/spark:3.5.7-python3"
  imagePullPolicy: IfNotPresent
  mainApplicationFile: "local:///opt/spark/examples/src/main/python/pi.py"
  arguments:
    - "100"
  sparkConf:
    "spark.jars.ivy": "/tmp/.ivy2"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    memory: "512m"
    serviceAccount: spark-operator
    env:
      - name: HOME
        value: "/tmp"
  executor:
    instances: 1
    cores: 1
    memory: "512m"
    serviceAccount: spark-operator
    env:
      - name: HOME
        value: "/tmp"

------------------------------------------------------------
[OK] SparkApplication created: s

## 6. Monitor SparkApplication

In [11]:
def get_spark_application_status(app_name: str) -> Dict[str, Any]:
    """Get the current status of a SparkApplication."""
    app = kubectl_get_json('sparkapplication', app_name)
    
    if not app:
        return {'found': False, 'state': 'NOT_FOUND'}
    
    status = app.get('status', {})
    app_state = status.get('applicationState', {})
    
    return {
        'found': True,
        'state': app_state.get('state', 'UNKNOWN'),
        'error_message': app_state.get('errorMessage', ''),
        'driver_info': status.get('driverInfo', {}),
        'executor_state': status.get('executorState', {}),
        'last_submission_attempt_time': status.get('lastSubmissionAttemptTime', ''),
        'termination_time': status.get('terminationTime', ''),
        'spark_application_id': status.get('sparkApplicationId', '')
    }


def print_spark_application_status(app_name: str):
    """Print formatted SparkApplication status."""
    status = get_spark_application_status(app_name)
    
    if not status['found']:
        print_status('error', f"SparkApplication '{app_name}' not found")
        return status
    
    state = status['state']
    state_icons = {
        'COMPLETED': 'success',
        'RUNNING': 'info',
        'SUBMITTED': 'pending',
        'PENDING_RERUN': 'pending',
        'FAILED': 'error',
        'SUBMISSION_FAILED': 'error',
        'FAILING': 'warning',
        'INVALIDATING': 'warning'
    }
    
    icon = state_icons.get(state, 'info')
    print_status(icon, f"State: {state}")
    
    if status['spark_application_id']:
        print(f"    Spark Application ID: {status['spark_application_id']}")
    
    if status['driver_info']:
        driver = status['driver_info']
        print(f"    Driver Pod: {driver.get('podName', 'N/A')}")
        if driver.get('webUIAddress'):
            print(f"    Spark UI: {driver.get('webUIAddress')}")
    
    if status['error_message']:
        print(f"    Error: {status['error_message']}")
    
    return status


# Check current status
print(f"\nCurrent status of {TEST_APP_NAME}:\n")
current_status = print_spark_application_status(TEST_APP_NAME)


Current status of spark-test-20260112-125105:

[INFO] State: UNKNOWN


In [12]:
def monitor_spark_application(app_name: str, timeout_seconds: int = 300, poll_interval: int = 5) -> Dict[str, Any]:
    """Monitor a SparkApplication until completion or timeout."""
    print(f"Monitoring SparkApplication: {app_name}")
    print(f"Timeout: {timeout_seconds}s, Poll interval: {poll_interval}s\n")
    
    terminal_states = {'COMPLETED', 'FAILED', 'SUBMISSION_FAILED'}
    start_time = time.time()
    last_state = None
    
    while True:
        elapsed = int(time.time() - start_time)
        
        if elapsed > timeout_seconds:
            print_status('warning', f"Timeout after {elapsed}s")
            return {'success': False, 'reason': 'TIMEOUT', 'last_state': last_state}
        
        status = get_spark_application_status(app_name)
        
        if not status['found']:
            print_status('warning', f"[{elapsed}s] SparkApplication not found, waiting...")
            time.sleep(poll_interval)
            continue
        
        state = status['state']
        
        # Print state changes
        if state != last_state:
            timestamp = datetime.now().strftime('%H:%M:%S')
            print(f"[{timestamp}] [{elapsed}s] State: {state}")
            
            if status['driver_info'].get('podName'):
                print(f"           Driver: {status['driver_info']['podName']}")
            
            last_state = state
        
        # Check for terminal state
        if state in terminal_states:
            if state == 'COMPLETED':
                print_status('success', f"SparkApplication completed in {elapsed}s")
                return {'success': True, 'state': state, 'elapsed': elapsed}
            else:
                print_status('error', f"SparkApplication {state} after {elapsed}s")
                if status['error_message']:
                    print(f"    Error: {status['error_message']}")
                return {'success': False, 'state': state, 'error': status['error_message'], 'elapsed': elapsed}
        
        time.sleep(poll_interval)


# Monitor the test application
result = monitor_spark_application(TEST_APP_NAME, timeout_seconds=300)

Monitoring SparkApplication: spark-test-20260112-125105
Timeout: 300s, Poll interval: 5s

[12:52:26] [0s] State: UNKNOWN


KeyboardInterrupt: 

## 7. Log Retrieval

In [13]:
def get_driver_logs(app_name: str, tail: int = 100) -> str:
    """Get logs from the driver pod."""
    print(f"Driver logs for {app_name} (last {tail} lines):\n")
    print("=" * 80)
    
    driver_pod = f"{app_name}-driver"
    
    try:
        result = run_kubectl(['-n', NAMESPACE, 'logs', driver_pod, f'--tail={tail}'], check=False)
        if result.returncode == 0:
            print(result.stdout)
            return result.stdout
        else:
            # Try with --previous flag for crashed containers
            result = run_kubectl(['-n', NAMESPACE, 'logs', driver_pod, '--previous', f'--tail={tail}'], check=False)
            if result.returncode == 0:
                print("(Previous container logs)")
                print(result.stdout)
                return result.stdout
            else:
                print(f"No logs available: {result.stderr}")
                return ""
    except Exception as e:
        print(f"Error getting logs: {e}")
        return ""
    finally:
        print("=" * 80)


# Get driver logs
driver_logs = get_driver_logs(TEST_APP_NAME)

Driver logs for spark-test-20260112-125105 (last 100 lines):

26/01/12 20:51:43 INFO TaskSetManager: Finished task 89.0 in stage 0.0 (TID 89) in 71 ms on 10.244.0.37 (executor 1) (90/100)
26/01/12 20:51:43 INFO TaskSetManager: Starting task 91.0 in stage 0.0 (TID 91) (10.244.0.37, executor 1, partition 91, PROCESS_LOCAL, 8998 bytes) 
26/01/12 20:51:43 INFO TaskSetManager: Finished task 90.0 in stage 0.0 (TID 90) in 70 ms on 10.244.0.37 (executor 1) (91/100)
26/01/12 20:51:43 INFO TaskSetManager: Starting task 92.0 in stage 0.0 (TID 92) (10.244.0.37, executor 1, partition 92, PROCESS_LOCAL, 8998 bytes) 
26/01/12 20:51:43 INFO TaskSetManager: Finished task 91.0 in stage 0.0 (TID 91) in 72 ms on 10.244.0.37 (executor 1) (92/100)
26/01/12 20:51:43 INFO TaskSetManager: Starting task 93.0 in stage 0.0 (TID 93) (10.244.0.37, executor 1, partition 93, PROCESS_LOCAL, 8998 bytes) 
26/01/12 20:51:43 INFO TaskSetManager: Finished task 92.0 in stage 0.0 (TID 92) in 71 ms on 10.244.0.37 (executor 1)

In [14]:
def get_executor_pods(app_name: str) -> List[str]:
    """Get list of executor pod names."""
    try:
        result = run_kubectl([
            '-n', NAMESPACE, 'get', 'pods',
            '-l', f'sparkoperator.k8s.io/app-name={app_name},spark-role=executor',
            '-o', 'jsonpath={.items[*].metadata.name}'
        ])
        pods = result.stdout.strip().split()
        return [p for p in pods if p]
    except Exception:
        return []


def get_executor_logs(app_name: str, tail: int = 50) -> Dict[str, str]:
    """Get logs from all executor pods."""
    executor_pods = get_executor_pods(app_name)
    
    if not executor_pods:
        print("No executor pods found")
        return {}
    
    logs = {}
    for pod in executor_pods:
        print(f"\nExecutor logs for {pod} (last {tail} lines):")
        print("-" * 60)
        try:
            result = run_kubectl(['-n', NAMESPACE, 'logs', pod, f'--tail={tail}'], check=False)
            if result.returncode == 0:
                print(result.stdout)
                logs[pod] = result.stdout
            else:
                print(f"No logs available: {result.stderr}")
        except Exception as e:
            print(f"Error: {e}")
        print("-" * 60)
    
    return logs


# Uncomment to get executor logs
# executor_logs = get_executor_logs(TEST_APP_NAME)

## 8. Troubleshooting Helpers

In [15]:
def diagnose_spark_application(app_name: str) -> Dict[str, Any]:
    """Run comprehensive diagnostics on a SparkApplication."""
    print(f"Diagnosing SparkApplication: {app_name}\n")
    print("=" * 80)
    
    diagnosis = {
        'app_status': None,
        'pod_issues': [],
        'events': [],
        'recommendations': []
    }
    
    # 1. Get SparkApplication status
    print("\n[1] SparkApplication Status")
    print("-" * 40)
    status = get_spark_application_status(app_name)
    diagnosis['app_status'] = status
    
    if not status['found']:
        print_status('error', "SparkApplication not found")
        diagnosis['recommendations'].append("Verify the application name and namespace")
        return diagnosis
    
    print(f"State: {status['state']}")
    if status['error_message']:
        print(f"Error: {status['error_message']}")
    
    # 2. Check driver pod
    print("\n[2] Driver Pod Status")
    print("-" * 40)
    driver_pod = f"{app_name}-driver"
    
    try:
        result = run_kubectl(['-n', NAMESPACE, 'get', 'pod', driver_pod, '-o', 'json'], check=False)
        if result.returncode == 0:
            pod_data = json.loads(result.stdout)
            pod_status = pod_data.get('status', {})
            phase = pod_status.get('phase', 'Unknown')
            print(f"Pod: {driver_pod}")
            print(f"Phase: {phase}")
            
            # Check container statuses
            for cs in pod_status.get('containerStatuses', []):
                state = cs.get('state', {})
                if 'waiting' in state:
                    reason = state['waiting'].get('reason', 'Unknown')
                    message = state['waiting'].get('message', '')
                    print(f"Container waiting: {reason}")
                    if message:
                        print(f"  Message: {message[:200]}")
                    
                    if 'ImagePull' in reason:
                        diagnosis['pod_issues'].append(f"Image pull issue: {reason}")
                        diagnosis['recommendations'].append(
                            f"Pre-load image into kind: docker pull {SPARK_IMAGE} && "
                            f"kind load docker-image {SPARK_IMAGE} --name {CLUSTER_NAME}"
                        )
                elif 'terminated' in state:
                    exit_code = state['terminated'].get('exitCode', -1)
                    reason = state['terminated'].get('reason', 'Unknown')
                    print(f"Container terminated: {reason} (exit code: {exit_code})")
                    diagnosis['pod_issues'].append(f"Container terminated: {reason}")
        else:
            print(f"Driver pod not found: {driver_pod}")
            diagnosis['recommendations'].append("Check if Spark Operator is running and has permissions")
    except Exception as e:
        print(f"Error checking driver pod: {e}")
    
    # 3. Get events
    print("\n[3] Recent Events")
    print("-" * 40)
    try:
        result = run_kubectl([
            '-n', NAMESPACE, 'get', 'events',
            '--field-selector', f'involvedObject.name={app_name}',
            '--sort-by=.lastTimestamp'
        ], check=False)
        if result.stdout.strip():
            print(result.stdout)
            diagnosis['events'] = result.stdout.split('\n')
        else:
            print("No events found for this application")
            
        # Also check driver pod events
        result = run_kubectl([
            '-n', NAMESPACE, 'get', 'events',
            '--field-selector', f'involvedObject.name={driver_pod}',
            '--sort-by=.lastTimestamp'
        ], check=False)
        if result.stdout.strip():
            print(f"\nDriver pod events:")
            print(result.stdout)
    except Exception as e:
        print(f"Error getting events: {e}")
    
    # 4. Recommendations
    if diagnosis['recommendations']:
        print("\n[4] Recommendations")
        print("-" * 40)
        for i, rec in enumerate(diagnosis['recommendations'], 1):
            print(f"{i}. {rec}")
    
    print("\n" + "=" * 80)
    return diagnosis


# Run diagnostics on the test application
# diagnosis = diagnose_spark_application(TEST_APP_NAME)

In [22]:
def list_all_spark_applications() -> List[Dict]:
    """List all SparkApplications in the namespace."""
    print(f"SparkApplications in namespace '{NAMESPACE}':\n")
    
    apps = kubectl_get_json('sparkapplication', '')
    
    if not apps or not apps.get('items'):
        print("No SparkApplications found")
        return []
    
    result = []
    for app in apps.get('items', []):
        name = app.get('metadata', {}).get('name', 'unknown')
        state = app.get('status', {}).get('applicationState', {}).get('state', 'UNKNOWN')
        created = app.get('metadata', {}).get('creationTimestamp', 'N/A')
        
        result.append({'name': name, 'state': state, 'created': created})
        
        state_icon = {
            'COMPLETED': '[OK]',
            'RUNNING': '[RUN]',
            'SUBMITTED': '[SUB]',
            'FAILED': '[FAIL]',
            'SUBMISSION_FAILED': '[FAIL]'
        }.get(state, '[?]')
        
        print(f"  {state_icon:7s} {name:50s} {created}")
    
    return result


all_apps = list_all_spark_applications()

SparkApplications in namespace 'dfp':

  [?]     spark-test-20260112-125105                         2026-01-12T20:51:31Z


In [19]:
def check_cluster_resources() -> Dict[str, Any]:
    """Check available cluster resources."""
    print("Cluster Resource Summary\n")
    print("=" * 60)
    
    result = {'nodes': [], 'pods_in_namespace': 0}
    
    # Get node resources
    try:
        nodes = kubectl_get_json('nodes', '')
        if nodes:
            print("\nNodes:")
            for node in nodes.get('items', []):
                name = node.get('metadata', {}).get('name', 'unknown')
                status = node.get('status', {})
                allocatable = status.get('allocatable', {})
                
                cpu = allocatable.get('cpu', 'N/A')
                memory = allocatable.get('memory', 'N/A')
                
                # Check conditions
                ready = 'Unknown'
                for cond in status.get('conditions', []):
                    if cond.get('type') == 'Ready':
                        ready = cond.get('status', 'Unknown')
                
                print(f"  {name}: Ready={ready}, CPU={cpu}, Memory={memory}")
                result['nodes'].append({'name': name, 'ready': ready, 'cpu': cpu, 'memory': memory})
    except Exception as e:
        print(f"Error getting nodes: {e}")
    
    # Get pod count in namespace
    try:
        pods = kubectl_get_json('pods', '')
        if pods:
            pod_count = len(pods.get('items', []))
            result['pods_in_namespace'] = pod_count
            print(f"\nPods in '{NAMESPACE}': {pod_count}")
    except Exception as e:
        print(f"Error getting pods: {e}")
    
    print("\n" + "=" * 60)
    return result


# Check cluster resources
resources = check_cluster_resources()

Cluster Resource Summary


Nodes:
  dfp-kind-control-plane: Ready=True, CPU=10, Memory=8025424Ki

Pods in 'dfp': 8



## 9. Cleanup

In [18]:
def delete_spark_application(app_name: str, wait: bool = True) -> bool:
    """Delete a SparkApplication."""
    print(f"Deleting SparkApplication: {app_name}")
    
    try:
        result = run_kubectl(['-n', NAMESPACE, 'delete', 'sparkapplication', app_name], check=False)
        
        if result.returncode == 0:
            print_status('success', f"SparkApplication '{app_name}' deleted")
            
            if wait:
                print("Waiting for resources to be cleaned up...")
                time.sleep(5)
                
                # Verify pods are gone
                pods = get_executor_pods(app_name)
                driver_exists = False
                try:
                    run_kubectl(['-n', NAMESPACE, 'get', 'pod', f'{app_name}-driver'])
                    driver_exists = True
                except:
                    pass
                
                if not pods and not driver_exists:
                    print_status('success', "All pods cleaned up")
                else:
                    print_status('warning', "Some pods may still be terminating")
            
            return True
        else:
            if 'NotFound' in result.stderr:
                print_status('warning', f"SparkApplication '{app_name}' not found (already deleted?)")
            else:
                print_status('error', f"Failed to delete: {result.stderr}")
            return False
            
    except Exception as e:
        print_status('error', f"Error deleting SparkApplication: {e}")
        return False


# Delete the test application
# Uncomment the line below to delete
# delete_spark_application(TEST_APP_NAME)

In [None]:
def cleanup_completed_applications(dry_run: bool = True) -> int:
    """Delete all completed or failed SparkApplications."""
    print(f"Cleaning up completed/failed SparkApplications (dry_run={dry_run})\n")
    
    apps = kubectl_get_json('sparkapplication', '')
    
    if not apps or not apps.get('items'):
        print("No SparkApplications found")
        return 0
    
    to_delete = []
    terminal_states = {'COMPLETED', 'FAILED', 'SUBMISSION_FAILED'}
    
    for app in apps.get('items', []):
        name = app.get('metadata', {}).get('name', '')
        state = app.get('status', {}).get('applicationState', {}).get('state', '')
        
        if state in terminal_states:
            to_delete.append({'name': name, 'state': state})
    
    if not to_delete:
        print("No completed/failed applications to clean up")
        return 0
    
    print(f"Found {len(to_delete)} application(s) to clean up:")
    for app in to_delete:
        print(f"  - {app['name']} ({app['state']})")
    
    if dry_run:
        print(f"\nDRY RUN: Would delete {len(to_delete)} application(s)")
        print("Set dry_run=False to actually delete")
    else:
        print(f"\nDeleting {len(to_delete)} application(s)...")
        for app in to_delete:
            delete_spark_application(app['name'], wait=False)
        print(f"\nDeleted {len(to_delete)} application(s)")
    
    return len(to_delete)


# Preview cleanup (dry run)
cleanup_completed_applications(dry_run=True)

# Uncomment to actually delete
# cleanup_completed_applications(dry_run=False)

Cleaning up completed/failed SparkApplications (dry_run=True)

No completed/failed applications to clean up
Cleaning up completed/failed SparkApplications (dry_run=False)

No completed/failed applications to clean up


0

## 10. Quick Reference Commands

Useful kubectl commands for Spark Operator debugging:

```bash
# List all SparkApplications
kubectl -n dfp get sparkapplication

# Describe a SparkApplication
kubectl -n dfp describe sparkapplication <app-name>

# Get SparkApplication YAML
kubectl -n dfp get sparkapplication <app-name> -o yaml

# View driver logs
kubectl -n dfp logs <app-name>-driver

# Follow driver logs
kubectl -n dfp logs -f <app-name>-driver

# View Spark Operator logs
kubectl -n dfp logs deployment/spark-operator

# List all Spark-related pods
kubectl -n dfp get pods -l sparkoperator.k8s.io/app-name

# Delete a SparkApplication
kubectl -n dfp delete sparkapplication <app-name>

# Check CRDs
kubectl get crd | grep spark

# Pre-load Spark image into kind
docker pull apache/spark:3.5.7-python3
kind load docker-image apache/spark:3.5.7-python3 --name dfp-kind
```

## 11. Summary and Next Steps

### What We Tested

1. Cluster connectivity and namespace existence
2. Spark Operator deployment and readiness
3. CRD installation (SparkApplication, ScheduledSparkApplication)
4. RBAC configuration (ServiceAccount, ClusterRole, ClusterRoleBinding)
5. SparkApplication submission and lifecycle monitoring
6. Log retrieval from driver and executor pods

### Common Issues and Solutions

| Issue | Solution |
|-------|----------|
| ImagePullBackOff | Pre-load image: `kind load docker-image <image> --name dfp-kind` |
| SparkApplication stuck in SUBMITTED | Check operator logs: `kubectl -n dfp logs deployment/spark-operator` |
| Permission denied | Verify ServiceAccount and RBAC permissions |
| Driver OOMKilled | Increase driver memory in SparkApplication spec |
| Executor not starting | Check resource availability with `kubectl describe node` |

### For Production Workloads

See `orchestration/kubeflow/dfp_kfp/components/kronodroid_spark_operator_transform_component.py` for a complete example with:
- Iceberg integration
- S3/MinIO connectivity
- LakeFS catalog configuration
- Proper dependency management