# Module 2: SLURM and Spark on SDSC Expanse

**DSC 232R - Big Data Analysis Using Spark**

This notebook covers:
1. Understanding SLURM job scheduling
2. Writing effective SLURM scripts
3. Configuring Spark for HPC environments
4. Using Singularity containers

**Note**: This notebook is designed for learning concepts locally. The SLURM commands shown are for reference - actual execution requires an HPC cluster.

## Key Takeaways

- **SLURM** manages resources and schedules jobs on HPC clusters
- **sbatch** submits batch jobs; **srun** runs interactive/parallel commands
- **Spark configuration** must match your SLURM resource allocation
- **Singularity containers** ensure reproducible environments

---

## 1. SLURM Basics

### What is SLURM?

SLURM (Simple Linux Utility for Resource Management) is a job scheduler that:
- Allocates compute resources (nodes, CPUs, memory, GPUs)
- Manages job queues with priorities
- Tracks usage for billing/allocation purposes

In [None]:
# SLURM Command Reference (for use on SDSC)
# These commands won't work locally - they're here for documentation

slurm_commands = {
    'sbatch': 'Submit a batch job script',
    'srun': 'Run a command within an allocation',
    'squeue': 'View job queue status',
    'scancel': 'Cancel a job',
    'sinfo': 'View partition/node information',
    'sacct': 'View job accounting data',
    'seff': 'View job efficiency metrics'
}

print("SLURM Command Reference")
print("=" * 50)
for cmd, desc in slurm_commands.items():
    print(f"{cmd:12} - {desc}")

SLURM Command Reference
sbatch       - Submit a batch job script
srun         - Run a command within an allocation
squeue       - View job queue status
scancel      - Cancel a job
sinfo        - View partition/node information
sacct        - View job accounting data
seff         - View job efficiency metrics


### Expanse Partitions

Different partitions have different resource limits and queue times:

In [None]:
import pandas as pd

partitions = pd.DataFrame({
    'Partition': ['debug', 'shared', 'compute', 'gpu', 'gpu-shared'],
    'Max Nodes': [2, 1, 32, 4, 1],
    'Max Time': ['30 min', '48 hrs', '48 hrs', '48 hrs', '48 hrs'],
    'Max Cores/Node': [128, 128, 128, 40, 10],
    'Use Case': [
        'Testing, development',
        'Small jobs (<128 cores)',
        'Large parallel jobs',
        'Multi-GPU jobs',
        'Single GPU jobs'
    ]
})

print("SDSC Expanse Partitions")
print("=" * 80)
print(partitions.to_string(index=False))

SDSC Expanse Partitions
 Partition  Max Nodes Max Time  Max Cores/Node                Use Case
     debug          2   30 min             128    Testing, development
    shared          1   48 hrs             128 Small jobs (<128 cores)
   compute         32   48 hrs             128     Large parallel jobs
       gpu          4   48 hrs              40          Multi-GPU jobs
gpu-shared          1   48 hrs              10         Single GPU jobs


---

## 2. Anatomy of a SLURM Script

Let's break down a SLURM job script piece by piece:

In [None]:
slurm_script = '''
#!/bin/bash
#SBATCH --job-name=my_spark_job      # Name shown in queue
#SBATCH --partition=shared           # Which queue to use
#SBATCH --nodes=1                    # Number of nodes
#SBATCH --ntasks-per-node=1          # MPI tasks (usually 1 for Spark)
#SBATCH --cpus-per-task=32           # CPU cores per task
#SBATCH --mem=128G                   # Memory per node
#SBATCH --time=04:00:00              # Wall clock limit (HH:MM:SS)
#SBATCH --account=uci150             # Allocation to charge
#SBATCH --output=logs/job_%j.out     # Standard output (%j = job ID)
#SBATCH --error=logs/job_%j.err      # Standard error

# Load required modules
module load singularitypro

# Run the job
singularity exec container.sif python script.py
'''

print("Example SLURM Script")
print("=" * 60)
print(slurm_script)

Example SLURM Script

#!/bin/bash
#SBATCH --job-name=my_spark_job      # Name shown in queue
#SBATCH --partition=shared           # Which queue to use
#SBATCH --nodes=1                    # Number of nodes
#SBATCH --ntasks-per-node=1          # MPI tasks (usually 1 for Spark)
#SBATCH --cpus-per-task=32           # CPU cores per task
#SBATCH --mem=128G                   # Memory per node
#SBATCH --time=04:00:00              # Wall clock limit (HH:MM:SS)
#SBATCH --account=uci150             # Allocation to charge
#SBATCH --output=logs/job_%j.out     # Standard output (%j = job ID)
#SBATCH --error=logs/job_%j.err      # Standard error

# Load required modules
module load singularitypro

# Run the job
singularity exec container.sif python script.py



### Understanding SBATCH Options

In [None]:
sbatch_options = pd.DataFrame({
    'Option': [
        '--partition', '--nodes', '--ntasks', '--cpus-per-task',
        '--mem', '--time', '--gres', '--account'
    ],
    'Description': [
        'Queue/partition name',
        'Number of compute nodes',
        'Total MPI tasks',
        'CPU cores per task',
        'Memory per node',
        'Wall clock time limit',
        'Generic resources (GPUs)',
        'Allocation account'
    ],
    'Example': [
        'shared, compute, gpu',
        '1, 4, 32',
        '1 (for Python/Spark)',
        '8, 32, 64, 128',
        '32G, 128G, 256G',
        '01:00:00, 48:00:00',
        'gpu:1, gpu:4',
        'uci150'
    ]
})

print("SBATCH Options Reference")
print("=" * 80)
print(sbatch_options.to_string(index=False))

SBATCH Options Reference
         Option              Description              Example
    --partition     Queue/partition name shared, compute, gpu
        --nodes  Number of compute nodes             1, 4, 32
       --ntasks          Total MPI tasks 1 (for Python/Spark)
--cpus-per-task       CPU cores per task       8, 32, 64, 128
          --mem          Memory per node      32G, 128G, 256G
         --time    Wall clock time limit   01:00:00, 48:00:00
         --gres Generic resources (GPUs)         gpu:1, gpu:4
      --account       Allocation account               uci150


---

## 3. Spark Configuration for HPC

### Memory Layout

When running Spark on HPC, you need to carefully allocate memory:

In [None]:
def calculate_spark_config(total_memory_gb, num_cores):
    """
    Calculate optimal Spark configuration based on SLURM allocation.

    Guidelines:
    - Driver: ~10-15% of total memory
    - Executor: ~75-80% of total memory
    - Overhead: ~10% for OS and Spark internals
    """
    driver_mem = max(4, int(total_memory_gb * 0.12))  # 12% for driver
    overhead = max(2, int(total_memory_gb * 0.08))    # 8% overhead
    executor_mem = total_memory_gb - driver_mem - overhead

    # Shuffle partitions: 2-3x number of cores
    shuffle_partitions = num_cores * 2

    return {
        'spark.driver.memory': f'{driver_mem}g',
        'spark.executor.memory': f'{executor_mem}g',
        'spark.sql.shuffle.partitions': shuffle_partitions,
        'spark.default.parallelism': shuffle_partitions,
        'spark.driver.maxResultSize': '4g'
    }

# Example configurations
configs = [
    (32, 8),    # Small job: 32 GB, 8 cores
    (128, 32),  # Medium job: 128 GB, 32 cores
    (256, 64),  # Large job: 256 GB, 64 cores
]

print("Spark Configuration Examples")
print("=" * 60)

for mem, cores in configs:
    config = calculate_spark_config(mem, cores)
    print(f"\n{mem} GB Memory, {cores} Cores:")
    for key, value in config.items():
        print(f"  {key}: {value}")

Spark Configuration Examples

32 GB Memory, 8 Cores:
  spark.driver.memory: 4g
  spark.executor.memory: 26g
  spark.sql.shuffle.partitions: 16
  spark.default.parallelism: 16
  spark.driver.maxResultSize: 4g

128 GB Memory, 32 Cores:
  spark.driver.memory: 15g
  spark.executor.memory: 103g
  spark.sql.shuffle.partitions: 64
  spark.default.parallelism: 64
  spark.driver.maxResultSize: 4g

256 GB Memory, 64 Cores:
  spark.driver.memory: 30g
  spark.executor.memory: 206g
  spark.sql.shuffle.partitions: 128
  spark.default.parallelism: 128
  spark.driver.maxResultSize: 4g


### SparkSession Configuration Template

In [None]:
spark_template = '''
from pyspark.sql import SparkSession
import os

def create_spark_session_for_slurm():
    """
    Create SparkSession configured for SLURM environment.
    Reads resource allocation from SLURM environment variables.
    """
    # Get SLURM allocation
    cpus = int(os.environ.get('SLURM_CPUS_PER_TASK', 8))
    mem_str = os.environ.get('SLURM_MEM_PER_NODE', '32G')

    # Parse memory (handle both '32G' and '32000' formats)
    if 'G' in mem_str:
        mem_gb = int(mem_str.replace('G', ''))
    else:
        mem_gb = int(mem_str) // 1024

    # Calculate allocations
    driver_mem = max(4, mem_gb // 8)
    executor_mem = mem_gb - driver_mem - 2

    spark = SparkSession.builder \\
        .appName(os.environ.get('SLURM_JOB_NAME', 'SparkJob')) \\
        .master(f'local[{cpus}]') \\
        .config('spark.driver.memory', f'{driver_mem}g') \\
        .config('spark.executor.memory', f'{executor_mem}g') \\
        .config('spark.driver.maxResultSize', '4g') \\
        .config('spark.sql.shuffle.partitions', str(cpus * 2)) \\
        .config('spark.default.parallelism', str(cpus * 2)) \\
        .getOrCreate()

    return spark
'''

print("SparkSession Template for SLURM")
print("=" * 60)
print(spark_template)

SparkSession Template for SLURM

from pyspark.sql import SparkSession
import os

def create_spark_session_for_slurm():
    """
    Create SparkSession configured for SLURM environment.
    Reads resource allocation from SLURM environment variables.
    """
    # Get SLURM allocation
    cpus = int(os.environ.get('SLURM_CPUS_PER_TASK', 8))
    mem_str = os.environ.get('SLURM_MEM_PER_NODE', '32G')
    
    # Parse memory (handle both '32G' and '32000' formats)
    if 'G' in mem_str:
        mem_gb = int(mem_str.replace('G', ''))
    else:
        mem_gb = int(mem_str) // 1024
    
    # Calculate allocations
    driver_mem = max(4, mem_gb // 8)
    executor_mem = mem_gb - driver_mem - 2
    
    spark = SparkSession.builder \
        .appName(os.environ.get('SLURM_JOB_NAME', 'SparkJob')) \
        .master(f'local[{cpus}]') \
        .config('spark.driver.memory', f'{driver_mem}g') \
        .config('spark.executor.memory', f'{executor_mem}g') \
        .config('spark.driver.maxResultSize

---

## 4. Local Simulation: Understanding Resource Allocation

Let's simulate how Spark uses resources locally:

In [None]:
from pyspark.sql import SparkSession
import time
import psutil
import os

# Get local system resources
local_cpus = os.cpu_count()
local_memory_gb = psutil.virtual_memory().total / (1024**3)

print(f"Local System Resources")
print(f"=" * 40)
print(f"CPUs: {local_cpus}")
print(f"Memory: {local_memory_gb:.1f} GB")

Local System Resources
CPUs: 2
Memory: 12.7 GB


In [None]:
# Create SparkSession with limited resources (simulating shared node)
# Limit to half the local resources to avoid impacting system

sim_cpus = max(2, local_cpus // 2)
sim_memory = max(2, int(local_memory_gb * 0.3))

print(f"Simulated SLURM Allocation")
print(f"=" * 40)
print(f"CPUs: {sim_cpus}")
print(f"Memory: {sim_memory} GB")

spark = SparkSession.builder \
    .appName("SLURM_Simulation") \
    .master(f"local[{sim_cpus}]") \
    .config("spark.driver.memory", f"{max(1, sim_memory // 4)}g") \
    .config("spark.sql.shuffle.partitions", str(sim_cpus * 2)) \
    .getOrCreate()

print(f"\nSparkSession created:")
print(f"  App Name: {spark.sparkContext.appName}")
print(f"  Master: {spark.sparkContext.master}")
print(f"  Default Parallelism: {spark.sparkContext.defaultParallelism}")

Simulated SLURM Allocation
CPUs: 2
Memory: 3 GB

SparkSession created:
  App Name: SLURM_Simulation
  Master: local[2]
  Default Parallelism: 2


In [None]:
# Demonstrate parallelism with different partition counts
import numpy as np

# Generate test data
n_elements = 1_000_000
data = list(range(n_elements))

def measure_execution_time(num_partitions):
    """Measure time to process data with given partition count."""
    rdd = spark.sparkContext.parallelize(data, num_partitions)

    start = time.time()
    result = rdd.map(lambda x: x * 2).reduce(lambda a, b: a + b)
    elapsed = time.time() - start

    return elapsed, result

# Test different partition counts
partition_counts = [sim_cpus, sim_cpus * 2, sim_cpus * 4, sim_cpus * 8]

print(f"\nPartition Performance Test ({n_elements:,} elements)")
print("=" * 50)
print(f"{'Partitions':>12} {'Time (s)':>12} {'Per Partition':>15}")
print("-" * 50)

for num_parts in partition_counts:
    elapsed, result = measure_execution_time(num_parts)
    elements_per_partition = n_elements // num_parts
    print(f"{num_parts:>12} {elapsed:>12.3f} {elements_per_partition:>15,}")


Partition Performance Test (1,000,000 elements)
  Partitions     Time (s)   Per Partition
--------------------------------------------------
           2        3.046         500,000
           4        1.405         250,000
           8        1.909         125,000
          16        3.950          62,500


### Partition Count Guidelines

The optimal number of partitions depends on:
1. **Number of cores** - Each partition can run on one core
2. **Data size** - Each partition should have enough data to process efficiently
3. **Memory per core** - Partitions must fit in available memory

In [None]:
def recommend_partitions(data_size_gb, memory_per_core_gb, num_cores):
    """
    Recommend partition count based on data and resources.

    Rules of thumb:
    - Target 100-200 MB per partition for typical workloads
    - At least 2x partitions per core for parallelism
    - Don't exceed memory capacity
    """
    target_partition_size_gb = 0.15  # 150 MB target

    # Based on data size
    partitions_by_size = max(1, int(data_size_gb / target_partition_size_gb))

    # Based on parallelism (2-3x cores)
    partitions_by_cores = num_cores * 2

    # Based on memory (each partition needs ~2x its size in memory)
    max_partitions_by_memory = int((memory_per_core_gb * num_cores) / (target_partition_size_gb * 2))

    recommended = min(partitions_by_size, max_partitions_by_memory)
    recommended = max(recommended, partitions_by_cores)  # At least 2x cores

    return {
        'recommended': recommended,
        'by_data_size': partitions_by_size,
        'by_core_count': partitions_by_cores,
        'max_by_memory': max_partitions_by_memory
    }

# Example scenarios
scenarios = [
    ('Small job', 10, 4, 8),     # 10 GB data, 4 GB/core, 8 cores
    ('Medium job', 50, 4, 32),   # 50 GB data, 4 GB/core, 32 cores
    ('Large job', 200, 4, 64),   # 200 GB data, 4 GB/core, 64 cores
]

print("Partition Recommendations")
print("=" * 70)

for name, data_gb, mem_per_core, cores in scenarios:
    rec = recommend_partitions(data_gb, mem_per_core, cores)
    print(f"\n{name}: {data_gb} GB data, {cores} cores, {mem_per_core} GB/core")
    print(f"  Recommended partitions: {rec['recommended']}")
    print(f"  (by size: {rec['by_data_size']}, by cores: {rec['by_core_count']}, max by memory: {rec['max_by_memory']}")

Partition Recommendations

Small job: 10 GB data, 8 cores, 4 GB/core
  Recommended partitions: 66
  (by size: 66, by cores: 16, max by memory: 106

Medium job: 50 GB data, 32 cores, 4 GB/core
  Recommended partitions: 333
  (by size: 333, by cores: 64, max by memory: 426

Large job: 200 GB data, 64 cores, 4 GB/core
  Recommended partitions: 853
  (by size: 1333, by cores: 128, max by memory: 853


---

## 5. Singularity Container Usage

Singularity containers ensure consistent environments across different systems:

In [None]:
singularity_commands = '''
# Singularity Command Reference
# ==============================

# 1. Build a container (requires sudo, do locally)
sudo singularity build container.sif Singularity.def

# 2. Transfer to SDSC
scp container.sif user@login.expanse.sdsc.edu:/expanse/lustre/projects/uci150/user/

# 3. Interactive shell in container
singularity shell container.sif

# 4. Execute a command in container
singularity exec container.sif python --version

# 5. Bind external paths (access host filesystem)
singularity exec --bind /expanse container.sif python script.py

# 6. Multiple bind points
singularity exec \\
    --bind /expanse/lustre/projects/uci150 \\
    --bind /scratch/$USER \\
    container.sif python script.py

# 7. Set environment variables
singularity exec --env PYTHONPATH=/app container.sif python script.py
'''

print(singularity_commands)


# Singularity Command Reference

# 1. Build a container (requires sudo, do locally)
sudo singularity build container.sif Singularity.def

# 2. Transfer to SDSC
scp container.sif user@login.expanse.sdsc.edu:/expanse/lustre/projects/uci150/user/

# 3. Interactive shell in container
singularity shell container.sif

# 4. Execute a command in container
singularity exec container.sif python --version

# 5. Bind external paths (access host filesystem)
singularity exec --bind /expanse container.sif python script.py

# 6. Multiple bind points
singularity exec \
    --bind /expanse/lustre/projects/uci150 \
    --bind /scratch/$USER \
    container.sif python script.py

# 7. Set environment variables
singularity exec --env PYTHONPATH=/app container.sif python script.py



### Container Definition File

In [None]:
singularity_def = '''
Bootstrap: docker
From: python:3.10-slim

%labels
    Author DSC 232R
    Description Spark + Ray container for HPC

%post
    # System packages
    apt-get update && apt-get install -y \\
        openjdk-17-jdk-headless \\
        wget curl git \\
        && rm -rf /var/lib/apt/lists/*

    # Python packages
    pip install --no-cache-dir \\
        pyspark>=3.5.0 \\
        numpy pandas matplotlib \\
        scikit-learn xgboost \\
        jupyter

%environment
    export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
    export PATH=$JAVA_HOME/bin:$PATH

%runscript
    python "$@"
'''

print("Example Singularity Definition")
print("=" * 50)
print(singularity_def)

Example Singularity Definition

Bootstrap: docker
From: python:3.10-slim

%labels
    Author DSC 232R
    Description Spark + Ray container for HPC

%post
    # System packages
    apt-get update && apt-get install -y \
        openjdk-17-jdk-headless \
        wget curl git \
        && rm -rf /var/lib/apt/lists/*

    # Python packages
    pip install --no-cache-dir \
        pyspark>=3.5.0 \
        numpy pandas matplotlib \
        scikit-learn xgboost \
        jupyter

%environment
    export JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64
    export PATH=$JAVA_HOME/bin:$PATH

%runscript
    python "$@"



---

## 6. Exercise: Write a SLURM Script

Create a SLURM script for the following scenario:

**Requirements:**
- Process 50 GB of weather data
- Need ~200 GB memory for processing overhead
- Want to use 64 cores
- Expected runtime: 2 hours
- Use the shared partition

In [None]:
# Exercise: Fill in the SLURM script template

# Think about:
# 1. Which partition supports 64 cores on one node?
# 2. How much memory can you request on shared partition?
# 3. What's the appropriate time format?

your_script = '''
#!/bin/bash
#SBATCH --job-name=weather_analysis
#SBATCH --partition=???           # Which partition?
#SBATCH --nodes=???
#SBATCH --ntasks-per-node=???
#SBATCH --cpus-per-task=???       # 64 cores
#SBATCH --mem=???                 # ~200 GB needed
#SBATCH --time=???                # 2 hours
#SBATCH --account=uci150
#SBATCH --output=logs/weather_%j.out
#SBATCH --error=logs/weather_%j.err

module load singularitypro

singularity exec \\
    --bind /expanse \\
    ray_spark_dsc232r.sif \\
    python weather_analysis.py
'''

print("Fill in the blanks in the script above!")
print("\nHints:")
print("- 'shared' partition allows up to 128 cores per node")
print("- Max memory on shared is 256 GB per node")
print("- Time format is HH:MM:SS")

Fill in the blanks in the script above!

Hints:
- 'shared' partition allows up to 128 cores per node
- Max memory on shared is 256 GB per node
- Time format is HH:MM:SS


In [None]:
# Solution

solution_script = '''
#!/bin/bash
#SBATCH --job-name=weather_analysis
#SBATCH --partition=shared         # Shared supports up to 128 cores/node
#SBATCH --nodes=1                  # Single node is sufficient
#SBATCH --ntasks-per-node=1        # One task for Spark driver
#SBATCH --cpus-per-task=64         # 64 cores requested
#SBATCH --mem=200G                 # 200 GB as specified
#SBATCH --time=02:00:00            # 2 hours (HH:MM:SS format)
#SBATCH --account=uci150
#SBATCH --output=logs/weather_%j.out
#SBATCH --error=logs/weather_%j.err

module load singularitypro

singularity exec \\
    --bind /expanse \\
    ray_spark_dsc232r.sif \\
    python weather_analysis.py
'''

print("Solution:")
print("=" * 50)
print(solution_script)

Solution:

#!/bin/bash
#SBATCH --job-name=weather_analysis
#SBATCH --partition=shared         # Shared supports up to 128 cores/node
#SBATCH --nodes=1                  # Single node is sufficient
#SBATCH --ntasks-per-node=1        # One task for Spark driver
#SBATCH --cpus-per-task=64         # 64 cores requested
#SBATCH --mem=200G                 # 200 GB as specified
#SBATCH --time=02:00:00            # 2 hours (HH:MM:SS format)
#SBATCH --account=uci150
#SBATCH --output=logs/weather_%j.out
#SBATCH --error=logs/weather_%j.err

module load singularitypro

singularity exec \
    --bind /expanse \
    ray_spark_dsc232r.sif \
    python weather_analysis.py



---

## 7. Job Monitoring Reference

Commands to monitor and debug jobs on SDSC:

In [None]:
monitoring_commands = '''
# SLURM Job Monitoring Commands
# =============================

# View your jobs in the queue
squeue -u $USER

# Detailed job information
scontrol show job <job_id>

# View output in real-time
tail -f logs/job_12345.out

# Check job efficiency after completion
seff <job_id>

# View detailed accounting
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode,Elapsed,MaxRSS,MaxVMSize

# Cancel a job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

# View partition information
sinfo -p shared

# View your allocation balance
expanse-client user
'''

print(monitoring_commands)


# SLURM Job Monitoring Commands

# View your jobs in the queue
squeue -u $USER

# Detailed job information
scontrol show job <job_id>

# View output in real-time
tail -f logs/job_12345.out

# Check job efficiency after completion
seff <job_id>

# View detailed accounting
sacct -j <job_id> --format=JobID,JobName,Partition,State,ExitCode,Elapsed,MaxRSS,MaxVMSize

# Cancel a job
scancel <job_id>

# Cancel all your jobs
scancel -u $USER

# View partition information
sinfo -p shared

# View your allocation balance
expanse-client user



### Common Issues and Solutions

In [None]:
troubleshooting = pd.DataFrame({
    'Issue': [
        'Job killed (exit 137)',
        'Job timeout',
        'ModuleNotFoundError',
        'FileNotFoundError',
        'Job PENDING forever'
    ],
    'Cause': [
        'Out of memory (OOM)',
        'Exceeded time limit',
        'Package not in container',
        'Path not bound in Singularity',
        'Resources not available'
    ],
    'Solution': [
        'Increase --mem or optimize code',
        'Increase --time or parallelize more',
        'Rebuild container with package',
        'Add --bind /path to singularity exec',
        'Try debug partition or reduce resources'
    ]
})

print("Troubleshooting Guide")
print("=" * 90)
print(troubleshooting.to_string(index=False))

Troubleshooting Guide
                Issue                         Cause                                Solution
Job killed (exit 137)           Out of memory (OOM)         Increase --mem or optimize code
          Job timeout           Exceeded time limit     Increase --time or parallelize more
  ModuleNotFoundError      Package not in container          Rebuild container with package
    FileNotFoundError Path not bound in Singularity    Add --bind /path to singularity exec
  Job PENDING forever       Resources not available Try debug partition or reduce resources


---

## Summary

### Key Concepts

1. **SLURM** schedules jobs on HPC clusters
   - `sbatch` for batch jobs
   - `srun` for interactive commands
   - `squeue` to monitor jobs

2. **Partitions** have different resource limits
   - `debug`: Fast queue, limited time
   - `shared`: Up to 128 cores, 256 GB
   - `compute`: Full nodes, large jobs

3. **Spark configuration** must match allocation
   - Driver memory: ~10-15% of total
   - Executor memory: ~75-80% of total
   - Partitions: 2-3x number of cores

4. **Singularity** provides reproducible environments
   - Build locally, run on cluster
   - Bind paths for filesystem access

In [None]:
# Cleanup
spark.stop()
print("SparkSession stopped.")

SparkSession stopped.


---

## Next: Module 3 - Introduction to Ray

In the next module, we'll introduce Ray.io and learn:
- Ray Core: Tasks and Actors
- Ray Data: Distributed datasets
- Ray Train: Distributed training

See: `03_ray_core.ipynb`