Skip to content

submit: replace shell-level cd job_workdir with scheduler --chdir directive (bundled with directory-model design) #19

@ultimatile

Description

@ultimatile

Summary

JOB_TEMPLATE in src/hpc/job.py currently sets the job's working
directory via a shell-level cd $job_workdir line in the rendered
script. Replacing this with a scheduler directive — Slurm's
#SBATCH --chdir= (a.k.a. -D), PJM's -d / --directory — would
move the working-directory choice from the script body into the
scheduler API.

This is a deferred design question, not a known bug. Filing for
visibility so it can be revisited together with related directory-
model decisions.

Robustness motivation

A shell-level cd has known failure modes that a scheduler
directive avoids:

  • Silent wrong-CWD execution. If $job_workdir does not exist
    or is not accessible at runtime, cd fails without exiting
    the script (default bash semantics). Subsequent commands then
    run in $HOME or $SLURM_SUBMIT_DIR, the job completes
    successfully from the scheduler's perspective, and the user
    discovers the wrong directory only by output inspection.
    --chdir is validated by the scheduler at submit / job-start
    time, so an invalid path fails at submit instead of producing a
    silently-wrong run.
  • Implicit "before the cd" zone. Any shell line emitted before
    cd runs in the submit-time CWD. The current template happens to
    have no such lines, but the contract depends on never adding any.
    --chdir makes the entire script body run in $job_workdir
    unconditionally.
  • Scheduler-side visibility. scontrol show job <id> reports
    WorkDir= only when set via --chdir. With cd, the working
    directory is hidden inside the script body and not queryable
    from scheduler metadata.
  • Requeue / preemption. A --chdir-set working directory is
    recorded by the scheduler and restored automatically on requeue.
    With cd, the script must re-execute the line; same end result
    in the simple case, but the scheduler-level path is more
    scheduler-aware and removes one shell pre-condition.

Why this is not a single-PR fix

The change interacts with a longer-running directory-model
question — whether to introduce a sourcedir concept distinct from
cluster.workdir. Once a second remote-side directory exists, the
following questions all need consistent answers:

  • Should --chdir point at workdir or sourcedir?
  • A user's own cd subdir in the script — is subdir resolved
    relative to workdir, sourcedir, or whichever directory the
    scheduler placed the job in?
  • The CWD-relative walker (cwd_relative = Path.cwd().relative_to(project_root) in src/hpc/cli.py) maps
    local CWD to a remote subpath. Which root does that subpath
    attach to?
  • [env] setup commands — do they run before or after the
    scheduler's chdir takes effect, and does this matter for paths
    used in module load / source / export?

Each combination of answers produces a different mental model for
users. Introducing --chdir standalone risks locking in a partial
answer that constrains the later sourcedir design.

Acceptance

This issue is closed when:

  • A directory-model design has been decided (workdir-only, or
    workdir + sourcedir with documented semantics for each axis
    above), AND
  • JOB_TEMPLATE has been updated to use the scheduler directive
    (#SBATCH --chdir= for Slurm, -d for PJM) consistent with
    that design, AND
  • The walker logic and [env] setup interaction are verified
    against the new model.

Closing without addressing the directory-model question is not
acceptable; the standalone refactor would lock in implicit
answers.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions