Slurm adaptor got invalid key/value pair in output #72

arnikz opened this issue Feb 25, 2020 · 16 comments

arnikz commented Feb 25, 2020

The GridEngine cluster at UMCU has been recently upgraded to use Slurm (v19) (and will replace GE soon-ish). So, I tested the sv-callers workflow but all Slurm jobs failed (also tried without the --max-memory arg, see release notes).

xenon -vvv scheduler slurm --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 5 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log
slurm adaptor: Got invalid key/value pair in output: Cgroup Support Configuration:
Error submitting jobscript (exit code 1):
13:18:55.487 [main] DEBUG n.e.x.a.s.ScriptingScheduler - creating sub scheduler for slurm adaptor at local://
13:18:55.498 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
13:18:55.501 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job
13:18:55.506 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Created Job local-0
13:18:55.507 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
13:18:55.508 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.s.RemoteCommandRunner - CommandRunner took 44 ms, executable = scontrol, arguments = [show, config], exitcode = 0, stdout:
Configuration data as of 2020-02-24T13:18:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = htp-batch-01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-02-17T18:10:10
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
CliFilterPlugins        = (null)
ClusterName             = spider
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerCPU            = 8000
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ALL
Epilog                  = /data/
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = NoOverMemoryKill
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerCPU            = 8000
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 26161
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 1-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 10000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(1001)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = htp-batch-01(
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 19.05.5
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm_state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = no
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at htp-batch-01 is UP

arnikz commented Feb 25, 2020

Perhaps, it's a good time to update the Docker images.

arnikz commented Mar 10, 2020

@sverhoeven: could you given an estimate on how much time is required to fix this? Thanks.

The ScriptingParser used in the SlurmScheduler class does not know about sections (:). I think it would take at least a day to write a robust parser and couple of hours to create a new Xenon and Xenon-* releases.

I've just added a single statement to ignore all lines without an = sign in them.

I'll add some tests and release a version with this fix

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

We would need a slurm 19 container to do proper testing?

arnikz commented Mar 11, 2020

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.


We would need a slurm 19 container to do proper testing?


arnikz commented Mar 19, 2020

Hi, I've tested my workflow with xenon-cli 3.0.5beta1 + new slurm-19 image but the jobs are still failing. Please heeelp!

The conda xenon-cli 3.0.5beta1 package was just made for non-linux users (#73).

It does not include the fix in the branch, it is a build with the Xenon v3.0.4 release.

jmaassen commented Mar 20, 2020

Hmmm... my (new) unit test does parse the output correctly.

I think there may be some version mixup with xenon somewhere. I'll see if I can find the problem.

update: Ah, it seems the fix may be in the jobstatus-bug branch ;-)

I'll cleanup the branch and test it with the other (non-slurm) scripting adaptors. I can then merge it into master and make a new release

I created a draft PR xenon-middleware/xenon#670 for the jobstatus-bug branch, to see the test failures more easily.

Hmmm... most of the test pass, except for one integration test. Apparently the sbatch argument "--workdir" has changed to "--chdir" at some point. Will fix.

Copy link

Fixed in the 3.1.0 release

CLI v3.0.5 released on conda with Xenon 3.1.0. Please test

arnikz commented Mar 23, 2020

All works fine with the latest release on Slurm 19. Thanks!

@arnikz arnikz closed this as completed Mar 23, 2020
