Slurm adaptor got invalid key/value pair in output #72

arnikz · 2020-02-25T10:22:44Z

The GridEngine cluster at UMCU has been recently upgraded to use Slurm (v19) (and will replace GE soon-ish). So, I tested the sv-callers workflow but all Slurm jobs failed (also tried without the --max-memory arg, see release notes).

xenon -vvv scheduler slurm --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 5 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log

slurm adaptor: Got invalid key/value pair in output: Cgroup Support Configuration:
Error submitting jobscript (exit code 1):
13:18:55.487 [main] DEBUG n.e.x.a.s.ScriptingScheduler - creating sub scheduler for slurm adaptor at local://
13:18:55.498 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
13:18:55.501 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job
13:18:55.506 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Created Job local-0
13:18:55.507 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
13:18:55.508 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.s.RemoteCommandRunner - CommandRunner took 44 ms, executable = scontrol, arguments = [show, config], exitcode = 0, stdout:
Configuration data as of 2020-02-24T13:18:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = htp-batch-01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-02-17T18:10:10
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
CliFilterPlugins        = (null)
ClusterName             = spider
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerCPU            = 8000
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ALL
Epilog                  = /data/tmpdir-epilogue.sh
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = NoOverMemoryKill
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerCPU            = 8000
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 26161
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 1-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 10000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(1001)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = htp-batch-01(10.0.0.14)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 19.05.5
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm_state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = no
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at htp-batch-01 is UP

stderr:

The text was updated successfully, but these errors were encountered:

arnikz · 2020-02-25T10:24:55Z

Perhaps, it's a good time to update the Docker images.

arnikz · 2020-03-10T15:18:02Z

@sverhoeven: could you given an estimate on how much time is required to fix this? Thanks.

sverhoeven · 2020-03-10T15:50:14Z

The ScriptingParser used in the SlurmScheduler class does not know about sections (:). I think it would take at least a day to write a robust parser and couple of hours to create a new Xenon and Xenon-* releases.

jmaassen · 2020-03-11T10:23:16Z

I've just added a single statement to ignore all lines without an = sign in them.

jmaassen · 2020-03-11T10:24:03Z

I'll add some tests and release a version with this fix

jmaassen · 2020-03-11T14:50:20Z

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

We would need a slurm 19 container to do proper testing?

arnikz · 2020-03-11T15:06:41Z

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

Thanks.

We would need a slurm 19 container to do proper testing?

Yes.

arnikz · 2020-03-19T14:37:10Z

Hi, I've tested my workflow with xenon-cli 3.0.5beta1 + new slurm-19 image but the jobs are still failing. Please heeelp!

sverhoeven · 2020-03-20T09:31:41Z

The conda xenon-cli 3.0.5beta1 package was just made for non-linux users (#73).

It does not include the fix in the https://github.com/xenon-middleware/xenon/tree/jobstatus-bug branch, it is a build with the Xenon v3.0.4 release.

jmaassen · 2020-03-20T10:00:29Z

Hmmm... my (new) unit test does parse the output correctly.

I think there may be some version mixup with xenon somewhere. I'll see if I can find the problem.

update: Ah, it seems the fix may be in the jobstatus-bug branch ;-)

jmaassen · 2020-03-20T10:03:06Z

I'll cleanup the branch and test it with the other (non-slurm) scripting adaptors. I can then merge it into master and make a new release

sverhoeven · 2020-03-20T10:34:25Z

I created a draft PR xenon-middleware/xenon#670 for the jobstatus-bug branch, to see the test failures more easily.

jmaassen · 2020-03-20T10:56:32Z

Hmmm... most of the test pass, except for one integration test. Apparently the sbatch argument "--workdir" has changed to "--chdir" at some point. Will fix.

jmaassen · 2020-03-20T13:59:59Z

Fixed in the 3.1.0 release

sverhoeven · 2020-03-23T08:38:26Z

CLI v3.0.5 released on conda with Xenon 3.1.0. Please test

arnikz · 2020-03-23T23:10:46Z

All works fine with the latest release on Slurm 19. Thanks!

jmaassen mentioned this issue Mar 11, 2020

Add support for slurm 19 xenon-middleware/xenon#669

Closed

sverhoeven mentioned this issue Mar 13, 2020

Slurm 19 and 20 image xenon-middleware/xenon-docker-images#47

Open

arnikz closed this as completed Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slurm adaptor got invalid key/value pair in output #72

Slurm adaptor got invalid key/value pair in output #72

arnikz commented Feb 25, 2020 •

edited

arnikz commented Feb 25, 2020 •

edited

arnikz commented Mar 10, 2020

sverhoeven commented Mar 10, 2020

jmaassen commented Mar 11, 2020

jmaassen commented Mar 11, 2020

jmaassen commented Mar 11, 2020

arnikz commented Mar 11, 2020

arnikz commented Mar 19, 2020 •

edited

sverhoeven commented Mar 20, 2020

jmaassen commented Mar 20, 2020 •

edited

jmaassen commented Mar 20, 2020

sverhoeven commented Mar 20, 2020

jmaassen commented Mar 20, 2020

jmaassen commented Mar 20, 2020

sverhoeven commented Mar 23, 2020

arnikz commented Mar 23, 2020

Slurm adaptor got invalid key/value pair in output #72

Slurm adaptor got invalid key/value pair in output #72

Comments

arnikz commented Feb 25, 2020 • edited

arnikz commented Feb 25, 2020 • edited

arnikz commented Mar 10, 2020

sverhoeven commented Mar 10, 2020

jmaassen commented Mar 11, 2020

jmaassen commented Mar 11, 2020

jmaassen commented Mar 11, 2020

arnikz commented Mar 11, 2020

arnikz commented Mar 19, 2020 • edited

sverhoeven commented Mar 20, 2020

jmaassen commented Mar 20, 2020 • edited

jmaassen commented Mar 20, 2020

sverhoeven commented Mar 20, 2020

jmaassen commented Mar 20, 2020

jmaassen commented Mar 20, 2020

sverhoeven commented Mar 23, 2020

arnikz commented Mar 23, 2020

arnikz commented Feb 25, 2020 •

edited

arnikz commented Feb 25, 2020 •

edited

arnikz commented Mar 19, 2020 •

edited

jmaassen commented Mar 20, 2020 •

edited