Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm adaptor got invalid key/value pair in output #72

Closed
arnikz opened this issue Feb 25, 2020 · 16 comments
Closed

Slurm adaptor got invalid key/value pair in output #72

arnikz opened this issue Feb 25, 2020 · 16 comments

Comments

@arnikz
Copy link

arnikz commented Feb 25, 2020

The GridEngine cluster at UMCU has been recently upgraded to use Slurm (v19) (and will replace GE soon-ish). So, I tested the sv-callers workflow but all Slurm jobs failed (also tried without the --max-memory arg, see release notes).

xenon -vvv scheduler slurm --location local:// submit --name smk.{rule} --inherit-env --cores-per-task {threads} --max-run-time 5 --max-memory {resources.mem_mb} --working-directory . --stderr stderr-%j.log --stdout stdout-%j.log
slurm adaptor: Got invalid key/value pair in output: Cgroup Support Configuration:
Error submitting jobscript (exit code 1):
13:18:55.487 [main] DEBUG n.e.x.a.s.ScriptingScheduler - creating sub scheduler for slurm adaptor at local://
13:18:55.498 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - Creating JobQueueScheduler for Adaptor local with multiQThreads: 4 and pollingDelay: 1000
13:18:55.501 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job
13:18:55.506 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Created Job local-0
13:18:55.507 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Submitting job to queue unlimited
13:18:55.508 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: Waiting for interactive job to start.
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: getJobStatus for job local-0
13:18:55.543 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.544 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: findJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.schedulers.JobQueueScheduler - local: cleanupJob for job local-0
13:18:55.545 [main] DEBUG n.e.x.a.s.RemoteCommandRunner - CommandRunner took 44 ms, executable = scontrol, arguments = [show, config], exitcode = 0, stdout:
Configuration data as of 2020-02-24T13:18:55
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = none
AccountingStorageHost   = htp-batch-01
AccountingStorageLoc    = N/A
AccountingStoragePort   = 6819
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages
AccountingStorageType   = accounting_storage/slurmdbd
AccountingStorageUser   = N/A
AccountingStoreJobComment = Yes
AcctGatherEnergyType    = acct_gather_energy/none
AcctGatherFilesystemType = acct_gather_filesystem/none
AcctGatherInterconnectType = acct_gather_interconnect/none
AcctGatherNodeFreq      = 0 sec
AcctGatherProfileType   = acct_gather_profile/none
AllowSpecResourcesUsage = 0
AuthAltTypes            = (null)
AuthInfo                = (null)
AuthType                = auth/munge
BatchStartTimeout       = 10 sec
BOOT_TIME               = 2020-02-17T18:10:10
BurstBufferType         = (null)
CheckpointType          = checkpoint/none
CliFilterPlugins        = (null)
ClusterName             = spider
CommunicationParameters = (null)
CompleteWait            = 0 sec
CoreSpecPlugin          = core_spec/none
CpuFreqDef              = Unknown
CpuFreqGovernors        = Performance,OnDemand,UserSpace
CredType                = cred/munge
DebugFlags              = (null)
DefMemPerCPU            = 8000
DisableRootJobs         = No
EioTimeout              = 60
EnforcePartLimits       = ALL
Epilog                  = /data/tmpdir-epilogue.sh
EpilogMsgTime           = 2000 usec
EpilogSlurmctld         = (null)
ExtSensorsType          = ext_sensors/none
ExtSensorsFreq          = 0 sec
FairShareDampeningFactor = 1
FastSchedule            = 1
FederationParameters    = (null)
FirstJobId              = 1
GetEnvTimeout           = 2 sec
GresTypes               = (null)
GpuFreqDef              = high,memory=high
GroupUpdateForce        = 1
GroupUpdateTime         = 600 sec
HASH_VAL                = Match
HealthCheckInterval     = 0 sec
HealthCheckNodeState    = ANY
HealthCheckProgram      = (null)
InactiveLimit           = 0 sec
JobAcctGatherFrequency  = 30
JobAcctGatherType       = jobacct_gather/linux
JobAcctGatherParams     = NoOverMemoryKill
JobCheckpointDir        = /var/slurm/checkpoint
JobCompHost             = localhost
JobCompLoc              = /var/log/slurm_jobcomp.log
JobCompPort             = 0
JobCompType             = jobcomp/none
JobCompUser             = root
JobContainerType        = job_container/none
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobDefaults             = (null)
JobFileAppend           = 0
JobRequeue              = 0
JobSubmitPlugins        = (null)
KeepAliveTime           = SYSTEM_DEFAULT
KillOnBadExit           = 0
KillWait                = 30 sec
LaunchParameters        = (null)
LaunchType              = launch/slurm
Layouts                 = 
Licenses                = (null)
LicensesUsed            = (null)
LogTimeFormat           = iso8601_ms
MailDomain              = (null)
MailProg                = /bin/mail
MaxArraySize            = 1001
MaxJobCount             = 10000
MaxJobId                = 67043328
MaxMemPerCPU            = 8000
MaxStepCount            = 40000
MaxTasksPerNode         = 512
MCSPlugin               = mcs/none
MCSParameters           = (null)
MessageTimeout          = 10 sec
MinJobAge               = 300 sec
MpiDefault              = none
MpiParams               = (null)
MsgAggregationParams    = (null)
NEXT_JOB_ID             = 26161
NodeFeaturesPlugins     = (null)
OverTimeLimit           = 0 min
PluginDir               = /usr/lib64/slurm
PlugStackConfig         = /etc/slurm/plugstack.conf
PowerParameters         = (null)
PowerPlugin             = 
PreemptMode             = OFF
PreemptType             = preempt/none
PreemptExemptTime       = 00:00:00
PriorityParameters      = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife   = 1-00:00:00
PriorityCalcPeriod      = 00:05:00
PriorityFavorSmall      = No
PriorityFlags           = 
PriorityMaxAge          = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType            = priority/multifactor
PriorityWeightAge       = 10000
PriorityWeightAssoc     = 0
PriorityWeightFairShare = 10000
PriorityWeightJobSize   = 0
PriorityWeightPartition = 0
PriorityWeightQOS       = 10000
PriorityWeightTRES      = (null)
PrivateData             = none
ProctrackType           = proctrack/cgroup
Prolog                  = (null)
PrologEpilogTimeout     = 65534
PrologSlurmctld         = (null)
PrologFlags             = Alloc,Contain,X11
PropagatePrioProcess    = 0
PropagateResourceLimits = ALL
PropagateResourceLimitsExcept = (null)
RebootProgram           = (null)
ReconfigFlags           = (null)
RequeueExit             = (null)
RequeueExitHold         = (null)
ResumeFailProgram       = (null)
ResumeProgram           = (null)
ResumeRate              = 300 nodes/min
ResumeTimeout           = 60 sec
ResvEpilog              = (null)
ResvOverRun             = 0 min
ResvProlog              = (null)
ReturnToService         = 1
RoutePlugin             = route/default
SallocDefaultCommand    = (null)
SbcastParameters        = (null)
SchedulerParameters     = (null)
SchedulerTimeSlice      = 30 sec
SchedulerType           = sched/backfill
SelectType              = select/cons_res
SelectTypeParameters    = CR_CORE_MEMORY
SlurmUser               = slurm(1001)
SlurmctldAddr           = (null)
SlurmctldDebug          = info
SlurmctldHost[0]        = htp-batch-01(10.0.0.14)
SlurmctldLogFile        = /var/log/slurm/slurmctld.log
SlurmctldPort           = 6817
SlurmctldSyslogDebug    = unknown
SlurmctldPrimaryOffProg = (null)
SlurmctldPrimaryOnProg  = (null)
SlurmctldTimeout        = 120 sec
SlurmctldParameters     = (null)
SlurmdDebug             = info
SlurmdLogFile           = /var/log/slurm/slurmd.log
SlurmdParameters        = (null)
SlurmdPidFile           = /var/run/slurmd.pid
SlurmdPort              = 6818
SlurmdSpoolDir          = /var/spool/slurmd
SlurmdSyslogDebug       = unknown
SlurmdTimeout           = 300 sec
SlurmdUser              = root(0)
SlurmSchedLogFile       = (null)
SlurmSchedLogLevel      = 0
SlurmctldPidFile        = /var/run/slurmctld.pid
SlurmctldPlugstack      = (null)
SLURM_CONF              = /etc/slurm/slurm.conf
SLURM_VERSION           = 19.05.5
SrunEpilog              = (null)
SrunPortRange           = 0-0
SrunProlog              = (null)
StateSaveLocation       = /var/spool/slurm_state
SuspendExcNodes         = (null)
SuspendExcParts         = (null)
SuspendProgram          = (null)
SuspendRate             = 60 nodes/min
SuspendTime             = NONE
SuspendTimeout          = 30 sec
SwitchType              = switch/none
TaskEpilog              = (null)
TaskPlugin              = affinity,cgroup
TaskPluginParam         = (null type)
TaskProlog              = (null)
TCPTimeout              = 2 sec
TmpFS                   = /tmp
TopologyParam           = (null)
TopologyPlugin          = topology/none
TrackWCKey              = No
TreeWidth               = 50
UsePam                  = 0
UnkillableStepProgram   = (null)
UnkillableStepTimeout   = 60 sec
VSizeFactor             = 0 percent
WaitTime                = 0 sec
X11Parameters           = (null)

Cgroup Support Configuration:
AllowedDevicesFile      = /etc/slurm/cgroup_allowed_devices_file.conf
AllowedKmemSpace        = (null)
AllowedRAMSpace         = 100.0%
AllowedSwapSpace        = 0.0%
CgroupAutomount         = yes
CgroupMountpoint        = /sys/fs/cgroup
ConstrainCores          = yes
ConstrainDevices        = no
ConstrainKmemSpace      = no
ConstrainRAMSpace       = no
ConstrainSwapSpace      = no
MaxKmemPercent          = 100.0%
MaxRAMPercent           = 100.0%
MaxSwapPercent          = 100.0%
MemorySwappiness        = (null)
MinKmemSpace            = 30 MB
MinRAMSpace             = 30 MB
TaskAffinity            = no

Slurmctld(primary) at htp-batch-01 is UP

stderr:
@arnikz
Copy link
Author

arnikz commented Feb 25, 2020

Perhaps, it's a good time to update the Docker images.

@arnikz
Copy link
Author

arnikz commented Mar 10, 2020

@sverhoeven: could you given an estimate on how much time is required to fix this? Thanks.

@sverhoeven
Copy link
Member

The ScriptingParser used in the SlurmScheduler class does not know about sections (:). I think it would take at least a day to write a robust parser and couple of hours to create a new Xenon and Xenon-* releases.

@jmaassen
Copy link
Member

I've just added a single statement to ignore all lines without an = sign in them.

@jmaassen
Copy link
Member

I'll add some tests and release a version with this fix

@jmaassen
Copy link
Member

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

We would need a slurm 19 container to do proper testing?

@arnikz
Copy link
Author

arnikz commented Mar 11, 2020

It's fixed in the jobstatus-bug branch of xenon. Or it parses the example shown above at least.

Thanks.

We would need a slurm 19 container to do proper testing?

Yes.

@arnikz
Copy link
Author

arnikz commented Mar 19, 2020

Hi, I've tested my workflow with xenon-cli 3.0.5beta1 + new slurm-19 image but the jobs are still failing. Please heeelp!

@sverhoeven
Copy link
Member

The conda xenon-cli 3.0.5beta1 package was just made for non-linux users (#73).

It does not include the fix in the https://github.com/xenon-middleware/xenon/tree/jobstatus-bug branch, it is a build with the Xenon v3.0.4 release.

@jmaassen
Copy link
Member

jmaassen commented Mar 20, 2020

Hmmm... my (new) unit test does parse the output correctly.

I think there may be some version mixup with xenon somewhere. I'll see if I can find the problem.

update: Ah, it seems the fix may be in the jobstatus-bug branch ;-)

@jmaassen
Copy link
Member

I'll cleanup the branch and test it with the other (non-slurm) scripting adaptors. I can then merge it into master and make a new release

@sverhoeven
Copy link
Member

I created a draft PR xenon-middleware/xenon#670 for the jobstatus-bug branch, to see the test failures more easily.

@jmaassen
Copy link
Member

Hmmm... most of the test pass, except for one integration test. Apparently the sbatch argument "--workdir" has changed to "--chdir" at some point. Will fix.

@jmaassen
Copy link
Member

Fixed in the 3.1.0 release

@sverhoeven
Copy link
Member

CLI v3.0.5 released on conda with Xenon 3.1.0. Please test

@arnikz
Copy link
Author

arnikz commented Mar 23, 2020

All works fine with the latest release on Slurm 19. Thanks!

@arnikz arnikz closed this as completed Mar 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants