job submission: allow custom resource requests #32

lukasheinrich · 2017-02-14T18:32:38Z

#31 would introduce default resources, but of course, custom one could be submitted at job submission. I'm thinking we could just reuse the k8s syntax straight away

https://kubernetes.io/docs/user-guide/compute-resources/

alintulu · 2020-09-01T13:04:22Z

Continuing the discussion on resource requests. In the reana.yaml there is already the use of the clause resources which is used when the workflow needs to access CVMFS.

workflow:
  resources:
    cvmfs:
      - fcc.cern.ch

This clause could maybe be used for requesting CPU/memory or setting an accounting group for HTCondor jobs. One suggestion:

workflow:
  resources:
    htcondor:
      accounting_group: 'physics_group'
      cpu: 4

However when it comes to setting HTCondor submission parameters such as max runtime, a problem arises. Since this is set workflow-wide, there is one value for the whole workflow. Specifying max runtime for HTCondor jobs would need to be set at every step independently. The value could even be flexible, changing depending on the sample.

@tiborsimko any thoughts on how to best implement this? Maybe max_runtime could follow the same structure as kerberos: true

workflow:
      type: serial
      specification:
        steps:
          - environment: 'reanahub/reana-env-root6'
            kerberos: true
            max_runtime: 3600
            commands:
              - echo hello

tiborsimko · 2020-09-01T16:08:35Z

Yes, I fully agree to specify them alongside each workflow step and not globally as for the CVFMS volumes. (That one is for volume mounting, so could have been globally defined more easily.)

For a Yadage example, see kubernetes_uid in resources clause:

eventselection:
  process:
    process_type: interpolated-script-cmd
    interpreter: bash
    script: |
      source /home/atlas/release_setup.sh
      source /analysis/build/x86*/setup.sh

      cat << 'EOF' > recast_xsecs.txt
      id/I:name/C:xsec/F:kfac/F:eff/F:relunc/F
      {did} {name} {xsec_in_pb} 1.0 1.0 1.0
      EOF

      echo {dxaod_file} > recast_inputs.txt
      myEventSelection {submitDir} recast_inputs.txt recast_xsecs.txt {lumi_in_ifb}
  publisher:
    publisher_type: interpolated-pub
    publish:
      histfile: '{submitDir}/hist-sample.root'
  environment:
    environment_type: 'docker-encapsulated'
    image: reanahub/reana-demo-atlas-recast-eventselection
    imagetag: '1.0'
    resources:
      - kubernetes_uid: 500

We could have there:

resources:
  - kubernetes_uid: 500
  - htcondor_accounting_group: 'physics_group'
  - htcondor_cpu: 4

Perhaps we can use a simple flat structure (i.e. htcondor_foo) since we already have kubernetes_uid there.

Something to cross-check with Yadage's native use of resources? CC @lukasheinrich @danikam are you also interested in Kubernetes or HTCondor memory/processor settings?

(2) For CWL it is similar, we are using "hints" there:

steps:
  first:
    hints:
      reana:
        compute_backend: htcondorcern

(3) For Serial, we can do as you suggest, and let them live alongside kerberos: true clause. However, it might also be nice to have a special "resources" sub-clause there, such as Yadage's "resources" or CWL's "hints", for consistency. Although this would mean to alter current behaviour of kerberos, which is perhaps not a good moment to address right now... Something to think about for later?

alintulu · 2020-09-02T06:26:07Z

@clelange

alintulu · 2020-09-02T07:08:23Z

Concerning the parameter max_runtime, it could be implemented more generally (not as htcondor_maxruntime) and used for the HTCondor parameter +MaxRuntime when HTCondor is specified as backend, and for the pods when the backend is Kubernetes.

clelange · 2020-09-02T15:23:30Z

I guess at some point things could get a bit complicated if depending on the compute backend one needs to always prepend the backend name to the resource. This makes it somewhat annoying if one wanted to switch from one to the other. This is something I would usually want to do, since I expect Kubernetes jobs to start much faster, i.e. I would use Kubernetes for validation and HTCondor for the full processing.

There are of course certain resources that only make sense for a given platform, e.g. the account group on HTCondor, so they should probably have this prefix while +MaxRuntime on HTCondor is largely equivalent to activeDeadlineSeconds of a Kubernetes Job spec, see https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup. This does not translate 1-to-1 to a Pod, but is probably pretty close in the HEP case.

For CPU and memory requests, I would think that plain cpu and memory fields as for Kubernetes Pod resources can be used while there is a differentiation between resources and limits:

resources:
  requests:
    memory: "64Mi"
    cpu: "250m"
  limits:
    memory: "128Mi"
    cpu: "500m"

Not sure if REANA should reflect the default on HTCondor, i.e. request cpu: 1000m and memory: 2Gi since that would make scheduling more difficult due to the limits number of resources, but at the same time maybe make jobs more stable.

closes reanahub/reana-job-controller#32

diegodelemos added this to the Someday milestone Oct 9, 2017

anton-khodak mentioned this issue Feb 14, 2018

Return Unimplemented for ResourceRequirement reanahub/reana-workflow-engine-cwl#21

Open

diegodelemos removed this from the Someday milestone Oct 6, 2019

diegodelemos added type/feature cli/general compute/htcondor compute/kubernetes workflow/cwl workflow/serial workflow/yadage labels Oct 6, 2019

alintulu self-assigned this Sep 3, 2020

alintulu pushed a commit to alintulu/reana-client that referenced this issue Nov 4, 2020

cli: check htcondor_max_runtime

f150035

closes reanahub/reana-job-controller#32

alintulu pushed a commit to alintulu/reana-client that referenced this issue Nov 4, 2020

cli: check htcondor_max_runtime

bd1b6c9

closes reanahub/reana-job-controller#32

alintulu pushed a commit to alintulu/reana-client that referenced this issue Nov 4, 2020

cli: check htcondor_max_runtime

0e12654

closes reanahub/reana-job-controller#32

alintulu mentioned this issue Nov 4, 2020

cli: check htcondor_max_runtime reanahub/reana-client#443

Closed

alintulu pushed a commit to alintulu/reana-client that referenced this issue Nov 4, 2020

cli: check htcondor_max_runtime

e743400

closes reanahub/reana-job-controller#32

alintulu pushed a commit to alintulu/reana-client that referenced this issue Nov 4, 2020

cli: check htcondor_max_runtime

ef79424

closes reanahub/reana-job-controller#32

diegodelemos self-assigned this Nov 6, 2020

diegodelemos closed this as completed in reanahub/docs.reana.io#63 Nov 6, 2020

giffels mentioned this issue Feb 12, 2024

Add job controller, job monitor and tools to support compute4punch backend #430

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job submission: allow custom resource requests #32

job submission: allow custom resource requests #32

lukasheinrich commented Feb 14, 2017

alintulu commented Sep 1, 2020

tiborsimko commented Sep 1, 2020

alintulu commented Sep 2, 2020

alintulu commented Sep 2, 2020

clelange commented Sep 2, 2020

job submission: allow custom resource requests #32

job submission: allow custom resource requests #32

Comments

lukasheinrich commented Feb 14, 2017

alintulu commented Sep 1, 2020

tiborsimko commented Sep 1, 2020

alintulu commented Sep 2, 2020

alintulu commented Sep 2, 2020

clelange commented Sep 2, 2020