Skip to content

Commit

Permalink
adding more notes on registries, monitoring and interfaces
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <vsoch@users.noreply.github.com>
  • Loading branch information
vsoch committed Oct 22, 2021
1 parent 403c868 commit c0ee034
Show file tree
Hide file tree
Showing 7 changed files with 203 additions and 4 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,3 +115,4 @@ to see the server running.
- code blocks should be copy-pasteable
- some code blocks could be runnable!
- site should render into pdf
- spell checking
8 changes: 4 additions & 4 deletions _data/toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@
children:
- title: SLURM
url: docs/schedulers/slurm
- title: Sun Grid Engine
url: docs/schedulers/sge
- title: Building and Packaging
url: docs/packaging
children:
Expand All @@ -36,10 +38,8 @@
# url: "docs/open-source"
# - title: Building, Packaging, Application Development
# url: "docs/building"
# - title: Monitoring
# url: "docs/monitoring"
# - title: Web Interfaces
# url: "docs/web-interfaces"
- title: Monitoring
url: "docs/monitoring"
# - title: Continous Integration and Delivery
# url: "docs/ci-cd"
- title: "Getting Started"
Expand Down
9 changes: 9 additions & 0 deletions _docs/containers/docker/distribution.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,12 @@ I should be able to easily:
Generally, a distribution comes down to a strategy that includes software, servers, and databases to fulfill these goals.
The [distribution specification](https://github.com/opencontainers/distribution-spec/blob/master/spec.md) helps to outline a lot of the interactions to make this possible.

## Registry Options

So you have a container, where should you put it? Here are some recommendations.

- [GitHub Packages](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-docker-registry): supports both Docker and "OCI registry as storage" artifacts, which includes Singularity images.
- [Docker Hub](https://hub.docker.com/): of course was the "first" container registry, however be careful in using it because you would be required to put an account token (with access to all your repos) in a CI service. They have also toyed with purging old images and have implemented [rate limiting](https://www.docker.com/increase-rate-limits), so while it's an option, it's not highly recommended.
- [Quay.io](https://quay.io/): is provided by RedHat, and is a nice registry because there are currently no limits on containers or pulling, and you can generate bots with repository specific tokens and permissions. How do you pronounce it? You probably want to say "KWAY" but I believe it's correctly pronounced "KEY."

If you are deploying from GitHub, there is a nice template [here](https://github.com/autamus/container-builder-template) to demonstrate pushing to each of an OCI registry (Docker Hub, Quay.io) or GitHub packages. If you are wanting to deploy Singularity (SIF) images in the same manner, use [this template](https://github.com/singularityhub/github-ci).
43 changes: 43 additions & 0 deletions _docs/containers/singularity/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -210,6 +210,49 @@ If you build locally and need to transfer to a cluster, you can use scp:
$ scp container.sif <username>@login.<cluster>:/scratch/users/<username>/container.sif
```

## Associated Tools

A landscape of [Associated tools](https://singularityhub.github.io/) is still maintained by @vsoch and other Singularity community members, including (but not limited to):

- [Singularity HPC](https://singularity-hpc.readthedocs.io/en/latest/): allows installing Singularity (and other container technologies like Podman) as modules.
- [Singularity Compose](https://singularityhub.github.io/singularity-compose/): orchestration for Singularity containers
- [Singularity Python](https://singularityhub.github.io/singularity-cli/): Python client for Singularity
- [Singularity Catalog](https://singularityhub.github.io/singularity-catalog/): is a nice place to browse recipes.
- [docker2singularity](https://github.com/singularityhub/docker2singularity): an early tool to convert from Docker2Singularity on a host.


## Registry Options

### GitHub Packages

So you have a Singularity container, where should you put it? Since a Singularity image is considered an [ORAS artifact](https://oras.land/) you can push it natively to GitHub packages, and since this can be associated with your code and CI and there are no limits, this is the recommended approach.
You can follow [this template](https://github.com/singularityhub/github-ci) to have automated builds and deploys for your containers.

### A Docker Registry

To kill two birds with one stone, you can actually build a Docker image, push to a docker registry, and then pull down to Singularity. As an example:

```bash
$ singularity pull docker://vanessa/salad
```

This is another recommended approach as you can choose a Docker registry without rate or storage limits. See [Docker Registry options]({{ site.baseurl }}/docs/containers/docker/distribution#registry-options) for this use case.

### Sylabs Cloud

The company Sylabs provides a [cloud](https://cloud.sylabs.io/) that you can create a free account to store your images. If you have a few small images this can work, but note that the space is limited and fills up quickly. You likely will need to pay to use it in a substantial way.

### Singularity Registry Server

[Singularity Register Server](https://singularityhub.github.io/sregistry/) (sregistry) is the open source version of Singularity Hub. It serves the Sylabs developed library API, so the Singularity software can interact with it natively. It is not an OCI registry proper, so it's not a highly recommended tool, but if you center needs to deploy a registry for users to pull with Singularity, this will fit the bill. It can be deployed with docker-compose, ansible, or (upon request) could easily work with Kubernetes.

### Singularity Hub

While [Singularity Hub](https://singularityhub.github.io/singularityhub-docs/) is no longer online, it was the first Singularity container registry,
and maintained by [@vsoch](https://github.com/vsoch) single-handedly for the 5 years she worked at Stanford. She could not take it with her, so the
registry was [converted to an archive](https://singularityhub.github.io/singularityhub-docs/2021/going-read-only/) that now is hosted at Dartmouth via the [Datalad project](https://singularity-hub.org/). As a promise of reproducibility, all of the containers that were available via the `shub://` URI are still pullable, and without any rate-limits as Singularity Hub was [forced to implement](https://singularityhub.github.io/singularityhub-docs/2019/release-announcement/) in 2019.



## Project History

Expand Down
36 changes: 36 additions & 0 deletions _docs/interfaces/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,42 @@ To create a dummy environment to test, you can use [ood-compose](https://github.
alongside [this tutorial for app development](https://osc.github.io/ood-documentation/release-1.6/app-development.html).
If you have any questions or issues, there is a [discourse here](https://discourse.osc.edu/c/open-ondemand/5).

## Scientific Applications

It's often the case that a researcher wants to deploy some kind of app, and sometimes even the memory or other resource requirements are extensive.
While there is no single answer for how to best do this, the following approaches are often used:

### Cloud Offerings

It can be fairly easy to deploy a containerized application to a cloud provider. Here are several services that you can use for
different provides:

#### Google Cloud

For all Google Cloud products, you are required to [create a Google project](https://cloud.google.com/gcp). You should check with your university research computing to see if there is a means to get an account through them, as it's common for universities to get discounts on these resources.

- [Compute Engine](https://cloud.google.com/compute/docs): is a straight forward strategy to deploy an instance, and then manage it yourself. You can either set up a simple web server, or run an application in a container. You are in charge of managing the instance.
- [App Engine](https://cloud.google.com/appengine/docs): allows you to manage an app locally, and deploy from the command line. It's easy, but tends to be more expensive.
- [Kubernetes](https://cloud.google.com/kubernetes-engine/): a scaled container cluster that is more complex, and only likely needed to maintain an app that needs to better scale, or multiple apps.
- [Cloud Functions](https://cloud.google.com/functions/): appropriate if you have a set of small scripts (or web services) to run.
- [Cloud Run](https://cloud.google.com/run/): A very simple way to run a container (also serverless).

For any Google Cloud product that you use, make sure to [estimate costs](https://cloud.google.com/products/calculator) and then [set alerts](https://cloud.google.com/billing/docs/how-to/budgets) in case they go over.

### Local Offerings

If your center can provide on-premise virtual machines, this is of course another option. This will require someone to manage the application (the server and certificates) so it likely will only be available at larger institutions, if at all.

### Hosted Offerings

#### Static Sites

If your app is static (e.g., uses basic JavaScript) you can host static content easily on [GitHub pages](https://pages.github.com/) or [Netlify](https://www.netlify.com/).

#### RShiny Apps

If you are familiar with R, you are likely familar with [Shiny](https://shiny.rstudio.com/) and [shinyapps.io](https://www.shinyapps.io/) continues to offer free hosting for apps (up to 5 apps at 25 hours per month on a free plan, which is suitable for many small research groups).

## Science Gateways

A science gateway is another means to allow access to advanced computational resouces, and unlike OnDemand which
Expand Down
23 changes: 23 additions & 0 deletions _docs/monitoring/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
---
title: Monitoring
tags:
- monitoring
description: Tools and practices for monitoring
permalink: /docs/monitoring/
---

# Monitoring

## Research Computing

What tools can you use to monitor the health of a cluster? Here are some recommendations:

- [Grafana](https://grafana.com/): provides operational dashboards to monitor cluster health.
- [Slurm-web](https://edf-hpc.github.io/slurm-web/): web frontend and REST API to [Slurm]({{ site.baseurl }}/docs/schedulers/slurm) workload manager. You can see job states, reservations, and node metrics.
- [Prometheus](https://prometheus.io/): open source monitoring solution used outside of the HPC space
- [Telegraf](https://www.influxdata.com/time-series-platform/telegraf/): open source server to collect system metrics.
- [InfluxDB](https://www.influxdata.com/products/influxdb/): time series data platform
- [Netdata](https://www.netdata.cloud/): real-time node-level performance, however does not provide much history
- [XDMoD 17](https://open.xdmod.org/9.0/index.html): provides historical performance of jobs and queues, and resources
- [XDMoD SUPReMM](https://supremm.xdmod.org/9.0/supremm-overview.html): an extension that also provides job-level performance
- [Ganglia](http://ganglia.sourceforge.net/): Not real-time, but provides node-level performance and history. Note that in 2019 they were [looking for maintainers](https://sourceforge.net/p/ganglia/mailman/message/36795542/).
87 changes: 87 additions & 0 deletions _docs/schedulers/sge.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
title: Sun Grid Engine
tags:
- scheduling
- resource-management
description: Getting started with Sun Grid Engine (SGE)
links:
- name: SGE Documentation
url: http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html
- name: Command reference between SLURM and SGE (pdf)
url: https://hpcsupport.utsa.edu/foswiki/pub/Main/SampleSlurmSubmitScripts/SGEtoSLURMconversion.pdf
---

# Sun Grid Engine

Sun Grid Engine (SGE) is an alternative to SLURM, and it provides a similar set of commands, with slight variation.
The [documentation](http://star.mit.edu/cluster/docs/0.93.3/guides/sge.html) is a good place to start, or if you are familiar with
[SLURM]({{ site.baseurl }}/docs/schedulers/slurm), we've translated the [command conversion tables](https://hpcsupport.utsa.edu/foswiki/pub/Main/SampleSlurmSubmitScripts/SGEtoSLURMconversion.pdf) here:

Here are some common commands and flags in SGE and SLURM with their respective equivalents. Is there anything missing? [Let us know](https://github.com/rse-ops/knowledge/issues).

| **User Commands** | **SGE** | **SLURM** |
| --- | --- | --- |
| **Interactive login** | qlogin | `srun --pty bash or srun (-p "partition") --time=4:0:0 --pty bash` For a quick dev node, just run "sdev" |
| **Job submission** | qsub \[script_file\] | sbatch \[script_file\] |
| **Job deletion** | qdel \[job_id\] | scancel \[job_id\] |
| **Job status by job** | qstat -u \\* \[-j job_id\] | squeue \[job_id\] |
| **Job status by user** | qstat \[-u user_name\] | squeue -u \[user_name\] |
| **Job hold** | qhold \[job_id\] | scontrol hold \[job_id\] |
| **Job release** | qrls \[job_id\] | scontrol release \[job_id\] |
| **Queue list** | qconf -sql | squeue |
| **List nodes** | qhost | sinfo -N OR scontrol show nodes |
| **Cluster status** | qhost -q | sinfo |
| [**GUI**](http://slurm.schedmd.com/sview.html) | qmon | sview |
| **Environmental** | | |
| **Job ID** | $JOB_ID | $SLURM_JOBID |
| **Submit directory** | $SGE\_O\_WORKDIR | $SLURM\_SUBMIT\_DIR |
| **Submit host** | $SGE\_O\_HOST | $SLURM\_SUBMIT\_HOST |
| **Node list** | $PE_HOSTFILE | $SLURM\_JOB\_NODELIST |
| **Job Array Index** | $SGE\_TASK\_ID | $SLURM\_ARRAY\_TASK_ID |
| **Job Specification** | | |
| **Script directive** | #$ | #SBATCH |
| **queue** | -q \[queue\] | -p \[queue\] |
| **count of nodes** | N/A | -N \[min\[-max\]\] |
| **CPU count** | -pe \[PE\] \[count\] | -n \[count\] |
| **Wall clock limit** | -l h_rt=\[seconds\] | -t \[min\] OR -t \[days-hh:mm:ss\] |
| **Standard out file** | -o \[file_name\] | -o \[file_name\] |
| **Standard error file** | -e \[file_name\] | e \[file_name\] |
| **Combine STDOUT & STDERR files** | -j yes | (use -o without -e) |
| **Copy environment** | -V | --export=\[ALL \| NONE \| variables\] |
| **Event notification** | -m abe | --mail-type=\[events\] |
| **send notification email** | -M \[address\] | --mail-user=\[address\] |
| **Job name** | -N \[name\] | --job-name=\[name\] |
| **Restart job** | -r \[yes\|no\] | --requeue OR --no-requeue (NOTE: <br>configurable default) |
| **Set working directory** | -wd \[directory\] | --workdir=\[dir_name\] |
| **Resource sharing** | -l exclusive | --exclusive OR--shared |
| **Memory size** | -l mem_free=\[memory\]\[K\|M\|G\] | --mem=\[mem\]\[M\|G\|T\] OR --mem-per-cpu= <br>\[mem\]\[M\|G\|T\] |
| **Charge to an account** | -A \[account\] | --account=\[account\] |
| **Tasks per node** | (Fixed allocation_rule in PE) | --tasks-per-node=\[count\] |
| | | --cpus-per-task=\[count\] |
| **Job dependancy** | -hold\_jid \[job\_id \| job_name\] | --depend=\[state:job_id\] |
| **Job project** | -P \[name\] | --wckey=\[name\] |
| **Job host preference** | -q \[queue\]@\[node\] OR -q <br>\[queue\]@@\[hostgroup\] | --nodelist=\[nodes\] AND/OR --exclude= <br>\[nodes\] |
| **Quality of service** | | --qos=\[name\] |
| **Job arrays** | -t \[array_spec\] | --array=\[array_spec\] (Slurm version 2.6+) |
| **Generic Resources** | -l \[resource\]=\[value\] | --gres=\[resource_spec\] |
| **Lincenses** | -l \[license\]=\[count\] | --licenses=\[license_spec\] |
| **Begin Time** | -a \[YYMMDDhhmm\] | --begin=YYYY-MM-DD\[THH:MM\[:SS\]\] |

| SGE | SLURM |
| --- | --- |
| qstat <br><br>> qstat -u username <br>> qstat -f | squeue <br><br>> squeue -u username <br>> squeue -al |
| qsub <br><br>> qsub -N jobname <br>> qsub -l h_rt=24:00:00 <br>> qsub -pe dmp4 16 <br>> qsub -l mem=4G <br>> qsub -o filename <br>> qsub -e filename <br>> qsub -l scratch_free=20G | sbatch <br><br>> sbatch -J jobname <br>> sbatch -t 24:00:00 <br>> sbatch -p node -n 16<br><br>> sbatch --mem=4000 <br>> sbatch -o filename <br>> sbatch -e filename |
| \# Interactive run, one core | \# Interactive run, one core |
| qrsh -l h_rt=8:00:00 | salloc -t 8:00:00 <br>interactive -p core -n 1 -t 8:00:00 |
| qdel | scancel |

| SGE for a single-core application | SLURM for a single-core application |
| --- | --- |
| #!/bin/bash<br>#<br>#<br>#$ -N test<br>#$ -j y<br>#$ -o test.output<br>#$ -cwd<br>#$ -M $USER@yourschool.edu<br>#$ -m bea<br>\# Request 5 hours run time<br>#$ -l h_rt=5:0:0<br>#$ -P your\_project\_id_here<br>#<br>#$ -l mem=4G<br>\# <br> <br><call your app here> | #!/bin/bash <br>#<br>#SBATCH -J test<br>#SBATCH -o test."%j".out<br>#SBATCH -e test."%j".err<br>\# Default in slurm<br>#SBATCH --mail-user $USER@yourschool.edu<br>#SBATCH --mail-type=ALL<br>#SBATCH -t 5:0:0 \# Request 5 hours run time<br>#SBATCH --mem=4000<br>#SBATCH -p normal<br><br> <br><load modules, call your app here> |

Comparison of some parallel environments set by sge and slurm:

| SGE | SLURM |
| --- | --- |
| `$JOB_ID` | `$SLURM_JOB_ID` |
| `$NSLOTS` | `$SLURM_NPROCS` |

0 comments on commit c0ee034

Please sign in to comment.