Skip to content
This repository has been archived by the owner on May 2, 2024. It is now read-only.

Getting Started

Alexander Pico edited this page Apr 27, 2021 · 39 revisions

Table of Contents

Getting a Wynton HPC account

Connecting to Wynton HPC

  • ssh
    • Open an ssh client. An ssh client is already installed if you are using OS X or Linux. On Windows you might need to download an ssh client application.

    • In the example below, replace alice with your actual Wynton user name. Type the first line and your password, when prompted:

{local}$ ssh alice@log2.wynton.ucsf.edu
alice@log2.wynton.ucsf.edu's password:
Last login: Thu Jul 16 17:03:28 2020 
[alice@wynlog2 ~]$
  • sftp
    • sftp is a common method used to transfer files between 2 computers. If you are using OS X or Linux, an sftp client application is already installed. Under Windows, you might need to download an additional application.

    • In the example below, replace alice with your actual Wynton user name. Type the first line and your password, when prompted:

{local}$ sftp alice@log2.wynton.ucsf.edu
alice@log2.wynton.ucsf.edu's password:
Connected to log2.wynton.ucsf.edu.
sftp>
  • For more information on how to transfer files on Wynton see Wiki - How to move files

  • Troubleshooting logging in to Wynton HPC

    • If you have difficulty connecting, make sure you have received confirmation that your account has been created and the username you are using is correct.
    • Make sure the server hostname you are connecting to is correct. You can only log into the Wynton login nodes or the data transfers directly from the outside.
    • If your password needs to be reset, please contact the Wynton system administrators.

Linux operating system on Wynton HPC

  • Wynton HPC runs the Linux operating system, specifically CentOS 7 Linux
  • Becoming comfortable using the Linux command line and the bash "shell" are very useful skills to interact with the Wynton HPC environment

Using the Linux command line

Storage

IMPORTANT: Wynton storage is NOT backed up. If your data are important, do not keep the only copy on Wynton.

  • BeeGFS is the parallel file system used by Wynton. It is optimized for HPC
  • Home directory
    • mounted under /wynton/home
    • user home directory quotas are 500GiB
  • Group directory
    • mounted under /wynton/group
    • to check quota for group members beegfs-ctl --getquota --git <group>
    • For example, 100TB of Gladstone space under /wynton/group/gladstone. Here's how it works.
  • Global scratch space
    • mounted as /wynton/scratch and is available as a shared directory from all Wynton nodes
    • If you are copying files that will only be needed temporarily, for example as input to a job, then you have the option of copying them directly to a global scratch space at /wynton/scratch. There is 492TiB of space available for this purpose.
    • /wynton/scratch is automatically purged after 2 weeks, but you should go ahead and delete the files when you no longer need them.
    • note: it is good practice to first create your own subdirectory here and copy to that location
mkdir /wynton/scratch/my_own_space
scp filename.tsv username@dt2.wynton.ucsf.edu:/wynton/scratch/my_own_space

Overview of the different kinds of nodes on Wynton

  • There are a few different kind of nodes (Linux hosts): login, development, data transfer, compute, gpu compute
  • login nodes
    • login nodes can be logged into directly
    • minimal compute resource
    • dedicated solely to basic tasks such as copying and moving files on the shared file system, submitting jobs, and checking the status on existing jobs
    • node names
      • log1.wynton.ucsf.edu
      • log2.wynton.ucsf.edu
  • development nodes
    • cannot log into development nodes directly. They can be accessed from the login nodes
    • node names
      • dev1.wynton.ucsf.edu
      • dev2.wynton.ucsf.edu
      • dev3.wynton.ucsf.edu
      • gpudev1.wynton.ucsf.edu
    • validating scripts, prototyping pipelines, compiling software, etc
    • interactive jobs (Python, R, Matlab)
  • data transfer node
    • like login nodes, the data transfer nodes can be logged into directly
    • node names
      • dt1.wynton.ucsf.edu
      • dt2.wynton.ucsf.edu
    • have access to the outside internet
    • data transfer nodes each have 10 Gbps network connections and can be logged into directly like the login nodes. For comparison, the login nodes have 1 Gbps network connections
    • for large transfers, making use of Globus would be the preferred transfer method
    • Gladstone user have additional options for high-speed data transfers to/from Gladstone, local and Dropbox locations: See internal confluence docs.
  • compute nodes
    • can not log in to compute nodes directly
    • the scheduler will send jobs to compute nodes
    • the majority of compute nodes have Intel processors, a few have AMD
    • local /scratch
      • either hard disk drive (HDD), solid state drive (SSD), or Non-Volitile Memory Express (NVMe) drive
      • each node has a tiny /tmp (4-8 GiB)
  • gpu (for GPU computation)
    • cannot log in to gpu nodes directly
    • as of 2019-09-20
      • 38 GPU nodes with a total of 132 GPUs available to all users
        • Among these, 31 GPU nodes, with a total of 108 GPUs, were contributed by different research groups
      • GPU jobs are limited to 2 hours in length when run on GPUs not contributed by the running user's lab.
      • Contributors are not limited to 2-hour GPU jobs on nodes they contributed
      • There is also one GPU development node that is available to all users

A little about Linux environment modules

  • https://wynton.ucsf.edu/hpc/software/software-modules.html
  • available module repositories (need to be loaded)
    • CBI : Repository of software shared by the Computational Biology and Informatics (http://cbi.ucsf.edu) at the UCSF Helen Diller Family Comprehensive Cancer Center
    • Sali: Repository of software shared by the UCSF Sali Lab
  • A list of the available modules in the CBI and Sali repositories is available at https://wynton.ucsf.edu/hpc/software/software-repositories.html or by using the module avail command after loading a module with module load To list all the modules in the CBI repository: module load CBI followed by module avail
  • Loading a module use: module load
    For example to load the R module from the CBI module repository: module load CBI r
  • To see what gets set when a module is loaded use: module show For example, to see what gets set when the mpi module is loaded: module show mpi
  • To see what software modules you have currently loaded use: module list
  • To see what software modules are currently available (in the software repositories you have loaded), use: module avail
  • To disable (“unload”) a previously loaded module. , use: module unload For example, to unload the R module if it had been loaded previously: module unload r
  • To disable all loaded software modules and repositories: module purge
  • Other ways of loading software

Submitting a job to Wynton

  • The current job scheduler used is SGE 8.1.9 (Son of Grid Engine), however Wynton will be transitioning to the Slurm job scheduler in Q4 2020
  • The scheduler coordinates distributing jobs, which get submitted as batch scripts, to the compute nodes of the cluster
  • Example SGE job submission qsub -l h_rt=00:01:00 -l mem_free=1G my_job.sge (replace time, memory and file name with your choices)
    • my_job.sge = the batch script file to be submitted
    • -l h_rt = maximum runtime (hh:mm:ss or seconds)
    • -l mem_free = maximum memory (K for kilobytes, M for megabytes, G for gigabytes)
  • Jobs always run on the compute nodes whether they are submitted from a login node or from a development node.
  • To check on the job
    • Current status: qstat or qstat -j 191442 (replace 191442 with the actual SGE job id)
    • After the job ran successfully: grep "usage" my_job_sge.0284740 (replace the output file name with the actual output file name)
    • After a failed job: tail -100000 /opt/sge/wynton/common/accounting | qacct -f -j 191442 (replace 191442 with the actual SGE job id)
  • How much memory to request when submitting a job?
    • With experience and trial & error, you can estimate the memory requirements for various types of jobs
    • Logs, reports and accountings can help provide clues
    • Wynton is relatively forgiving on memory estimates
    • If unsure, try 8GB and then increase/decrease accordingly
  • Tips on submitting jobs
    • For intensive jobs during busy times, you can reserve resources for your job as soon as they become available by including this parameter -R y
    • Compute nodes do not have access to the internet, i.e., you can not run jobs that include steps like downloading files from online resources.
    • Development nodes DO have access to the internet.
    • If your script or pipeline requires access to the internet, consider splitting up the work
      • run a script on a dev node that retrieves online files and then submits jobs to be run on compute nodes.
    • Also cron jobs can be run on a dev node to periodically download files separate from compute-heavy jobs that can be submitted to compute nodes
  • To check the job queue metrics of the cluster, go to https://wynton.ucsf.edu/hpc/status/index.html
  • https://wynton.ucsf.edu/hpc/scheduler/submit-jobs.html

Parallel jobs

  • A parallel environment for multithreaded (SMP) jobs is available for use on the cluster
  • This environment must be used for all multithreaded jobs. Such jobs not running in this PE are subject to being killed by the cluster systems administrator without warning.
  • Example submission for a parallel BLAST job
#!/bin/bash
#
#$ -S /bin/bash
#$ -l arch=linux-x64    # Specify architecture, required
#$ -l mem_free=1G       # Memory usage, required.  Note that this is per slot
#$ -pe smp 2            # Specify parallel environment and number of slots, required
#$ -R yes               # SGE host reservation, highly recommended
#$ -cwd                 # Current working directory

blastall -p blastp -d nr -i in.txt -o out.txt -a $NSLOTS

Notes on the example

  • In the above example, the '-a' flag tells blastall the number of processors it should use.
  • $NSLOTS is the number of slots requested for the parallel environment
  • more information on parallel and MPI jobs, https://salilab.org/qb3cluster/Parallel_jobs

GPU scheduling

  • Compiling GPU applications
    • The CUDA Toolkit is installed on the development nodes
    • Several versions of CUDA are available via software modules. To see the currently available versions, run the command: module avail cuda
  • more information on GPU jobs, https://wynton.ucsf.edu/hpc/scheduler/gpu.html

Interactive sessions on Wynton HPC

  • It is currently not possible to request interactive jobs via the scheduler
  • There are dedicated development nodes (dev1, dev2, dev3, gpudev1) that can be used for short-term interactive development such as building software and prototyping scripts before submitting them to the scheduler.
  • Interactive python session
    1) ssh to a login node
    2) ssh to a dev node
    3) type python3 to enter the Python REPL for an interactive session
    4) when done, type exit() to quit session
  • Interactive R session
    1) ssh to a login node
    2) ssh to a dev node
    3) type R to enter the R interactive session
    4) when done, type q() to quit the session
  • Interactive MATLAB session
    1) ssh to a login node
    2) ssh to a dev node
    3) type module load Sali matlab
    4) type matlab

More information on working with

Best Practices

  • Backup your data if it is important
  • Login nodes (or dev nodes) to submit batch jobs to the cluster, dev nodes for interactive work
  • Use local scratch for staging data and computations
  • If using conda environments in Anaconda Python, this is best done inside a Singularity container
  • If writing many files to the file system, for example 1000's or more, avoid writing all the files to a single directory.
    • instead, spread out the files into a number of different directories for better performance
  • For interactively using GUI applications, using X2Go will have better performance than X-forwarding

Troubleshooting tips

  • Check the job scheduler logs
    • error log: unless otherwise specified, this will be in the directory that the job was launched from and the file name will be formatted as the job script name followed by .e<jobid>
    • output log: unless otherwise specified, this will be in the directory that the job was launched and the file name will be formatted as the job script name followed by .o<jobid>.

Getting Additional Help