Skip to content

Hyak (MOX) Computing

Steven Roberts edited this page Aug 4, 2017 · 12 revisions

A brief guide to using Hyak (MOX)

Before proceeding, you'll need two things:

  1. Request a security token from UW IT - This will take a few days.
  2. Contact Sean Bennett to add you as a user to the srlab Hyak (MOX) group.

Quick start (all commands below are intended to be executed in a shell/terminal)

  1. Login to Hyak (MOX):

ssh UWNetID@mox.hyak.uw.edu

  • Replace UWNetID with your own UW Net ID.
  1. Enter your corresponding UW Net ID password when prompted.
  • Your password will not be displayed on the screen when you type.
  1. Enter the Entrust passcode that is shown on your Entrust security token device.
  • Press the green power button on the Entrust security token device to display the Entrust passcode.
  1. Run a job on Hyak (MOX).

Understanding Hyak (MOX)

Storage

  1. User-specific storage
  • Storage allocation: 5GB
  • Located on your login node (e.g. /usr/lusers/UWnetID)
  1. Group-specific storage
  • Storage allocation: 1500GB
  • Located: /gscratch/srlab/
  • Shared by all srlab members
  1. Temporary storage
  • Storage allocation: 200TB shared by all UW Hyak MOX users.
  • Located: /gscratch/scrubbed/
  • Files are automatically deleted 30 days after creation.

Nodes:

  1. Log in node.
  • [UWnetIDe@mox ~]$
  • Use for submitting jobs to the queue
  • Use for transferring files to/from Hyak (MOX)
  1. Interactive node.
  • [UWnetID@nNNNN ~]$
  • Has access to physical node computing specs (but not parallel "supercomputing")
  • Can be used in two modes: basic or build
    • Basic
      • No internet access
      • Can only be used by one user at a time
    • Build
      • Has internet access
      • Intended for compiling programs for use in parallel computing.
  1. Computing node.
  • [UWnetID@nNNNN ~]$
  • Add explanation/example(s)

Advanced Hyak (MOX) usage

Use Jupyter Notebook

Installing software from RPM packages (Guide based on this Stack Exchange question)

  1. Download the desired software:

wget http://address.to.software.rpm

  1. Initiate a build node:

srun -p build --pty /bin/bash

  1. Load Intel MPI and Intel compiler module (not sure if this is needed):

module load icc_17-impi_2017

  1. Unpack the RPM:

rpm2cpio downloaded_software.rpm > desired_filename.cpio

  1. Extract the CPIO created in Step 2:

cpio -idv < desired_filename.cpio

  1. Copy the executable file to a location in your $PATH (e.g./usr/lusers/UWnetID/bin):

cp ./usr/bin/software /usr/lusers/UWnetID/bin

  • If software depends on any libraries, the libraries need to be added to the $PATH (e.g. /usr/lusers/UWnetID/lib):

export LD_LIBRARY_PATH=/usr/lusers/UWnetID/lib:$LD_LIBRARY_PATH

  • It may be worthwhile to put the library export command above in your ~./bashrc file.

Utilizing Slurm resource manager

  1. Create bash script to run all of your programs at once. Parts in <> must be supplied by you.

--job-name is the name you want to give your job.
--time is an estimate of the run time, the more correct this better limited shared resources are used.
--mem is the amount of memory to be allocated by the node, we have a max of 512gb available.
--workdir is the working directory for your project, where slurm output will be written.

    ## Job Name
    #SBATCH --job-name=<name>
    ## Resources
    ## Nodes
    #SBATCH --nodes=1
    ## Walltime (run time)
    #SBATCH --time=<hh:mm:ss>
    ## Memory per node
    #SBATCH --mem=<XXXG>
    ## Specify the working directory for this job
    #SBATCH --workdir=</gscratch/srlab/your/directory/here>
    command.1 -a argument1 -b argument2
    ...
    command.n -a argument1 -b argument2
  1. Test this on a local machine to make sure everything works as intended.
  2. Log in to Hyak.
  3. Queue the job using the sbatch -p srlab -A srlab job.sh command. Record the jobID number returned.

Checking the status of a running job

  1. Log in to Hyak.
  2. At the login type `scontrol show job <JobID#> which produces output that looks like the example below. Bolded sections are important to look at.
    JobState indicates if the job is Running, or has timed out
    RunTime indicates current run time. Displayed in day-hour:minute:second format. TimeLimit indicates maximum allowed run time. Displayed in day-hour:minute:second format.
     JobId=7308 JobName=Oly-Gap-Filling  
     UserId=seanb80(557445) GroupId=hyak-srlab(415510) MCS_label=N/A  
     Priority=120 Nice=0 Account=srlab QOS=normal  
     **JobState=RUNNING** Reason=None Dependency=(null)  
     Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0  
     **RunTime=02:18:13** **TimeLimit=10-00:00:00** TimeMin=N/A  
     SubmitTime=2017-04-13T09:22:26 EligibleTime=2017-04-13T09:22:26  
     StartTime=2017-04-13T09:22:26 EndTime=2017-04-23T09:22:26 Deadline=N/A  
     PreemptTime=None SuspendTime=None SecsPreSuspend=0   
     Partition=srlab AllocNode:Sid=mox1:45801   
     ReqNodeList=(null) ExcNodeList=(null)   
     NodeList=n2185   
     BatchHost=n2185   
     NumNodes=1 NumCPUs=28 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*   
     TRES=cpu=28,mem=400G,node=1   
     Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*   
     MinCPUsNode=1 MinMemoryNode=400G MinTmpDiskNode=0   
     Features=(null) DelayBoot=00:00:00   
     Gres=(null) Reservation=(null)   
     OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)   
     Command=/gscratch/srlab/data/Oly_gap_filling/PBRun.sh   
     WorkDir=/gscratch/srlab/data/Oly_gap_filling/   
     StdErr=/gscratch/srlab/data/Oly_gap_filling//slurm-7308.out   
     StdIn=/dev/null   
     StdOut=/gscratch/srlab/data/Oly_gap_filling//slurm-7308.out  
     Power=   

Interacting with the Execute Node during a job

  1. Log in to Hyak
  2. Find your execute node via scontrol show job JOBID#
  3. ssh in to the execute node via ssh NODEID
  4. You now have access to the execute node via terminal, useful for using top for looking at resource usage.

Considerations for Hyak Jobs

  • Number of files created, including temporary files:

    Prior to deploying code on Hyak, it is ideal to run a test case on another computer to get an idea of the number of files generated by the program, including temporary files. Currently we are limited ~1.5 million files in our shared scratch so check prior to beginning a run that there is enough file space available.

    The command to determine current file system usage for the shared scratch directory is

    mmlsquota -j srlab gscratch --block-size G

    and the output looks like

While your job is running