<center><img src=img/MScAI_brand.png width=70%></center>

# ICHEC

<center><img src=https://www.ichec.ie/sites/default/files/logo.png style="background-color:green;" width=50%></center>

ICHEC is the Irish Centre for High-End Computing, a department of NUI Galway hosted in the IT Building. The server farm is in Waterford and some staff are in Dublin.

### The **Kay** supercomputer
* 336 nodes each of 2x 20-core 2.4GHz and 192 GiB of RAM
* 16 GPU nodes, as above plus 2x NVIDIA Tesla V100
* Some more nodes with bigger processors/more memory
* Login nodes and other service nodes.

* https://www.ichec.ie/about/infrastructure/kay
* https://www.ichec.ie/academic/national-hpc/documentation

### Condominium access

Each Irish institution has "condominium" access to ICHEC, including MSc students. https://www.ichec.ie/academic/condonium-service

To get access:

* Register on https://register.ichec.ie/register
* Fill in the Word form (in Blackboard) and send to Noreen Goggin <noreen.goggin@nuigalway.ie>


### `ssh` 

`ssh` is a command for logging-in to a remote node. When you have an ICHEC account, you can run this on your laptop:

```bash
$ ssh my_username@kay.ichec.ie
```

You'll be asked for your password and you'll end up at a Linux prompt on a `kay` login node. The login node is for managing your files and submitting batch jobs. Do not directly **run** any large jobs there!

Access is from within the NUI Galway network only!

Later on, for convenience, you can set up access from home/elsewhere also by generating an SSH keypair: https://www.ichec.ie/academic/national-hpc/documentation/ssh-keys

### Python on ICHEC


Lots of people are using Python for jobs on ICHEC.

https://www.ichec.ie/academic/national-hpc-service/software/python-conda

### Virtual environments

It's recommend to use a **virtual environment** to manage your Python/Anaconda installation. A `venv` is just a collection of software installed with Anaconda.

You can create multiple virtual environments and activate the one you want. One advantage is you might have one project that uses Tensorflow 1 and another that uses Tensorflow 2, and you want to be able to run both projects at different times. You can create a `venv` for each project with the right packages installed and just switch between them.

### Setting up your environment

You can do this on a login node:

```sh
# tell kay we will use Anaconda
module load conda/2
# create a new venv
conda create --name my_env
# activate it
conda activate my_env
# install the stuff we need 
conda install tensorflow-gpu networkx scikit-learn
# for completeness: exit the virtual environment
conda deactivate
```

A `venv` is just a collection of software installed with Anaconda. It doesn't include the contents of your home directory. So, the next step is to make and/or copy your work directories to `kay`, e.g. on a login node:

```bash
mkdir work
```

You'll need to be able to navigate a Unix filesystem, e.g.:

* `mkdir # make directory`
* `mv    # move/rename`
* `cp    # copy`
* `cd    # change directory`
* `ls    # list directory`

### `scp`

`scp` is a command for copying files to a remote computer. When you have an ICHEC account, you can run this on your laptop (notice `-r` is for "recursive copy"):

```bash
$ scp -r my_project_directory my_username@kay.ichec.ie:/ichec/home/users/my_username/work/
```

### NFS

The contents of your home directory on the login node (which we started to create above) are synced back and forth to any compute node on `kay` your jobs might run on via `NFS`, the network filesystem.

### Interactive testing with `srun`

Then you can request a node for eg 30m interactive testing. This is just for testing/prototyping, not for large runs.

Run this from the login node:
```bash
srun -p DevQ -N 1 -A nuig02 -t 0:30:00
```


If one is available, you will be delivered to a compute node on DevQ. Then:
```bash
module load conda/2
conda info --envs
source activate my_env
cd work/my_project_directory
python my_prog.py
```

After whatever length of time you requested (e.g. 30m above), your session will expire and you'll be logged out from that node.

### Batch mode

Above, we used `kay` interactively (typing commands at the shell), but **batch mode** is the common way to run large jobs. "Batch mode" is the opposite of "interactive mode". 

That sounds like a downgrade! But HPC systems provide batch mode because:

* jobs are long-running (hours or days), so no benefit to being interactive;
* the system can balance resource allocation for fairness and throughput.

### SLURM

SLURM is a workload manager. It is the mechanism we use to submit batch jobs for execution on `kay`.


<center><img src=img/SLURM.png width=55%></center>

### Queues
When you submit a job for batch execution, you submit it to a **queue**. ICHEC has several distinct queues with different types of nodes, including:

Name | Node type | Max nodes | Max walltime | Purpose
-----|-----------|-----------|--------------|--------
DevQ | Cluster   | 	4        | 	1 hour      | Prototyping
ProdQ| Cluster   | 	40       | 	72 hours    | Large jobs
GpuQ | GPU       |  4        |  48 hours    | GPU jobs

### Submitting jobs with `sbatch`

To specify a job, we first create a special shell script with the `.sh` suffix, e.g. `work/my_project_directory/my_batch_job.sh`:

<font size=5>
    
```bash
#!/bin/sh

# request 1 node for 1 hour on DevQ queue
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH -p DevQ
#SBATCH -A nuig02
#SBATCH --mail-user=my.email@nuigalway.ie
#SBATCH --mail-type=BEGIN,END

module load taskfarm
module load conda/2
# module load cuda/10 # uncomment for GPU
source activate my_env

cd $SLURM_SUBMIT_DIR
python my_prog.py
```

</font>

The first line is a standard way of telling the operating system that it is a shell script to be run by `/bin/sh`, a standard shell.

`sbatch` is the Unix command which submits a job to a queue. We run it like this on a login node:

```bash
cd work/my_project_directory
sbatch my_batch_job.sh
```

### Taskfarming

The above procedure only makes sense if `my_prog.py` is a very large job that requires a whole node to itself!


An easier way to use supercomputer facilities is to run many runs in parallel, one per core. Suppose our program `my_prog.py` has a hyperparameter `n` which takes on values 0-9, and for each we want to run 100 repetitions with 100 random seeds. We write our program to accept arguments `my_prog.py <n> <seed>`. 

Then we can just put these commands into a file named e.g. `tasks.sh`:

```bash
# each line is one command
python my_prog.py 0 0
python my_prog.py 0 1
python my_prog.py 0 2
...
python my_prog.py 9 99
```

Then, instead of `python my_prog.py` at the bottom of `my_batch_job.sh`, we'll put a `taskfarm` command:

```bash
taskfarm tasks.sh
```
It will take each line of `tasks.sh` as a task. It will start tasks, usually one per core, and when they finish it will start new ones, until the whole file is finished.

### After submitting

* You can check on your queued jobs: `squeue -u my_username`
* You should receive an email when the job **starts**. 
  * It might contain error messages. To debug, check your account name and check that your tasks run ok interactively.
* When your job finishes, you should receive another email.

### GPU on `kay`

Ensure that your `venv` includes (e.g.) `tensorflow-gpu` (it should as we installed it earlier). Then request a GPU node for interactive testing:

```bash
srun -p GpuQ -N 1 -A nuig02 -t 0:30:00
```

If one is available, you will be delivered to a compute node with GPU. Then run these:
```bash
module load cuda/10.0
module load conda/2
conda info --envs # what envs exist?
source activate my_env
# test that GPU support is working - you should see lots of messages
# including "name: Tesla V100-PCIE-16GB"
python -c 'import tensorflow as tf; tf.test.gpu_device_name()'
```

Then make sure to include the line `module load cuda/10.0` in your `my_batch_job.sh` file.

In the **Deep Learning** module (in Semester 2 for full-time students, in Year 2 Semester 2 for part-time), you'll have opportunities to use deep neural network libraries such as Tensorflow or PyTorch on GPU.