# Cluster Computing 

Physics and astronomy often invove large datasets or a large number of computations or both, for which simply running code on your personal machines will not suffice. 
You could need more firepower (eg. more cores), more memory or just different processors (eg. GPUs). This is where computer clusters come in. You take your code, stick it on a cluster and submit jobs to be run on the cluster remotely. Here, we'll go over the basics of how to do that. 

## Connecting to a cluster

In [Section 2](https://github.com/jeffiuliano/Penn-Summer-Computing-Training/blob/main/2_ssh_and_scp/SSH_SCP_Workbook.ipynb), we learnt to securely interact with a non-local machine. To connect to most clusters, we will use SSH as described there. For example, for those of you using *marmalade*, you will do 

`ssh username@marmalade.physics.upenn.edu`

- Note that to connect to *marmalade*, you must be on a Penn secured network (eg. AirPennNet wi-fi) or VPN into Penn. For using the Penn VPN, you need a PennKey (UPenn username and account) and associated account setup. For details of how to VPN into Penn, go [here](https://www.isc.upenn.edu/how-to/university-vpn-getting-started-guide). 
- Note also that *marmalade* access is granted by your PI. Please bother them to get access. 

This will then prompt you for a password. The password for your account should be the same as the password associated with your PennKey. Remember that your terminal will not indicate that you are typing while you enter your password. If you are successful, the terminal will print out something like:

`Last login: Wed May 4 15:58:52 2022 from xx.xxx.xxx.xxx`

You are now on the **head node / login node** on *marmalade*. 

### Head / Login Nodes

#### What is a head node and why will everyone get mad at you if you mess with it?

Most clusters are set up such that you land on a *head or login node* when you connect to it. This is not the location where actual jobs are run, this node will not perform your computations. It's purpose is to facilitate those computations being run on a *compute node*. **Do not use the login node to run scripts.** You might wind up preventing other people from logging in by consuming resources. This node will not have enough memory to support large jobs. Etc etc. **This is how you make everyone in your research group and beyond angry with you.** 

Things it is okay to use the login node for:
- view and edit scripts
- view output
- perform git actions 
- submit and managae jobs
- spy on who else is using the cluster resources 

Things I will yell at you for using the login node for:
- running jobs
- debugging scripts
- I'll probably add some more things here

Grey area:
- managing your python environment
- installing other packages 
- installing code 

Note that for the "grey area" points, it's still better to perform these actions on a compute node if you can, because ultimately, those are the nodes that will be running your jobs, not your login node. For example, *marmalade* has AMD nodes as well as Intel nodes. Use the specific compute nodes that you will run your jobs on to compile your code, because there are some differences between the two types of nodes. 

#### Cores vs. Nodes

**Nodes** are effectivly a self-contained CPU. This has some memory, input/output and storage. This also has processors. The processors are sometimes referred to as **cores**, but usually, each processor is made up of a ~couple cores. 

The cores then share everything the node has - they share memory, I/O and storage. But for parallelisable code, you can parallelise across these cores and make them simultaneously run tasks. 

For *marmalade*, 
- astro has 3 AMD nodes which each have 64 processors or a total of 128 cores
- CM group has Intel nodes, you should bother them for their details 
- CM also has some GPUs 

### Home 

The login node will take you to your home directory. This is where you can install your code, edit your `.bashrc`, direct your output. The home directories on *marmalade* are backed up every 24h. 

### Scratch

Depending on the cluster, a separate location is preferred for runtime output called *scratch*. This has faster input/output than the *home* directory. Depending on how often you need to read/write during a job, it's better to send stuff to scratch as opposed to home. 

For *marmalade*, the scratch directories are attached to the compute nodes. So if your job was running on `node11` say, then the scratch in node11 will hold your output. The jobs I run on *m* are basic MCMCs, don't need fast I/O, so I just write to home. 

### Data

Again, depending on the setup of the cluster, sometimes a data directory is provided per group to share and store large quantities of data. 

Astro on *m* doesn't have this, but CM does I think. Bother your CM grad students / postdocs to learn more. 

## Cluster set-up

Now that you're logged in, how do you go about setting up you code and ensuring it runs correctly? There's a few simple places to start:
- if it's your own, personal code, you can scp it onto the cluster 
- git clone from a repository 

#### rsync / sftp 

These work similar to the predescribed scp, with some useful differences. 


`rsync` synchronises files between the two endpoints. It functions similarly to scp - you tell it where you want to transfer the file(s) from and to. Then, rsync will just copy the parts of the file(s) that are different, saving time on large files. Adding `-aP` adds a neat progress bar. Example:

`rsync -aP "user@cluster.location:/home/user/location/of/files/*" /local/machine/directory/`

Note the quotation marks around the cluster address. This is necessary for iTerm or Mac M1s (I forget which) to correctly read in that entire string for the cluster location.


`sftp` stands for SSH file transfer protocol. It's a little different from rsync and scp. This lets you SSH into a remote machine and move around in its directories as well as local directories and simply get or put files in either location. It's great for moving files around both ways in different locations. Example:

`ssh user@cluster.location`

`cd some/remote/directory`

`get some_remote_file`

`lcd some/local/dorectory`

`put some_local_file`

Here, cd is change remote directory while lcd is local change directory, same for pwd and lpwd. The command get downloads a remote file to the local working directory, while put uploads a local file to the remote working directory. 

### Loading available software

You might also need modules to be able to run your code, for eg. an installation of C++ or fortran or python. Most clusters will have installations of these already in place, available for you to use. A good place to start to look for these is 

`module avail` 

This prints out all available pre-installed software that you can just load onto your profile on the cluster, for eg. with 

`module load git/xx.xx`

You can then check that you indeed have access to git now with 

`which git`

This should print some location of where the git you are using lives `...bin/git`

Useful modules on *m* include a few versions of GCC, git, anaconda, valgrind ... The cluster helpdesk (marmalade-manager@sas.upenn.edu) is usually responsive and will install more software to add to the list of available modules if you make a good case for it (perhaps even if you make a bad one). 

### Saving cluster set-up

Once you know which modules you will need and have compiled your codes based on these modules, you should save this set-up. 
Check whihc modules you currently have loaded by doing 

`module list`

Then save these in your .bashrc or your .bash_profile to ensure they are loaded everytime you login:

`module load xxx/xxx.xx`

You should also save your anaconda environment. Someone else will tell you about that. This saves the version numbers of your python packages and loads to exact right ones for which you have compiled your code and know that it works. To export the package info for the active environment, do 

`conda env export > environment.yml`

These are important programming practices and will save you SO much time when things invariably break or you have to move your work to a different cluster. 

### Source cluster set-up

When you login, the cluster usually sources your .bashrc and .bash_profile. This loads all your settings, as long as oyu saved them here. 

Some clusters are set up such that you don't need to reload these settings when you move to a compute node to run a job interactively or when you submit a job. Some aren't. 

For *m*, I need to source my .bashrc and .bash_profile everytime I run an interactive job, and have gotten into the habit of doing the same for submitted jobs

`source .bashrc` 


## Running jobs 

Ideally, you'd like to test your code before try to submit a job and have to wait for it to begin, compute and possibly fail because of some bug. Clusters often have debug queues dedicated to this purpose that you can submit jobs to or directly interact with to run your code. 

Below I'll cover how to do both assuming the cluster uses the SLURM queuing system. 

### Interactive sessions




`srun -n 4 -p highcore -t 180 --pty bash` 

### Submitting jobs

#### Which queue is for you? 

#### Submitting batch jobs 

## Useful SLURM commands

## Bonus: open remote text files with atom