## Setting Up a Simple Slurm Cluster on Amazon Web Services for Personal Use

***Note: This does cost money, be sure to monitor costs.***

#### Prerequisite:
- AWS account (go here to make an account: https://aws.amazon.com/)

#### Tutorial Used:
- https://www.tecracer.com/blog/2023/05/hello-slurm-getting-started-with-high-performance-computing-on-aws.html

### Step 1: Launch an EC2 Instance
1. Once logged into aws, type *ec2* into the search
2. There should be a button on the page that says *Launch Instance*, click on that.
3. File in information for the ec2 instance
    - **Name:** enter a name that is relevant and memorable. Mine was *HPC-BASE13*.
    - **Application and OS Image:** Amazon Linux 2023
    - **Key Pair:** Use a pre-established key pair that you used before or create a new one. I imagine best practice is a new one. This will generate a *.pem* file. Save it to your computer, when you `ssh` in, this is like login credentials.
    - **Network Settings:** I did not touch this.
    - **Configure Storage:** I did not touch this.
    - **Advanced Settings:** I only set the IAM Instance profile to an IAM profile that I gave admin access to. This is necessary or you will get errors.
4. Click *Launch Instance*, and it'll get setup. The instance will be a blank slate with no files on it, so next we have to configure it to act as our slurm base.
5. You can connect to the instance via the aws browser connection by clicking *Connect* and then on the next page using the *Connect using EC2 Instance Connect* option. I use the terminal command `ssh -i "<ssh_key_name>.pem" ec2-user@<unique_get_from_ssh_tab>.us-west-1.compute.amazonaws.com`.

### Step 2: Configure Slurm Base Instance

The base instance you create here basically will do a lot of the cluster things automatically for you. Most of it is pulled directly from the tutorial used.

1. Set up environment

In [None]:
# switch user
sudo su -
su - ec2-user

# install python3-pip and nodejs
sudo yum install python3-pip nodejs -y

# install and create a python virtual environment
python3 -m pip install --user --upgrade virtualenv
python3 -m virtualenv ~/hpc-ve

# activate (or turn on) this virtual environment --- this will need to be done every time you login to the hpc-base, or you can set it up in the `~/.bashrc` file
source ~/hpc-ve/bin/activate

# install the aws cluster software that pretty much does everything for you. You might be able to get away with not setting the versions of these softwares, but I didn't try. 
pip install aws-parallelcluster==3.5.0 && pip install flask==2.2.3

#check that you have it installed, it should be the version you specified
pcluster version

# use the software to help you configure your *.yaml* file that will help you set up for slurm base
pcluster configure --config cluster-config.yaml --region=us-west-1

#note: the yaml file name can be anything you want. It will be created with this command. If you want to edit it afterwards, you would use nano or vi
#2nd note: you need to specify the region, and it needs to be the same as your ec2 instance (mine is called HPC-Base13)

2. Choose slurm configuration (below is copied from the tutorial, because I don't know much about these settings and micro instances are on the aws free tier)
    - Choose the EC2 Key Pair Name added in the prerequisites step
    - Choose the scheduler to be used: slurm
    - Choose the Operating System: alinux2 (Amazon Linux 2)
    - Choose the HeadNode instance type: t3.micro
    - Number of queues: default (1)
    - Name of queue: default (queue1)
    - Number of compute resources for queue1: default (1)
    - Compute instance type for queue1: t3.micro
    - Maximum instance count: default (10)
    - Automate VPC creation: default
    - Set VPC settings: no automated creation; choose your created subnets during the prerequisites step
3. Alter the *.yaml* file, so that we can have a shared drive with extra space on it. I used part of the config file from the aws github for this software: https://github.com/aws/aws-parallelcluster/blob/release-3.0/cli/tests/pcluster/example_configs/awsbatch.simple.yaml

In [None]:
nano cluster-config.yaml

  3. continued
      - I typed the following information into the yaml file (with proper indenting to add the shared drive and saved:
> SharedStorage:  
>> MountDir: /sharet  
Name: ebs2  
StorageType: Ebs  

In [None]:
#my config file
Region: us-west-1
Image:
  Os: alinux2
HeadNode:
  InstanceType: t3.micro
  Networking:
    SubnetId: <id_from_program>
  Ssh:
    KeyName: <ssh_key_name>
Scheduling:
  Scheduler: slurm
  SlurmQueues:
  - Name: queue1
    ComputeResources:
    - Name: t3micro
      Instances:
      - InstanceType: t3.micro
      MinCount: 0
      MaxCount: 10
    Networking:
      SubnetIds:
      - <id_from_program>
SharedStorage:
  - MountDir: /sharet
    Name: ebs2
    StorageType: Ebs

4. Run the command to create the cluster based on the config file. This is where you would specify the name of your cluster. It must be a new name you have used before. I used *cluster5*.

In [None]:
pcluster create-cluster --cluster-configuration cluster-config.yaml --cluster-name cluster5 --region us-west-1

# to check the progress of the cluster creation
pcluster describe-cluster --cluster-name cluster5 --region=us-west-1

### Step 3: Use Cluster

You now want to `ssh` into the headnode (where you submit jobs), which is an instance that is auto-created by the hpc-base instance when you create the cluster using `pcluster`. You need to upload your *.pem* file into the cluster if it's not there.

In [None]:
pcluster ssh --cluster-name cluster5 -i ~/.ssh/<ssh_key_name>.pem --region=us-west-1

#I altered the ~/.bashrc file so that the source command in step 2 and the above command are run automatically when I connect to the hpc-base instance, so that I go straight to the headnode.

You can write a simple script (with the *.sh* file type, like `echo "Hello World"`) to do something and then test this by sending an sbatch command.

In [None]:
sbatch script.sh

The cluster will create the compute node and process the job and spit out a *slurm-#.out* file with the results in the same working directory where you ran the `sbatch` command.

### Step 4: Configuration and Actual Runs

I added all my files to the /sharet file space (originally has the drive named "share13" but this gave me errors with the mamba environment, so that if I recreate the headnode (change the cluster config), then I have everything on the cluster's drive. I put my github files and my data files in different folders, just to make it easier.

When running the sbatch code, I kept getting errors. It was an issue with RAM/mem, where my genomic mapping program, STAR, was not working. This was the error in the log:

> EXITING: fatal error trying to allocate genome arrays, exception thrown: std::bad_alloc  
Possible cause 1: not enough RAM. Check if you have enough RAM 2012033883 bytes  
Possible cause 2: not enough virtual memory allowed with ulimit. SOLUTION: run ulimit -v 2012033883  
  
> Jan 10 03:55:20 ...... FATAL ERROR, exiting  

It's taken a bit of troubleshooting. I think it's because I didn't set up up the config file correctly or I'm submitting an `sbatch` command that is not appropriate.