## Deploying GPU Cluster on AWS

Please go to EC2 dashboard and select `Launch Instance`.

- **Step 1: Choose AMI**: Launch `Deep Learning AMI (Ubuntu) Version 22.0`

- **Step 2: Choose an Instance Type**: Choose g3.4xlarge for testing multiple nodes with 1 GPU each

- **Step 3: Configure Instance**: Select 4 (however much you need) for `Number of instances` and leave the rest as default

- **Step 4: Add Storage**: Select 200GiB General Purpose. Adjust the size depending on how big the dataset is.

- **Step 5: Add Tags**: Skip this.

- **Step 6: Configure Security Group**: Create a new security group and choose a name like `GPU-group` for keeping track of the security.

- **Step 7: Review**: Launch with your own key pair, such as `course-key`.

Additional configuration for the security group is required so that the nodes can communicate with each other.

Please go to `Security Groups` and edit the rules

- **Edit Inbound Rules**: Add rule with `all traffic` as `Type` and `Custom` with this security group (type in `GPU-group` and will show the security group id) as `Source`.

- **Edit Outbound Rules**: Same as inbound.

## Environment Setup

This setup needs to be done for each node individually.

First, activate the pytorch environment.

`source activate pytorch_p36`

Install the latest Pytorch 1.1.

`conda install pytorch torchvision cudatoolkit=10.0 -c pytorch`

Install h5pickle (a wrapper of `h5py` for distributed computing)

`conda config --add channels conda-forge`

`conda install h5pickle`

Find the name of private IP of the node by running `ifconfig` (usually `ens3`) and export it to NCLL socket:

`export NCCL_SOCKET_IFNAME=ens3` (add to `.bashrc` to make this change permanent)

Upload the scripts to each node or git clone from the repository.

Also, upload the data to each node if running without NFS (Network File System) setup.

## Getting the processed data

Download the data processed using MapReduce by executing: 

`wget https://s3.amazonaws.com/cs205amazonreview/combined_result_5class.h5` 

## Running the sequential version (only need 1 node)

Run the following command on one node:

`python sequential_rnn.py --dir <Input Path> --batch <Batch Size> --lr <Learning Rate> --epochs <# Epochs> --workers <# Workers> --n_vocab 10003 --filename <Model Name> > log.out &`

where `<Input Path>` is the path to the input file, `<# Workers>` is the # of CPUs to load the data; `<Model Name>` is the name of the RNN model for saving; `log.out` is the output log file. 

One example would be:

`python sequential_rnn.py --dir ../data/combined_result_5class.h5 --batch 128 --lr 0.1 --epochs 10 --workers 8 --n_vocab 10003 --filename model_1n_1g > log_1n_1g_b128.out &`

### Profiling the sequential version

If you want to profile the sequential code, please replace `python` with `python -m cProfile` in the command above, as shown below:

`python -m cProfile -o sequential.prof sequential_rnn.py --dir <Input Path> --batch <Batch Size> --lr <Learning Rate> --epochs <# Epochs> --workers <# Workers> --n_vocab 10003 --filename <Model Name> > log.out &`

In order to visualize the profiling result, please install sneakviz by executing:

`pip install snakeviz`

Visualize the profiling by running:

`snakeviz sequential.prof`


## Running the distributed version 

Note this is the command for running code where each node keeps a local copy of the data.

For each node run:

```python -m torch.distributed.launch --nproc_per_node=<#GPU per Node> --nnodes=<Total # of Nodes> --node_rank=<i> --master_addr="<Master Node Private IP>" --master_port=<Free Port> main.py --dir <Input Path> --epochs <# Epochs> --workers <# Workers> --n_vocab <# Words in Dictionary> --dynamic <Dyanmic Mode> --filename <Model Name> > log.out &```

where `<#GPU per Node>` is the number of GPUs in each node; `<Total # of Nodes>` is the total number of nodes; `<i>` is the rank assigned to this node (starting from 0 = master node); `<Master Node Private IP>` is the private IP of master node which can be found by running `ifconfig` under `ens3`; `<Free Port>` is any free port; `<Input Path>` is the path to the input file; `<# Workers>` is the # of CPUs to load the data; `<# Words in Dictionary` is the number of words in the dictionary, and in our case it's 10003;`<Model Name>` is the name of the RNN model for saving; `Dynamic Mode` refers to using the dynamic load balancer, where negative value is not running, 0 runs only once after the 1st epoch and any positive real integer `j` updates the load every `j` epochs; `log.out` is the output log file. 

For example, running 2 nodes without dynamic load balancing and on the background and with log files would be:

Node 1: 

```python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="172.31.35.159" --master_port=23456 dynamic_rnn.py --dir ../data/combined_result_5class.h5  --batch 128 --lr 0.1 --epochs 10 --dynamic -1 --workers 8 --n_vocab 10003 --filename model_2n_1g > log.out &```

Node 2:

```python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="172.31.35.159" --master_port=23456 dynamic_rnn.py --dir ../data/combined_result_5class.h5  --batch 128 --lr 0.1 --epochs 10 --dynamic -1 --workers 8 --n_vocab 10003 --filename model_2n_1g > log.out &```

## Configure NFS for file sharing

This is inspired by `Harvard CS205 - Spring 2019 - Infrastructure Guide - I7 - MPI on AWS`, but with modifications to bypass the extra user account created, which is unnecessary in this setting.

Let `master$` denote master node and `$node` denote any other node

Run the following commands on master node:

- Install NFS server: `master$ sudo apt-get install nfs-kernel-server`

- Create NFS directory: `master$ mkdir cloud`

- Export cloud directory: add the following line `/home/ubuntu/cloud *(rw,sync,no_root_squash,no_subtree_check)` to `/etc/exports` by executing `master$ sudo vi /etc/exports`

- Update the changes: `master$ sudo exportfs -a`

Configure the NFS client on other nodes:

- Install NFS client: `node$ sudo apt-get install nfs-common`

- Create NFS directory: `node$ mkdir cloud`

- Mount the shared directory: `node$ sudo mount -t nfs <Master Node Private IP>:/home/ubuntu/cloud /home/ubuntu/cloud`

- Make the mount permanent (optional): add the following line `<Master Node Private>:/home/ubuntu/cloud /home/ubuntu/cloud nfs` to `/etc/fstab` by executing `node$ sudo vi /etc/fstab
`

## Running with NFS mounted directory

Please upload the data to NFS mounted directory `cloud` first.

For each node run:

```python -m torch.distributed.launch --nproc_per_node=<#GPU per Node> --nnodes=<Total # of Nodes> --node_rank=<i> --master_addr="<Master Node Private IP>" --master_port=<Free Port> main.py --dir <Mounted Input Path> --epochs <# Epochs> --workers <# Workers> --n_vocab <# Words in Dictionary> --dynamic <Dyanmic Mode> --filename <Model Name> > log.out &```

where `<#GPU per Node>` is the number of GPUs in each node; `<Total # of Nodes>` is the total number of nodes; `<i>` is the rank assigned to this node (starting from 0 = master node); `<Master Node Private IP>` is the private IP of master node which can be found by running `ifconfig` under `ens3`; `<Free Port>` is any free port; `<Mounted Input Path>` is the path to the mounted input file; `<# Workers>` is the # of CPUs to load the data; `<# Words in Dictionary` is the number of words in the dictionary, and in our case it's 10003;`<Model Name>` is the name of the RNN model for saving; `Dynamic Mode` refers to using the dynamic load balancer, where negative value is not running, 0 runs only once after the 1st epoch and any positive real integer `j` updates the load every `j` epochs; `log.out` is the output log file. 

For example, running 2 nodes without dynamic load balancing and on the background and with log files would be:

Node 1: 

```python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr="172.31.35.159" --master_port=23456 dynamic_rnn.py --dir ../cloud/combined_result_5class.h5  --batch 128 --lr 0.1 --epochs 10 --dynamic -1 --workers 8 --n_vocab 10003 --filename model_2n_1g > log.out &```

Node 2:

```python -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr="172.31.35.159" --master_port=23456 dynamic_rnn.py --dir ../cloud/combined_result_5class.h5  --batch 128 --lr 0.1 --epochs 10 --dynamic -1 --workers 8 --n_vocab 10003 --filename model_2n_1g > log.out &```