<center><img src="https://i.imgur.com/FHMoW3N.png" width=360px><br><b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Collaborative training <sup>v0.9 alpha</sup></b></center>


This notebook will use local or colab GPU to help train ALBERT-large collaboratively. Your instance will compute gradients and exchange them with a bunch of volunteers around the world. We explain how it works at the bottom. But for now, please run all cells :)

To start training, you will need to login to your Hugging Face account, please fill in the prompts as in the example below (replace `robot-bengali` with your username):

![img](https://i.imgur.com/txuWbJi.png)

Please do not run colab notebooks from multiple google accounts: google doesn't like this.

In [None]:
experiment_name = "bengali_MAIN"
hivemind_version = "0.9.9.post1"
%env EXPERIMENT_ID 15

!echo "Installing dependencies..."
!pip git+https://github.com/learning-at-home/hivemind.git@{hivemind_version} >> install.log 2>&1
!git clone https://github.com/yandex-research/DeDLOC >> install.log 2>&1
%cd ./DeDLOC/sahajbert
!pip install -r requirements.txt >> install.log 2>&1

from shlex import quote
import torch
from huggingface_auth import authorize_with_huggingface
from runner import run_with_logging

assert torch.cuda.is_available(), "GPU device not found. If running in colab, please retry in a few minutes."
device_name = torch.cuda.get_device_name(0)
microbatch_size = 4 if 'T4' in device_name or 'P100' in device_name else 1
print(f"Running with device {device_name}, local batch size = {microbatch_size}")

authorizer = authorize_with_huggingface()

!ulimit -n 4096 && HIVEMIND_THREADS=256 \
 USERNAME={quote(authorizer.username)} PASSWORD={quote(authorizer.password)} python ./run_trainer.py --client_mode \
 --initial_peers {authorizer.coordinator_ip}:{authorizer.coordinator_port} \
 --averaging_expiration 10 --statistics_expiration 120 \
 --batch_size_lead 400 --per_device_train_batch_size {microbatch_size} --gradient_accumulation_steps 1 \
 --logging_first_step --logging_steps 100 --run_name {quote(authorizer.username)} \
 --output_dir ./outputs --overwrite_output_dir --logging_dir ./logs \
 --experiment_prefix {quote(experiment_name)} --seed 42

### What's up next?
* Check the training progress on public learning curves: https://wandb.ai/learning-at-home/Main_metrics
* Run a second GPU session with kaggle notebooks: https://www.kaggle.com/yhn112/collaborative-training-d87a28
* View model checkpoints: https://huggingface.co/neuropark/sahajBERT
* See [this tutorial](https://github.com/learning-at-home/hivemind/tree/master/examples/albert) on how to start your own collaborative runs!


Co-created by [yhn112](https://github.com/yhn112), [leshanbog](https://github.com/leshanbog), [foksly](https://github.com/foksly) and [borzunov](https://github.com/borzunov) from [hivemind](https://github.com/learning-at-home/hivemind) (YSDA), [lhoestq](https://github.com/lhoestq), [SaulLu](https://github.com/SaulLu) and [stas00](https://github.com/stas00) from [huggingface](http://huggingface.co).

### How it works

Since peers can join and leave at any time, we can't use global [Ring All-Reduce](https://towardsdatascience.com/visual-intuition-on-ring-allreduce-for-distributed-deep-learning-d1f34b4911da) for averaging: a single missing peer can break the entire protocol. Instead, peers dynamically assemble into small groups and run all-reduce within each group. Consider an example with 9 GPUs:


<img src="https://i.imgur.com/QcD1mfG.png" width=360px>

The All-Reduce protocol within group can be Ring-AllReduce, but we use a simpler all-to-all algorithm known as butterfly-like all-reduce.<br>

<img src="https://i.imgur.com/ewq3vS6.png" width=380px>

After each successful round, participants shuffle around and find new groups:

<img src="https://i.imgur.com/dexNCL3.png" width=350px>

If one of the peers fails to do his part, it will only affect his local group, and only for a single round.


<img src="https://i.imgur.com/RBmElUi.png" width=340px>

Afterwards, peers from the failed group will find new groupmates according to the algorithm.