Hack Up the Credentials
-----------------------
Set the up the credentials to submit the runs to `r1z1` cluster

This step is **not** needed in the production solution for Databricks Notebook

In [4]:
# The CREDENTIALS has been preset with `conda env config vars set CREDENTIALS="xxs"`

# To generate CREDENTIALS for your local environment, you can use the following command:
# data = {
#     "workspace_url": "https://dbc-04ac0685-8857.staging.cloud.databricks.com/",
#     "token": "dapi338xx", 
#     "mosaic_token": "Y4x7xx",
# }
# data = json.dumps(data)
# credentials = base64.b64encode(data.encode('utf-8')).decode('utf-8')

import os
from ygong.mosaic import submit, _set_up_environment

credentials = os.environ.get("CREDENTIALS")
_set_up_environment(credentials)

Testing on a Single Device (CPU or GPU) Single Node
--------------------------------------------------------
The main goal is the make sure `trainer.fit` can run on single node single CPU successfully. It could use very small smoke test dataset and try to run the model training through a few batches. 

This step can be run on developers' desktop if the hardware can fit the model. Or it can be run in the notebook that attached to a remote GPU instance.  

In [21]:
import sys
import importlib
# update to use `/Workspace/Repos/yu.gong@databricks.com/custom-train-demo/src` if running on Databricks notebook
sys.path.append(os.path.abspath('../src'))

# always reload for local updates to take effect
import train
importlib.reload(train)


<module 'train' from '/Users/yu.gong/workspace/mosaic/custom-train-demo/src/train.py'>

In [20]:
from dataclasses import asdict
import json
name = "custom-train-demo"
config = train.MyConfig(global_train_batch_size=4, name=name)
train.main(config)
print("Done")

Done


Testing on a Multiple GPU Device on a Single Node
--------------------------------------------------
This step I would prefer the terminal approach. 


1. Manually set up the environment(e.g. clone repo, install packages) that to mimic what's going to happen for remote run. In this example, following this [guide](https://docs.databricks.com/en/repos/index.html) to add [custom-train-demo](https://e2-dogfood.staging.cloud.databricks.com/browse/folders/58238802205414?o=6051921418418893) into the workspace. 
2. Open the terminal of the attached interactive GPU instance.
3. [optional] install the llm-foundary that contains the hacks and mimic the RunConfig.integrations. *NOT NEEDED IN PRODUCTION*

```bash
git clone -b prototype "https://github.com/ygong1/llm-foundry.git" llm-foundry
cd ~/llm-foundry && pip install -e .[gpu] 
```
4. run the distributed training command. *NOTE*: `~/llm-foundry/scripts/train/launcher.py` should be replaced as `composer` in production

```bash
cd /Workspace/Repos/yu.gong@databricks.com/custom-train-demo/
~/llm-foundry/scripts/train/launcher.py ./src/train.py $json_str
```


In this example, there is race condition of downloading the CIFAR10 dataset in [train.py](https://e2-dogfood.staging.cloud.databricks.com/?o=6051921418418893#files/58238802205419). In the Databricks workspace notebook, we can modify the train.py logic directly to only allow local rank 0 process do the downloading and the rest just wait for the dataset to be ready, save the changes and then switch to the terminal to re-execute `~/llm-foundry/scripts/train/launcher.py ./src/train.py $json_str`.

We do this iteration multiple times unit the command can run successfully for distributed training on multiple GPUs on the single node.

Commit the code to the repo from the workspace and push to origin.

In [22]:
config = train.MyConfig(global_train_batch_size=4, name=name)
json_str = json.dumps(asdict(config))
print(json_str)


{"name": "custom-train-demo", "seed": 42, "dist_timeout": 600.0, "global_train_batch_size": 4, "device_train_microbatch_size": 16}


Submit Run Remotely On a Cluster
-------------------------------

Wrap up the config and submit the remote run.




In [26]:
from mcli import RunConfig
from ygong.mosaic import ScalingConfig
from ygong.mosaic import submit

config = train.MyConfig(global_train_batch_size=4, name=name)
json_str = json.dumps(asdict(config))
commands = [
    "cd ~/custom-train-demo",
    f'''~/llm-foundry/scripts/train/launcher.py ./src/train.py  '{json_str}' '''
]

scalingConfig = ScalingConfig(gpusNum=8, gpuType="a100_80gb", poolName="r1z1")

# Ideally, the following should be something like the following: then all the
# file local changes can be automatically synced to the remote machine for training
# integrations = [ 
#     {  
#         'integration_type': 'databricks_repo',
#         'git_repo': '/Workspace/Repos/yu.gong@databricks.com/custom-train-demo/',
#     }
# ]
interactions = [
   {
      'integration_type': 'git_repo',
      'git_repo': 'ygong1/llm-foundry',
      'path': '~/llm-foundry',
      'git_branch': 'prototype',
      'pip_install': '-e .[gpu]',
      'ssh_clone': False
   },
   {
      'integration_type': 'git_repo',
      'git_repo': 'ygong1/custom-train-demo',
      'path': '~/custom-train-demo',
      'ssh_clone': False
   },
   {
      'integration_type': 'pip_packages',
      'packages': ['pynvml', 'mosaicml-streaming[databricks]'],
   },
]

config = RunConfig(
    name="custom-train-demo",
    image='mosaicml/llm-foundry:2.2.1_cu121_flash2-latest',
    command="\n".join(commands),
    compute=scalingConfig.toCompute,
    integrations= interactions,
    env_variables={},
)

In [24]:
submit(None, config, scalingConfig, debug=True)

Button(description='cancel the run', style=ButtonStyle())

Unnamed: 0,Run Name,Run ID,Status,Experiment Run
0,custom-train-demo-gZr3uF,ffe617bd-d07a-42d6-a830-36edf56aa6d3,RUNNING,


2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING
2024-03-24 17:23:45,166 - ygong.mosaic.submit - DEBUG - waiting for the MLFLow experiment run to be ready, run statusRUNNING


KeyboardInterrupt: 