Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data #1891

andrewsykim · 2024-01-30T15:53:53Z

Why are these changes needed?

Add a sample RayJob that packages Fine-tune a PyTorch Lightning Text Classifier with Ray Data into a single RayJob.

I plan to use this sample for future documentation with Kueue, but this change could be a good sample on it's own as well.

Related issue number

Part of #1890

Checks

I've made sure the tests are passing.
Testing Strategy
- Unit tests
- Manual tests
- This PR is not tested :(

andrewsykim · 2024-01-30T15:56:28Z

@kevin85421 PTAL, I plan to use this sample for upcoming Kueue documentation (#1890)

andrewsykim · 2024-01-30T15:57:56Z

Here's the log output from the Job I ran in my own GKE cluster:

2024-01-30 07:44:32,696	INFO cli.py:36 -- Job submission server address: http://ch-text-classifier-pdk2c-raycluster-hmsjj-head-svc.default.svc.cluster.local:8265
2024-01-30 07:44:33,771	SUCC cli.py:60 -- ----------------------------------------------------------------
2024-01-30 07:44:33,771	SUCC cli.py:61 -- Job 'pytorch-text-classifier-pdk2c-kc2x8' submitted successfully
2024-01-30 07:44:33,771	SUCC cli.py:62 -- ----------------------------------------------------------------
2024-01-30 07:44:33,771	INFO cli.py:285 -- Next steps
2024-01-30 07:44:33,771	INFO cli.py:286 -- Query the logs of the job:
2024-01-30 07:44:33,771	INFO cli.py:288 -- ray job logs pytorch-text-classifier-pdk2c-kc2x8
2024-01-30 07:44:33,772	INFO cli.py:290 -- Query the status of the job:
2024-01-30 07:44:33,772	INFO cli.py:292 -- ray job status pytorch-text-classifier-pdk2c-kc2x8
2024-01-30 07:44:33,772	INFO cli.py:294 -- Request the job to be stopped:
2024-01-30 07:44:33,772	INFO cli.py:296 -- ray job stop pytorch-text-classifier-pdk2c-kc2x8
2024-01-30 07:44:33,790	INFO cli.py:303 -- Tailing logs until the job exits (disable with --no-wait):
[2024-01-30 07:44:45,014] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
2024-01-30 07:44:50.800330: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-01-30 07:44:51.906945: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-01-30 07:44:51.907175: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2024-01-30 07:44:51.907189: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Downloading readme:   0%|          | 0.00/35.3k [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 35.3k/35.3k [00:00<00:00, 12.8MB/s]

Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/37.7k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 37.7k/37.7k [00:00<00:00, 213kB/s]

Downloading data files:  33%|███▎      | 1/3 [00:00<00:00,  5.55it/s]

Downloading data: 100%|██████████| 251k/251k [00:00<00:00, 2.52MB/s]

Downloading data files:  67%|██████▋   | 2/3 [00:00<00:00,  7.43it/s]

Downloading data: 100%|██████████| 37.6k/37.6k [00:00<00:00, 554kB/s]

Downloading data files: 100%|██████████| 3/3 [00:00<00:00,  8.49it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 2139.59it/s]

Generating test split:   0%|          | 0/1063 [00:00<?, ? examples/s]
Generating test split: 100%|██████████| 1063/1063 [00:00<00:00, 223912.47 examples/s]

Generating train split:   0%|          | 0/8551 [00:00<?, ? examples/s]
Generating train split: 100%|██████████| 8551/8551 [00:00<00:00, 1892538.31 examples/s]

Generating validation split:   0%|          | 0/1043 [00:00<?, ? examples/s]
Generating validation split: 100%|██████████| 1043/1043 [00:00<00:00, 650990.93 examples/s]
2024-01-30 07:44:57,635	INFO worker.py:1405 -- Using address 10.12.0.63:6379 set in the environment variable RAY_ADDRESS
2024-01-30 07:44:57,635	INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.12.0.63:6379...
2024-01-30 07:44:57,643	INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at http://10.12.0.63:8265 

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]
Downloading: 100%|██████████| 29.0/29.0 [00:00<00:00, 230kB/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]
Downloading: 100%|██████████| 570/570 [00:00<00:00, 4.65MB/s]

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]
Downloading: 100%|██████████| 208k/208k [00:00<00:00, 5.40MB/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]
Downloading: 100%|██████████| 426k/426k [00:00<00:00, 7.93MB/s]
2024-01-30 07:45:02,970	INFO tune.py:592 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949

View detailed results here: /home/ray/ray_results/ptl-sent-classification
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/ray/ray_results/ptl-sent-classification`
(TrainTrainable pid=1480) [2024-01-30 07:45:07,097] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(TrainTrainable pid=1480) 2024-01-30 07:45:13.058412: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(TrainTrainable pid=1480) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(TrainTrainable pid=1480) 2024-01-30 07:45:13.933668: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=1480) 2024-01-30 07:45:13.933789: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(TrainTrainable pid=1480) 2024-01-30 07:45:13.933799: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

Training started with configuration:
╭──────────────────────────────────────╮
│ Training config                      │
├──────────────────────────────────────┤
│ train_loop_config/batch_size      16 │
│ train_loop_config/eps          1e-08 │
│ train_loop_config/lr           1e-05 │
│ train_loop_config/max_epochs       5 │
╰──────────────────────────────────────╯
(RayTrainWorker pid=1584) Setting up process group for: env:// [rank=0, world_size=2]
(TorchTrainer pid=1480) Started distributed worker processes: 
(TorchTrainer pid=1480) - (ip=10.12.0.63, pid=1584) world_rank=0, local_rank=0, node_rank=0
(TorchTrainer pid=1480) - (ip=10.12.0.63, pid=1585) world_rank=1, local_rank=1, node_rank=0
(RayTrainWorker pid=1584) [2024-01-30 07:45:19,372] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(RayTrainWorker pid=1585) [2024-01-30 07:45:19,396] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
(SplitCoordinator pid=1755) Auto configuring locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc']
(RayTrainWorker pid=1585) 2024-01-30 07:45:25.227997: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(RayTrainWorker pid=1585) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=1585) 2024-01-30 07:45:26.320797: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=1585) 2024-01-30 07:45:26.320928: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
(RayTrainWorker pid=1585) 2024-01-30 07:45:26.320940: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(SplitCoordinator pid=1754) Auto configuring locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc']
(RayTrainWorker pid=1584) 
Downloading:   0%|          | 0.00/416M [00:00<?, ?B/s]
(RayTrainWorker pid=1584) 
Downloading:   1%|          | 4.39M/416M [00:00<00:09, 46.0MB/s]
(RayTrainWorker pid=1584) 
Downloading:   3%|▎         | 12.9M/416M [00:00<00:05, 71.5MB/s]
(RayTrainWorker pid=1584) 
Downloading:   5%|▌         | 21.1M/416M [00:00<00:05, 78.0MB/s]
(RayTrainWorker pid=1584) 
Downloading:   7%|▋         | 29.4M/416M [00:00<00:04, 81.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:   9%|▉         | 37.4M/416M [00:00<00:04, 82.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  11%|█         | 46.3M/416M [00:00<00:04, 86.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  13%|█▎        | 55.2M/416M [00:00<00:04, 88.3MB/s]
(RayTrainWorker pid=1584) 
Downloading:  15%|█▌        | 64.2M/416M [00:00<00:04, 90.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  18%|█▊        | 73.2M/416M [00:00<00:03, 91.7MB/s]
(RayTrainWorker pid=1584) 
Downloading:  20%|█▉        | 82.1M/416M [00:01<00:03, 92.0MB/s]
(RayTrainWorker pid=1584) 
Downloading:  22%|██▏       | 91.1M/416M [00:01<00:03, 93.0MB/s]
(RayTrainWorker pid=1584) 
Downloading:  24%|██▍       | 100M/416M [00:01<00:03, 93.6MB/s] 
(RayTrainWorker pid=1584) 
Downloading:  26%|██▋       | 109M/416M [00:01<00:03, 92.9MB/s]
(RayTrainWorker pid=1584) 
Downloading:  28%|██▊       | 118M/416M [00:01<00:03, 91.4MB/s]
(RayTrainWorker pid=1584) 
Downloading:  30%|███       | 127M/416M [00:01<00:03, 88.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  33%|███▎      | 135M/416M [00:01<00:03, 89.1MB/s]
(RayTrainWorker pid=1584) 
Downloading:  35%|███▍      | 144M/416M [00:01<00:03, 88.3MB/s]
(RayTrainWorker pid=1584) 
Downloading:  37%|███▋      | 152M/416M [00:01<00:03, 87.7MB/s]
(RayTrainWorker pid=1584) 
Downloading:  39%|███▉      | 161M/416M [00:01<00:03, 88.9MB/s]
(RayTrainWorker pid=1584) 
Downloading:  41%|████      | 170M/416M [00:02<00:02, 90.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  43%|████▎     | 179M/416M [00:02<00:02, 91.0MB/s]
(RayTrainWorker pid=1584) 
Downloading:  45%|████▌     | 188M/416M [00:02<00:02, 91.9MB/s]
(RayTrainWorker pid=1584) 
Downloading:  47%|████▋     | 197M/416M [00:02<00:02, 92.0MB/s]
(RayTrainWorker pid=1584) 2024-01-30 07:45:25.220206: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
(RayTrainWorker pid=1584) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
(RayTrainWorker pid=1584) 
Downloading:  49%|████▉     | 205M/416M [00:02<00:02, 90.0MB/s]
(RayTrainWorker pid=1584) 
Downloading:  51%|█████▏    | 214M/416M [00:02<00:02, 90.1MB/s]
(RayTrainWorker pid=1584) 
Downloading:  54%|█████▎    | 223M/416M [00:02<00:02, 90.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  56%|█████▌    | 231M/416M [00:02<00:02, 90.8MB/s]
(RayTrainWorker pid=1584) 
Downloading:  58%|█████▊    | 240M/416M [00:02<00:02, 91.8MB/s]
(RayTrainWorker pid=1584) 
Downloading:  60%|██████    | 250M/416M [00:02<00:01, 93.1MB/s]
(RayTrainWorker pid=1584) 
Downloading:  62%|██████▏   | 259M/416M [00:03<00:01, 92.8MB/s]
(RayTrainWorker pid=1584) 
Downloading:  64%|██████▍   | 268M/416M [00:03<00:01, 93.3MB/s]
(RayTrainWorker pid=1584) 
Downloading:  67%|██████▋   | 276M/416M [00:03<00:01, 91.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  69%|██████▊   | 285M/416M [00:03<00:01, 91.7MB/s]
(RayTrainWorker pid=1584) 
Downloading:  71%|███████   | 294M/416M [00:03<00:01, 89.5MB/s]
(RayTrainWorker pid=1584) 2024-01-30 07:45:26.328328: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayTrainWorker pid=1584) 2024-01-30 07:45:26.328343: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(RayTrainWorker pid=1584) 
Downloading:  73%|███████▎  | 303M/416M [00:03<00:01, 90.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  75%|███████▍  | 312M/416M [00:03<00:01, 91.1MB/s]
(RayTrainWorker pid=1584) 
Downloading:  77%|███████▋  | 321M/416M [00:03<00:01, 91.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  79%|███████▉  | 329M/416M [00:03<00:00, 92.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  81%|████████▏ | 338M/416M [00:03<00:00, 92.9MB/s]
(RayTrainWorker pid=1584) 
Downloading:  84%|████████▎ | 347M/416M [00:04<00:00, 93.3MB/s]
(RayTrainWorker pid=1584) 
Downloading:  86%|████████▌ | 356M/416M [00:04<00:00, 93.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  88%|████████▊ | 365M/416M [00:04<00:00, 91.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  90%|█████████ | 374M/416M [00:04<00:00, 89.8MB/s]
(RayTrainWorker pid=1584) 
Downloading:  92%|█████████▏| 383M/416M [00:04<00:00, 87.6MB/s]
(RayTrainWorker pid=1584) 
Downloading:  94%|█████████▍| 391M/416M [00:04<00:00, 88.4MB/s]
(RayTrainWorker pid=1584) 
Downloading:  96%|█████████▌| 400M/416M [00:04<00:00, 89.2MB/s]
(RayTrainWorker pid=1584) 
Downloading:  98%|█████████▊| 409M/416M [00:04<00:00, 89.9MB/s]
Downloading: 100%|██████████| 416M/416M [00:04<00:00, 89.9MB/s]
(RayTrainWorker pid=1584) Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias']
(RayTrainWorker pid=1584) - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=1584) - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=1584) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
(RayTrainWorker pid=1584) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=1584) /home/ray/kuberay/kuberay-ray-kueue-example/ray-operator/config/samples/pytorch-text-classifier/fine-tune-pytorch-text-classifier.py:28: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
(RayTrainWorker pid=1584)   self.metric = load_metric("glue", "cola")
(RayTrainWorker pid=1585) Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
(RayTrainWorker pid=1584) 
Downloading builder script:   0%|          | 0.00/1.84k [00:00<?, ?B/s]
Downloading builder script: 5.76kB [00:00, 17.4MB/s]                   
(RayTrainWorker pid=1584) GPU available: True, used: True
(RayTrainWorker pid=1584) TPU available: False, using: 0 TPU cores
(RayTrainWorker pid=1584) IPU available: False, using: 0 IPUs
(RayTrainWorker pid=1584) HPU available: False, using: 0 HPUs
(RayTrainWorker pid=1584) Missing logger folder: /home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/lightning_logs
(RayTrainWorker pid=1585) LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1]
(RayTrainWorker pid=1584) 
(RayTrainWorker pid=1584)   | Name  | Type                          | Params
(RayTrainWorker pid=1584) --------------------------------------------------------
(RayTrainWorker pid=1584) 0 | model | BertForSequenceClassification | 108 M 
(RayTrainWorker pid=1584) --------------------------------------------------------
(RayTrainWorker pid=1584) 108 M     Trainable params
(RayTrainWorker pid=1584) 0         Non-trainable params
(RayTrainWorker pid=1584) 108 M     Total params
(RayTrainWorker pid=1584) 433.247   Total estimated model params size (MB)
(SplitCoordinator pid=1755) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]
(SplitCoordinator pid=1755) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=1755) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1755) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 0.06 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 3.13 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:02<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:02<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:02<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:02<00:00,  2.03s/it]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:02<00:02,  2.03s/it]
                                                                                                                                  

(pid=1754) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 0.46 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 25.68 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  1.62it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  1.62it/s]
                                                                                                                                   
(RayTrainWorker pid=1585) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
(RayTrainWorker pid=1585) - This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
(RayTrainWorker pid=1585) - This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
(RayTrainWorker pid=1585) Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
(RayTrainWorker pid=1585) You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
(RayTrainWorker pid=1585) /home/ray/kuberay/kuberay-ray-kueue-example/ray-operator/config/samples/pytorch-text-classifier/fine-tune-pytorch-text-classifier.py:28: FutureWarning: load_metric is deprecated and will be removed in the next major version of datasets. Use 'evaluate.load' instead, from the new library 🤗 Evaluate: https://huggingface.co/docs/evaluate
(RayTrainWorker pid=1585)   self.metric = load_metric("glue", "cola")
(RayTrainWorker pid=1585) Missing logger folder: /home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/lightning_logs
(RayTrainWorker pid=1584) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
(SplitCoordinator pid=1755) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)] [repeated 2x across cluster]
(SplitCoordinator pid=1755) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False) [repeated 2x across cluster]
(SplitCoordinator pid=1755) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True` [repeated 2x across cluster]
(RayTrainWorker pid=1584) [W reducer.cpp:1300] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())

(pid=1755) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 3.13 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  9.68it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 1.54 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  9.68it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 1.54 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  9.68it/s]
                                                                                                                                  
(RayTrainWorker pid=1585) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000000)

Training finished iteration 1 at 2024-01-30 07:47:24. Total running time: 2min 21s
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ checkpoint_dir_name    checkpoint_000000 │
│ time_this_iter_s               129.66612 │
│ time_total_s                   129.66612 │
│ training_iteration                     1 │
│ epoch                                  0 │
│ matthews_correlation             0.52371 │
│ step                                 268 │
│ train_loss                        0.2402 │
╰──────────────────────────────────────────╯
Training saved a checkpoint for iteration 1 at: (local)/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000000
(RayTrainWorker pid=1584) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000000)

(pid=1754) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 0.46 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 25.68 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  1.50it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 12.61 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  1.50it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 12.61 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  1.50it/s]
                                                                                                                                   
(SplitCoordinator pid=1755) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]
(SplitCoordinator pid=1755) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=1755) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1755) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 3.13 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  4.80it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  4.80it/s]
                                                                                                                                  
(RayTrainWorker pid=1585) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000001)

Training finished iteration 2 at 2024-01-30 07:49:19. Total running time: 4min 16s
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ checkpoint_dir_name    checkpoint_000001 │
│ time_this_iter_s               115.05276 │
│ time_total_s                   244.71888 │
│ training_iteration                     2 │
│ epoch                                  1 │
│ matthews_correlation              0.5519 │
│ step                                 536 │
│ train_loss                       0.05616 │
╰──────────────────────────────────────────╯
Training saved a checkpoint for iteration 2 at: (local)/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000001
(RayTrainWorker pid=1584) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000001)

(pid=1754) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 0.46 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 25.68 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  1.95it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  1.95it/s]
                                                                                                                                   
(SplitCoordinator pid=1755) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]
(SplitCoordinator pid=1755) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=1755) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1755) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 3.13 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  9.65it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  9.65it/s]
                                                                                                                                  
(RayTrainWorker pid=1585) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000002)

Training finished iteration 3 at 2024-01-30 07:51:15. Total running time: 6min 12s
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ checkpoint_dir_name    checkpoint_000002 │
│ time_this_iter_s               115.21172 │
│ time_total_s                   359.93061 │
│ training_iteration                     3 │
│ epoch                                  2 │
│ matthews_correlation             0.58089 │
│ step                                 804 │
│ train_loss                       0.01834 │
╰──────────────────────────────────────────╯
(RayTrainWorker pid=1584) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000002)
Training saved a checkpoint for iteration 3 at: (local)/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000002

(pid=1754) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 0.46 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 25.68 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  1.95it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  1.95it/s]
                                                                                                                                   
(SplitCoordinator pid=1755) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]
(SplitCoordinator pid=1755) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=1755) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1755) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 3.13 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  9.68it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 1.54 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  9.68it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 1.54 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  9.68it/s]
                                                                                                                                  
(RayTrainWorker pid=1585) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000003)

Training finished iteration 4 at 2024-01-30 07:53:09. Total running time: 8min 6s
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ checkpoint_dir_name    checkpoint_000003 │
│ time_this_iter_s                114.2081 │
│ time_total_s                   474.13871 │
│ training_iteration                     4 │
│ epoch                                  3 │
│ matthews_correlation              0.5857 │
│ step                                1072 │
│ train_loss                       0.00919 │
╰──────────────────────────────────────────╯
(RayTrainWorker pid=1584) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000003)
Training saved a checkpoint for iteration 4 at: (local)/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000003

(pid=1754) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
                                                                                                                         
(SplitCoordinator pid=1754) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 0.46 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 25.68 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  2.43it/s]
(pid=1754) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 25.21 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  2.43it/s]
                                                                                                                                   
(SplitCoordinator pid=1755) Executing DAG InputDataBuffer[Input] -> TaskPoolMapOperator[MapBatches(tokenize_sentence)] -> OutputSplitter[split(2, equal=True)]
(SplitCoordinator pid=1755) Execution config: ExecutionOptions(resource_limits=ExecutionResources(cpu=None, gpu=None, object_store_memory=None), exclude_resources=ExecutionResources(cpu=1.0, gpu=2.0, object_store_memory=0.0), locality_with_output=['e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc', 'e3a36cb35ddf7b9f7674a3b43415c8eca5c7a91aece18a2d47251cbc'], preserve_order=False, actor_locality_enabled=True, verbose_progress=False)
(SplitCoordinator pid=1755) Tip: For detailed progress reporting, run `ray.data.DataContext.get_current().execution_options.verbose_progress = True`

(pid=1755) Running 0:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 0.0 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 1.0/7.0 CPU, 0.0/0.0 GPU, 3.13 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.08 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:   0%|          | 0/1 [00:00<?, ?it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory: 100%|██████████| 1/1 [00:00<00:00,  9.65it/s]
(pid=1755) Running: 0.0/7.0 CPU, 0.0/0.0 GPU, 3.07 MiB/1.37 GiB object_store_memory:  50%|█████     | 1/2 [00:00<00:00,  9.65it/s]
                                                                                                                                  
(RayTrainWorker pid=1585) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000004)

Training finished iteration 5 at 2024-01-30 07:55:04. Total running time: 10min 1s
╭──────────────────────────────────────────╮
│ Training result                          │
├──────────────────────────────────────────┤
│ checkpoint_dir_name    checkpoint_000004 │
│ time_this_iter_s                114.3761 │
│ time_total_s                   588.51481 │
│ training_iteration                     5 │
│ epoch                                  4 │
│ matthews_correlation             0.60427 │
│ step                                1340 │
│ train_loss                       0.02815 │
╰──────────────────────────────────────────╯
(RayTrainWorker pid=1584) Checkpoint successfully created at: Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000004)
Training saved a checkpoint for iteration 5 at: (local)/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000004

Training completed after 5 iterations at 2024-01-30 07:55:20. Total running time: 10min 17s

Result(
  metrics={'train_loss': 0.028154587373137474, 'matthews_correlation': 0.6042729119797927, 'epoch': 4, 'step': 1340},
  path='/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03',
  filesystem='local',
  checkpoint=Checkpoint(filesystem=local, path=/home/ray/ray_results/ptl-sent-classification/TorchTrainer_88684_00000_0_2024-01-30_07-45-03/checkpoint_000004)
)
2024-01-30 07:55:22,855	SUCC cli.py:60 -- ---------------------------------------------------
2024-01-30 07:55:22,855	SUCC cli.py:61 -- Job 'pytorch-text-classifier-pdk2c-kc2x8' succeeded
2024-01-30 07:55:22,855	SUCC cli.py:62 -- ---------------------------------------------------

andrewsykim · 2024-01-30T15:59:25Z

ray-operator/config/samples/pytorch-text-classifier/ray-job.pytorch-distributed-training.yaml

+            - name: kuberay-repo
+              mountPath: /home/ray/kuberay
+          containers:
+            - name: ray-head


This RayJob uses a single-node Ray cluster to avoid configuring shared storage for checkpointing. Otherwise I would need to configure something like GCSFuse which would not work on every environment

andrewsykim · 2024-01-30T15:59:49Z

cc @architkulkarni @kevin85421

ray-operator/config/samples/pytorch-text-classifier/ray-job.pytorch-distributed-training.yaml

kevin85421 · 2024-01-30T19:16:49Z

ray-operator/config/samples/pytorch-text-classifier/fine-tune-pytorch-text-classifier.py

@@ -0,0 +1,153 @@
+import ray


I prefer to embed the Python script in the YAML file to maintain consistency between them. Ensuring the same commit for both the YAML and the Python file requires specifying the commit in the YAML's runtime environment. However, it's easy to forget updating the runtime environment whenever the Python script is modified.

Should we just stick to the ConfigMap route then? It's easier but i don't want users thinking this is the recommended way to upload source for Ray jobs

I'm inclined to keep the script separate because:

I like how the python script is versioned in a .py file instead of in embedded YAML

The sample shows how to use working_dir in runtime environment.

This is more inline with how users will actually use RayJob (I imagine no one is adding their job code in a ConfigMap)

i don't want users thinking this is the recommended way to upload source for Ray jobs

This makes sense to me. I will check with the Ray team to see if there is a repository where we can upload Ray example scripts.

I checked with the Ray team, but we didn't reach a consensus on creating a new repository for the Python scripts. Hence, we can initially put them in the KubeRay repository. However, as the KubeRay repository is 186MB, downloading the script may take some time in the runtime environment.

with Ray Data Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

andrewsykim · 2024-01-31T04:22:07Z

...or/config/samples/pytorch-text-classifier/ray-job-v1alpha1.pytorch-distributed-training.yaml

@@ -0,0 +1,41 @@
+# This RayJob is based on the "Fine-tune a PyTorch Lightning Text Classifier with Ray Data" example in the Ray documentation.
+# See https://docs.ray.io/en/master/train/examples/lightning/lightning_cola_advanced.html for more details.
+apiVersion: ray.io/v1alpha1


@kevin85421 added a v1alpha1 RayJob as well since Kueue does not fully support v1 until kubernetes-sigs/kueue#1435 is merged

…ay-project#1891)

andrewsykim force-pushed the ray-kueue-example branch from e56e064 to af0ed07 Compare January 30, 2024 15:55

andrewsykim commented Jan 30, 2024

View reviewed changes

andrewsykim force-pushed the ray-kueue-example branch from af0ed07 to a8f346b Compare January 30, 2024 16:01

kevin85421 reviewed Jan 30, 2024

View reviewed changes

andrewsykim force-pushed the ray-kueue-example branch 2 times, most recently from fc9ef9e to 376e70c Compare January 31, 2024 04:15

Add a sample RayJob to fine-tune a PyTorch lightning text classifier

30bb84e

with Ray Data Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

andrewsykim force-pushed the ray-kueue-example branch from 376e70c to 30bb84e Compare January 31, 2024 04:17

andrewsykim commented Jan 31, 2024

View reviewed changes

kevin85421 approved these changes Jan 31, 2024

View reviewed changes

kevin85421 merged commit ceb9f01 into ray-project:master Jan 31, 2024
23 checks passed

ryanaoleary pushed a commit to ryanaoleary/kuberay that referenced this pull request Feb 13, 2024

Add a sample RayJob to fine-tune a PyTorch lightning text classifier (r…

2b380c6

…ay-project#1891)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data #1891

Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data #1891

andrewsykim commented Jan 30, 2024

andrewsykim commented Jan 30, 2024

andrewsykim commented Jan 30, 2024

andrewsykim Jan 30, 2024

andrewsykim commented Jan 30, 2024

kevin85421 Jan 30, 2024

andrewsykim Jan 30, 2024

andrewsykim Jan 30, 2024 •

edited

kevin85421 Jan 30, 2024

kevin85421 Jan 31, 2024

andrewsykim Jan 31, 2024 •

edited

Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data #1891

Add a sample RayJob to fine-tune a PyTorch lightning text classifier with Ray Data #1891

Conversation

andrewsykim commented Jan 30, 2024

Why are these changes needed?

Related issue number

Checks

andrewsykim commented Jan 30, 2024

andrewsykim commented Jan 30, 2024

andrewsykim Jan 30, 2024

Choose a reason for hiding this comment

andrewsykim commented Jan 30, 2024

kevin85421 Jan 30, 2024

Choose a reason for hiding this comment

andrewsykim Jan 30, 2024

Choose a reason for hiding this comment

andrewsykim Jan 30, 2024 • edited

Choose a reason for hiding this comment

kevin85421 Jan 30, 2024

Choose a reason for hiding this comment

kevin85421 Jan 31, 2024

Choose a reason for hiding this comment

andrewsykim Jan 31, 2024 • edited

Choose a reason for hiding this comment

andrewsykim Jan 30, 2024 •

edited

andrewsykim Jan 31, 2024 •

edited