# Training

This tutorial provides a step-by-step guide to training a model on a dataset.

There are three steps involved in training a model:

1. Tokenizing and loading the data.

2. Creating the model

3. Training the model.

We will explore each of these steps in detail.

## Tokenizing and loading the data

The initial step involves either tokenizing a new dataset or loading a previously tokenized one. We will employ the `LMDatasetProvider` class and utilize a dataset from the [Hugging Face datasets library](https://huggingface.co/datasets).

In [1]:
import os
from phyagi.datasets.train.lm.lm_dataset_provider import LMDatasetProvider

os.environ["WANDB_MODE"] = "disabled"
dataset_provider = LMDatasetProvider.from_hub(
    dataset_path="wikitext",
    dataset_name="wikitext-2-raw-v1",
    tokenizer="gpt2",
    cache_dir="cache",
    seq_len=1024,
)

[phyagi] [2025-05-19 10:45:15,798] [INFO] [lm_dataset_provider.py:252:from_hub] Loading non-tokenized dataset...
[phyagi] [2025-05-19 10:45:18,000] [INFO] [lm_dataset_provider.py:285:from_hub] Creating validation split (if necessary)...


Processing dataset with shared memory...:   0%|          | 0/5 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/37 [00:00<?, ? examples/s]

Processing dataset with shared memory...:   0%|          | 0/4 [00:00<?, ? examples/s]

[phyagi] [2025-05-19 10:45:19,442] [INFO] [lm_dataset_provider.py:309:from_hub] Saving tokenized dataset: cache


Once the data has been tokenized, a set of NumPy arrays will be generated in the `cache` directory. If the data has already been tokenized, you can load it from the cache directory using the following code: `dataset_provider = LMDatasetProvider.from_cache("cache")`.

These arrays represent continuous blocks of tokenized data that can be swiftly loaded into memory and divided into different sequence lengths. With the dataset provider prepared, you can obtain the training and validation datasets by calling the `get_train_dataset` and `get_val_dataset` methods, as shown below:

In [2]:
train_dataset = dataset_provider.get_train_dataset()
val_dataset = dataset_provider.get_val_dataset()

## Creating the model

After loading the datasets, you can create a model using a configuration object, which is similar to the Hugging Face API. For example:

In [3]:
from transformers import GPT2Config, GPT2LMHeadModel

config = GPT2Config(n_layer=4, n_embd=192)
model = GPT2LMHeadModel(config)

This example employs the default configuration (apart from the number of layers and embeddings) arguments to create the model. However, you can easily modify these by altering the arguments passed to `GPT2Config`.  
   
## Training the model  
   
With the data prepared and the model loaded, we can now proceed to train the model. PhyAGI offers two training classes: `DsTrainer` and `HfTrainer`. The former utilizes DeepSpeed for training the model, while the latter relies on Hugging Face's `transformers.Trainer` class.

### DeepSpeed

In [None]:
from phyagi.trainers.ds.ds_trainer import DsTrainer
from phyagi.trainers.ds.ds_training_args import DsTrainingArguments

training_args = DsTrainingArguments("gpt2-wt103", max_steps=1)
trainer = DsTrainer(
    model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer.train()

[2025-05-19 10:45:20,528] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed info: version=0.16.7, git-hash=unknown, git-branch=unknown
[2025-05-19 10:45:20,529] [INFO] [config.py:735:__init__] Config mesh_device None world_size = 1
[2025-05-19 10:45:20,588] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2025-05-19 10:45:20,589] [INFO] [logging.py:107:log_dist] [Rank 0] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2025-05-19 10:45:20,590] [INFO] [logging.py:107:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2025-05-19 10:45:20,591] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedAdam
[2025-05-19 10:45:20,592] [INFO] [logging.py:107:log_dist] [Rank 0] Creating fp16 optimizer with dynamic loss scale
[2025-05-19 10:45:20,597] [INFO] [logging.py:107:log_dist] [Rank 0] DeepSpeed Final Optimizer = FP16_Optimizer
[2025-05-19 10:45:20,600] [INFO] [logging.py:107:log_dist] [Ra

  0%|          | 0/1 [00:00<?, ?it/s]

[phyagi] [2025-05-19 10:45:28,362] [INFO] [ds_trainer.py:761:train] {'step/step_runtime': 7.674737930297852, 'step/samples_per_second': 133.42475134655956, 'step/samples_per_second_per_gpu': 133.42475134655956, 'step/tokens_per_second': 133.42475134655956, 'step/tokens_per_second_per_gpu': 133.42475134655956, 'step/tflops': 0.007211353769554979, 'step/tflops_per_gpu': 0.007211353769554979, 'step/mfu': 2.3284965352131026e-05, 'train/total_runtime': 7.676496267318726, 'train/progress': 1.0, 'train/epoch': 0.5, 'train/step': 1, 'train/loss': 10.843445777893066, 'train/loss_scale': 4096, 'train/gradient_norm': 2.3035886052062624, 'train/ppl': 51197.48956505967, 'train/learning_rate': 0.0, 'train/batch_size': 1024, 'train/n_samples': 1024, 'train/n_tokens': 1024, 'train/step_runtime': 7.674936771392822, 'train/samples_per_second': 133.42129459838768, 'train/samples_per_second_per_gpu': 133.42129459838768, 'train/tokens_per_second': 133.42129459838768, 'train/tokens_per_second_per_gpu': 133.

100%|██████████| 1/1 [00:07<00:00,  7.68s/it]

[phyagi] [2025-05-19 10:45:28,365] [INFO] [ds_trainer.py:790:train] Training done.





DeepSpeed uses `deepspeed` launcher. For instance, to train the model with 4 GPUs, you can execute the following command:  
   
```bash  
deepspeed --num_gpus=4 train.py  
```  
   
Make sure to replace `train.py` with the appropriate Python script containing your training code.

### Hugging Face

In [None]:
from transformers import TrainingArguments
from phyagi.trainers.hf.hf_trainer import HfTrainer

# Re-initialize the model to avoid the training to continue from the previous run
model = GPT2LMHeadModel(config)

training_args = TrainingArguments("gpt2-wt103", max_steps=1)
trainer = HfTrainer(
    model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer.train()

Step,Training Loss


TrainOutput(global_step=1, training_loss=10.859561920166016, metrics={'train_runtime': 0.5342, 'train_samples_per_second': 14.976, 'train_steps_per_second': 1.872, 'total_flos': 87482695680.0, 'train_loss': 10.859561920166016, 'epoch': 0.003389830508474576})

Hugging Face requires the `torch` launcher to be used for multi-GPU training. For instance, to train the model with 4 GPUs, you can execute the following command:  
   
```bash  
torchrun --nproc-per-node=4 train.py  
```  
   
Make sure to replace `train.py` with the appropriate Python script containing your training code.

### PyTorch Lightning

In [None]:
from phyagi.trainers.pl.pl_trainer import PlTrainer
from phyagi.trainers.pl.pl_training_args import PlTrainingArguments, PlTrainerArguments, PlLightningModuleArguments

# Re-initialize the model to avoid the training to continue from the previous run
model = GPT2LMHeadModel(config)

training_args = PlTrainingArguments("gpt2-wt103", trainer=PlTrainerArguments(max_steps=1), lightning_module=PlLightningModuleArguments(optimizer={"type": "adamw"}))
trainer = PlTrainer(
    model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)
trainer.train()

INFO: [rank: 0] Seed set to 42
INFO:lightning.fabric.utilities.seed:[rank: 0] Seed set to 42
INFO: Trainer will use only 1 of 4 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=4)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
INFO:lightning.pytorch.utilities.rank_zero:Trainer will use only 1 of 4 GPUs because it is running inside an interactive / notebook environment. You may try to set `Trainer(devices=4)` but please note that multi-GPU inside interactive / notebook environments is considered experimental and unstable. Your mileage may vary.
INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPU

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

/home/gderosa/miniconda3/envs/phyagisdk/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:476: Your `val_dataloader`'s sampler has shuffling enabled, it is strongly recommended that you turn shuffling off for val/test dataloaders.
/home/gderosa/miniconda3/envs/phyagisdk/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.
/home/gderosa/miniconda3/envs/phyagisdk/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=127` in the `DataLoader` to improve performance.


Training: |          | 0/? [00:00<?, ?it/s]

INFO: `Trainer.fit` stopped: `max_steps=1` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_steps=1` reached.


PyTorch Lightnign does not require any launcher to be used for multi-GPU training. For instance, to train the model with 4 GPUs, you can execute the following command:  
   
```bash  
python train.py  
```  
   
Make sure to replace `train.py` with the appropriate Python script containing your training code.

# Customizing the training

PhyAGI is designed to be flexible and customizable, and training-related components are no exception. It also provides a `DsTrainer` class designed to be used with [DeepSpeed](https://www.deepspeed.ai/). In this tutorial, we will show you how to customize the training process with the `DsTrainer` class.

Please note that the approach depicted below is extensible to any trainer in PhyAGI, such as `HfTrainer` and `DDLTrainer`.

## Overriding the DeepSpeed trainer

As mentioned before, this tutorial will guide you in overriding the `DsTrainer` class. PhyAGI attempts to keep a unified structure between its trainers, following the style provided by the Hugging Face API. This means that the `DsTrainer` class has methods, such as `save_checkpoint`, `load_checkpoint`, `train_step` and `evaluate_step`.

In this example, we will override the `train_step` function of the trainer and create a customized training loop. The `train_step` function is called by the `train` function, which is the main training loop of the trainer. The `train_step` function is called once per batch, and it is responsible for performing the forward and backward passes of the model.

In [None]:
from typing import Any, Callable, Dict, Iterable, Iterator, List, Optional, Tuple, Union

import torch
from torch.optim.lr_scheduler import LRScheduler
from torch.optim.optimizer import Optimizer
from torch.utils.data import Dataset, Sampler

from phyagi.trainers.ds.ds_trainer import DsTrainer
from phyagi.trainers.ds.ds_trainer_callback import DsTrainerCallback
from phyagi.trainers.ds.ds_training_args import DsTrainingArguments


class MyDsTrainer(DsTrainer):
    def __init__(
        self,
        model: torch.nn.Module,
        args: Optional[DsTrainingArguments] = None,
        data_collator: Optional[Callable] = None,
        sampler: Optional[Sampler] = None,
        train_dataset: Optional[Dataset] = None,
        eval_dataset: Optional[Dataset] = None,
        model_parameters: Optional[Union[Iterable[torch.Tensor], Dict[str, torch.Tensor]]] = None,
        mpu: Optional[Any] = None,
        dist_init_required: bool = None,
        dist_timeout: int = 1800,
        callbacks: Optional[List[DsTrainerCallback]] = None,
        optimizers: Optional[Tuple[Optimizer, LRScheduler]] = None,
    ) -> None:
        super().__init__(
            model,
            args=args,
            data_collator=data_collator,
            sampler=sampler,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            model_parameters=model_parameters,
            mpu=mpu,
            dist_init_required=dist_init_required,
            dist_timeout=dist_timeout,
            callbacks=callbacks,
            optimizers=optimizers,
        )

    def train_step(self, train_iterator: Iterator) -> torch.Tensor:
        gradient_accumulation_steps = self.engine.gradient_accumulation_steps()
        total_loss = 0.0

        for _ in range(gradient_accumulation_steps):
            input_ids, labels = next(train_iterator)

            input_ids = input_ids.to(self.engine.device)
            labels = labels.to(self.engine.device)

            outputs = self.engine(input_ids, labels=labels)
            loss = outputs[0].mean()

            self.engine.backward(loss)
            self.engine.step()

            total_loss += loss

        return total_loss / gradient_accumulation_steps

The usage of the trainer follows the same procedure depicted by the training model section. The only difference is that we will use the `MyDsTrainer` class instead of the `DsTrainer` class. As mentioned before, every method from the `DsTrainer` class can be overriden and customized according to your needs. This is only a simple example that illustrates how to override the `train_step` function.