Example Code from Section 4: SFT

## 4.1 Using HuggingFace Models
Loading a HuggingFace model and tokenizer.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-Math-1.5B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Math-1.5B")

**Forward pass.** We run a forward pass on a batch of input IDs and get the logits with the .logits attribute of the output. Then, we can compute the loss between the model's predicted logits and the actual labels.

In [None]:
input_ids = train_batch["input_ids"].to(device)
labels = train_batch["labels"].to(device)

logits = model(input_ids).logits    # does this compute only next token or a whole sequence? slightly confused
loss = F.cross_entropy(..., ...)

**Saving a trained model.** Use the .save_pretrained() function to save a model to a directory after training is finished. Also recommend saving the tokenizer as well.

In [None]:
model.save_pretrained(save_directory=output_dir)
tokenizer.save_pretrained(save_directory=output_dir)

**Gradient accumulation.** 80GB GPU does not have enough memory to support reasonable batch sizes. To use larger batch sizes, we use gradient accumulation. Rather than updating our model weights after every batch, we accumulate the gradients over several batches before taking a step.

In [None]:
gradient_accumulation_steps = 4
for idx, (inputs, labels) in enumerate(data_loader):
    # Forward pass
    logits = model(inputs)
    loss = loss_fn(logits, labels) / gradient_accumulation_steps    # why do we divide loss by steps?

    # Backward pass
    loss.backward() # this accumulates gradients?

    if (idx + 1) % gradient_accumulation_steps == 0:
        # update weights and zero gradients every `gradient_accumulation_steps` batches
        optimizer.step()
        optimizer.zero_grad()

## 4.3 SFT Experiment

Here is some starter code to initialize vLLM and to load the policy weights into the vLLM instance before every rollout phase.

In [None]:
from transformers import PreTrainedModel
from vllm import LLM
from vllm.model_executor import set_random_seed as vllm_set_random_seed

def init_vllm(
    model_id: str,
    device: str,
    seed: int,
    gpu_memory_utilization: float = 0.85,
):
    """
    Start the inference process, here we use vLLM to hold a model on
    a GPU separate from the policy.
    """
    vllm_set_random_seed(seed)

    world_size_patch = patch("torch.distributed.get_world_size", return_value=1)
    profiling_patch = patch(
        "vllm.worker.worker.Worker._assert_memory_footprint_increased_during_profiling",
        return_value=None,
    )
    with world_size_patch, profiling_patch:
        return LLM(
            model=model_id,
            device=device,
            dtype=torch.bfloat16,
            enable_prefix_caching=True,
            gpu_memory_utilization=gpu_memory_utilization,
        )

def load_policy_into_vllm_instance(policy: PreTrainedModel, llm: LLM):
    """
    Copied from https://github.com/huggingface/trl/blob/
    22759c820867c8659d00082ba8cf004e963873c1/trl/trainer/grpo_trainer.py#L670
    """
    state_dict = policy.state_dict()
    llm_model = llm.llm_engine.model_executor.driver_worker.model_runner.model
    llm_model.load_weights(state_dict.items())

You may find it helpful to log metrics with respect to both the train and validation steps (this will also be useful in later RL experiments). To do this in wandb, you can use the following code:

In [None]:
import wandb

# Setup wandb metrics
wandb.define_metric("train_step")   # x-axis for training
wandb.define_metric("eval_step")    # x-axis for eval

# everything that starts with train / is tied to train_step
wandb.define_metric("train/*", step_metric="train_step")

# everything that starts with eval / is tied to eval_step
wandb.define_metric("eval/*", step_metric="eval_step")

Use gradient clipping with clip value 1.0.