# Continued Pre-Training (CPT) at scale with DeepSpeed

This guide provides a step-by-step workflow for continued pre-training the [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt) model on a multi-GPU Anyscale cluster. It uses LLaMA-Factory for the training framework and `DeepSpeed` to efficiently manage memory and scale the training process.

CPT is a technique to further adapt a pre-trained base model on large-scale unlabeled text. By continuing to train on high-quality corpora, you adapt the model to new domain knowledge and improve generalization. This notebook performs full fine-tuning of the base model instead of using parameter-efficient fine-tuning (PEFT) techniques.

- **Full fine-tuning vs LoRA:** Full fine-tuning generally yields the best quality but requires significantly more compute, longer training, and large checkpoints. LoRA is much faster and cheaper with small adapter checkpoints, but typically shows the most improvement on curated, simplified corpora (gains on broad/noisy corpora may be limited). See [Compare full vs freeze vs PEFT](https://docs.anyscale.com/llm/fine-tuning#compare-full-vs-freeze-vs-parameter-efficient-fine-tuning-peft) and [LoRA speed and memory optimizations](https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations#lora).

## Step 1: Set up your environment

### Dependencies
First, ensure your environment has the correct libraries. Start with a pre-built container image and install LLaMA-Factory and DeepSpeed on top of it.

Recommended container image:
```bash
anyscale/ray-llm:2.48.0-py311-cu128
```

Execute the following commands to install the required packages and optional tools for experiment tracking and faster model downloads:


In [None]:
%%bash
# Install the specific version of LLaMA-Factory
pip install -q llamafactory==0.9.3

# Install DeepSpeed for large-scale training
pip install -q deepspeed==0.16.9

# (Optional) For accelerated model downloads from Hugging Face
pip install -q hf_transfer==0.1.9

# (Optional) Experiment tracking library
pip install -q mlflow==3.4.0


### Model and compute resources

DeepSpeed ZeRO-3 partitions parameters, gradients, and optimizer states across multiple GPUs, enabling CPT of mid-sized LLMs on just 4 GPUs.

| Item | Value |
|------|-------|
| **Base model** | [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt) |
| **Worker nodes** | 4 × L40S / 4 x A100-40G |

## Step 2: Prepare the dataset

### Understand the dataset
This tutorial uses a simple JSONL corpus ([C4](https://huggingface.co/datasets/allenai/c4)) containing cleaned English web text derived from Common Crawl, widely used for language-model pretraining. Each line is a JSON object with at least a `text` field. For demo purposes, the sample `c4.jsonl` contains only the first 100 records from the original C4 dataset (hosted on S3) to enable quick runs.

**Dataset example**

```json
{"text": "Beginners BBQ Class Taking Place in Missoula!\nDo you want to get better at making delicious BBQ? You will have the opportunity, put this on your calendar now. Thursday, September 22nd join World Class BBQ Champion, Tony Balay from Lonestar Smoke Rangers. He will be teaching a beginner level class for everyone who wants to get better with their culinary skills.\nHe will teach you everything you need to know to compete in a KCBS BBQ competition, including techniques, recipes, timelines, meat selection and trimming, plus smoker and fire information.\nThe cost to be in the class is $35 per person, and for spectators it is free. Included in the cost will be either a t-shirt or apron and you will be tasting samples of each meat that is prepared.", "timestamp": "2019-04-25 12:57:54", "url": "https://klyq.com/beginners-bbq-class-taking-place-in-missoula/"}
```

### Register the dataset

To specify new datasets that are accessible across Ray worker nodes, you must first add a **`dataset_info.json`** to **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)** such as `/mnt/cluster_storage`. This configuration file acts as a central registry for all your datasets. It maps a custom name to your dataset file location, format, and column structure. 

If you plan to run CPT on this text dataset, first complete the setup steps below. Ensure that you place the dataset files in a storage location that all workers can access (for example, a shared mount or object storage). Avoid storing large files on the head node.

`dataset_info.json`
```json
{
  "my_cpt_c4": {
      "file_name": "/mnt/cluster_storage/c4.jsonl",
      "columns": {
          "prompt": "text"
      }
  }
}
```

For a more detailed dataset preparation and formatting guide, see [Choose your data format](https://docs.anyscale.com/llm/fine-tuning/data-preparation#continued-pretraining).


In [None]:
%%bash
# Make sure all files are accessible to worker nodes
# Create a copy of the data in /mnt/cluster_storage
wget https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/datasets/alpaca/c4.jsonl -O /mnt/cluster_storage/c4.jsonl
# Create a copy of the dataset registry in /mnt/cluster_storage
cp ../dataset-configs/dataset_info.json /mnt/cluster_storage/


## Step 3: Create the pre-training config (CPT with DeepSpeed)

Next, create the main YAML configuration file—the master recipe for your pre-training job. It specifies the base model, the training method (full fine-tuning), the dataset, training hyperparameters, cluster resources, and more.

**Important notes:**
- **MLflow tracking:** To track experiments with MLflow, set `report_to: mlflow` in the config. If you don't want to use MLflow, set `report_to: none` to avoid errors.
- **Access and paths:** The YAML only needs to be on the **head node**, but any referenced paths (`dataset_dir`, `output_dir`) must reside on storage **reachable by all workers** (for example, `/mnt/cluster_storage/`).
- **Gated models:** If your base model has gated access (for example, Gemma) on Hugging Face, set `HF_TOKEN` in the runtime environment.
- **GPU selection and placement:** The config uses a 4xL40S node (`anyscale/accelerator_shape:4xL40S`) so that all 4 GPUs are on the same machine, which is important for efficient DeepSpeed ZeRO-3 communication. You can switch to other multi-GPU nodes such as `4xA100-40GB` or any other node type with comparable or more VRAM, depending on your cloud availability.

### Configure LLaMA-Factory with Ray

**Note**: To customize the training configuration, edit `train-configs/cpt_deepspeed.yaml`. 

```yaml
# cpt_deepspeed.yaml

### model
model_name_or_path: google/gemma-3-4b-pt
trust_remote_code: true

### method
stage: pt
do_train: true
finetuning_type: full

### deepspeed
deepspeed: /mnt/cluster_storage/ds_z3_config.json # path to the DeepSpeed config

### dataset
dataset: my_cpt_c4
dataset_dir: /mnt/cluster_storage

template: gemma
cutoff_len: 512
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: gemma3_4b_full_cpt
logging_steps: 2
save_steps: 50
plot_loss: true
report_to: mlflow   # or none

### train
per_device_train_batch_size: 1 # Adjust this depending on your GPU memory and sequence length
gradient_accumulation_steps: 2
num_train_epochs: 2.0
learning_rate: 1.0e-4
bf16: true
lr_scheduler_type: cosine
warmup_ratio: 0.1
ddp_timeout: 180000000

### ray
ray_run_name: gemma3_4b_full_cpt
ray_storage_path: /mnt/cluster_storage/
ray_num_workers: 4  # Number of GPUs to use
resources_per_worker:
  GPU: 1
  # accelerator_type:L40S: 0.001            # Use this to simply specify a GPU type (may place GPUs on separate nodes).
  anyscale/accelerator_shape:4xL40S: 0.001  # Prefer this for DeepSpeed so all 4 GPUs are on the same node.
  # See https://docs.ray.io/en/master/ray-core/accelerator-types.html#accelerator-types for a full list of accelerator types.
ray_init_kwargs:
  runtime_env:
    env_vars:
      # If using gated models like google/gemma-3-4b-pt
      HF_TOKEN: <your_huggingface_token>
      # If hf_transfer is installed
      HF_HUB_ENABLE_HF_TRANSFER: '1'
      # If using mlflow for experiments tracking
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

**Note:**
This configuration assumes `4xL40S` GPUs are available in your cloud environment. If not, you can substitute with `4xA100-40G` (or another supported accelerator with similar VRAM).

Together, `stage: pt` and `finetuning_type: full` configure this run as full continued pre-training on this C4-based corpus, producing full model checkpoints rather than lightweight adapters.

### DeepSpeed configuration
DeepSpeed is an open-source deep-learning optimization library developed by Microsoft, aimed at enabling large-model training. Higher ZeRO stages (1→3) and enabling CPU offload reduce GPU VRAM usage, but might cause slower training.

To enable DeepSpeed, create a separate Deepspeed config in the **[storage shared across nodes](https://docs.anyscale.com/configuration/storage#shared)**. and reference it from your main training yaml config with:

```yaml
deepspeed: /mnt/cluster_storage/ds_z3_config.json
```

Below is a sample ZeRO-3 config:

`ds_z3_config.json`
```json
{
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
},
"bf16": {
    "enabled": "auto"
},
"zero_optimization": {
    "stage": 3,
    "overlap_comm": false,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
}
}
```

For a more detailed guide on acceleration and optimization methods including DeepSpeed on Ray, see [Speed and memory optimizations](https://docs.anyscale.com/llm/fine-tuning/speed-and-memory-optimizations).


In [None]:
%%bash
# Create a copy of the DeepSpeed configuration file in /mnt/cluster_storage
cp ../deepspeed-configs/ds_z3_config.json /mnt/cluster_storage/


## Step 4: Train and monitor

**Note**: For gated models such as [`google/gemma-3-4b-pt`](https://huggingface.co/google/gemma-3-4b-pt), ensure that you accept the license agreement for the models on the Hugging Face site and set `HF_TOKEN` in the runtime environment. If you installed MLflow, configure its credentials. Otherwise, set `report_to: none` in `cpt_deepspeed.yaml` to avoid `api_token not set` errors.

With all configurations in place, you can launch pre-training in one of two ways:

### Option A: Run from a workspace (quickstart)

The `USE_RAY=1` prefix tells LLaMA-Factory to run in distributed mode on the Ray cluster attached to your workspace.


In [None]:
%%bash
USE_RAY=1 llamafactory-cli train ../train-configs/cpt_deepspeed.yaml


### Option B: Run as an Anyscale job (production)

For longer or production runs, submit the training as an **Anyscale job**. Jobs run outside your interactive session for better stability, retries, and durable logs. You package LLaMA-Factory and other libraries in a container image and launch with a short job config. See [Run LLaMA-Factory as an Anyscale job](https://docs.anyscale.com/llm/fine-tuning/llamafactory-jobs) for the step-by-step guide.

### Tracking with MLflow

If you enabled MLflow logging (`report_to: mlflow` in your YAML), LLaMA-Factory logs metrics (loss, learning rate, etc.), parameters, and artifacts to your configured MLflow tracking server.

**Example YAML snippet:**

```yaml
report_to: mlflow

ray_init_kwargs:
  runtime_env:
    env_vars:
      MLFLOW_TRACKING_URI: "https://<your_cloud_id>.cloud.databricks.com"
      MLFLOW_TRACKING_TOKEN: "<mlflow_tracking_token>"
      MLFLOW_EXPERIMENT_NAME: "/Users/<your_user_id>/experiment_name"
```

**MLFlow example**

![MLflow](https://anyscale-public-materials.s3.us-west-2.amazonaws.com/llm-finetuning/llama-factory/3.2.4/mlflow.png)

For a more detailed guide on tracking experiments with other tools such as Weights & Biases or MLflow, see [Observability and tracking](https://docs.anyscale.com/llm/fine-tuning/observability-and-tracking).

## Step 5: Locate checkpoints

Ray Train writes checkpoints under `ray_storage_path/ray_run_name`. In this example run, the path is: `/mnt/cluster_storage/gemma3_4b_full_cpt`. 

Inside, you see a **trainer session** directory named like:
`TorchTrainer_8c6a5_00000_0_2025-09-09_09-53-45/`.

- Ray Train creates `TorchTrainer_*` **when the trainer starts**; the suffix encodes a short run ID and the **start timestamp**.
- Within that directory, Ray Train names checkpoints `checkpoint_000xxx/`, where the number is the saved ordered checkpoints.

Control the save cadence with `save_strategy` and `save_steps`. For instructions on how to resume interrupted training with `resume_from_checkpoint` and more, see [Understand the artifacts directory](https://docs.anyscale.com/llm/fine-tuning/checkpointing#artifacts-directory).

## Step 6: Export the model

If you use LoRA, you can keep the base model and adapters separate for [multi-LoRA deployment](https://docs.anyscale.com/llm/serving/multi-lora) or [merge the adapters](https://docs.anyscale.com/llm/fine-tuning/checkpointing#merge-lora) into the base model for low-latency inference. 

For full fine-tuning or freeze-tuning, export the fine-tuned model directly.

You may optionally apply [post-training quantization](https://docs.anyscale.com/llm/fine-tuning/checkpointing#ptq) on merged or full models before serving.
