# Domain Adaptive Pre-Training (DAPT)

## Goal

Given a foundational language model (in this case llama-2-7B) that was pre-trained on a broad, general-purpose corpus, our goal is to further pretrain the model on a specific domain (in this example, ChipDesign) to enhance its understanding of domain-specific language and context. This process is called Domain-Adaptive Pretraining (DAPT). DAPT adapts a general-purpose model to specialized tasks within a particular field. Instead of training from scratch, we aim to “specialize” the model by focusing on a target domain corpus, allowing it to adapt to the unique vocabulary, semantics, and syntax of that field.

Our primary goals with respect to DAPT are as follows:
* Improve the model’s performance and accuracy on domain-specific tasks
* Ensure the model retains general language capabilities
* Minimize pretraining time by leveraging existing knowledge in the model

DAPT typically enhances a model’s efficacy in downstream tasks for the domain by exposing it to domain-relevant texts. This pretraining phase can result in more accurate and context-aware predictions on domain-specific data, as the model gains an understanding of field-specific terminology, abbreviations, and common phrases.

# NeMo Tools and Resources

* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)

# Software Requirements
* Access to latest NeMo Framework NGC Containers
* This playbook has been tested on: nvcr.io/nvidia/nemo:dev. It is expected to work similarly on other environments.


#### Launch the NeMo Framework container as follows: 

```
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '"device=0,1"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:dev
```

#### Launch Jupyter Notebook as follows: 
```
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

```


# Hardware Requirements

* This playbook has been tested on 2xA100 80G but can be scaled to multiple GPUs as well as multiple nodes by modifying the appropriate parameters

# Data

* In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv)

# Notebook Outline

* Step 1: Prepare the data for pretraining. This is a multi-step process discussed in detail later in the specific section (later in the notebook).

* Step 2: Download the llama-2-7B hugging face checkpoint and convert to .nemo format.

* Step 3: Continued pretraining the llama-2-7b model using the prepared data and the custom trained tokenizer (from the previous notebook).

* Step 4: Generate Results from llama-2-7b model and trained DAPT Checkpoints 

* Step 5: Calculate evaluation metrics

# Step 0: Clone the Model Checkpoint

This notebook assumed the model has been cloned from [hugging face](https://huggingface.co/meta-llama/Llama-2-7b-hf) in the mounted directory ```/dli/task/02_custom_tokenizer_training/models/weight/```

Clone the model: 
```
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
```

# Step 1: Data Preparation for Pretraining

Identify the different file types (example: code, text, etc) in the pretraining data, in this case we only have 'code' type files. This is typically dataset dependent. 

In [None]:
import os
import json
from pathlib import Path

# Function to count and categorize JSONL files based on 'file_type' field
def identify_jsonl_files(data_path):
    code_files, text_files = [], []
    cnt_code, cnt_text = 0, 0

    for root, _, files in os.walk(data_path):
        for file in files:
            if not file.endswith('.jsonl'):
                continue
            
            file_path = os.path.join(root, file)
            has_code, has_text = False, False

            with open(file_path, 'r') as f:
                for line in f:
                    try:
                        json_obj = json.loads(line.strip())
                        file_type = json_obj.get('file_type', '').lower()

                        if file_type == 'code':
                            has_code = True
                        elif file_type == 'text':
                            has_text = True
                        
                        if has_code and has_text:
                            break  # No need to read further if both types are present

                    except json.JSONDecodeError:
                        continue  # Ignore malformed JSON lines

            if has_code:
                code_files.append(file_path)
                cnt_code += 1
            if has_text:
                text_files.append(file_path)
                cnt_text += 1

    return code_files, text_files, cnt_code, cnt_text

# Path to JSONL dataset
data_path = '/dli/task/02_custom_tokenizer_training/curated_sample_data/curated_data'

# Identify JSONL files
code_files, text_files, cnt_code, cnt_text = identify_jsonl_files(data_path)

# Output results
print(f"\nNumber of files containing 'file_type': 'text': {cnt_text}")
print(f"Number of files containing 'file_type': 'code': {cnt_code}")

### Merging code JSONL files into a single JSONL file for further preprocessing

This is an optional step, it is possible to use multiple jsonl files in this workflow as well. This example uses a single merged. jsonl file

In [None]:
import os
import json

def list_jsonl_files(directory):
    jsonl_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.jsonl'):
                jsonl_files.append(os.path.join(root, file))
    return jsonl_files

# Function to merge multiple jsonl files into a single file 
def merge_jsonl_files(directory, output_file):
    jsonl_files = list_jsonl_files(directory)
    
    with open(output_file, 'w') as outfile:
        for input_file in jsonl_files:
            with open(input_file, 'r') as infile:
                for line in infile:
                    try:
                        json_object = json.loads(line.strip())
                        json.dump(json_object, outfile)
                        outfile.write('\n')
                    except json.JSONDecodeError:
                        print(f"Skipping invalid JSON in {input_file}: {line.strip()}")

    print(f"Merged {len(jsonl_files)} JSONL files into {output_file}")

In [None]:
directory = '/dli/task/02_custom_tokenizer_training/curated_sample_data/curated_data'
output_file = '/dli/task/02_custom_tokenizer_training/curated_sample_data/code_merged_output.jsonl'
merge_jsonl_files(directory, output_file)

### Data Format Conversion for pretraining: JSONL to bin/idx files 

For efficient pretraining, we convert data from JSONL to bin/idx format. 

JSONL files, while convenient for storing structured text data, are not optimized for high-speed data loading during large language model training. In pretraining workflows, particularly those with large datasets and complex model architectures, the need for fast data access and efficient memory management is essential.

The bin/idx format is a binary format specifically designed to facilitate high-throughput data loading. This format allows direct, randomized access to data samples, which speeds up I/O operations and reduces the memory footprint compared to loading JSONL files. By converting data to bin/idx format, hardware utilization can be maximized and bottlenecks in data processing can be avoided, leading to a more efficient pretraining process.

#### Benefits of bin/idx format for Pretraining:

* **Optimized I/O Performance:** The binary format enables quicker data reads and reduces latency, allowing the model to continuously access data at high speeds.
* **Efficient Memory Usage:** Data in bin/idx format consumes less memory during loading, making it suitable for large datasets and enabling better use of available system resources.
* **Enhanced Scalability:** With bin/idx, it’s easier to handle shuffling and batching of large datasets, which is essential for pretraining on diverse domain-specific data.

Modify the `input` to point to the merged `jsonl` file. Similarly modify paths to `vocab`, `tokenizer-model`, `merge-file` to point to relevant file paths. `tokenizer-model` should point to the custom tokenizer (trained in the custom tokenizer training notebook) if your data has domain specific terminology

In [None]:
# Using default Llama-2 tokenizer for testing purpose
!python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input='/dli/task/02_custom_tokenizer_training/curated_sample_data/code_merged_output.jsonl' \
--json-keys=text \
--tokenizer-library=sentencepiece \
--dataset-impl mmap \
--tokenizer-model '/dli/task/02_custom_tokenizer_training/models/weight/llama2-7b-hf/tokenizer.model' \
--tokenizer-type llama \
--append-eod \
--output-prefix='preprocessed_data'

In [None]:
#### Uncomment to use custom trained tokenizer ####
# !python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
# --input='/workspace/dapt-custom-tokenization/code_merged_output.jsonl' \
# --json-keys=text \
# --tokenizer-library=sentencepiece \
# --vocab '/workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json/vocab.json' \
# --dataset-impl mmap \
# --tokenizer-model '/workspace/Llama-2-7b-hf/tokenizer.model' \
# --tokenizer-type llama \
# --merge-file '/workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json/merges.txt' \
# --append-eod \
# --output-prefix='preprocessed_data'

In [None]:
# If the above step runs successfully, two files with the extensions .bin and .idx will be generated
!ls 

# Step 2: Convert Llama-2-7b Hugging Face Checkpoint to NeMo2.0 format

Llama-2-7B model can be automatically downloaded and converted to NeMo2 format with the following script:

* Save the following code snippet as ```convert_ckpt_nemo2.py```
* Run ```python3 convert_ckpt_nemo2.py```

```
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B

if __name__ == "__main__":
    output = llm.import_ckpt(
        model=llm.LlamaModel(config=Llama2Config7B()),
        source="hf:///workspace/Llama-2-7b-hf",
    )
```

The conversion will generate a ```llama-2``` NeMo2 checkpoint directory which can be used for the continued pretraining using NeMo Toolkit as shown in Step 3 in default ```$NEMO_HOME``` folder, unless otherwise specified ```NEMO_HOME``` is set as ```/root/.cache/nemo```

Alternatively, you can directly use ```source="meta-llama/Llama2-7b-hf"``` to use the model directly from Hugging Face instead of using the locally downloaded version in ```\workspace```

```
cd /dli/task/02_custom_tokenizer_training/models/weight
python3 /dli/task/03_domain_adaptive_pretraining/convert_ckpt_nemo2.py

```

In [None]:
!ls /root/.cache/nemo/models/llama2-7b-hf

In [None]:
!ls /root/.cache/nemo/models/llama2-7b-hf/weights/

# Step 3: Continued Pretraining using Llama2-7b

For this step we use a predefined recipe `llama2_7b.pretrain_recipe` from NeMo Toolkit for continued pretraining. We will modify the `pretrain_recipe` and use it for continued pretraining workflow. Typically this involves changing dataset files and data blends, changing learning rate scheduler, changing default parallelism based on number of devices available, adding connector to resume training, etc.

First, we define the recipe and executor for using NeMo2 as follows:

In [None]:
import nemo_run as run
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B

# Configure recipe to pre-train based on the default Llama-2-7B recipe
def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.llama2_7b.pretrain_recipe(
        name="llama2_7b_dapt",
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )

    # Set parallelism and validation parameters
    strategy = recipe.trainer.strategy
    strategy.context_parallel_size = 1
    strategy.tensor_model_parallel_size = 1
    recipe.trainer.val_check_interval = 10

    return recipe

# Executor for running pretraining 
def local_executor_torchrun(devices: int = 1) -> run.LocalExecutor:
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun")
    return executor

Let's instantiate the `recipe` and modify it so that it uses the desired number of GPUs, resuming from the pretrained Llama2-7b checkpoint instead of training from scratch.

The default `recipe` initializes all the essential components required for Llama2 7B pretraining, including model, dataloader, trainer, logger, optimizer etc. `recipe` is not executed during instantiation, so it is very simple to modify it to fit your custom training workflow. In our case, we want to do the DAPT (instead of pretraining from scratch), and all we need to do is to add a `resume` config which points to the Llama2 7B checkpoint.

You can easily change the optimizer, parallelism, data as per your use case. Look at the following example for guidance on how to tweak these parameters. Note: you are only configuring your task at this stage; the underlying code is not executed unless you launch the job using the executor.

In [None]:
!rm -r /dli/task/02_custom_tokenizer_training/models/weight/llama2-7b-hf/.cache

In [None]:
import nemo.lightning as nl
from nemo.collections.common.tokenizers import AutoTokenizer

# Define dataset configuration
data = run.Config(
    llm.PreTrainingDataModule,
    paths=['/dli/task/03_domain_adaptive_pretraining/preprocessed_data_text_document'],
    seq_length=4096,
    tokenizer=run.Config(
        AutoTokenizer,
        pretrained_model_name="/dli/task/02_custom_tokenizer_training/models/weight/llama2-7b-hf",
    ),
    micro_batch_size=1,
    global_batch_size=8,
)

# Instantiate the recipe
recipe = configure_recipe(nodes=1, gpus_per_node=2)

# Configure resume settings
recipe.resume = run.Config(
    nl.AutoResume,
    restore_config=run.Config(nl.RestoreConfig, path="/root/.cache/nemo/models/llama2-7b-hf"),
)

# Ensure tokenizer is set
recipe.data.tokenizer = data.tokenizer

# Configure parallelism settings
recipe.trainer.strategy.tensor_model_parallel_size = 2
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.context_parallel_size = 1

# Configure training steps and validation intervals
recipe.trainer.max_steps = 20
recipe.trainer.max_epochs = 1
recipe.trainer.val_check_interval = 10
recipe.trainer.limit_val_batches=5

# Set batch size settings
recipe.data.global_batch_size = data.global_batch_size
recipe.data.micro_batch_size = data.micro_batch_size
recipe.data.num_val_samples = 128  # Adjust based on dataset size

# Set checkpoint and log locations
recipe.log.log_dir = "/workspace/logs_03_15"
recipe.log.ckpt.save_optim_on_train_end = True

# Configure learning rate scheduler
recipe.optim.config.lr = 1e-5
recipe.optim.lr_scheduler.min_lr = 1e-6

# Assign dataset configuration
recipe.data = data

# Configure data blending (if needed)
recipe.data.paths = [1, '/dli/task/03_domain_adaptive_pretraining/preprocessed_data_text_document']

After configure the training procedure properly, we can run the training by instantiate the `executor` and use `nemorun` to start the training:

In [None]:
# Launch the pretraining job 
executor = local_executor_torchrun(devices=recipe.trainer.devices)
run.run(recipe, executor=executor)

In [None]:
### Modify checkpoint path
!ls -lh /workspace/logs_03_15/llama2_7b_dapt/2025-03-18_05-20-49/checkpoints/

### To monitor the training, launch Tensorboard from another terminal

`tensorboard --logdir nemo_experiments --bind_all`

# Step 4: Generate Results from Llama-2-7b Model and Trained DAPT Checkpoints

We use the `llm.generate` API in NeMo 2.0 to generate results from the trained DAPT checkpoint. Find your last saved checkpoint from your experiment dir: `/workspace/logs_01_31/llama2_7b_dapt/2025-02-27_00-43-49/checkpoints`.

In [None]:
### Modify checkpoint path 
dapt_ckpt_path=str(next((d for d in Path("/workspace/logs_03_15/llama2_7b_dapt/2025-03-18_05-20-49/checkpoints/").iterdir() if d.is_dir() and d.name.endswith("-last")), None))
print("We will load DAPT checkpoint from:", dapt_ckpt_path)
base_ckpt_path=Path("/root/.cache/nemo/models/llama2-7b-hf/")
print("We will load base model checkpoint from:", base_ckpt_path)

When using `llm.generate` API, you can also pass a [custom data module](https://github.com/NVIDIA/NeMo/blob/main/tutorials/llm/llama/nemo2-sft-peft/nemo2-sft.ipynb). Here we will use a sample verilog dataset to generate predictions.  For a quick demonstration, we will use the first 100 lines as an example input.The input JSONL file should contain input and output fields (additional keys are optional). In the following example, the generated predictions are saved to the `dapt_predictions.jsonl` file. Note that while fine-tuning required a minimum of 2 GPUs with `tensor_model_parallel_size=2`, generating predictions only requires `tensor_model_parallel_size=1`. However, using multiple GPUs can speed up the inference.

In [None]:
!ls /dli/task/03_domain_adaptive_pretraining/data/

In [None]:
%%bash
head -n 30 /dli/task/03_domain_adaptive_pretraining/data/MG-Verilog_high_level_global_summary_in_out_test.jsonl > /dli/task/03_domain_adaptive_pretraining/evals/toy_verilog_test.jsonl
head -n 1 /dli/task/03_domain_adaptive_pretraining/evals/toy_verilog_test.jsonl

In [None]:
input_data="/dli/task/03_domain_adaptive_pretraining/evals/toy_verilog_test.jsonl"
output_path_base="/dli/task/03_domain_adaptive_pretraining/evals/llama2_7b_base_prediction.jsonl"
output_path_dapt="/dli/task/03_domain_adaptive_pretraining/evals/dapt_prediction.jsonl"

In [None]:
from megatron.core.inference.common_inference_params import CommonInferenceParams
from nemo.collections.llm.recipes.precision.mixed_precision import bf16_mixed
from nemo import lightning as nl

def trainer() -> run.Config[nl.Trainer]:
    strategy = run.Config(
        nl.MegatronStrategy,
        tensor_model_parallel_size=2
    )
    trainer = run.Config(
        nl.Trainer,
        accelerator="gpu",
        devices=2,
        num_nodes=1,
        strategy=strategy,
        plugins=bf16_mixed(),
    )
    return trainer

# Configure inference to predict on base model checkpoint
def configure_inference_base():
    return run.Partial(
        llm.generate,
        path=str(base_ckpt_path),
        trainer=trainer(),
        input_dataset=input_data,
        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1),
        output_path=output_path_base,
    )

# Configure inference to predict on trained DAPT checkpoint
def configure_inference_dapt():
    return run.Partial(
        llm.generate,
        path=str(dapt_ckpt_path),
        trainer=trainer(),
        input_dataset=input_data,
        inference_params=CommonInferenceParams(num_tokens_to_generate=50, top_k=1),
        output_path=output_path_dapt,
    )


def local_executor_torchrun(nodes: int = 1, devices: int = 2) -> run.LocalExecutor:
    # Env vars for jobs are configured here
    env_vars = {
        "TORCH_NCCL_AVOID_RECORD_STREAMS": "1",
        "NCCL_NVLS_ENABLE": "0",
    }

    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun", env_vars=env_vars)

    return executor

if __name__ == '__main__':
    run.run(configure_inference_base(), executor=local_executor_torchrun())
    run.run(configure_inference_dapt(), executor=local_executor_torchrun())

# Step 5: Calculate Evaluation Metrics

We can evaluate the model's predictions by calculating the Exact Match (EM) and F1 scores.

- Exact Match is a binary measure (0 or 1) checking if the model outputs match one of the ground truth answer exactly.
- F1 score is the harmonic mean of precision and recall for the answer words.

Below is a script that computes these metrics. The sample scores can be improved by training the model further and performing hyperparameter tuning. In this notebook, we only train for 20 steps.

In [None]:
# Scores from base model
!python /opt/NeMo/scripts/metric_calculation/peft_metric_calc.py --pred_file /dli/task/03_domain_adaptive_pretraining/evals/llama2_7b_base_prediction.jsonl --label_field "label" --pred_field "prediction"

In [None]:
# Scores from DAPT model
!python /opt/NeMo/scripts/metric_calculation/peft_metric_calc.py --pred_file /dli/task/03_domain_adaptive_pretraining/evals/dapt_prediction.jsonl --label_field "label" --pred_field "prediction"