# Domain Adaptive Pre-Training (DAPT)

## Goal

Given a foundational language model (in this case llama-2-7B) that was pre-trained on a broad, general-purpose corpus, our goal is to further pretrain the model on a specific domain (in this example, ChipDesign) to enhance its understanding of domain-specific language and context. This process is called Domain-Adaptive Pretraining (DAPT). DAPT adapts a general-purpose model to specialized tasks within a particular field. Instead of training from scratch, we aim to “specialize” the model by focusing on a target domain corpus, allowing it to adapt to the unique vocabulary, semantics, and syntax of that field.

Our primary goals with respect to DAPT are as follows:
* Improve the model’s performance and accuracy on domain-specific tasks
* Ensure the model retains general language capabilities
* Minimize pretraining time by leveraging existing knowledge in the model

DAPT typically enhances a model’s efficacy in downstream tasks for the domain by exposing it to domain-relevant texts. This pretraining phase can result in more accurate and context-aware predictions on domain-specific data, as the model gains an understanding of field-specific terminology, abbreviations, and common phrases.

# NeMo Tools and Resources

* [NeMo Framework](https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html)

# Software Requirements
* Access to latest NeMo Framework NGC Containers
* This playbook has been tested on: nvcr.io/nvidia/nemo:dev. It is expected to work similarly on other environments.


#### Launch the NeMo Framework container as follows: 

```
docker run -it -p 8080:8080 -p 8088:8088 --rm --gpus '"device=0,1"' --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:dev
```

#### Launch Jupyter Notebook as follows: 
```
jupyter notebook --allow-root --ip 0.0.0.0 --port 8088 --no-browser --NotebookApp.token=''

```


# Hardware Requirements

* This playbook has been tested on 2xA100 80G but can be scaled to multiple GPUs as well as multiple nodes by modifying the appropriate parameters

# Data

* In this playbook, we will leverage chip domain/hardware datasets from open-source GitHub repositories, wiki URLs, and academic papers. Data has been processed and curated using [NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/tree/main) as shown in this [playbook](https://github.com/jvamaraju/ndc_dapt_playbook/tree/dapt_jv)

# Notebook Outline

* Step 1: Prepare the data for pretraining. This is a multi-step process discussed in detail later in the specific section (later in the notebook).

* Step 2: Download the llama-2-7B hugging face checkpoint and convert to .nemo format.

* Step 3: Continued pretraining the llama-2-7b model using the prepared data and the custom trained tokenizer (from the previous notebook).

# Step 0: Clone the Model Checkpoint

This notebook assumed the model has been cloned from [hugging face](https://huggingface.co/meta-llama/Llama-2-7b-hf) in the mounted directory ```/workspace```

Clone the model: 
```
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-hf
```

# Step 1: Data Preparation for pretraining

Identify the different file types (example: code, text, etc) in the pretraining data, in this case we only have 'code' type files. This is typically dataset dependent. 

In [None]:
import os
import json

from nemo.collections.llm import Llama2Config7B


# Function to count the number of files in each of the different file types- code, text
def identify_jsonl_files(data_path):
    code_files = []
    text_files = []
    cnt_text = 0
    cnt_code = 0
    for root, _, files in os.walk(data_path):
        for file in files:
            if file.endswith('.jsonl'):
                file_path = os.path.join(root, file)
                with open(file_path, 'r') as f:
                    has_code = False
                    has_text = False
                    for line in f:
                        try:
                            json_obj = json.loads(line.strip())
                            file_type = json_obj.get('file_type', '').lower()
                            if file_type == 'code':
                                has_code = True
                            elif file_type == 'text':
                                has_text = True
                            if has_code and has_text:
                                break
                        except json.JSONDecodeError:
                            continue
                if has_code:
                    code_files.append(file_path)
                    cnt_code = cnt_code + 1
                if has_text:
                    text_files.append(file_path)
                    cnt_text = cnt_text + 1
    return code_files, text_files, cnt_code, cnt_text

# Modify data path to point to jsonl data source, in this case data_path='code/data/all_jsonl_data'
data_path = '/workspace/dapt-custom-tokenization/code/data/all_jsonl_data'

code_files, text_files, cnt_code, cnt_text = identify_jsonl_files(data_path)

print("\nNumber of Files containing 'file_type':'text':", cnt_text)
print("Number of Files containing 'file_type':'code':", cnt_code)

### Merging code JSONL files into a single JSONL file for further preprocessing

This is an optional step, it is possible to use multiple jsonl files in this workflow as well. This example uses a single merged. jsonl file

In [None]:
import os
import json

def list_jsonl_files(directory):
    jsonl_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith('.jsonl'):
                jsonl_files.append(os.path.join(root, file))
    return jsonl_files

# Function to merge multiple jsonl files into a single file 
def merge_jsonl_files(directory, output_file):
    jsonl_files = list_jsonl_files(directory)
    
    with open(output_file, 'w') as outfile:
        for input_file in jsonl_files:
            with open(input_file, 'r') as infile:
                for line in infile:
                    try:
                        json_object = json.loads(line.strip())
                        json.dump(json_object, outfile)
                        outfile.write('\n')
                    except json.JSONDecodeError:
                        print(f"Skipping invalid JSON in {input_file}: {line.strip()}")

    print(f"Merged {len(jsonl_files)} JSONL files into {output_file}")

In [None]:
directory = '/workspace/dapt-custom-tokenization/code/data/all_jsonl_data'
output_file = '/workspace/dapt-custom-tokenization/code_merged_output.jsonl'
merge_jsonl_files(directory, output_file)

### Data Format Conversion for pretraining: JSONL to bin/idx files 

For efficient pretraining, we convert data from JSONL to bin/idx format. 

JSONL files, while convenient for storing structured text data, are not optimized for high-speed data loading during large language model training. In pretraining workflows, particularly those with large datasets and complex model architectures, the need for fast data access and efficient memory management is essential.

The bin/idx format is a binary format specifically designed to facilitate high-throughput data loading. This format allows direct, randomized access to data samples, which speeds up I/O operations and reduces the memory footprint compared to loading JSONL files. By converting data to bin/idx format, hardware utilization can be maximized and bottlenecks in data processing can be avoided, leading to a more efficient pretraining process.

#### Benefits of bin/idx format for Pretraining:

* **Optimized I/O Performance:** The binary format enables quicker data reads and reduces latency, allowing the model to continuously access data at high speeds.
* **Efficient Memory Usage:** Data in bin/idx format consumes less memory during loading, making it suitable for large datasets and enabling better use of available system resources.
* **Enhanced Scalability:** With bin/idx, it’s easier to handle shuffling and batching of large datasets, which is essential for pretraining on diverse domain-specific data.

In [None]:
!ls /workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json

Modify the `input` to point to the merged `jsonl` file. Similarly modify paths to `vocab`, `tokenizer-model`, `merge-file` to point to relevant file paths. `tokenizer-model` should point to the custom tokenizer (trained in the custom tokenizer training notebook) if your data has domain specific terminology

In [None]:
#### Uncomment to use custom trained tokenizer ####
# !python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
# --input='/workspace/dapt-custom-tokenization/code_merged_output.jsonl' \
# --json-keys=text \
# --tokenizer-library=sentencepiece \
# --vocab '/workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json/vocab.json' \
# --dataset-impl mmap \
# --tokenizer-model '/workspace/Llama-2-7b-hf/tokenizer.model' \
# --tokenizer-type llama \
# --merge-file '/workspace/dapt-custom-tokenization/code/code/models/tokenizer/llama2/custom_tokenizer_init_20000.json/merges.txt' \
# --append-eod \
# --output-prefix='preprocessed_data'

In [None]:
# Using default Llama-2 tokenizer for testing purpose
!python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input='/workspace/dapt-custom-tokenization/code_merged_output.jsonl' \
--json-keys=text \
--tokenizer-library=sentencepiece \
--dataset-impl mmap \
--tokenizer-model '/workspace/Llama-2-7b-hf/tokenizer.model' \
--tokenizer-type llama \
--append-eod \
--output-prefix='preprocessed_data'

In [None]:
# If the above step runs successfully, two files with the extensions .bin and .idx will be generated
!ls 

# Step 2: Download and Import Llama-2-7b Hugging Face checkpoint

Llama2-7B model can be automatically downloaded and converted to NeMo2 format with the following script:


In [None]:
%%writefile convert2nemo2.py
from nemo.collections import llm
from nemo.collections.llm import Llama2Config7B

if __name__ == "__main__":
    output = llm.import_ckpt(
        model=llm.LlamaModel(config=Llama2Config7B()),
        source="hf:///workspace/Llama-2-7b-hf",
    )

In [None]:
!torchrun convert2nemo2.py

The conversion will generate a ```llama-2``` NeMo2 checkpoint directory which can be used for the continued pretraining using NeMo Toolkit as shown in Step 3 in default ```$NEMO_HOME``` folder, unless otherwise specified ```NEMO_HOME``` is set as ```/root/.cache/nemo```

Alternatively, you can directly use ```source="meta-llama/Llama2-2-7b-hf"``` to use the model directly from Hugging Face instead of using the locally downloaded version in ```\workspace```

In [None]:
!ls /workspace

# Step 3: Continued Pretraining using Llama2-7b with NeMo2

For this step we use a predefined recipe `llama2_7b.pretrain_recipe` from NeMo Toolkit for continued pretraining. We will modify the `pretrain_recipe` and use it for continued pretraining workflow. Typically this involves changing dataset files and data blends, changing learning rate scheduler, changing default parallelism based on number of devices available, adding connector to resume training, etc.

First, we define the recipe and executor for using NeMo2 as follows:

In [None]:
import nemo_run as run
from nemo.collections import llm

# Configure recipe to pre-train based on the default llama-2-7b recipe
def configure_recipe(nodes: int = 1, gpus_per_node: int = 1):
    recipe = llm.llama2_7b.pretrain_recipe(
        name="llama2_7b_dapt",
        # Modify based on number of nodes available
        num_nodes=nodes,
        num_gpus_per_node=gpus_per_node,
    )
    # Modify
    recipe.trainer.strategy.context_parallel_size = 1
    recipe.trainer.strategy.tensor_model_parallel_size=1
    recipe.trainer.val_check_interval = 100
    return recipe

# Executor for running pretraining 
def local_executor_torchrun(devices: int = 1) -> run.LocalExecutor:
    executor = run.LocalExecutor(ntasks_per_node=devices, launcher="torchrun")
    return executor

Let's instantiate the `recipe` and modify it so that it uses the desired number of GPUs, resuming from the pretrained Llama2-7b checkpoint instead of training from scratch.

The default `recipe` initializes all the essential components required for Llama2 7B pretraining, including model, dataloader, trainer, logger, optimizer etc. `recipe` is not executed during instantiation, so it is very simple to modify it to fit your custom training workflow. In our case, we want to do the DAPT (instead of pretraining from scratch), and all we need to do is to add a `resume` config which points to the Llama2 7B checkpoint.

You can easily change the optimizer, parallelism, data as per your use case. Look at the following example for guidance on how to tweak these parameters. Note: you are only configuring your task at this stage; the underlying code is not executed unless you launch the job using the executor.

In [None]:
import nemo.lightning as nl
from nemo.collections.common.tokenizers import AutoTokenizer

# Instantiate data
data = run.Config(
        llm.PreTrainingDataModule,
        # Pass the path to your data here
        paths=['preprocessed_data_text_document'],
        seq_length=4096,
        tokenizer=run.Config(
            AutoTokenizer,
            pretrained_model_name="/workspace/Llama-2-7b-hf",
        ),
        micro_batch_size=1,
        global_batch_size=8,
    )


# Instantiate the recipe
recipe = configure_recipe(nodes=1, gpus_per_node=2)

# Modify resume connector
resume = run.Config(
            nl.AutoResume,
            restore_config=run.Config(nl.RestoreConfig, path="/root/.cache/nemo/models/Llama-2-7b-hf"),
        )
recipe.resume = resume
recipe.data.tokenizer = run.Config(
        AutoTokenizer,
        pretrained_model_name="/workspace/Llama-2-7b-hf"
    )

# (Optional) Modify the TP/PP/CP settings
recipe.trainer.strategy.tensor_model_parallel_size = 2
recipe.trainer.strategy.pipeline_model_parallel_size = 1
recipe.trainer.strategy.context_parallel_size = 1

# (Optional) Modify the batch size settings
recipe.data.global_batch_size = 8
recipe.data.micro_batch_size = 1

# (Optional) Modify the checkpoint and log location
recipe.log.log_dir= "/workspace/logs_01_31"

# (Optional) Modify the learning rate scheudler
recipe.optim.config.lr = 1e-5
recipe.optim.lr_scheduler.min_lr = 1e-6

# If not configured, the recipe uses mock data for pretraining
recipe.data = data

# (Optional) Modify the data blends
# recipe.data.paths = [0.2, 'path/to/data1', 0.1, 'path/to/data2']
# recipe.data.paths = [1, 'preprocessed_data_text_document']

After configure the training procedure properly, we can run the training by instantiate the `executor` and use `nemorun` to start the training:

In [None]:
# Launch the pretraining job 
executor = local_executor_torchrun(devices=recipe.trainer.devices)
run.run(recipe, executor=executor)

### To monitor the training, launch Tensorboard from another terminal

`tensorboard --logdir nemo_experiments --bind_all`