# NeMo Framework - Training a large language model

## Overview
Large language model (LLM) like ChatGPT possess astonishing versatility, being able to perform tasks such as induction, programming, translation, and more, with results comparable to or even superior to human experts. To learn how to pre-train a large language model (LLM). NVIDIA has introduced NeMo Framework that is capabilities to pre-process training data, distribute training across multiple GPUs efficiently.

Pre-trained language model is powerful in a variety of tasks but often lack the specialized focus needed for domain-specific applications. Therefore, to adapt the language model to a domain-specific task, fine-tuning can be employed. In this notebook, you will learn how to implement two type of tuning methods, **(1) Fine-tuning** and **(2) PEFT methods** like **LoRA** for adapting language model on specific downstream task using NVIDIA NeMo.

## Table of Contents

This course covers the below sections:
1. [Pre-training](#s1)
    - [1.1 Download dataset](#s1.1)
    - [1.2 Data preprocessing](#s1.2)
    - [1.3 Download pre-trained model for continued pre-training](#s1.3)
    - [1.4 Run pre-training](#s1.4)
2. [Instruction Tuning ](#s2)
    - [2.1 Download dataset: erhwenkuo/alpaca-data-gpt4-chinese-zhtw](#s2.1)
    - [2.2 Split the data into train, validation and test](#s2.2)
    - [2.3 Full parameter fine-tuning](#s2.3)
    - [2.4 Parameter Efficient Fine-tuning](#s2.4)
3. [Evaluation](#s3)
4. [Export and Deploy a NeMo Checkpoint to TensorRT-LLM](#s4)

## 1. Pre-training <a name='s1'></a>

The initial phase of our process is concentrated on model pre-training, which serves as the primary stage for the model to acquire knowledge.

### 1.1 Download dataset <a name='s1.1'></a>

In [20]:
from datasets import load_dataset
dataset = load_dataset('erhwenkuo/wikinews-zhtw')['train']
dataset.to_json('./data/custom_dataset/json/wikinews-zhtw.jsonl', force_ascii=False)

Creating json from Arrow format:   0%|          | 0/10 [00:00<?, ?ba/s]

13914259

In [24]:
# Data preprocessing
!mkdir -p data/custom_dataset/preprocessed

!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input data/custom_dataset/json/wikinews-zhtw.jsonl \
--json-keys text \
--dataset-impl mmap \
--tokenizer-library huggingface \
--tokenizer-type=/workspace/tokenizer-llama32-3B \
--output-prefix data/custom_dataset/preprocessed/wikinews \
--append-eod 

[NeMo I 2024-12-10 06:59:48 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: /workspace/tokenizer-llama32-3B
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Vocab size: 128256
Output prefix: data/custom_dataset/preprocessed/wikinews
Time to startup: 0.2579519748687744
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Processing file data/custom_dataset/json/wikinews-zhtw.jsonl 1/1
[NeMo I 2024-12-10 06:59:48 tokenizer_utils:178] Getting HuggingFace AutoTokenizer with pretrained_model_name: /workspace/tokenizer-llama32-3B
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Processed 100 docu

### 1.3 Download pre-trained model for continued pre-training

In [30]:
%%bash
export HF_TOKEN='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'
HF_MODEL=meta-llama/Llama-3.2-3B-Instruct ## Download from HF

huggingface-cli download $HF_MODEL

python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
--input_name_or_path $HF_MODEL \
--output_path Llama-3.2-3B-Instruct.nemo \
--llama31 True \
--precision bf16

Fetching 16 files: 100%|██████████| 16/16 [00:00<00:00, 170760.47it/s]


/root/.cache/huggingface/hub/models--meta-llama--Llama-3.2-3B-Instruct/snapshots/0cb88a4f764b7a12671c53f0838cd831a0843b95


usage: convert_llama_hf_to_nemo.py [-h] --input_name_or_path
                                   INPUT_NAME_OR_PATH --output_path
                                   OUTPUT_PATH [--hparams_file HPARAMS_FILE]
                                   [--precision PRECISION]
convert_llama_hf_to_nemo.py: error: unrecognized arguments: --llama3 True


CalledProcessError: Command 'b"export HF_TOKEN='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'\nHF_MODEL=meta-llama/Llama-3.2-3B-Instruct ## Download from HF\n\nhuggingface-cli download $HF_MODEL\n\npython /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \\\n--input_name_or_path $HF_MODEL \\\n--output_path Llama-3.2-3B-Instruct.nemo \\\n--llama3 True \\\n--precision bf16\n"' returned non-zero exit status 2.

In [19]:
# 設定環境變數
%env HF_MODEL=meta-llama/Llama-3.2-3B-Instruct 
%env HF_TOKEN=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

# 呼叫轉換腳本
!python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
    --input_name_or_path $HF_MODEL \
    --output_path Llama-3.2-3B-Instruct.nemo

env: HF_MODEL=meta-llama/Llama-3.2-3B-Instruct
env: HF_TOKEN=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
[NeMo I 2024-12-10 06:41:22 convert_llama_hf_to_nemo:111] loading checkpoint meta-llama/Llama-3.2-3B-Instruct
    
Traceback (most recent call last):
  File "/opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py", line 312, in <module>
    convert(args)
  File "/opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py", line 112, in convert
    model = LlamaForCausalLM.from_pretrained(args.input_name_or_path)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2981, in from_pretrained
    config, model_kwargs = cls.config_class.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 611, in from_pretrained
    return cls.from_dict(config_dict, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 763, in from_dict
    config = cls(**c

### 1.4 Run pre-training <a name='s1.4'></a>

In [None]:
# Update continue training script for nemo:24.05

file_path = '/opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py'
insert_line = 158
new_line = "        cfg.trainer.precision = None\n"

with open(file_path, 'r', encoding='utf-8') as file:
    lines = file.readlines()

lines.insert(insert_line, new_line)

with open(file_path, 'w', encoding='utf-8') as file:
    file.writelines(lines)

In [None]:
%env MODEL_NAME=Llama-3.1-8B 
%env MODEL=Llama-3.1-8B-Instruct.nemo
%env NUM_GPUS=2 
%env MAX_STEPS=100
%env MBS=1
%env GBS=1
%env TP=1
%env PP=1
%env LR=1e-4
%env DATA_SPLITS='9990,8,2'
%env DATA_PREFIX=[1.0,data/custom_dataset/preprocessed/wikinews_text_document]

!python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_continue_training.py \
--config-path=/opt/NeMo-Framework-Launcher/launcher_scripts/conf/training/llama --config-name=llama2_7b \
+restore_from_path=$MODEL \
+base_results_dir=results \
+model.seq_len_interpolation_factor=null \
trainer.num_nodes=1 \
trainer.devices=$NUM_GPUS \
trainer.precision=16 \
trainer.max_steps=$MAX_STEPS \
trainer.limit_val_batches=32 \
trainer.val_check_interval=100 \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/Pretraining \
exp_manager.wandb_logger_kwargs.name=$MODEL_NAME \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.checkpoint_callback_params.model_parallel_size=$(($TP*$PP)) \
+exp_manager.checkpoint_callback_params.every_n_train_steps=50 \
+exp_manager.checkpoint_callback_params.every_n_epochs=null \
exp_manager.checkpoint_callback_params.monitor="epoch" \
exp_manager.checkpoint_callback_params.save_top_k=-1 \
model.micro_batch_size=$MBS \
model.global_batch_size=$GBS \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.tokenizer.library=huggingface \
model.tokenizer.type=tokenizer \
model.tokenizer.model=null \
model.optim.lr=$LR \
model.data.splits_string=${DATA_SPLITS} \
model.data.data_prefix=${DATA_PREFIX} \
model.data.num_workers=0 \
model.data.seq_length=1024

## 2. Instruction Tuning <a name='s2'></a>

We will be using the [erhwenkuo/alpaca-data-gpt4-chinese-zhtw](https://huggingface.co/datasets/erhwenkuo/alpaca-data-gpt4-chinese-zhtw) is a dataset that contains Chinese (zh-tw) Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

The dataset was originaly shared in [this repository](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM). This dataset is a translation from English to Chinese.

### 2.1 Download dataset: erhwenkuo/alpaca-data-gpt4-chinese-zhtw <a name='s2.1'></a>
Let's download dataset and save it as json first.

In [None]:
import os
import json
from datasets import load_dataset
dataset = load_dataset('erhwenkuo/alpaca-data-gpt4-chinese-zhtw')['train']
output_path = 'data/alpaca/gpt4-chinese-zhtw.jsonl'
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, 'w') as f:
    for human_instruction, human_input, assistant_output in zip(dataset['instruction'], dataset['input'], dataset['output']):
        f.write(json.dumps({'input': '\n'.join([human_instruction.strip(),human_input.strip()]).strip(), 'output': assistant_output.strip()}, ensure_ascii=False)+ '\n')

In [None]:
!head -n 1 data/alpaca/gpt4-chinese-zhtw.jsonl

### 2.2 Split the data into train, validation and test. <a name='s2.2'></a>

Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the chinese-dolly directory

In [None]:
import json
import random

input_file = "data/alpaca/gpt4-chinese-zhtw.jsonl"
training_output_file = "data/alpaca/training.jsonl"
validation_output_file = "data/alpaca/validation.jsonl"
test_output_file = "data/alpaca/test.jsonl"

# Specify the proportion of data for training and validation
train_proportion = 0.98
validation_proportion = 0.01
test_proportion = 0.01

# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
    lines = f.readlines()
    random.shuffle(lines)

# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)

# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]

# Write JSON objects to training file
with open(training_output_file, "w") as f:
    for line in train_data:
        f.write(line.strip() + "\n")

# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
    for line in validation_data:
        f.write(line.strip() + "\n")

# Write JSON objects to training file
with open(test_output_file, "w") as f:
    for line in test_data:
        f.write(line.strip() + "\n")

In [None]:
# What the dataset looks like after spliting
!head -1 data/alpaca/training.jsonl

### 2.3 Full parameter fine-tuning  <a name='s2.3'></a>

In [None]:
%env MODEL_NAME=TinyLlama-1.1B
%env MODEL=results/TinyLlama-1.1B/Pretraining/checkpoints/megatron_llama.nemo
%env NUM_GPUS=1
%env MAX_STEPS=100
%env VAL_INTERVAL=1.0
%env GBS=16
%env MBS=1
%env TP=1
%env PP=1
%env LR=1e-4
%env TRAIN_DS=[data/alpaca/training.jsonl]
%env VALID_DS=[data/alpaca/validation.jsonl]
%env TEST_DS=[data/alpaca/test.jsonl]
%env CONCAT_SAMPLING_PROBS=[1.0]
%env PROMPT_TEMPLATE="<|user|>\n{input}</s>\n<|assistant|>\n{output}"

!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
--config-path=/opt/NeMo/examples/nlp/language_modeling/tuning/conf --config-name=megatron_gpt_finetuning_config \
trainer.devices=$NUM_GPUS \
trainer.max_steps=$MAX_STEPS \
trainer.precision=16 \
trainer.val_check_interval=$VAL_INTERVAL \
exp_manager.explicit_log_dir=results/$MODEL_NAME/SFT \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.restore_from_path=$MODEL \
model.global_batch_size=$GBS \
model.micro_batch_size=$MBS \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.validation_ds.file_names=${VALID_DS} \
model.data.test_ds.file_names=${TEST_DS} \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.test_ds.num_workers=0 \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.train_ds.max_seq_length=1024 \
model.optim.lr=$LR \
model.peft.peft_scheme=null

### 2.4. Parameter Efficient Fine-tuning <a name='s2.4'></a>
Fine-tuning language model can be computationally expensive and risk overfitting, especially with small, specialized datasets. Parameter-efficient fine-tuning methods like LoRA offer a solution. These techniques adapt the model to specific tasks by modifying only a subset of parameters, reducing computational costs and mitigating overfitting risks. In essence, LoRA enable a more efficient and targeted adaptation of large language models for specialized tasks.

In [None]:
%env MODEL_NAME=TinyLlama-1.1B
%env MODEL=results/TinyLlama-1.1B/Pretraining/checkpoints/megatron_llama.nemo
%env NUM_GPUS=1
%env MAX_STEPS=10
%env VAL_INTERVAL=1.0
%env GBS=16
%env MBS=1
%env TP=1
%env PP=1
%env LR=1e-4
%env TRAIN_DS=[data/alpaca/training.jsonl]
%env VALID_DS=[data/alpaca/validation.jsonl]
%env TEST_DS=[data/alpaca/test.jsonl]
%env CONCAT_SAMPLING_PROBS=[1.0]
%env PROMPT_TEMPLATE="<|user|>\n{input}</s>\n<|assistant|>\n{output}"

!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
--config-path=/opt/NeMo/examples/nlp/language_modeling/tuning/conf --config-name=megatron_gpt_finetuning_config \
trainer.devices=$NUM_GPUS \
trainer.max_steps=$MAX_STEPS \
trainer.precision=16 \
trainer.val_check_interval=$VAL_INTERVAL \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/PEFT \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.restore_from_path=$MODEL \
model.global_batch_size=$GBS \
model.micro_batch_size=$MBS \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.validation_ds.file_names=${VALID_DS} \
model.data.test_ds.file_names=${TEST_DS} \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.test_ds.num_workers=0 \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.train_ds.max_seq_length=1024 \
model.optim.lr=$LR \
model.peft.peft_scheme=lora \
model.peft.lora_tuning.adapter_dim=32

## 3 Evaluation <a name='s3'></a>

If you want to evaluate an SFT .nemo file:

In [None]:
%env MODEL_NAME=TinyLlama-1.1B
%env MODEL=results/TinyLlama-1.1B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo
%env NUM_GPUS=1
%env TEST_DS=[data/alpaca/test.jsonl]
%env OUTPUT=data/alpaca/prediction
%env PROMPT_TEMPLATE="<|user|>\n{input}</s>\n<|assistant|>\n{output}"

!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
trainer.precision=16 \
trainer.devices=$NUM_GPUS \
model.restore_from_path=$MODEL \
model.tensor_model_parallel_size=$NUM_GPUS \
model.pipeline_model_parallel_size=1 \
model.megatron_amp_O2=False \
model.peft.restore_from_path=null \
model.data.test_ds.file_names=$TEST_DS \
model.data.test_ds.names=\['alpaca_test'] \
model.data.test_ds.global_batch_size=32 \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.tokens_to_generate=30 \
model.data.test_ds.label_key='output' \
model.data.test_ds.add_eos=True \
model.data.test_ds.add_sep=False \
model.data.test_ds.add_bos=False \
model.data.test_ds.truncation_field="input" \
model.data.test_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.test_ds.write_predictions_to_file=True \
model.data.test_ds.output_file_path_prefix=$OUTPUT

In [None]:
import json

def modify_and_overwrite_jsonl(file_path):
    data_list = []
    with open(file_path, 'r') as file:
        for line in file:
            data = json.loads(line)
            data_list.append(data)
    
    with open(file_path, 'w', encoding='utf-8') as file:
        for data in data_list:
            json_line = json.dumps(data, ensure_ascii=False) + "\n"
            file.write(json_line)

file_path = "/workspace/data/alpaca/prediction_test_alpaca_test_inputs_preds_labels.jsonl"
modify_and_overwrite_jsonl(file_path)

If you want to evaluate a PEFT Model, you should provide a base GPT model and a PEFT model .nemo file

In [None]:
%env MODEL_NAME=TinyLlama-1.1B
%env MODEL=results/TinyLlama-1.1B/Pretraining/checkpoints/megatron_llama.nemo
%env PEFT_MODEL=results/TinyLlama-1.1B/PEFT/checkpoints/megatron_gpt_peft_lora_tuning.nemo
%env NUM_GPUS=1
%env TEST_DS=[data/alpaca/test.jsonl]
%env OUTPUT=/workspace/data/alpaca/prediction_peft
%env PROMPT_TEMPLATE="<|user|>\n{input}</s>\n<|assistant|>\n{output}"

!python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
trainer.precision=16 \
trainer.devices=$NUM_GPUS \
model.restore_from_path=$MODEL \
model.megatron_amp_O2=False \
model.peft.restore_from_path=$PEFT_MODEL \
model.peft.peft_scheme=lora \
model.data.test_ds.file_names=$TEST_DS \
model.data.test_ds.names=\['alpaca_test'] \
model.data.test_ds.global_batch_size=32 \
model.data.test_ds.micro_batch_size=1 \
model.data.test_ds.tokens_to_generate=30 \
model.data.test_ds.label_key='output' \
model.data.test_ds.add_eos=True \
model.data.test_ds.add_sep=False \
model.data.test_ds.add_bos=False \
model.data.test_ds.truncation_field="input" \
model.data.test_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.test_ds.write_predictions_to_file=True \
model.data.test_ds.output_file_path_prefix=$OUTPUT

In [None]:
file_path = "/workspace/data/alpaca/prediction_peft_test_alpaca_test_inputs_preds_labels.jsonl"
modify_and_overwrite_jsonl(file_path)

## 4. Export and Deploy a NeMo Checkpoint to TensorRT-LLM <a name='s4'></a>

Open a terminal and run the following code:

```sh
python /opt/NeMo/scripts/deploy/nlp/deploy_triton.py \
--nemo_checkpoint /workspace/results/TinyLlama-1.1B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo \
--model_type llama \
--dtype float16 \
--triton_model_name TinyLlama
```

The command above launches a inference server. Keep it running and run the following cell to send a request to the server.

In [None]:
!python /opt/NeMo/scripts/deploy/nlp/query.py \
--url "http://localhost:8000" \
--model_name TinyLlama \
--prompt '<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHi, how are you?</s>\n<|assistant|>\n'

## Clear your data

In [None]:
!rm -rf data results TinyLlama-1.1B-Chat-v1.0