# NeMo Framework - Training a large language model

## Overview
Large language model (LLM) like ChatGPT possess astonishing versatility, being able to perform tasks such as induction, programming, translation, and more, with results comparable to or even superior to human experts. To learn how to pre-train a large language model (LLM). NVIDIA has introduced NeMo Framework that is capabilities to pre-process training data, distribute training across multiple GPUs efficiently.

Pre-trained language model is powerful in a variety of tasks but often lack the specialized focus needed for domain-specific applications. Therefore, to adapt the language model to a domain-specific task, fine-tuning can be employed. In this notebook, you will learn how to implement two type of tuning methods, **(1)Fine-tuning** and **(2)PEFT methods** like **LoRA** for adapting language model on specific downstream task using NVIDIA NeMo.

## Table of Contents

This course covers the below sections:
1. [Pre-training](#s1)
    - [1.1 Download dataset](#s1.1)
    - [1.2 Data preprocessing](#s1.2)
    - [1.3 Download pre-trained model for continued pre-training](#s1.3)
    - [1.4 Run pre-training](#s1.4)
    
    
2. [Instruction Tuning ](#s2)
    - [2.1 Download dataset: erhwenkuo/alpaca-data-gpt4-chinese-zhtw](#s2.1)
    - [2.2 Split the data into train, validation and test](#s2.2)
    - [2.3 Full parameter fine-tuning](#s2.3)
    - [2.4. Parameter Efficient Fine-tuning](#s2.4)


3. [Evaluation](#s3)

## 1. Pre-training <a name='s1'></a>

The initial phase of our process is concentrated on model pre-training, which serves as the primary stage for the model to acquire knowledge.

### 1.1 Download dataset <a name='s1.1'></a>

In [None]:
from datasets import load_dataset
dataset = load_dataset('erhwenkuo/wikinews-zhtw')['train']
dataset.to_json('./data/custom_dataset/json/wikinews-zhtw.jsonl', force_ascii=False)

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9827/9827 [00:00<00:00, 84785.62 examples/s]
Creating json from Arrow format: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 78.31ba/s]


13914259

### 1.2 Data preprocessing <a name='s1.2'></a>

In [None]:
# Data preprocessing
!mkdir -p data/custom_dataset/preprocessed

!python /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input data/custom_dataset/json/wikinews-zhtw.jsonl \
--json-keys text \
--dataset-impl mmap \
--tokenizer-library huggingface \
--tokenizer-type=/workspace/tokenizer-llama31-8B-Instruct \
--output-prefix data/custom_dataset/preprocessed/wikinews \
--append-eod

      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      "prompt_template_fields": re.findall("\{(.*?)\}", task.prompt_template),
    
      return re.fullmatch("[0-9][0-9]\.[0-9][0-9].*", nvidia_torch_version)  # "YY.MM.*"
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      """
    
      width, height = re.findall('\d+', line)
    
      cm = get_cmap("Set1")
    
      """
    
      """
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
      dat

### 1.3 Download pre-trained model for continued pre-training

In [None]:
%%bash
export HF_TOKEN='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'
HF_MODEL=meta-llama/Llama-3.1-8B-Instruct

# 確保模型完整下載
huggingface-cli download $HF_MODEL --resume

# 執行轉換
python /opt/NeMo/scripts/checkpoint_converters/convert_llama_hf_to_nemo.py \
--input_name_or_path $HF_MODEL \
--output_path Llama-3.1-8B-Instruct.nemo \
--llama31 True \
--precision bf16

Fetching 17 files:   0%|          | 0/17 [00:00<?, ?it/s]Downloading 'LICENSE' to '/root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/blobs/a7c3ca16cee30425ed6ad841a809590f2bcbf290.incomplete'
Downloading 'README.md' to '/root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/blobs/bbd5630a05b65c1a8b25141bd11ec44844107d58.incomplete'
Download complete. Moving file to /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/blobs/a7c3ca16cee30425ed6ad841a809590f2bcbf290
Download complete. Moving file to /root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/blobs/bbd5630a05b65c1a8b25141bd11ec44844107d58
Downloading 'model-00003-of-00004.safetensors' to '/root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/blobs/fc1cdddd6bfa91128d6e94ee73d0ce62bfcdb7af29e978ddcab30c66ae9ea7fa.incomplete'
Downloading 'model-00004-of-00004.safetensors' to '/root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruc

/root/.cache/huggingface/hub/models--meta-llama--Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659


      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    


[NeMo I 2024-12-12 10:01:31 convert_llama_hf_to_nemo:128] loading checkpoint meta-llama/Llama-3.1-8B-Instruct


Loading checkpoint shards: 100%|██████████| 4/4 [00:03<00:00,  1.22it/s]


hf_config: {'vocab_size': 128256, 'max_position_embeddings': 131072, 'hidden_size': 4096, 'intermediate_size': 14336, 'num_hidden_layers': 32, 'num_attention_heads': 32, 'num_key_value_heads': 8, 'hidden_act': 'silu', 'initializer_range': 0.02, 'rms_norm_eps': 1e-05, 'pretraining_tp': 1, 'use_cache': True, 'rope_theta': 500000.0, 'rope_scaling': {'factor': 8.0, 'low_freq_factor': 1.0, 'high_freq_factor': 4.0, 'original_max_position_embeddings': 8192, 'rope_type': 'llama3'}, 'attention_bias': False, 'attention_dropout': 0.0, 'mlp_bias': False, 'head_dim': 128, 'return_dict': True, 'output_hidden_states': False, 'output_attentions': False, 'torchscript': False, 'torch_dtype': torch.bfloat16, 'use_bfloat16': False, 'tf_legacy_loss': False, 'pruned_heads': {}, 'tie_word_embeddings': False, 'chunk_size_feed_forward': 0, 'is_encoder_decoder': False, 'is_decoder': False, 'cross_attention_hidden_size': None, 'add_cross_attention': False, 'tie_encoder_decoder': False, 'max_length': 20, 'min_len

[NeMo W 2024-12-12 10:01:37 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
      torch.set_autocast_gpu_dtype(dtype)
    


- model.layers.16.self_attn.q_proj.weight
- model.layers.16.self_attn.k_proj.weight
- model.layers.16.self_attn.v_proj.weight
- model.layers.16.self_attn.o_proj.weight
- model.layers.16.mlp.gate_proj.weight
- model.layers.16.mlp.up_proj.weight
- model.layers.16.mlp.down_proj.weight
- model.layers.16.input_layernorm.weight
- model.layers.16.post_attention_layernorm.weight
- model.layers.17.self_attn.q_proj.weight
- model.layers.17.self_attn.k_proj.weight
- model.layers.17.self_attn.v_proj.weight
- model.layers.17.self_attn.o_proj.weight
- model.layers.17.mlp.gate_proj.weight
- model.layers.17.mlp.up_proj.weight
- model.layers.17.mlp.down_proj.weight
- model.layers.17.input_layernorm.weight
- model.layers.17.post_attention_layernorm.weight
- model.layers.18.self_attn.q_proj.weight
- model.layers.18.self_attn.k_proj.weight
- model.layers.18.self_attn.v_proj.weight
- model.layers.18.self_attn.o_proj.weight
- model.layers.18.mlp.gate_proj.weight
- model.layers.18.mlp.up_proj.weight
- model.

GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-12-12 10:01:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/setup.py:177: GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
    
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:4

nemo_config: {'mcore_gpt': True, 'micro_batch_size': 1, 'global_batch_size': 8, 'tensor_model_parallel_size': 1, 'pipeline_model_parallel_size': 1, 'virtual_pipeline_model_parallel_size': None, 'encoder_seq_length': 131072, 'max_position_embeddings': 131072, 'num_layers': 32, 'hidden_size': 4096, 'ffn_hidden_size': 14336, 'num_attention_heads': 32, 'init_method_std': 0.02, 'use_scaled_init_method': True, 'hidden_dropout': 0.0, 'attention_dropout': 0.0, 'ffn_dropout': 0.0, 'kv_channels': None, 'apply_query_key_layer_scaling': True, 'normalization': 'rmsnorm', 'layernorm_epsilon': 1e-05, 'do_layer_norm_weight_decay': False, 'make_vocab_size_divisible_by': 128, 'pre_process': True, 'post_process': True, 'persist_layer_norm': True, 'bias': False, 'activation': 'fast-swiglu', 'headscale': False, 'transformer_block_type': 'pre_ln', 'openai_gelu': False, 'normalize_attention_scores': True, 'position_embedding_type': 'rope', 'rotary_percentage': 1.0, 'attention_type': 'multihead', 'share_embed

[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:43 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its c

[NeMo I 2024-12-12 10:01:43 tokenizer_utils:184] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Llama-3.1-8B-Instruct
[NeMo I 2024-12-12 10:01:44 megatron_base_model:601] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-12-12 10:01:44 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: context_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:44 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:44 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:44 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:01:44 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: deterministic_mode in its c

[NeMo I 2024-12-12 10:02:33 dist_ckpt_io:421] Using TorchDistSaveShardedStrategy(torch_dist, 1) dist-ckpt save strategy.


      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 10:03:20 convert_llama_hf_to_nemo:335] NeMo model saved to: Llama-3.1-8B-Instruct.nemo


    
    


### 1.4 Run pre-training <a name='s1.4'></a>

# 8張H100測試用code (8分鐘跑完)

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=Llama-3.1-8B-Instruct.nemo
NUM_GPUS=8
MAX_STEPS=100
MBS=2
GBS=8
TP=4
PP=2
CP=1
LR=5e-4
DATA_SPLITS=\'9990,8,2\'
DATA_PREFIX=[1.0,data/custom_dataset/preprocessed/wikinews_text_document]
export CUDA_DEVICE_MAX_CONNECTIONS=1
export HUGGING_FACE_HUB_TOKEN='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'

python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=/opt/NeMo-Framework-Launcher/launcher_scripts/conf/training/llama --config-name=llama3_1_8b \
+base_results_dir=results \
trainer.num_nodes=1 \
trainer.devices=$NUM_GPUS \
trainer.max_steps=$MAX_STEPS \
trainer.limit_val_batches=1 \
trainer.val_check_interval=100 \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/pretrain \
exp_manager.wandb_logger_kwargs.name=$MODEL_NAME \
exp_manager.resume_if_exists=True \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.checkpoint_callback_params.model_parallel_size=$(($TP*$PP)) \
model.micro_batch_size=$MBS \
model.global_batch_size=$GBS \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.context_parallel_size=$CP \
model.init_method_std=0.02 \
model.optim.lr=$LR \
model.data.splits_string=${DATA_SPLITS} \
model.data.data_prefix=${DATA_PREFIX} \
model.data.num_workers=8 \
+model.restore_from_path=$MODEL \
+model.rotary_base=500000.0 \
+model.seq_len_interpolation_factor=8

      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-12 10:43:03 megatron_gpt_pretraining:37] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-12 10:43:03 megatron_gpt_pretraining:38] 
    run:
      name: llama3_1_8b
      results_dir: ${base_results_dir}/${.name}
      time_limit: 0-01:30:00
      dependency: singleton
    trainer:
      num_nodes: 1
      devices: 8
      accelerator: gpu
      precision: bf16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: null
      max_steps: 100
      max_time: 05:23:30:00
      log_every_n_steps: 10
      val_check_interval: 100
      limit_val_batches: 1
      limit_test_batches: 50
      accumulate_grad_batches: 1
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: /workspace/results/Llama-3.1-8B/pretrain
      exp_dir: null
      name: megatron_llama
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: nemo_llama_pretrain
        name: Llama-3.1-8B

[NeMo W 2024-12-12 10:43:04 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
      torch.set_autocast_gpu_dtype(dtype)
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(limit_val_batches=1)` was configured so 1 batch will be used.


[NeMo I 2024-12-12 10:43:04 exp_manager:400] ExpManager schema
[NeMo I 2024-12-12 10:43:04 exp_manager:401] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo W 2024-12-12 10:43:04 exp_manager:784] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/results/Llama-3.1-8B/pretrain/checkpoints. Training from scratch.


[NeMo I 2024-12-12 10:43:04 exp_manager:459] Experiments will be logged at /workspace/results/Llama-3.1-8B/pretrain
[NeMo I 2024-12-12 10:43:04 exp_manager:1010] TensorboardLogger has been set up


[NeMo W 2024-12-12 10:43:04 exp_manager:1139] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-12-12 10:43:04 exp_manager:593] TFLOPs per sec per GPU will be calculated, conditioned on supported models. Defaults to -1 upon failure.
[NeMo I 2024-12-12 10:43:04 megatron_gpt_pretraining:46] Continual training: loading weights from Llama-3.1-8B-Instruct.nemo


[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg.

[NeMo I 2024-12-12 10:43:21 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-12-12 10:43:21 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-12-12 10:43:21 megatron_init:325] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2024-12-12 10:43:21 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-12 10:43:21 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-12 10:43:21 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2024-12-12 10:43:21 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-12 10:43:21 megatron_init:347] Rank 0 has model parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
[NeMo I 2024-12-12 10:43:21 megatron_init:348] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
[NeMo I 2024-12-12 10:43:21 megatron_init:357] Rank 0 has tensor 

[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:21 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg.

[NeMo I 2024-12-12 10:43:21 tokenizer_utils:184] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3.1-8B
[NeMo I 2024-12-12 10:43:27 megatron_base_model:601] Padded vocab_size: 128512, original vocab_size: 128256, dummy tokens: 256.


[NeMo W 2024-12-12 10:43:27 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:27 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:27 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:27 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 10:43:27 megatron_base_model:1186] The model: MegatronGPTModel() does not have field.name: tp_comm_bulk_wgrad in its cfg.

apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8


apply rope scaling ...
apply rope scaling ...
apply rope scaling ...
apply rope scaling ...
apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8


apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8


apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

      checkpoint.load_state_dict(
    
      device = getattr(value, "device", None)
    


[NeMo I 2024-12-12 10:45:49 nlp_overrides:1358] Model MegatronGPTModel was successfully restored from /workspace/Llama-3.1-8B-Instruct.nemo.


[NeMo W 2024-12-12 10:45:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-12 10:45:49 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subject t

[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1680] Pipeline model parallel rank: 0, Tensor model parallel rank: 0, Number of model parameters on device: 1.00e+09. Number of precise model parameters on device: 8033157120.
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1524] Building GPT datasets.
[NeMo I 2024-12-12 10:45:50 utils:259] Let split_matrix = [(0, 0.999), (0.999, 0.9998), (0.9998, 1.0)]
[NeMo I 2024-12-12 10:45:50 utils:259] Building dataset splits with cls=GPTDataset, sizes=[800, 16, 400], and config=GPTDatasetConfig(random_seed=1234, sequence_length=8192, blend=(['data/custom_dataset/preprocessed/wikinews_text_document'], [1.0]), blend_per_split=None, renormalize_blend_weights=False, split='9990,8,2', split_matrix=[(0, 0.999), (0.999, 0.9998), (0.9998, 1.0)], num_dataset_builder_threads=1, path_to_cache=None, mmap_bin_files=True, mock=False, tokenizer=<nemo.collections.common.tokenizers.huggingface.auto_tokenizer.AutoTokenizer object at 0x7fb1cb19c640>, reset_position_ids

[NeMo W 2024-12-12 10:45:50 utils:259] Building a BlendedDataset for a single MegatronDataset


[NeMo I 2024-12-12 10:45:50 utils:259] Build and save the BlendedDataset indices
[NeMo I 2024-12-12 10:45:50 utils:259] 	Build and save the dataset and dataset sample indexes


[NeMo W 2024-12-12 10:45:50 utils:259] Unable to save the BlendedDataset indexes because path_to_cache is None
[NeMo W 2024-12-12 10:45:50 utils:259] Building a BlendedDataset for a single MegatronDataset


[NeMo I 2024-12-12 10:45:50 utils:259] Build and save the BlendedDataset indices
[NeMo I 2024-12-12 10:45:50 utils:259] 	Build and save the dataset and dataset sample indexes


[NeMo W 2024-12-12 10:45:50 utils:259] Unable to save the BlendedDataset indexes because path_to_cache is None
[NeMo W 2024-12-12 10:45:50 utils:259] Building a BlendedDataset for a single MegatronDataset


[NeMo I 2024-12-12 10:45:50 utils:259] Build and save the BlendedDataset indices
[NeMo I 2024-12-12 10:45:50 utils:259] 	Build and save the dataset and dataset sample indexes


[NeMo W 2024-12-12 10:45:50 utils:259] Unable to save the BlendedDataset indexes because path_to_cache is None


[NeMo I 2024-12-12 10:45:50 utils:259] Verifying NumPy indices for BlendedDataset train split
[NeMo I 2024-12-12 10:45:50 utils:259] Verifying NumPy indices for BlendedDataset valid split
[NeMo I 2024-12-12 10:45:50 utils:259] Verifying NumPy indices for BlendedDataset test split
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1613] Length of train dataset: 804
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1615] Length of val dataset: 17
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1617] Length of test dataset: 402
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1618] Finished building GPT datasets.
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1729] Setting up train dataloader with len(len(self._train_ds)): 804 and consumed samples: 0
[NeMo I 2024-12-12 10:45:50 megatron_gpt_model:1627] Building dataloader with consumed samples: 0
[NeMo I 2024-12-12 10:45:50 data_samplers:76] Instantiating MegatronPretrainingSampler with total_samples: 804 and consumed_samples: 0
[NeMo I 2024-12-12 

LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]


[NeMo I 2024-12-12 10:45:50 modelPT:787] Optimizer config = MegatronDistributedFusedAdam (
    Parameter Group 0
        betas: [0.9, 0.95]
        bias_correction: True
        eps: 1e-08
        is_expert: False
        lr: 0.0005
        weight_decay: 0.1
    )
[NeMo I 2024-12-12 10:45:50 lr_scheduler:948] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7fb187364250>" 
    will be used during training (effective maximum steps = 100) - 
    Parameters : 
    (warmup_steps: 500
    constant_steps: 0
    min_lr: 1.0e-05
    max_steps: 100
    )



  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | Float16Module | 1.0 B  | train
------------------------------------------------
1.0 B     Trainable params
0         Non-trainable params
1.0 B     Total params
4,016.570 Total estimated model params size (MB)
328       Modules in train mode
0         Modules in eval mode
[NeMo W 2024-12-12 10:45:50 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:148: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Plea

Epoch 0: : 100%|██████████| 100/100 [02:14<00:00, reduced_train_loss=2.050, global_step=99.00, consumed_samples=800.0, train_step_timing in s=1.260]
Validation: |          | 0/? [00:00<?, ?it/s][A
Validation:   0%|          | 0/4 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/4 [00:00<?, ?it/s][A
Validation DataLoader 0: 100%|██████████| 4/4 [00:00<00:00,  9.22it/s][A


Epoch 0, global step 100: 'val_loss' reached 1.82502 (best 1.82502), saving model to '/workspace/results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama--val_loss=1.83-step=100-consumed_samples=800.0.ckpt' as top 10


Epoch 0: : 100%|██████████| 100/100 [02:16<00:00, reduced_train_loss=2.050, global_step=99.00, consumed_samples=800.0, train_step_timing in s=1.260, val_loss=1.830][NeMo I 2024-12-12 10:50:35 dist_ckpt_io:421] Using TorchDistSaveShardedStrategy(torch_dist, 1) dist-ckpt save strategy.


      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    


Epoch 0: : 100%|██████████| 100/100 [02:54<00:00, reduced_train_loss=2.050, global_step=99.00, consumed_samples=800.0, train_step_timing in s=1.260, val_loss=1.830]
[NeMo I 2024-12-12 10:51:11 perf_metrics_utils:42] train_step_timing in s: [1.26, 1.26, 1.26, 1.26, 1.26, 1.26, 1.27, 1.26, 1.26, 1.26]
[NeMo I 2024-12-12 10:51:11 perf_metrics:86] TFLOPs per sec per GPU=375.93
[NeMo I 2024-12-12 10:51:11 dist_ckpt_io:421] Using TorchDistSaveShardedStrategy(torch_dist, 1) dist-ckpt save strategy.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

# 原始的CODE

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=Llama-3.1-8B-Instruct.nemo
NUM_GPUS=8
MAX_STEPS=100
MBS=1
GBS=2
TP=2
PP=1
CP=1
LR=1e-4
DATA_SPLITS=\'9990,8,2\'
DATA_PREFIX=[1.0,data/custom_dataset/preprocessed/wikinews_text_document]
export CUDA_DEVICE_MAX_CONNECTIONS=1
export HUGGING_FACE_HUB_TOKEN='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'

python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=/opt/NeMo-Framework-Launcher/launcher_scripts/conf/training/llama --config-name=llama3_1_8b \
+base_results_dir=results \
trainer.num_nodes=1 \
trainer.devices=$NUM_GPUS \
trainer.max_steps=$MAX_STEPS \
trainer.limit_val_batches=1 \
trainer.val_check_interval=50 \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/pretrain \
exp_manager.wandb_logger_kwargs.name=$MODEL_NAME \
exp_manager.resume_if_exists=True \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.checkpoint_callback_params.model_parallel_size=$(($TP*$PP)) \
model.micro_batch_size=$MBS \
model.global_batch_size=$GBS \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.context_parallel_size=$CP \
model.init_method_std=0.02 \
model.optim.lr=$LR \
model.data.splits_string=${DATA_SPLITS} \
model.data.data_prefix=${DATA_PREFIX} \
model.data.num_workers=0 \
+model.restore_from_path=$MODEL \
+model.rotary_base=500000.0 \
+model.seq_len_interpolation_factor=8

# 嘗試離線運行

In [None]:
%%bash
MODEL_NAME=Llama-3.1-8B
MODEL=Llama-3.1-8B-Instruct.nemo
NUM_GPUS=2
MAX_STEPS=100
MBS=1
GBS=2
TP=2
PP=1
CP=1
LR=1e-4
DATA_SPLITS="9990,8,2"
DATA_PREFIX="[1.0,data/custom_dataset/preprocessed/wikinews_text_document]"
TOKENIZER_DIR=tokenizer-llama3  # 本地 Tokenizer 路徑

# 禁用 CUDA 多連接
export CUDA_DEVICE_MAX_CONNECTIONS=1

# 執行 NeMo 訓練程式
python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_pretraining.py \
--config-path=/opt/NeMo-Framework-Launcher/launcher_scripts/conf/training/llama --config-name=llama3_1_8b \
+base_results_dir=results \
trainer.num_nodes=1 \
trainer.devices=$NUM_GPUS \
trainer.max_steps=$MAX_STEPS \
trainer.limit_val_batches=1 \
trainer.val_check_interval=50 \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/pretrain \
exp_manager.wandb_logger_kwargs.name=$MODEL_NAME \
exp_manager.resume_if_exists=True \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
exp_manager.checkpoint_callback_params.model_parallel_size=$(($TP*$PP)) \
model.micro_batch_size=$MBS \
model.global_batch_size=$GBS \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.context_parallel_size=$CP \
model.init_method_std=0.02 \
model.optim.lr=$LR \
model.data.splits_string=${DATA_SPLITS} \
model.data.data_prefix=${DATA_PREFIX} \
model.data.num_workers=0 \
+model.restore_from_path=$MODEL \
+model.rotary_base=500000.0 \
+model.seq_len_interpolation_factor=8 \
model.tokenizer.library=megatron  \
model.tokenizer.type=$TOKENIZER_DIR \
model.tokenizer.vocab_file=$TOKENIZER_DIR/vocab.json \
model.tokenizer.merge_file=$TOKENIZER_DIR/merges.txt \
model.tokenizer.model_file=$TOKENIZER_DIR/tokenizer.model \
+model.tokenizer.use_fast=false

## 2. Instruction Tuning <a name='s2'></a>

We will be using the [erhwenkuo/alpaca-data-gpt4-chinese-zhtw](https://huggingface.co/datasets/erhwenkuo/alpaca-data-gpt4-chinese-zhtw) is a dataset that contains Chinese (zh-tw) Instruction-Following generated by GPT-4 using Alpaca prompts for fine-tuning LLMs.

The dataset was originaly shared in [this repository](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM). This dataset is a translation from English to Chinese.

### 2.1 Download dataset: erhwenkuo/alpaca-data-gpt4-chinese-zhtw <a name='s2.1'></a>
Let's download dataset and save it as json first.

In [None]:
import os
import json
from datasets import load_dataset
dataset = load_dataset('erhwenkuo/alpaca-data-gpt4-chinese-zhtw')['train']
output_path = 'data/alpaca/gpt4-chinese-zhtw.jsonl'
os.makedirs(os.path.dirname(output_path), exist_ok=True)

with open(output_path, 'w') as f:
    for human_instruction, human_input, assistant_output in zip(dataset['instruction'], dataset['input'], dataset['output']):
        f.write(json.dumps({'input': '\n'.join([human_instruction.strip(),human_input.strip()]).strip(), 'output': assistant_output.strip()}, ensure_ascii=False)+ '\n')

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52049/52049 [00:00<00:00, 515662.50 examples/s]


In [None]:
!head -n 1 data/alpaca/gpt4-chinese-zhtw.jsonl

{"input": "給出三個保持健康的小貼士。", "output": "1. 飲食要均衡且富有營養：確保你的餐食包含各種水果、蔬菜、瘦肉、全穀物和健康脂肪。這有助於為身體提供必要的營養，使其發揮最佳功能，並有助於預防慢性疾病。2. 經常參加體育鍛煉：鍛鍊對於保持強壯的骨骼、肌肉和心血管健康至關重要。每週至少要進行150分鐘的中等有氧運動或75分鐘的劇烈運動。3. 獲得足夠的睡眠：獲得足夠的高質量睡眠對身體和心理健康至關重要。它有助於調節情緒，提高認知功能，並支援健康的生長和免疫功能。每晚睡眠目標為7-9小時。"}


### 2.2 Split the data into train, validation and test. <a name='s2.2'></a>

Generate the train, test and validation splits- you may use your own script to do this or create a new script and use the following sample split_train_val.py by copying it over in the alpaca directory

In [None]:
import json
import random

input_file = "data/alpaca/gpt4-chinese-zhtw.jsonl"
training_output_file = "data/alpaca/training.jsonl"
validation_output_file = "data/alpaca/validation.jsonl"
test_output_file = "data/alpaca/test.jsonl"

# Specify the proportion of data for training and validation
train_proportion = 0.98
validation_proportion = 0.01
test_proportion = 0.01

# Read the JSONL file and shuffle the JSON objects
with open(input_file, "r") as f:
    lines = f.readlines()
    random.shuffle(lines)

# Calculate split indices
total_lines = len(lines)
train_index = int(total_lines * train_proportion)
val_index = int(total_lines * validation_proportion)

# Distribute JSON objects into training and validation sets
train_data = lines[:train_index]
validation_data = lines[train_index:train_index+val_index]
test_data = lines[train_index+val_index:]

# Write JSON objects to training file
with open(training_output_file, "w") as f:
    for line in train_data:
        f.write(line.strip() + "\n")

# Write JSON objects to validation file
with open(validation_output_file, "w") as f:
    for line in validation_data:
        f.write(line.strip() + "\n")

# Write JSON objects to training file
with open(test_output_file, "w") as f:
    for line in test_data:
        f.write(line.strip() + "\n")

In [None]:
# What the dataset looks like after spliting
!head -1 data/alpaca/training.jsonl

{"input": "生成一句關於從失敗中學習的重要性的句子。", "output": "“接受和從失敗中學習對於成長和成功至關重要，因為它教給我們寶貴的經驗教訓，使我們能夠培養韌性，並提供改進的機會。”"}


### 2.3 Full parameter fine-tuning  <a name='s2.3'></a>

# 8張H100訓練

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo
NUM_GPUS=8
MAX_STEPS=100
VAL_INTERVAL=1.0
GBS=32
MBS=2
TP=8
PP=1
LR=2e-4
SEQ_LEN=8192
TRAIN_DS=[data/alpaca/training.jsonl]
VALID_DS=[data/alpaca/validation.jsonl]
TEST_DS=[data/alpaca/test.jsonl]
CONCAT_SAMPLING_PROBS=[1.0]
PROMPT_TEMPLATE="\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n\
{output}\""

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
--config-path=/opt/NeMo/examples/nlp/language_modeling/tuning/conf --config-name=megatron_gpt_finetuning_config \
trainer.devices=$NUM_GPUS \
trainer.max_epochs=null \
trainer.max_steps=$MAX_STEPS \
trainer.val_check_interval=$VAL_INTERVAL \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/SFT \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.restore_from_path=$MODEL \
model.global_batch_size=$GBS \
model.micro_batch_size=$MBS \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.validation_ds.file_names=${VALID_DS} \
model.data.test_ds.file_names=${TEST_DS} \
model.data.train_ds.max_seq_length=$SEQ_LEN \
model.data.validation_ds.max_seq_length=$SEQ_LEN \
model.data.test_ds.max_seq_length=$SEQ_LEN \
model.data.train_ds.num_workers=8 \
model.data.validation_ds.num_workers=8 \
model.data.test_ds.num_workers=8 \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.prompt_template="$PROMPT_TEMPLATE" \
model.optim.lr=$LR \
model.peft.peft_scheme=null

      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-12 11:23:36 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-12 11:23:36 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 8
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: null
      max_steps: 100
      log_every_n_steps: 10
      val_check_interval: 1.0
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: /workspace/results/Llama-3.1-8B/SFT
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        save_top_k: 1


      super().__init__(
    
[NeMo W 2024-12-12 11:23:36 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
      torch.set_autocast_gpu_dtype(dtype)
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..


[NeMo I 2024-12-12 11:23:37 exp_manager:400] ExpManager schema
[NeMo I 2024-12-12 11:23:37 exp_manager:401] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo W 2024-12-12 11:23:37 exp_manager:862] Exp_manager is logging to /workspace/results/Llama-3.1-8B/SFT, but it already exists.
[NeMo W 2024-12-12 11:23:37 exp_manager:784] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/results/Llama-3.1-8B/SFT/checkpoints. Training from scratch.


[NeMo I 2024-12-12 11:23:37 exp_manager:459] Experiments will be logged at /workspace/results/Llama-3.1-8B/SFT
[NeMo I 2024-12-12 11:23:37 exp_manager:1010] TensorboardLogger has been set up


[NeMo W 2024-12-12 11:23:37 exp_manager:1139] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-12-12 11:23:37 exp_manager:593] TFLOPs per sec per GPU will be calculated, conditioned on supported models. Defaults to -1 upon failure.


[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

[NeMo I 2024-12-12 11:23:48 megatron_init:314] Rank 0 has data parallel group : [0]
[NeMo I 2024-12-12 11:23:48 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0]
[NeMo I 2024-12-12 11:23:48 megatron_init:325] All data parallel group ranks with context parallel combined: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2024-12-12 11:23:48 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-12 11:23:48 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-12 11:23:48 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3], [4], [5], [6], [7]]
[NeMo I 2024-12-12 11:23:48 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-12 11:23:48 megatron_init:347] Rank 0 has model parallel group: [0, 1, 2, 3, 4, 5, 6, 7]
[NeMo I 2024-12-12 11:23:48 megatron_init:348] All model parallel group ranks: [[0, 1, 2, 3, 4, 5, 6, 7]]
[NeMo I 2024-12-12 11:23:48 megatron_init:357] Rank 0 has tensor 

[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:48 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

[NeMo I 2024-12-12 11:23:48 tokenizer_utils:184] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3.1-8B
[NeMo I 2024-12-12 11:23:49 megatron_base_model:601] Padded vocab_size: 129024, original vocab_size: 128256, dummy tokens: 768.


[NeMo W 2024-12-12 11:23:49 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:49 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:49 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:49 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 11:23:49 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8


apply rope scaling ...
apply rope scaling ...
apply rope scaling ...
apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8


apply rope scaling ...
apply rope scaling ...
apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

      checkpoint.load_state_dict(
    
      device = getattr(value, "device", None)
    


[NeMo I 2024-12-12 11:24:56 nlp_overrides:1358] Model MegatronGPTSFTModel was successfully restored from /workspace/results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo.
[NeMo I 2024-12-12 11:24:56 megatron_gpt_finetuning:75] Running full finetuning since no peft scheme is given.
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 1.0 B  | train
    -------------------------------------------
    1.0 B     Trainable params
    0         Non-trainable params
    1.0 B     Total params
    4,019.208 Total estimated model params size (MB)
    649       Modules in train mode
    0         Modules in eval mode


[NeMo W 2024-12-12 11:24:56 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-12 11:24:56 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-12 11:24:56 megatron_gpt_sft_model:836] Building GPT SFT validation datasets.
[NeMo I 2024-12-12 11:24:56 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-12 11:24:56 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 11:24:56 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.079305
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.077116
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:249] Loading data/alpaca/validation.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000836
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:840] Length of val dataset: 520
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:828] Building GPT SFT test datasets.
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.048851
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.045071
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:249] Loading data/alpaca/test.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000649
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:831] Length of test dataset: 521
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:847] Building GPT SFT traing datasets.
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.055687
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.080003
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:249] Loading data/alpaca/training.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000667
[NeMo I 2024-12-12 11:24:57 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-12-12 11:24:57 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.03 (sec)
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:849] Length of train dataset: 3216
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0
[NeMo I 2024-12-12 11:24:57 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0


LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 4 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 6 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 7 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 5 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]


[NeMo I 2024-12-12 11:24:57 nlp_overrides:274] Configuring DDP for model parallelism.
[NeMo I 2024-12-12 11:24:57 modelPT:787] Optimizer config = FusedAdam (
    Parameter Group 0
        betas: [0.9, 0.98]
        bias_correction: True
        eps: 1e-08
        is_expert: False
        lr: 0.0002
        weight_decay: 0.01
    )
[NeMo I 2024-12-12 11:24:57 lr_scheduler:948] Scheduler "<nemo.core.optim.lr_scheduler.CosineAnnealing object at 0x7f2f70f61de0>" 
    will be used during training (effective maximum steps = 100) - 
    Parameters : 
    (warmup_steps: 50
    min_lr: 0.0
    constant_steps: 0
    max_steps: 100
    )



  | Name  | Type     | Params | Mode 
-------------------------------------------
0 | model | GPTModel | 1.0 B  | train
-------------------------------------------
1.0 B     Trainable params
0         Non-trainable params
1.0 B     Total params
4,019.208 Total estimated model params size (MB)
649       Modules in train mode
0         Modules in eval mode
[NeMo W 2024-12-12 11:24:57 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:148: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    
      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'meg

Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-12-12 11:26:05 num_microbatches_calculator:218] setting number of microbatches to constant 16


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    


Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:14<00:00,  0.14it/s][NeMo I 2024-12-12 11:26:20 num_microbatches_calculator:218] setting number of microbatches to constant 16


[NeMo W 2024-12-12 11:26:20 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-12 11:26:20 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-12 11:26:20 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 202

Epoch 0: : 100%|██████████| 100/100 [03:51<00:00, reduced_train_loss=2.900, global_step=99.00, consumed_samples=3200.0, train_step_timing in s=1.940]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-12 11:31:22 num_microbatches_calculator:218] setting number of microbatches to constant 16

Validation:   0%|          | 0/17 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/17 [00:00<?, ?it/s][A
Validation DataLoader 0:   6%|▌         | 1/17 [00:01<00:16,  0.94it/s][A
Validation DataLoader 0:  12%|█▏        | 2/17 [00:02<00:15,  0.96it/s][A
Validation DataLoader 0:  18%|█▊        | 3/17 [00:03<00:16,  0.83it/s][A
Validation DataLoader 0:  24%|██▎       | 4/17 [00:05<00:16,  0.78it/s][A
Validation DataLoader 0:  29%|██▉       | 5/17 [00:06<00:15,  0.75it/s][A
Validation DataLoader 0:  35%|███▌      | 6/17 [00:07<00:14,  0.78it/s][A
Validation DataLoader 0:  41%|████      | 7/17 [00:09<00:13,  0.76it/s][A
Validation DataLoader 0:  47%|████▋     | 8/17 [

Epoch 0, global step 100: 'validation_loss' reached 2.63934 (best 2.63934), saving model to '/workspace/results/Llama-3.1-8B/SFT/checkpoints/megatron_gpt_peft_None_tuning--validation_loss=2.639-step=100-consumed_samples=3200.0.ckpt' as top 1
[NeMo W 2024-12-12 11:31:45 nlp_overrides:610] Distributed checkpoints requires DistributedCheckpointIO plugin to be used. Setting up a default now.


Epoch 0: : 100%|██████████| 100/100 [04:16<00:00, reduced_train_loss=2.900, global_step=99.00, consumed_samples=3200.0, train_step_timing in s=1.940, val_loss=2.640][NeMo I 2024-12-12 11:31:45 dist_ckpt_io:421] Using TorchDistSaveShardedStrategy(torch_dist, 1) dist-ckpt save strategy.


      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    


Epoch 0: : 100%|██████████| 100/100 [05:35<00:00, reduced_train_loss=2.900, global_step=99.00, consumed_samples=3200.0, train_step_timing in s=1.940, val_loss=2.640]

CalledProcessError: Command 'b'\nMODEL_NAME=Llama-3.1-8B\nMODEL=results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo\nNUM_GPUS=8\nMAX_STEPS=100\nVAL_INTERVAL=1.0\nGBS=32\nMBS=2\nTP=8\nPP=1\nLR=2e-4\nSEQ_LEN=8192\nTRAIN_DS=[data/alpaca/training.jsonl]\nVALID_DS=[data/alpaca/validation.jsonl]\nTEST_DS=[data/alpaca/test.jsonl]\nCONCAT_SAMPLING_PROBS=[1.0]\nPROMPT_TEMPLATE="\\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\\nYou are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user\'s queries promptly and politely.<|eot_id|>\\n\\\n<|start_header_id|>user<|end_header_id|>\\n\\\n{input}<|eot_id|>\\n\\\n<|start_header_id|>assistant<|end_header_id|>\\n\\\n{output}\\""\n\npython /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \\\n--config-path=/opt/NeMo/examples/nlp/language_modeling/tuning/conf --config-name=megatron_gpt_finetuning_config \\\ntrainer.devices=$NUM_GPUS \\\ntrainer.max_epochs=null \\\ntrainer.max_steps=$MAX_STEPS \\\ntrainer.val_check_interval=$VAL_INTERVAL \\\nexp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/SFT \\\nexp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \\\nmodel.tensor_model_parallel_size=$TP \\\nmodel.pipeline_model_parallel_size=$PP \\\nmodel.restore_from_path=$MODEL \\\nmodel.global_batch_size=$GBS \\\nmodel.micro_batch_size=$MBS \\\nmodel.data.train_ds.file_names=${TRAIN_DS} \\\nmodel.data.validation_ds.file_names=${VALID_DS} \\\nmodel.data.test_ds.file_names=${TEST_DS} \\\nmodel.data.train_ds.max_seq_length=$SEQ_LEN \\\nmodel.data.validation_ds.max_seq_length=$SEQ_LEN \\\nmodel.data.test_ds.max_seq_length=$SEQ_LEN \\\nmodel.data.train_ds.num_workers=8 \\\nmodel.data.validation_ds.num_workers=8 \\\nmodel.data.test_ds.num_workers=8 \\\nmodel.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \\\nmodel.data.train_ds.prompt_template="$PROMPT_TEMPLATE" \\\nmodel.optim.lr=$LR \\\nmodel.peft.peft_scheme=null\n'' returned non-zero exit status 1.

# 原始的code

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo
NUM_GPUS=4
MAX_STEPS=100
VAL_INTERVAL=1.0
GBS=16
MBS=1
TP=4
PP=1
LR=1e-4
SEQ_LEN=8192
TRAIN_DS=[data/alpaca/training.jsonl]
VALID_DS=[data/alpaca/validation.jsonl]
TEST_DS=[data/alpaca/test.jsonl]
CONCAT_SAMPLING_PROBS=[1.0]
PROMPT_TEMPLATE="\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n\
{output}\""

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
--config-path=/opt/NeMo/examples/nlp/language_modeling/tuning/conf --config-name=megatron_gpt_finetuning_config \
trainer.devices=$NUM_GPUS \
trainer.max_epochs=null \
trainer.max_steps=$MAX_STEPS \
trainer.val_check_interval=$VAL_INTERVAL \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/SFT \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=$PP \
model.restore_from_path=$MODEL \
model.global_batch_size=$GBS \
model.micro_batch_size=$MBS \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.validation_ds.file_names=${VALID_DS} \
model.data.test_ds.file_names=${TEST_DS} \
model.data.train_ds.max_seq_length=$SEQ_LEN \
model.data.validation_ds.max_seq_length=$SEQ_LEN \
model.data.test_ds.max_seq_length=$SEQ_LEN \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.test_ds.num_workers=0 \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.prompt_template="$PROMPT_TEMPLATE" \
model.optim.lr=$LR \
model.peft.peft_scheme=null

### 2.4. Parameter Efficient Fine-tuning <a name='s2.4'></a>
Fine-tuning language model can be computationally expensive and risk overfitting, especially with small, specialized datasets. Parameter-efficient fine-tuning methods like LoRA offer a solution. These techniques adapt the model to specific tasks by modifying only a subset of parameters, reducing computational costs and mitigating overfitting risks. In essence, LoRA enable a more efficient and targeted adaptation of large language models for specialized tasks.

# 原始的code

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo
NUM_GPUS=4
MAX_STEPS=100
VAL_INTERVAL=1.0
GBS=16
MBS=1
TP=4
PP=1
LR=1e-4
SEQ_LEN=8192
TRAIN_DS=[data/alpaca/training.jsonl]
VALID_DS=[data/alpaca/validation.jsonl]
TEST_DS=[data/alpaca/test.jsonl]
CONCAT_SAMPLING_PROBS=[1.0]
PROMPT_TEMPLATE="\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n\
{output}\""

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_finetuning.py \
--config-path=/opt/NeMo/examples/nlp/language_modeling/tuning/conf --config-name=megatron_gpt_finetuning_config \
trainer.devices=$NUM_GPUS \
trainer.max_epochs=null \
trainer.max_steps=$MAX_STEPS \
trainer.val_check_interval=$VAL_INTERVAL \
exp_manager.explicit_log_dir=/workspace/results/$MODEL_NAME/PEFT \
exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True \
model.tensor_model_parallel_size=1 \
model.restore_from_path=$MODEL \
model.global_batch_size=$GBS \
model.micro_batch_size=$MBS \
model.data.train_ds.file_names=${TRAIN_DS} \
model.data.validation_ds.file_names=${VALID_DS} \
model.data.test_ds.file_names=${TEST_DS} \
model.data.train_ds.max_seq_length=$SEQ_LEN \
model.data.validation_ds.max_seq_length=$SEQ_LEN \
model.data.test_ds.max_seq_length=$SEQ_LEN \
model.data.train_ds.num_workers=0 \
model.data.validation_ds.num_workers=0 \
model.data.test_ds.num_workers=0 \
model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
model.data.train_ds.prompt_template="$PROMPT_TEMPLATE" \
model.optim.lr=$LR \
model.peft.peft_scheme=lora \
model.peft.lora_tuning.adapter_dim=32

      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-12 12:04:02 megatron_gpt_finetuning:56] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-12 12:04:02 megatron_gpt_finetuning:57] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 4
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: null
      max_steps: 100
      log_every_n_steps: 10
      val_check_interval: 1.0
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: /workspace/results/Llama-3.1-8B/PEFT
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.validation_ds.metric.name}
        save_top_k: 1

      super().__init__(
    
[NeMo W 2024-12-12 12:04:02 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
      torch.set_autocast_gpu_dtype(dtype)
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
`Trainer(val_check_interval=1.0)` was configured so validation will run at the end of the training epoch..


[NeMo I 2024-12-12 12:04:03 exp_manager:400] ExpManager schema
[NeMo I 2024-12-12 12:04:03 exp_manager:401] {'explicit_log_dir': None, 'exp_dir': None, 'name': None, 'version': None, 'use_datetime_version': True, 'resume_if_exists': False, 'resume_past_end': False, 'resume_ignore_no_checkpoint': False, 'resume_from_checkpoint': None, 'create_tensorboard_logger': True, 'summary_writer_kwargs': None, 'create_wandb_logger': False, 'wandb_logger_kwargs': None, 'create_mlflow_logger': False, 'mlflow_logger_kwargs': {'experiment_name': None, 'tracking_uri': None, 'tags': None, 'save_dir': './mlruns', 'prefix': '', 'artifact_location': None, 'run_id': None, 'log_model': False}, 'create_dllogger_logger': False, 'dllogger_logger_kwargs': {'verbose': False, 'stdout': False, 'json_file': './dllogger.json'}, 'create_clearml_logger': False, 'clearml_logger_kwargs': {'project': None, 'task': None, 'connect_pytorch': False, 'model_name': None, 'tags': None, 'log_model': False, 'log_cfg': False, 'log_

[NeMo W 2024-12-12 12:04:03 exp_manager:862] Exp_manager is logging to /workspace/results/Llama-3.1-8B/PEFT, but it already exists.
[NeMo W 2024-12-12 12:04:03 exp_manager:784] There were no checkpoints found in checkpoint_dir or no checkpoint folder at checkpoint_dir :/workspace/results/Llama-3.1-8B/PEFT/checkpoints. Training from scratch.


[NeMo I 2024-12-12 12:04:03 exp_manager:459] Experiments will be logged at /workspace/results/Llama-3.1-8B/PEFT
[NeMo I 2024-12-12 12:04:03 exp_manager:1010] TensorboardLogger has been set up


[NeMo W 2024-12-12 12:04:03 exp_manager:1139] The checkpoint callback was told to monitor a validation value and trainer's max_steps was set to 100. Please ensure that max_steps will run for at least 1 epochs to ensure that checkpointing will not error out.


[NeMo I 2024-12-12 12:04:03 exp_manager:593] TFLOPs per sec per GPU will be calculated, conditioned on supported models. Defaults to -1 upon failure.


[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

[NeMo I 2024-12-12 12:04:13 megatron_init:314] Rank 0 has data parallel group : [0, 1, 2, 3]
[NeMo I 2024-12-12 12:04:13 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0, 1, 2, 3]
[NeMo I 2024-12-12 12:04:13 megatron_init:325] All data parallel group ranks with context parallel combined: [[0, 1, 2, 3]]
[NeMo I 2024-12-12 12:04:13 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-12 12:04:13 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-12 12:04:13 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3]]
[NeMo I 2024-12-12 12:04:13 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-12 12:04:13 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-12-12 12:04:13 megatron_init:348] All model parallel group ranks: [[0], [1], [2], [3]]
[NeMo I 2024-12-12 12:04:13 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-12-12 12:04:13 m

[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:13 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

[NeMo I 2024-12-12 12:04:13 tokenizer_utils:184] Getting HuggingFace AutoTokenizer with pretrained_model_name: meta-llama/Meta-Llama-3.1-8B
[NeMo I 2024-12-12 12:04:14 megatron_base_model:601] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-12-12 12:04:14 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:14 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:14 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:14 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-12 12:04:14 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4


apply rope scaling ...
apply rope scaling ...
apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

      checkpoint.load_state_dict(
    
      device = getattr(value, "device", None)
    


[NeMo I 2024-12-12 12:05:20 nlp_overrides:1358] Model MegatronGPTSFTModel was successfully restored from /workspace/results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo.
[NeMo I 2024-12-12 12:05:20 megatron_gpt_finetuning:72] Adding adapter weights to the model for PEFT
[NeMo I 2024-12-12 12:05:20 nlp_adapter_mixins:249] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
    649       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-12-12 12:05:22 nlp_adapter_mixins:254] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    ------------------------------

[NeMo W 2024-12-12 12:05:22 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-12 12:05:22 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-12 12:05:23 megatron_gpt_sft_model:836] Building GPT SFT validation datasets.
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.098930
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.067703
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:249] Loading data/alpaca/validation.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000806
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-12 12:05:23 megatron_gpt_sft_model:840] Length of val dataset: 520
[NeMo I 2024-12-12 12:05:23 megatron_gpt_sft_model:828] Building GPT SFT test datasets.
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.048978
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.046652
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:249] Loading data/alpaca/test.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000631
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-12 12:05:23 megatron_gpt_sft_model:831] Length of test dataset: 521
[NeMo I 2024-12-12 12:05:23 megatron_gpt_sft_model:847] Building GPT SFT traing datasets.
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.047996
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:527] Processing 1 data files using 2 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:00.047688
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:249] Loading data/alpaca/training.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.000648
[NeMo I 2024-12-12 12:05:23 text_memmap_dataset:165] Computing global indices


      counts = torch.cuda.LongTensor([1])
    


make: Entering directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
make: Nothing to be done for 'default'.
make: Leaving directory '/opt/NeMo/nemo/collections/nlp/data/language_modeling/megatron'
> building indices for blendable datasets ...
 > sample ratios:
   dataset 0, input: 1, achieved: 1
[NeMo I 2024-12-12 12:05:24 blendable_dataset:67] > elapsed time for building blendable dataset indices: 0.03 (sec)
[NeMo I 2024-12-12 12:05:24 megatron_gpt_sft_model:849] Length of train dataset: 1608
[NeMo I 2024-12-12 12:05:24 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0
[NeMo I 2024-12-12 12:05:24 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0
[NeMo I 2024-12-12 12:05:24 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0


LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]


[NeMo I 2024-12-12 12:05:24 nlp_overrides:274] Configuring DDP for model parallelism.
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kqv_adapter
[NeMo I 2024-12-12 12:05:24 adapter_mixins:495] Unfrozen adapter : lora_kq


  | Name  | Type     | Params | Mode 
-------------------------------------------
0 | model | GPTModel | 8.0 B  | train
-------------------------------------------
10.5 M    Trainable params
8.0 B     Non-trainable params
8.0 B     Total params
32,162.988Total estimated model params size (MB)
809       Modules in train mode
0         Modules in eval mode
[NeMo W 2024-12-12 12:05:24 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.
    
[NeMo W 2024-12-12 12:05:24 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:148: Found `dataloader_iter` argument in the `validation_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    


Sanity Checking: |          | 0/? [00:00<?, ?it/s][NeMo I 2024-12-12 12:05:25 num_microbatches_calculator:218] setting number of microbatches to constant 4


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    


Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:13<00:00,  0.15it/s][NeMo I 2024-12-12 12:05:38 num_microbatches_calculator:218] setting number of microbatches to constant 4


[NeMo W 2024-12-12 12:05:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-12 12:05:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss_dataloader0', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-12 12:05:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('validation_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 202

Epoch 0: : 100%|██████████| 100/100 [02:18<00:00, reduced_train_loss=1.590, global_step=99.00, consumed_samples=1600.0, train_step_timing in s=1.550]
Validation: |          | 0/? [00:00<?, ?it/s][A[NeMo I 2024-12-12 12:07:57 num_microbatches_calculator:218] setting number of microbatches to constant 4

Validation:   0%|          | 0/33 [00:00<?, ?it/s][A
Validation DataLoader 0:   0%|          | 0/33 [00:00<?, ?it/s][A
Validation DataLoader 0:   3%|▎         | 1/33 [00:00<00:08,  3.59it/s][A
Validation DataLoader 0:   6%|▌         | 2/33 [00:00<00:08,  3.58it/s][A
Validation DataLoader 0:   9%|▉         | 3/33 [00:01<00:13,  2.25it/s][A
Validation DataLoader 0:  12%|█▏        | 4/33 [00:02<00:15,  1.90it/s][A
Validation DataLoader 0:  15%|█▌        | 5/33 [00:02<00:16,  1.73it/s][A
Validation DataLoader 0:  18%|█▊        | 6/33 [00:03<00:16,  1.64it/s][A
Validation DataLoader 0:  21%|██        | 7/33 [00:04<00:16,  1.59it/s][A
Validation DataLoader 0:  24%|██▍       | 8/33 [0

Epoch 0, global step 100: 'validation_loss' reached 1.79125 (best 1.79125), saving model to '/workspace/results/Llama-3.1-8B/PEFT/checkpoints/megatron_gpt_peft_lora_tuning--validation_loss=1.791-step=100-consumed_samples=1600.0.ckpt' as top 1
Error executing job with overrides: ['trainer.devices=4', 'trainer.max_epochs=null', 'trainer.max_steps=100', 'trainer.val_check_interval=1.0', 'exp_manager.explicit_log_dir=/workspace/results/Llama-3.1-8B/PEFT', 'exp_manager.checkpoint_callback_params.save_nemo_on_train_end=True', 'model.tensor_model_parallel_size=1', 'model.restore_from_path=results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo', 'model.global_batch_size=16', 'model.micro_batch_size=1', 'model.data.train_ds.file_names=[data/alpaca/training.jsonl]', 'model.data.validation_ds.file_names=[data/alpaca/validation.jsonl]', 'model.data.test_ds.file_names=[data/alpaca/test.jsonl]', 'model.data.train_ds.max_seq_length=8192', 'model.data.validation_ds.max_seq_length=8192', 'model.

Process is interrupted.


## 3 Evaluation <a name='s3'></a>

If you want to evaluate an SFT .nemo file:

# 8張H100

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=results/Llama-3.1-8B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo
NUM_GPUS=8
TP=8
GB=64
SEQ_LEN=8192
TEST_DS=[data/alpaca/test.jsonl]
OUTPUT=data/alpaca/prediction
PROMPT_TEMPLATE="\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n\
{output}\""

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
trainer.precision=bf16 \
trainer.devices=$NUM_GPUS \
model.restore_from_path=$MODEL \
model.global_batch_size=$GB \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=1 \
model.megatron_amp_O2=True \
model.peft.restore_from_path=null \
model.data.test_ds.file_names=$TEST_DS \
model.data.test_ds.names=\['alpaca_test'] \
model.data.test_ds.global_batch_size=$GB \
model.data.test_ds.tokens_to_generate=128 \
model.data.test_ds.label_key='output' \
model.data.test_ds.add_eos=True \
model.data.test_ds.add_sep=False \
model.data.test_ds.add_bos=False \
model.data.test_ds.max_seq_length=$SEQ_LEN \
model.data.test_ds.truncation_field="input" \
model.data.test_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.test_ds.write_predictions_to_file=True \
model.data.test_ds.output_file_path_prefix=$OUTPUT

# 原始的code

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=results/Llama-3.1-8B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo
NUM_GPUS=4
TP=1
GB=32
SEQ_LEN=8192
TEST_DS=[data/alpaca/test.jsonl]
OUTPUT=data/alpaca/prediction
PROMPT_TEMPLATE="\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n\
{output}\""

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
trainer.precision=bf16 \
trainer.devices=$NUM_GPUS \
model.restore_from_path=$MODEL \
model.global_batch_size=$GB \
model.tensor_model_parallel_size=$TP \
model.pipeline_model_parallel_size=1 \
model.megatron_amp_O2=True \
model.peft.restore_from_path=null \
model.data.test_ds.file_names=$TEST_DS \
model.data.test_ds.names=\['alpaca_test'] \
model.data.test_ds.global_batch_size=$GB \
model.data.test_ds.tokens_to_generate=128 \
model.data.test_ds.label_key='output' \
model.data.test_ds.add_eos=True \
model.data.test_ds.add_sep=False \
model.data.test_ds.add_bos=False \
model.data.test_ds.max_seq_length=$SEQ_LEN \
model.data.test_ds.truncation_field="input" \
model.data.test_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.test_ds.write_predictions_to_file=True \
model.data.test_ds.output_file_path_prefix=$OUTPUT

      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-12 12:14:38 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-12 12:14:38 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 4
      accelerator: gpu
      num_nodes: 1
      precision: bf16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo

[NeMo W 2024-12-12 12:14:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
      torch.set_autocast_gpu_dtype(dtype)
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Error executing job with overrides: ['trainer.precision=bf16', 'trainer.devices=4', 'model.restore_from_path=results/Llama-3.1-8B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo', 'model.global_batch_size=32', 'model.tensor_model_parallel_size=1', 'model.pipeline_model_parallel_size=1', 'model.megatron_amp_O2=True', 'model.peft.restore_from_path=null', 'model.data.test_ds.file_names=[data/alpaca/test.jsonl]', 'model.data.test_ds.names=[alpaca_test]', 'model.data.test_ds.global_batch_size=32', 'model.data.test_ds.tokens_to_generate=128', 'model.data.test_ds.label_key=output', 'model.

CalledProcessError: Command 'b'\nMODEL_NAME=Llama-3.1-8B\nMODEL=results/Llama-3.1-8B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo\nNUM_GPUS=4\nTP=1\nGB=32\nSEQ_LEN=8192\nTEST_DS=[data/alpaca/test.jsonl]\nOUTPUT=data/alpaca/prediction\nPROMPT_TEMPLATE="\\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\\n\\\nYou are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user\'s queries promptly and politely.<|eot_id|>\\n\\\n<|start_header_id|>user<|end_header_id|>\\n\\\n{input}<|eot_id|>\\n\\\n<|start_header_id|>assistant<|end_header_id|>\\n\\\n{output}\\""\n\npython /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \\\ntrainer.precision=bf16 \\\ntrainer.devices=$NUM_GPUS \\\nmodel.restore_from_path=$MODEL \\\nmodel.global_batch_size=$GB \\\nmodel.tensor_model_parallel_size=$TP \\\nmodel.pipeline_model_parallel_size=1 \\\nmodel.megatron_amp_O2=True \\\nmodel.peft.restore_from_path=null \\\nmodel.data.test_ds.file_names=$TEST_DS \\\nmodel.data.test_ds.names=\\[\'alpaca_test\'] \\\nmodel.data.test_ds.global_batch_size=$GB \\\nmodel.data.test_ds.tokens_to_generate=128 \\\nmodel.data.test_ds.label_key=\'output\' \\\nmodel.data.test_ds.add_eos=True \\\nmodel.data.test_ds.add_sep=False \\\nmodel.data.test_ds.add_bos=False \\\nmodel.data.test_ds.max_seq_length=$SEQ_LEN \\\nmodel.data.test_ds.truncation_field="input" \\\nmodel.data.test_ds.prompt_template="$PROMPT_TEMPLATE" \\\nmodel.data.test_ds.write_predictions_to_file=True \\\nmodel.data.test_ds.output_file_path_prefix=$OUTPUT\n'' returned non-zero exit status 1.

In [None]:
import json

def modify_and_overwrite_jsonl(file_path):
    data_list = []
    with open(file_path, 'r') as file:
        for line in file:
            data = json.loads(line)
            data_list.append(data)

    with open(file_path, 'w', encoding='utf-8') as file:
        for data in data_list:
            json_line = json.dumps(data, ensure_ascii=False) + "\n"
            file.write(json_line)

file_path = "data/alpaca/prediction_test_alpaca_test_inputs_preds_labels.jsonl"
modify_and_overwrite_jsonl(file_path)

If you want to evaluate a PEFT Model, you should provide a base GPT model and a PEFT model .nemo file

In [None]:
%%bash

MODEL_NAME=Llama-3.1-8B
MODEL=results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo
PEFT_MODEL=results/Llama-3.1-8B/PEFT/checkpoints/megatron_gpt_peft_lora_tuning.nemo
NUM_GPUS=4
GB=32
SEQ_LEN=8192
TEST_DS=[data/alpaca/test.jsonl]
OUTPUT=data/alpaca/prediction_peft
PROMPT_TEMPLATE="\"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n\
{output}\""

python /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_generate.py \
model.restore_from_path=$MODEL \
model.peft.restore_from_path=$PEFT_MODEL \
model.peft.peft_scheme=lora \
trainer.devices=$NUM_GPUS \
model.global_batch_size=$GB \
model.data.test_ds.file_names=$TEST_DS \
model.data.test_ds.names=\['alpaca_test'] \
model.data.test_ds.global_batch_size=$GB \
model.data.test_ds.tokens_to_generate=128 \
model.data.test_ds.label_key='output' \
model.data.test_ds.add_eos=True \
model.data.test_ds.add_sep=False \
model.data.test_ds.add_bos=False \
model.data.test_ds.max_seq_length=$SEQ_LEN \
model.data.test_ds.truncation_field="input" \
model.data.test_ds.prompt_template="$PROMPT_TEMPLATE" \
model.data.test_ds.write_predictions_to_file=True \
model.data.test_ds.output_file_path_prefix=$OUTPUT

      def forward(ctx, input, weight, bias, allreduce_dgrad):
    
      def backward(ctx, grad_output):
    
      def forward(
    
      def backward(ctx, grad_output):
    
      from torch.distributed._sharded_tensor import ShardedTensor as TorchShardedTensor
    
        module is deprecated and will be removed in 0.10.0. Please use 
        'megatron.core.extensions.transformer_engine' instead.
      from megatron.core.transformer.custom_layers.transformer_engine import SplitAlongDim
    
      quantize_op_abstract = torch.library.impl_abstract("tensorrt::quantize_op")(
    
      cm = get_cmap("Set1")
    
      if get_gast_version() < LooseVersion("0.5"):
    
      other = LooseVersion(other)
    
    See https://hydra.cc/docs/1.2/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
      ret = run_job(
    


[NeMo I 2024-12-06 09:00:40 megatron_gpt_generate:125] 
    
    ************** Experiment configuration ***********
[NeMo I 2024-12-06 09:00:40 megatron_gpt_generate:126] 
    name: megatron_gpt_peft_${model.peft.peft_scheme}_tuning
    trainer:
      devices: 4
      accelerator: gpu
      num_nodes: 1
      precision: 16
      logger: false
      enable_checkpointing: false
      use_distributed_sampler: false
      max_epochs: 9999
      max_steps: 20000
      log_every_n_steps: 10
      val_check_interval: 200
      gradient_clip_val: 1.0
    exp_manager:
      explicit_log_dir: null
      exp_dir: null
      name: ${name}
      create_wandb_logger: false
      wandb_logger_kwargs:
        project: null
        name: null
      resume_if_exists: true
      resume_ignore_no_checkpoint: true
      create_checkpoint_callback: true
      checkpoint_callback_params:
        monitor: validation_${model.data.test_ds.metric.name}
        save_top_k: 1
        mode: max
        save_nemo_o

      super().__init__(
    
[NeMo W 2024-12-06 09:00:40 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/_graveyard/precision.py:49: The `MixedPrecisionPlugin` is deprecated. Use `pytorch_lightning.plugins.precision.MixedPrecision` instead.
    
      torch.set_autocast_gpu_dtype(dtype)
    
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_

[NeMo I 2024-12-06 09:01:07 megatron_init:314] Rank 0 has data parallel group : [0, 1, 2, 3]
[NeMo I 2024-12-06 09:01:07 megatron_init:320] Rank 0 has combined group of data parallel and context parallel : [0, 1, 2, 3]
[NeMo I 2024-12-06 09:01:07 megatron_init:325] All data parallel group ranks with context parallel combined: [[0, 1, 2, 3]]
[NeMo I 2024-12-06 09:01:07 megatron_init:328] Ranks 0 has data parallel rank: 0
[NeMo I 2024-12-06 09:01:07 megatron_init:336] Rank 0 has context parallel group: [0]
[NeMo I 2024-12-06 09:01:07 megatron_init:339] All context parallel group ranks: [[0], [1], [2], [3]]
[NeMo I 2024-12-06 09:01:07 megatron_init:340] Ranks 0 has context parallel rank: 0
[NeMo I 2024-12-06 09:01:07 megatron_init:347] Rank 0 has model parallel group: [0]
[NeMo I 2024-12-06 09:01:07 megatron_init:348] All model parallel group ranks: [[0], [1], [2], [3]]
[NeMo I 2024-12-06 09:01:07 megatron_init:357] Rank 0 has tensor model parallel group: [0]
[NeMo I 2024-12-06 09:01:07 m

[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

[NeMo I 2024-12-06 09:01:07 megatron_base_model:601] Padded vocab_size: 128256, original vocab_size: 128256, dummy tokens: 0.


[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: expert_model_parallel_size in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: moe_extended_tp in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: finalize_model_grads_func in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: use_te_rng_tracker in its cfg. Add this key to cfg or config_mapping to make to make it configurable.
[NeMo W 2024-12-06 09:01:07 megatron_base_model:1186] The model: MegatronGPTSFTModel() does not have field.name: tp_comm_bulk_wg

apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4


apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4


apply rope scaling ...
apply rope scaling ...


Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

      checkpoint.load_state_dict(
    
      device = getattr(value, "device", None)
    


[NeMo I 2024-12-06 09:03:24 nlp_overrides:1358] Model MegatronGPTSFTModel was successfully restored from /workspace/results/Llama-3.1-8B/pretrain/checkpoints/megatron_llama.nemo.
[NeMo I 2024-12-06 09:03:29 nlp_adapter_mixins:249] Before adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    0         Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total params
    32,121.045Total estimated model params size (MB)
    649       Modules in train mode
    0         Modules in eval mode
[NeMo I 2024-12-06 09:03:31 nlp_adapter_mixins:254] After adding PEFT params:
      | Name  | Type     | Params | Mode 
    -------------------------------------------
    0 | model | GPTModel | 8.0 B  | train
    -------------------------------------------
    10.5 M    Trainable params
    8.0 B     Non-trainable params
    8.0 B     Total p

[NeMo W 2024-12-06 09:03:31 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:161: You have overridden `MegatronGPTSFTModel.configure_sharded_model` which is deprecated. Please override the `configure_model` hook instead. Instantiation with the newer hook will be created on the device right away and have the right data type depending on the precision setting in the Trainer.
    
[NeMo W 2024-12-06 09:03:31 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/configuration_validator.py:143: You are using the `dataloader_iter` step flavor. If you consume the iterator more than once per step, the `batch_idx` argument in any hook that takes it will not match with the batch index of the last batch consumed. This might have unforeseen effects on callbacks or code that expects to get the correct index. This will also not work well with gradient accumulation. This feature is very experimental and subjec

[NeMo I 2024-12-06 09:03:32 megatron_gpt_sft_model:828] Building GPT SFT test datasets.
[NeMo I 2024-12-06 09:03:32 text_memmap_dataset:116] Building data files
[NeMo I 2024-12-06 09:03:32 text_memmap_dataset:527] Processing 1 data files using 127 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-12-06 09:03:34 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:02.889242
[NeMo I 2024-12-06 09:03:35 text_memmap_dataset:527] Processing 1 data files using 127 workers


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

[NeMo I 2024-12-06 09:03:38 text_memmap_dataset:542] Time building 0 / 1 mem-mapped files: 0:00:03.145660
[NeMo I 2024-12-06 09:03:38 text_memmap_dataset:158] Loading data files
[NeMo I 2024-12-06 09:03:38 text_memmap_dataset:249] Loading data/alpaca/test.jsonl


      idx_info_dict = pickle.load(open(idx_fn + ".info", "rb"))
    


[NeMo I 2024-12-06 09:03:38 text_memmap_dataset:161] Time loading 1 mem-mapped files: 0:00:00.001094
[NeMo I 2024-12-06 09:03:38 text_memmap_dataset:165] Computing global indices
[NeMo I 2024-12-06 09:03:38 megatron_gpt_sft_model:831] Length of test dataset: 521
[NeMo I 2024-12-06 09:03:38 megatron_gpt_sft_model:854] Building dataloader with consumed samples: 0


LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
[NeMo W 2024-12-06 09:03:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/data_connector.py:424: The 'test_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=62` in the `DataLoader` to improve performance.
    
[NeMo W 2024-12-06 09:03:38 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/loops/utilities.py:148: Found `dataloader_iter` argument in the `test_step`. Note that the support for this signature is experimental and the behavior is subject to change.
    


Testing: |          | 0/? [00:00<?, ?it/s]setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      def forward(ctx, xz, conv1d_weight, conv1d_bias, x_proj_weight, delta_proj_weight,
    
      def backward(ctx, dout):
    
      def forward(
    
      def backward(ctx, dout, *args):
    
      def forward(ctx, x, weight, bias, process_group=None, sequence_parallel=True):
    
      def backward(ctx, grad_output):
    
      def forward(ctx, zxbcdt, conv1d_weight, conv1d_bias, dt_bias, A, D, chunk_size, initial_states=None, seq_idx=None, dt_limit=(0.0, float("inf")), return_final_states=False, activation="silu",
    
      def backward(ctx, dout, *args):
    
      input_info_tensor = torch.cuda.FloatTensor(input_info)
    
      string_tensor = torch.a

Testing DataLoader 0:   0%|          | 0/17 [00:00<?, ?it/s]setting number of microbatches to constant 1


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:   6%|▌         | 1/17 [00:31<08:31,  0.03it/s]setting number of microbatches to constant 1


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    
    


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  12%|█▏        | 2/17 [00:51<06:24,  0.04it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  18%|█▊        | 3/17 [01:10<05:28,  0.04it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  24%|██▎       | 4/17 [01:26<04:40,  0.05it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  29%|██▉       | 5/17 [01:41<04:03,  0.05it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8
Testing DataLoader 0:  35%|███▌      | 6/17 [02:01<03:42,  0.05it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 8
Testing DataLoader 0:  41%|████      | 7/17 [02:21<03:21,  0.05it/s]setting number of microbatches to constant 1


[rank2]:W1206 09:06:15.180000 139969535759488 torch/_dynamo/convert_frame.py:744] [4/8] torch._dynamo hit config.cache_size_limit (8)
[rank2]:W1206 09:06:15.180000 139969535759488 torch/_dynamo/convert_frame.py:744] [4/8]    function: 'calculate_cross_entropy_loss' (/opt/megatron-lm/megatron/core/fusions/fused_cross_entropy.py:47)
[rank2]:W1206 09:06:15.180000 139969535759488 torch/_dynamo/convert_frame.py:744] [4/8]    last reason: tensor 'L['exp_logits']' size mismatch at index 0. expected 544, actual 736
[rank2]:W1206 09:06:15.180000 139969535759488 torch/_dynamo/convert_frame.py:744] [4/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank2]:W1206 09:06:15.180000 139969535759488 torch/_dynamo/convert_frame.py:744] [4/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  47%|████▋     | 8/17 [02:37<02:57,  0.05it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8
Testing DataLoader 0:  53%|█████▎    | 9/17 [02:57<02:37,  0.05it/s]setting number of microbatches to constant 1


[rank1]:W1206 09:06:50.536000 140583929025664 torch/_dynamo/convert_frame.py:744] [4/8] torch._dynamo hit config.cache_size_limit (8)
[rank1]:W1206 09:06:50.536000 140583929025664 torch/_dynamo/convert_frame.py:744] [4/8]    function: 'calculate_cross_entropy_loss' (/opt/megatron-lm/megatron/core/fusions/fused_cross_entropy.py:47)
[rank1]:W1206 09:06:50.536000 140583929025664 torch/_dynamo/convert_frame.py:744] [4/8]    last reason: tensor 'L['exp_logits']' size mismatch at index 0. expected 272, actual 688
[rank1]:W1206 09:06:50.536000 140583929025664 torch/_dynamo/convert_frame.py:744] [4/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank1]:W1206 09:06:50.536000 140583929025664 torch/_dynamo/convert_frame.py:744] [4/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.


setting number of microbatches to constant 8
Testing DataLoader 0:  59%|█████▉    | 10/17 [03:14<02:16,  0.05it/s]setting number of microbatches to constant 1


[rank3]:W1206 09:07:07.066000 140178608280704 torch/_dynamo/convert_frame.py:744] [4/8] torch._dynamo hit config.cache_size_limit (8)
[rank3]:W1206 09:07:07.066000 140178608280704 torch/_dynamo/convert_frame.py:744] [4/8]    function: 'calculate_cross_entropy_loss' (/opt/megatron-lm/megatron/core/fusions/fused_cross_entropy.py:47)
[rank3]:W1206 09:07:07.066000 140178608280704 torch/_dynamo/convert_frame.py:744] [4/8]    last reason: tensor 'L['exp_logits']' size mismatch at index 0. expected 352, actual 560
[rank3]:W1206 09:07:07.066000 140178608280704 torch/_dynamo/convert_frame.py:744] [4/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank3]:W1206 09:07:07.066000 140178608280704 torch/_dynamo/convert_frame.py:744] [4/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.


setting number of microbatches to constant 8
Testing DataLoader 0:  65%|██████▍   | 11/17 [03:32<01:55,  0.05it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 8
Testing DataLoader 0:  71%|███████   | 12/17 [03:51<01:36,  0.05it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  76%|███████▋  | 13/17 [04:06<01:15,  0.05it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8


[rank0]:W1206 09:08:06.010000 139846295696512 torch/_dynamo/convert_frame.py:744] [4/8] torch._dynamo hit config.cache_size_limit (8)
[rank0]:W1206 09:08:06.010000 139846295696512 torch/_dynamo/convert_frame.py:744] [4/8]    function: 'calculate_cross_entropy_loss' (/opt/megatron-lm/megatron/core/fusions/fused_cross_entropy.py:47)
[rank0]:W1206 09:08:06.010000 139846295696512 torch/_dynamo/convert_frame.py:744] [4/8]    last reason: tensor 'L['exp_logits']' size mismatch at index 0. expected 768, actual 400
[rank0]:W1206 09:08:06.010000 139846295696512 torch/_dynamo/convert_frame.py:744] [4/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
[rank0]:W1206 09:08:06.010000 139846295696512 torch/_dynamo/convert_frame.py:744] [4/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  82%|████████▏ | 14/17 [04:27<00:57,  0.05it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8


      self.activation_dtype = torch.get_autocast_gpu_dtype()
    


Testing DataLoader 0:  88%|████████▊ | 15/17 [04:43<00:37,  0.05it/s]setting number of microbatches to constant 1


    


setting number of microbatches to constant 8
Testing DataLoader 0:  94%|█████████▍| 16/17 [05:02<00:18,  0.05it/s]setting number of microbatches to constant 1
setting number of microbatches to constant 8
Testing DataLoader 0: 100%|██████████| 17/17 [05:16<00:00,  0.05it/s][NeMo I 2024-12-06 09:09:01 megatron_gpt_sft_model:553] skipping autogenerated example example <|start_header_id|>system<|end_header_id|>
    You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>
    <|start_header_id|>user<|end_header_id|>
    給定一份財務資料樣本，計算每月總支出。
    住房：$1,500，食品：$600，交通：$250，娛樂：$400<|eot_id|>
    <|start_header_id|>assistant<|end_header_id|>
     prediction 每月總支出為$2,750。 label 在這種情況下，每個類別的值之和將是每月的總支出：住房+食品+交通+娛樂= $1,500 + $600 + $250 + $400 = $2,750。
[NeMo I 2024-12-06 09:09:01 megatron_gpt_sft_model:553] skipping autogenerated example example <|start_header_id|>system<|end_header_id|>
    You are 

[NeMo W 2024-12-06 09:09:01 megatron_gpt_sft_model:677] No training data found, reconfiguring microbatches based on validation batch sizes.
[NeMo W 2024-12-06 09:09:01 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-06 09:09:01 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('test_loss_alpaca_test', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
    
[NeMo W 2024-12-06 09:09:01 nemo_logging:349] /usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('test_loss', ..

In [None]:
file_path = "data/alpaca/prediction_peft_test_alpaca_test_inputs_preds_labels.jsonl"
modify_and_overwrite_jsonl(file_path)

## 4. Export and Deploy a NeMo Checkpoint to TensorRT-LLM <a name='s4'></a>

Open a terminal and run the following code:

```sh
python /opt/NeMo/scripts/deploy/nlp/deploy_triton.py \
--nemo_checkpoint results/Llama-3.1-8B/SFT/checkpoints/megatron_gpt_peft_None_tuning.nemo \
--model_type llama \
--dtype bfloat16 \
--triton_model_name Llama
```

The command above launches a inference server. Keep it running and run the following cell to send a request to the server.

In [None]:
%%bash

PROMPT_TEMPLATE="<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\
You are a knowledgeable assistant trained to provide accurate and helpful information. Please respond to the user's queries promptly and politely.<|eot_id|>\n\
<|start_header_id|>user<|end_header_id|>\n\
{input}<|eot_id|>\n\
<|start_header_id|>assistant<|end_header_id|>\n"

INPUT="今天天氣好嗎?"

PROMPT="${PROMPT_TEMPLATE//\{input\}/$INPUT}"

python /opt/NeMo/scripts/deploy/nlp/query.py \
--url "http://localhost:8000" \
--model_name Llama \
--prompt "$PROMPT"