<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Duplicate%2C_remove%2C_and_reorder_layers_of_an_LLM_Example_with_Llama_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*More details in this article: [From Llama 3 70B to 120B: How to Self-Augment an LLM?](https://newsletter.kaitchup.com/p/from-llama-3-70b-to-120b-how-to-self)*

This notebook shows how to create a new LLM by duplicating, removing, and reodering layers of a base LLM. It uses [mergekit](https://github.com/arcee-ai/mergekit) and its passthrough method.

Several configurations are tried with Llama 3 8B and each resulting model is evaluated with the [evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).

First, install the following packages:

In [None]:
!git clone https://github.com/arcee-ai/mergekit.git
!cd mergekit && pip install -e .  # install the package and make scripts available
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git
!pip install trl bitsandbytes accelerate

Cloning into 'mergekit'...
remote: Enumerating objects: 2265, done.[K
remote: Counting objects: 100% (1354/1354), done.[K
remote: Compressing objects: 100% (520/520), done.[K
remote: Total 2265 (delta 1081), reused 947 (delta 833), pack-reused 911[K
Receiving objects: 100% (2265/2265), 640.50 KiB | 2.44 MiB/s, done.
Resolving deltas: 100% (1584/1584), done.
Obtaining file:///content/mergekit
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tqdm==4.66.2 (from mergekit==0.0.4.3)
  Downloading tqdm-4.66.2-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Collecting accelerate~=0.27.2 (from mergekit==0.0.4.3)
  Downloading a

We need to enter Hugging Face's access token to download Llama 3 and its tokenizer.

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) n
Token is valid (permission: read).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Method that we will use to fine-tune each model:

*Note: I use QLoRA with 8-bit quantization. Replace "load_in_8bit" with "load_in_4bit" to use 4-bit quantization if you don't have enough GPU memory.*

In [None]:
import torch, os
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  os.system('pip install flash_attn')
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'


base_model = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model,add_eos_token=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'

def sft(model_dir, model_output=''):
  if model_output=='':
    model_output = model_dir+'/sft/'
  model = AutoModelForCausalLM.from_pretrained(
            model_dir, load_in_8bit=True, device_map={"": 0}, attn_implementation=attn_implementation
  )

  model = prepare_model_for_kbit_training(model)
  ds = load_dataset("timdettmers/openassistant-guanaco")


  peft_config = LoraConfig(
          lora_alpha=16,
          lora_dropout=0.05,
          r=16,
          bias="none",
          task_type="CAUSAL_LM",
          target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
  )

  from trl import SFTConfig

  training_arguments = SFTConfig(
          output_dir=model_output,
          evaluation_strategy="steps",
          do_eval=True,
          optim="paged_adamw_8bit",
          per_device_train_batch_size=4,
          gradient_accumulation_steps=4,
          per_device_eval_batch_size=4,
          log_level="debug",
          save_steps=50,
          logging_steps=25,
          learning_rate=1e-4,
          fp16 = not torch.cuda.is_bf16_supported(),
          bf16 = torch.cuda.is_bf16_supported(),
          eval_steps=25,
          max_steps=100,
          warmup_ratio=0.1,
          lr_scheduler_type="linear",
  )

  trainer = SFTTrainer(
          model=model,
          train_dataset=ds['train'],
          eval_dataset=ds['test'],
          peft_config=peft_config,
          dataset_text_field="text",
          max_seq_length=512,
          tokenizer=tokenizer,
          args=training_arguments,
  )

  trainer.train()


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Baseline Fine-tuning

I ran the following cell to know the performance of QLoRA 8-bit fine-tuning for Llama 3 8B. We want the models that we will create from layer duplication/deletion/reordering to perform closely, or better, on the validation split of the dataset used for fine-tuning.



In [None]:
sft("meta-llama/Meta-Llama-3-8B", "./base_8bit/")

config.json:   0%|          | 0.00/654 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/177 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 41,943,040
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,1.4119,1.318529
50,1.2659,1.298529
75,1.2786,1.292989
100,1.2591,1.291749


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./base_8bit/checkpoint-50
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--meta-llama--Meta-Llama-3-8B/snapshots/62bd457b6fe961a42a631306577e622c83876cb6/config.json
Model config LlamaConfig {
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.40.2",


#Layer Duplication

First, set the configuration that will be used by mergekit. This particular configuration duplicates the last layers of the model.

Then, run mergekit to create the model.

Finally, the model is evaluated with the evaluation harness.

In [None]:
config = '''
slices:
- sources:
  - layer_range: [0, 27]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [27, 32]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [27, 32]
    model: meta-llama/Meta-Llama-3-8B
merge_method: passthrough
dtype: float16
'''


with open('config.yaml', 'w') as f:
    f.write(config)

!mergekit-yaml config.yaml ./llama-3-8B-27-32x2 --cuda --lazy-unpickle --allow-crimes &> /dev/null
!lm_eval --model hf --model_args pretrained=./llama-3-8B-27-32x2,load_in_8bit=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_dup/llama-3-8B-27-32x2

2024-05-18 08:35:09.852234: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-18 08:35:09.852287: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-18 08:35:09.854099: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 19.9MB/s]
2024-05-18:08:35:15,217 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-18:08:35:22,102 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'hellaswag', 'winogrande']
2024-05-18:08:35:22,109 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Set

Fine-tune the model created with mergekit:

In [None]:
sft("./llama-3-8B-27-32x2/")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 48,496,640
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,1.8296,1.495447
50,1.4004,1.412419
75,1.383,1.391789
100,1.3513,1.385539


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-27-32x2//sft/checkpoint-50
tokenizer config file saved in ./llama-3-8B-27-32x2//sft/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-27-32x2//sft/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-27-32x2//sft/checkpoint-100
tokenizer config file saved in ./llama-3-8B-27-32x2//sft/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-27-32x2//sft/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




In [None]:
config = '''
slices:
- sources:
  - layer_range: [0, 4]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [0, 4]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [4, 32]
    model: meta-llama/Meta-Llama-3-8B
merge_method: passthrough
dtype: float16
'''


with open('config.yaml', 'w') as f:
    f.write(config)

!mergekit-yaml config.yaml ./llama-3-8B-0-4x2 --cuda --lazy-unpickle --allow-crimes &> /dev/null
!lm_eval --model hf --model_args pretrained=./llama-3-8B-0-4x2,load_in_8bit=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_dup/llama-3-8B-0-4x2

2024-05-18 10:16:17.013855: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-18 10:16:17.013910: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-18 10:16:17.015744: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-18:10:16:23,271 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-18:10:16:30,293 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'hellaswag', 'winogrande']
2024-05-18:10:16:30,299 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-18:10:16:30,299 INFO     [eval

In [None]:
sft("./llama-3-8B-0-4x2/")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 47,185,920
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,1.6369,1.387245
50,1.3144,1.336498
75,1.3098,1.327007


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-0-4x2//sft/checkpoint-50
tokenizer config file saved in ./llama-3-8B-0-4x2//sft/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-0-4x2//sft/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4


Step,Training Loss,Validation Loss
25,1.6369,1.387245
50,1.3144,1.336498
75,1.3098,1.327007
100,1.2894,1.323652


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-0-4x2//sft/checkpoint-100
tokenizer config file saved in ./llama-3-8B-0-4x2//sft/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-0-4x2//sft/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




In [None]:
config = '''
slices:
- sources:
  - layer_range: [0, 32]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [0, 32]
    model: meta-llama/Meta-Llama-3-8B
merge_method: passthrough
dtype: float16
'''


with open('config.yaml', 'w') as f:
    f.write(config)

!mergekit-yaml config.yaml ./llama-3-8B-0-32x2 --cuda --lazy-unpickle --allow-crimes &> /dev/null
!lm_eval --model hf --model_args pretrained=./llama-3-8B-0-32x2,load_in_8bit=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_dup/llama-3-8B-0-32x2

2024-05-18 11:56:29.004239: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-18 11:56:29.004288: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-18 11:56:29.005912: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 21.6MB/s]
2024-05-18:11:56:35,255 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-18:11:56:42,268 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'hellaswag', 'winogrande']
2024-05-18:11:56:42,273 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Set

In [None]:
sft("./llama-3-8B-0-32x2/")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 83,886,080
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,2.0675,1.558924
50,1.4442,1.449833
75,1.4175,1.42296
100,1.3828,1.415381


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-0-32x2//sft/checkpoint-50
tokenizer config file saved in ./llama-3-8B-0-32x2//sft/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-0-32x2//sft/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-0-32x2//sft/checkpoint-100
tokenizer config file saved in ./llama-3-8B-0-32x2//sft/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-0-32x2//sft/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




In [None]:
config = '''
slices:
- sources:
  - layer_range: [0, 10]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [5, 15]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [10, 20]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [15, 25]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [20, 32]
    model: meta-llama/Meta-Llama-3-8B
merge_method: passthrough
dtype: float16
'''


with open('config.yaml', 'w') as f:
    f.write(config)

!mergekit-yaml config.yaml ./llama-3-8B-altx2 --cuda --lazy-unpickle --allow-crimes &> /dev/null
!lm_eval --model hf --model_args pretrained=./llama-3-8B-altx2,load_in_8bit=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_dup/llama-3-8B-altx2

2024-05-18 14:15:42.254801: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-18 14:15:42.254854: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-18 14:15:42.256677: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-18:14:15:49,217 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-18:14:15:56,468 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'hellaswag', 'winogrande']
2024-05-18:14:15:56,474 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-18:14:15:56,474 INFO     [eval

In [None]:
sft("./llama-3-8B-altx2/")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 68,157,440
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,1.6142,1.479523
50,1.4081,1.423185
75,1.3845,1.3993
100,1.3547,1.390922


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-altx2//sft/checkpoint-50
tokenizer config file saved in ./llama-3-8B-altx2//sft/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-altx2//sft/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-altx2//sft/checkpoint-100
tokenizer config file saved in ./llama-3-8B-altx2//sft/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-altx2//sft/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




# Layer Deletion

In [None]:
config = '''
slices:
- sources:
  - layer_range: [0, 10]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [21,32]
    model: meta-llama/Meta-Llama-3-8B
merge_method: passthrough
dtype: float16
'''


with open('config.yaml', 'w') as f:
    f.write(config)

!mergekit-yaml config.yaml ./llama-3-8B-0-10_21-32 --cuda --lazy-unpickle --allow-crimes &> /dev/null
!lm_eval --model hf --model_args pretrained=./llama-3-8B-0-10_21-32,load_in_8bit=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_del/llama-3-8B-0-10_21-32

2024-05-18 15:29:08.850872: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-18 15:29:08.850925: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-18 15:29:08.852727: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Downloading builder script: 100% 5.67k/5.67k [00:00<00:00, 18.8MB/s]
2024-05-18:15:29:14,189 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-18:15:29:21,253 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'hellaswag', 'winogrande']
2024-05-18:15:29:21,260 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Set

In [None]:
sft("./llama-3-8B-0-10_21-32/")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 27,525,120
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,4.4513,3.277726
50,2.9257,2.752851
75,2.6461,2.601437
100,2.5232,2.555698


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-0-10_21-32//sft/checkpoint-50
tokenizer config file saved in ./llama-3-8B-0-10_21-32//sft/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-0-10_21-32//sft/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-0-10_21-32//sft/checkpoint-100
tokenizer config file saved in ./llama-3-8B-0-10_21-32//sft/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-0-10_21-32//sft/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




# Layer Reordering

In [None]:
config = '''
slices:
- sources:
  - layer_range: [5, 9]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [0,4]
    model: meta-llama/Meta-Llama-3-8B
- sources:
  - layer_range: [10,32]
    model: meta-llama/Meta-Llama-3-8B
merge_method: passthrough
dtype: float16
'''


with open('config.yaml', 'w') as f:
    f.write(config)

!mergekit-yaml config.yaml ./llama-3-8B-5-9_0-4_10-32 --cuda --lazy-unpickle --allow-crimes &> /dev/null
!lm_eval --model hf --model_args pretrained=./llama-3-8B-5-9_0-4_10-32,load_in_8bit=True --tasks winogrande,hellaswag,arc_challenge --device cuda:0 --num_fewshot 0 --batch_size 16 --output_path ./eval_reord/llama-3-8B-5-9_0-4_10-32

2024-05-18 16:33:16.695167: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-18 16:33:16.695218: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-18 16:33:16.697053: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-18:16:33:21,925 INFO     [__main__.py:254] Verbosity set to INFO
2024-05-18:16:33:29,050 INFO     [__main__.py:341] Selected Tasks: ['arc_challenge', 'hellaswag', 'winogrande']
2024-05-18:16:33:29,058 INFO     [evaluator.py:141] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2024-05-18:16:33:29,058 INFO     [eval

In [None]:
sft("./llama-3-8B-5-9_0-4_10-32/")

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

Repo card metadata block was not found. Setting CardData to empty.


Map:   0%|          | 0/518 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs
Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 4
  Total optimization steps = 100
  Number of trainable parameters = 39,321,600
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss
25,9.221,8.229436
50,7.7788,7.384885
75,7.1803,6.823301
100,6.578,6.515566


***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-5-9_0-4_10-32//sft/checkpoint-50
tokenizer config file saved in ./llama-3-8B-5-9_0-4_10-32//sft/checkpoint-50/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-5-9_0-4_10-32//sft/checkpoint-50/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./llama-3-8B-5-9_0-4_10-32//sft/checkpoint-100
tokenizer config file saved in ./llama-3-8B-5-9_0-4_10-32//sft/checkpoint-100/tokenizer_config.json
Special tokens file saved in ./llama-3-8B-5-9_0-4_10-32//sft/checkpoint-100/special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


