<a href="https://colab.research.google.com/github/tuhinmallick/AI-for-Fashion/blob/main/Fine_tuning_Phi_3_5_MoE_and_Mini_With_Code_for_AutoRound_and_Bitsandbytes_Quantization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*All the details in article: [Fine-tuning Phi-3.5 MoE and Mini on Your Computer](https://newsletter.kaitchup.com/p/fine-tuning-phi-35-moe-and-mini-on)*

This notebook shows how to quantize Phi-3.5 Mini with AutoRound, and how to fine-tune it with QLoRA. QLoRA fine-tuning with bitsandbytes, and LoRA code, are also provided. There is also an evaluation section comparing bitsandbytes and AutoRound quantization.

For Phi-3.5 MoE, you will find the QLoRA fine-tuning, using bitsandbytes, at the end of the notebook.

#Quantization
##AutoRound


In [None]:
!pip install --upgrade transformers auto-round flash_attn optimum auto-gptq

Collecting transformers
  Downloading transformers-4.44.1-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting auto-round
  Downloading auto_round-0.3-py3-none-any.whl.metadata (18 kB)
Collecting flash_attn
  Downloading flash_attn-2.6.3.tar.gz (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting optimum
  Downloading optimum-1.21.4-py3-none-any.whl.metadata (19 kB)
Collecting auto-gptq
  Downloading auto_gptq-0.7.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting intel-extension-for-transformers (from auto-round)
  Downloading intel_extension_for_transformers-1.4.2-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (26 kB)
Collecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.1-


* Requirements:
 * CPU RAM: 13.1 GB
 * GPU: 10.3 GB   

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "microsoft/Phi-3.5-Mini-instruct"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 4, 128, True
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, batch_size=2, seqlen=512, sym=sym, gradient_accumulate_steps=4, device='cuda')
autoround.quantize()
output_dir = "./AutoRound/GPTQ-sym/"
autoround.save_quantized(output_dir)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

2024-08-21 10:03:14 INFO autoround.py L209: using torch.float16 for quantization tuning


Downloading readme:   0%|          | 0.00/373 [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/921 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/33.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Filter:   0%|          | 0/10000 [00:00<?, ? examples/s]

2024-08-21 10:04:09 INFO autoround.py L1039: quantizing 1/32, model.layers.0
2024-08-21 10:05:37 INFO autoround.py L966: quantized 4/4 layers in the block, loss iter 0: 0.000037 -> iter 196: 0.000004
2024-08-21 10:05:38 INFO autoround.py L1039: quantizing 2/32, model.layers.1
2024-08-21 10:07:05 INFO autoround.py L966: quantized 4/4 layers in the block, loss iter 0: 0.000220 -> iter 191: 0.000035
2024-08-21 10:07:05 INFO autoround.py L1039: quantizing 3/32, model.layers.2
2024-08-21 10:08:32 INFO autoround.py L966: quantized 4/4 layers in the block, loss iter 0: 0.020320 -> iter 21: 0.002543
2024-08-21 10:08:33 INFO autoround.py L1039: quantizing 4/32, model.layers.3
2024-08-21 10:09:59 INFO autoround.py L966: quantized 4/4 layers in the block, loss iter 0: 0.003458 -> iter 109: 0.002240
2024-08-21 10:10:00 INFO autoround.py L1039: quantizing 5/32, model.layers.4
2024-08-21 10:11:27 INFO autoround.py L966: quantized 4/4 layers in the block, loss iter 0: 0.041033 -> iter 86: 0.005105
20

#Evaluation: Zero-shot MMLU, MMLU-PRO, Arc Challenge

In [None]:
!pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git

Collecting git+https://github.com/EleutherAI/lm-evaluation-harness.git
  Cloning https://github.com/EleutherAI/lm-evaluation-harness.git to /tmp/pip-req-build-_cpj6cgn
  Running command git clone --filter=blob:none --quiet https://github.com/EleutherAI/lm-evaluation-harness.git /tmp/pip-req-build-_cpj6cgn
  Resolved https://github.com/EleutherAI/lm-evaluation-harness.git to commit a4987bba6e9e9b3f22bd3a6c1ecf0abd04fd5622
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting evaluate (from lm_eval==0.4.3)
  Downloading evaluate-0.4.2-py3-none-any.whl.metadata (9.3 kB)
Collecting datasets>=2.16.0 (from lm_eval==0.4.3)
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting jsonlines (from lm_eval==0.4.3)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting peft>=0.2.0 (from lm_eval==0.4.3)
  Downloading peft-0.12.0-py3-n

###Phi-3 Mini

In [None]:
!lm_eval --model hf --model_args pretrained=microsoft/Phi-3-mini-128k-instruct,dtype=float16 --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-22 10:35:42.387358: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-22 10:35:42.404869: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-22 10:35:42.426572: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-22 10:35:42.433268: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-22 10:35:42.449306: I tensorflow/core/platform/cpu_feature_guar

###Original Phi-3.5 Mini

In [None]:
!lm_eval --model hf --model_args pretrained=microsoft/Phi-3.5-mini-instruct,dtype=float16 --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-21 11:19:06.837008: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-21 11:19:06.854233: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-21 11:19:06.875632: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-21 11:19:06.882257: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-21 11:19:06.897697: I tensorflow/core/platform/cpu_feature_guar

###Phi-3.5 Mini Quantized to 4-bit with AutoRound

In [None]:
!lm_eval --model hf --model_args pretrained=kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit,dtype=float16 --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-21 11:45:46.772098: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-21 11:45:46.788982: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-21 11:45:46.810504: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-21 11:45:46.816977: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-21 11:45:46.831932: I tensorflow/core/platform/cpu_feature_guar

###Phi-3.5 Mini Quantized to 4-bit with bitsandbytes

In [None]:
!lm_eval --model hf --model_args pretrained=microsoft/Phi-3.5-Mini-instruct,load_in_4bit=True --tasks mmlu,arc_challenge,leaderboard_mmlu_pro --device cuda:0 --num_fewshot 0 --batch_size 4 --output_path ./eval/

2024-08-21 12:48:05.133766: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-08-21 12:48:05.151289: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-21 12:48:05.172697: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-21 12:48:05.179227: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-08-21 12:48:05.194696: I tensorflow/core/platform/cpu_feature_guar

#Fine-tuning Phi-3.5 Mini


In [None]:
!pip install -qqq --upgrade bitsandbytes transformers peft accelerate datasets trl flash_attn

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.5/137.5 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m296.4/296.4 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.1/315.1 kB[0m [31m30.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.8/245.8 kB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.6/104.6 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25h

##QLoRA with bitsandbytes

* Batch size of 1 with a sequence of 512 tokens requires 6 GB of CPU and 8 GB of GPU RAM
* Batch size of 8 (in the code bewlo) with sequences of 512 tokens requires 6 GB of CPU and 13 GB of GPU RAM

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer, SFTConfig

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "microsoft/Phi-3.5-Mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model,gradient_checkpointing_kwargs={'use_reentrant':True})

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Phi-3.5/Phi-3.5-Mini_QLoRA",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

tokenizer_config.json:   0%|          | 0.00/3.98k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/395 [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data:   0%|          | 0.00/20.9M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.11M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9846 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/518 [00:00<?, ? examples/s]

config.json:   0%|          | 0.00/3.45k [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-Mini-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3.5-Mini-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/195 [00:00<?, ?B/s]

Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 307
  Number of trainable parameters = 8,912,896


Step,Training Loss,Validation Loss
25,1.6071,1.484774
50,1.3169,1.296879
75,1.2442,1.26404
100,1.2171,1.250713
125,1.1946,1.243362
150,1.1586,1.238708
175,1.1815,1.235508
200,1.1824,1.232928
225,1.1802,1.231119
250,1.197,1.230293



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./Phi-3.5/Phi-3.5-Mini_QLoRA/checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models-

TrainOutput(global_step=307, training_loss=1.229702399685639, metrics={'train_runtime': 6800.8452, 'train_samples_per_second': 1.448, 'train_steps_per_second': 0.045, 'total_flos': 1.1181274994466816e+17, 'train_loss': 1.229702399685639, 'epoch': 0.9975629569455727})

##QLoRA with AutoRound

* Requirements:
 * same as QLoRA with bistandbytes

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer, SFTConfig

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")


model = AutoModelForCausalLM.from_pretrained(
          model_name, trust_remote_code=True, device_map={"": 0}, attn_implementation=attn_implementation
)

model = prepare_model_for_kbit_training(model,gradient_checkpointing_kwargs={'use_reentrant':True})

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Phi-3.5/Phi-3.5-Mini_QLoRA_AutoRound",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=4,
        per_device_eval_batch_size=8,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

Repo card metadata block was not found. Setting CardData to empty.


configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 8
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 4
  Total optimization steps = 307
  Number of trainable parameters = 8,912,896


Step,Training Loss,Validation Loss
25,1.5731,1.448355
50,1.2933,1.289964
75,1.2382,1.258594
100,1.2151,1.247414
125,1.1941,1.241973
150,1.1598,1.239165
175,1.1833,1.237587
200,1.184,1.235978
225,1.1844,1.234958
250,1.2033,1.234782



***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8

***** Running Evaluation *****
  Num examples = 518
  Batch size = 8
Saving model checkpoint to ./Phi-3.5/Phi-3.5-Mini_QLoRA_AutoRound/checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/h

TrainOutput(global_step=307, training_loss=1.2267252366006958, metrics={'train_runtime': 5841.9991, 'train_samples_per_second': 1.685, 'train_steps_per_second': 0.053, 'total_flos': 3224588596002816.0, 'train_loss': 1.2267252366006958, 'epoch': 0.9975629569455727})

##LoRA
* GPU requirements: 16 GB

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
)
from trl import SFTTrainer, SFTConfig

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "microsoft/Phi-3.5-Mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")


model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True,  device_map={"": 0}, attn_implementation=attn_implementation
)

model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj"]
)


training_arguments = SFTConfig(
        output_dir="./Phi-3.5/Phi-3.5-Mini_LoRA",
        eval_strategy="steps",
        do_eval=True,
        optim="adamw_torch",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        per_device_eval_batch_size=4,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-Mini-instruct/snapshots/64963004ad95869fa73a30279371c8778509ac84/tokenizer.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-Mini-instruct/snapshots/64963004ad95869fa73a30279371c8778509ac84/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-Mini-instruct/snapshots/64963004ad95869fa73a30279371c8778509ac84/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-Mini-instruct/snapshots/64963004ad95869fa73a30279371c8778509ac84/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-Mini-instruct/snapshots/64963004ad95869fa73a30279371c8778509ac84/tokenizer_config.json
Special tokens have been added in the vocabulary, make sure the associ

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing Phi3ForCausalLM.

All the weights of Phi3ForCausalLM were initialized from the model checkpoint at microsoft/Phi-3.5-Mini-instruct.
If your task is similar to the task the model of the checkpoint was trained on, you can already use Phi3ForCausalLM for predictions without further training.
loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--microsoft--Phi-3.5-Mini-instruct/snapshots/64963004ad95869fa73a30279371c8778509ac84/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": [
    32007,
    32001,
    32000
  ],
  "pad_token_id": 32000
}

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


Map:   0%|          | 0/518 [00:00<?, ? examples/s]

Using auto half precision backend
Currently training with a batch size of: 4
***** Running training *****
  Num examples = 9,846
  Num Epochs = 1
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 8
  Total optimization steps = 307
  Number of trainable parameters = 8,912,896


Step,Training Loss,Validation Loss
25,1.5964,1.459
50,1.2693,1.270693
75,1.2098,1.239383
100,1.1884,1.228382
125,1.1665,1.221909
150,1.132,1.218116
175,1.1536,1.215834
200,1.1561,1.214221
225,1.148,1.213173
250,1.1665,1.212524



***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4

***** Running Evaluation *****
  Num examples = 518
  Batch size = 4
Saving model checkpoint to ./Phi-3.5/Phi-3.5-Mini_LoRA/checkpoint-307
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--

TrainOutput(global_step=307, training_loss=1.2014875536244545, metrics={'train_runtime': 4058.6058, 'train_samples_per_second': 2.426, 'train_steps_per_second': 0.076, 'total_flos': 1.0693751107780608e+17, 'train_loss': 1.2014875536244545, 'epoch': 0.9975629569455727})

#Fine-tuning Phi-3.5 MoE

##QLoRA with bitsandbytes

In [None]:
import torch
from datasets import load_dataset
from peft import LoraConfig, prepare_model_for_kbit_training
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from trl import SFTTrainer, SFTConfig

#use bf16 and FlashAttention if supported
if torch.cuda.is_bf16_supported():
  compute_dtype = torch.bfloat16
  attn_implementation = 'flash_attention_2'
else:
  compute_dtype = torch.float16
  attn_implementation = 'sdpa'

model_name = "microsoft/Phi-3.5-MoE-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, add_eos_token=True, use_fast=True)
tokenizer.pad_token = tokenizer.unk_token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'left'

ds = load_dataset("timdettmers/openassistant-guanaco")

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=compute_dtype,
        bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=compute_dtype, trust_remote_code=True, quantization_config=bnb_config, device_map={"": 0}, attn_implementation=attn_implementation
)
print(model)
print(model.get_memory_footprint())


model = prepare_model_for_kbit_training(model,gradient_checkpointing_kwargs={'use_reentrant':True})

peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.05,
        r=16,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules= ['k_proj', 'q_proj', 'v_proj', 'o_proj', "gate_proj", "down_proj", "up_proj","gate","w1","w2","w3"]
)


training_arguments = SFTConfig(
        output_dir="./Phi-3.5/Phi-3.5-MoE_QLoRA",
        eval_strategy="steps",
        do_eval=True,
        optim="paged_adamw_8bit",
        per_device_train_batch_size=1,
        gradient_accumulation_steps=32,
        per_device_eval_batch_size=1,
        log_level="debug",
        save_strategy="epoch",
        logging_steps=25,
        learning_rate=1e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        eval_steps=25,
        num_train_epochs=1,
        warmup_ratio=0.1,
        lr_scheduler_type="linear",
        dataset_text_field="text",
        max_seq_length=512
)

trainer = SFTTrainer(
        model=model,
        train_dataset=ds['train'],
        eval_dataset=ds['test'],
        peft_config=peft_config,
        tokenizer=tokenizer,
        args=training_arguments,
)

trainer.train()