<a href="https://colab.research.google.com/github/y-hiroki-radiotech/llm-final-task/blob/main/llm_jp_3_13b_task_solve.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive

drive.mount("/content/drive")
%cd "/content/drive/MyDrive/LLM/2024_大規模言語モデル/05.最終課題/3. task_solve_code"

Mounted at /content/drive
/content/drive/MyDrive/LLM/2024_大規模言語モデル/05.最終課題/3. task_solve_code


UnslothとQLoRAを使ったInstructionファインチューニング

In [3]:
!pip install wandb unsloth trl datasets accelerate bitsandbytes peft



### HuggingFace Authentication

In [4]:
from google.colab import userdata

HUGGINGFACE_TOKEN = userdata.get('HF_TOKEN_READ')
!huggingface-cli login --token $HUGGINGFACE_TOKEN

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
The token `LLM_Course` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `LLM_Course`


### Loading the Base Model

In [5]:
import torch
from unsloth import FastLanguageModel

model_name = "llm-jp/llm-jp-3-13b"

max_seq_length = 1024
dtype = None
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: NVIDIA A100-SXM4-40GB. Max memory: 39.564 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 8.0. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors.index.json:   0%|          | 0.00/29.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/6 [00:00<?, ?it/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.88G [00:00<?, ?B/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/2.71G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

## データセットの準備

In [6]:
import pandas as pd
import datasets
from datasets import Dataset
from sklearn.model_selection import train_test_split

dataset1 = pd.read_csv("label_data_all.csv")
dataset2 = pd.read_csv("elyza_tasks_100_labels.csv")
dataset2 = dataset2.rename(columns={"input": "questions", "output": "answers"})
dataset = pd.concat([dataset1, dataset2], axis=0)
# データセットのラベルに不適切なものが含まれていたので取り除いておく
dataset = dataset[dataset["labels"] != "Simple_Fact_Checking"]
dataset = dataset[dataset["labels"] != "Comparison_Similarity"]

# データを10倍にかさ増ししたもの
# aug_dataset = pd.concat([dataset] * 10, axis=0)
# aug_dataset = aug_dataset.reset_index(drop=True)
# aug_dataset = aug_dataset.sample(frac=1, random_state=42).reset_index(drop=True)

# train_data, test_data = train_test_split(aug_dataset, stratify=aug_dataset["labels"], test_size=0.1, random_state=42)
# test_data, eval_data = train_test_split(test_data, stratify=test_data["labels"], test_size=0.5, random_state=42)

train_data = Dataset.from_pandas(dataset)
# test_data = Dataset.from_pandas(test_data)
# eval_data = Dataset.from_pandas(eval_data)

### Loading and Processing the Dataset

In [7]:
from unsloth.chat_templates import get_chat_template

# Initialize the tokenizer with the chat template and mapping
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3",
    mapping={"role": "from", "content": "value", "user": "human", "assistant": "gpt"}, # ShareGPT style
    map_eos_token=True, # <|im_end|> to <|eot_id|> instead
)

In [8]:
def select_prompt(labels):
    base_prompt = """前提条件
    このタスクでは以下の品質基準に従って日本語で回答を生成してください。
    - 明確性:説明は具体的でわかりやすいこと
    - 正確性:情報は信頼できるものを使用する
    - 簡潔性:回答は簡潔でわかりやすく
    回答の前に、まずは以下の情報を確認してください。
    - 質問の意図を正確に理解しているか。必要な情報が全て含まれているか。説明は十分か
    """
    if labels == "Task_Solution":
        base_prompt = base_prompt + "質問に対して、具体的な解決策を提案してください。 複数の解決策が考えられる場合は、それぞれを検討し、メリット・デメリットを比較して、最適な解決策を選択してください。 最適な解決策を選択した理由を明確に説明し、その実現可能性についても考察してください。 解決策の実施手順、必要な資源、予想される結果などを具体的に記述してください。"
    elif labels == "Creative_Generation":
        base_prompt = base_prompt + "以下の指示に従って、創造的な作品を作成してください。 指示に記載された要素をすべて含め、オリジナリティと整合性を重視してください。 具体的な指示がない場合でも、創造的な解釈に基づいて作品を完成させてください。 作品の形式、長さ、スタイルなどは、指示に明記されている場合を除き、自由に選択できます。 作成した作品について、自身の創作意図や考え方を簡潔に説明してください。"
    elif labels == "Knowledge_Explanation":
        base_prompt = base_prompt + "質問について、正確で分かりやすい説明を提供してください。 説明は、対象読者の知識レベルを考慮し、専門用語を避け、平易な言葉で記述してください。 重要なポイントを明確にし、例や図表などを用いて理解を深める工夫をしてください。 説明の構成は、箇条書き、段落形式など、自由に選択できます。 説明の信頼性を高めるために、参照元などを明記してください（可能な場合）。"
    elif labels == "Analytical_Reasoning":
        base_prompt = base_prompt + "質問の情報に基づいて、論理的に推論し、結論を導き出してください。 可能な限り多くの推論経路を検討し、それぞれの可能性とその根拠を説明してください。 結論は簡潔に記述し、その根拠となる論理的ステップを詳細に示してください。 仮定や前提があれば明確に示し、結論の信頼性を評価してください。 情報が不足している場合は、それを指摘し、追加情報が必要な理由を説明してください。"
    elif labels == "Information_Extraction":
        base_prompt = base_prompt + "指定された情報またはキーワードを抽出してください。 抽出された情報は、指定された形式（例：リスト、表、段落）で出力してください。 複数の情報が該当する場合は、すべて抽出してください。 テキスト中に該当情報がない場合は、その旨を明記してください。 曖昧な表現が含まれる場合は、その解釈を明確にしてください。"
    elif labels == "Step_by_Step_Calculation":
        base_prompt = base_prompt + "解法手順をステップごとに明確に記述し、各ステップで用いた計算式や論理を説明してください。 計算結果だけでなく、計算過程も重視します。 単位を明記し、計算結果の妥当性を検証してください。 問題に不明点がある場合は、その旨を指摘してください。"
    elif labels == "Role_Play_Response":
        base_prompt = base_prompt + "状況を分析し、あなたの役割にふさわしい行動を、詳細に記述してください。 あなたの判断基準、考え方の根拠、そして期待される結果を明確に説明してください。 "
    elif labels == "Opinion_Perspective":
        base_prompt = base_prompt + "質問に対して、あなたの意見を述べてください。 あなたの意見を支持する根拠を明確に説明し、反対意見についても考慮した上で、あなたの結論を導き出してください。 複数の視点から問題を分析し、客観的な視点も取り入れてください。 あなたの意見は、簡潔で明確、そして論理的に整合性のあるものでなければなりません。"

    return base_prompt

In [9]:
def formatting_prompts_func(examples):
    convos = []

    # Iterate through each item in the batch
    for question, label, answer in zip(examples["questions"], examples["labels"], examples["answers"]):
        tool_user = {
            "content": select_prompt(label),
            "role": "system"
        }
        query_user = {
            "content": f"質問:{question}",
            "role": "user"
        }
        assistant = {
            "content": f"回答:{answer}",
            "role": "assistant"
        }
        convos.append([tool_user, query_user, assistant])

    texts = [tokenizer.apply_chat_template(
        convo,
        tokenize=False,
        add_generation_prompt=False,
        return_tensors=None,
    ) + tokenizer.eos_token for convo in convos] # add tokenizer.eos_token

    return {"text": texts}

In [10]:
def formatting_prompts_func_test(examples):
    convos = []

    # Iterate through each item in the batch
    for question, label, answer in zip(examples["questions"], examples["labels"], examples["answers"]):
        tool_user = {
            "content": select_prompt(label),
            "role": "system"
        }
        query_user = {
            "content": f"質問:{question}",
            "role": "user"
        }
        assistant = {
            "content": f"回答:",
            "role": "assistant"
        }
        convos.append([tool_user, query_user, assistant])

    texts = [tokenizer.apply_chat_template(
        convo,
        tokenize=False,
        add_generation_prompt=True,
        return_tensors=None,
    ) + tokenizer.eos_token for convo in convos] # [TODO]ここは変更したほうがいいかも

    return {"text": texts}

In [11]:
# Apply the formatting on dataset
train_data = train_data.map(formatting_prompts_func, batched = True,)
# eval_data = eval_data.map(formatting_prompts_func, batched = True,)
# test_data = test_data.map(formatting_prompts_func_test, batched = True,)

Map:   0%|          | 0/436 [00:00<?, ? examples/s]

### LoRA可能な線形層を取得する

In [12]:
import re

model_modules = str(model.modules)
pattern = r"\((\w+)\): Linear"
linear_layer_names = re.findall(pattern, model_modules)
target_modules = list(set(linear_layer_names))
print(target_modules)

['v_proj', 'lm_head', 'o_proj', 'k_proj', 'q_proj', 'down_proj', 'up_proj', 'gate_proj']


### QLoRAを設定する

In [13]:
r = 16
lora_alpha = 16

model = FastLanguageModel.get_peft_model(
    model,
    target_modules=target_modules,
    r=r,
    lora_alpha=lora_alpha,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=False,
    loftq_config=None
)

Unsloth: Offloading output_embeddings to disk to save VRAM


  offloaded_W = torch.load(filename, map_location = "cpu", mmap = True)
Unsloth 2024.12.4 patched 40 layers with 40 QKV layers, 40 O layers and 40 MLP layers.


Unsloth: Training lm_head in mixed precision to save VRAM


### Defining Training Arguments

In [30]:
# collatorを使って「応答フォーマット」を指定して、それに続くトークンのみを損失関数の計算対象とする。
# from trl import DataCollatorForCompletionOnlyLM
# response_template_ids = tokenizer.encode("assistant")
# collator = DataCollatorForCompletionOnlyLM(response_template_ids, tokenizer=tokenizer)

In [14]:
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

training_args = TrainingArguments(
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,  # 追加：評価用のバッチサイズ
    gradient_accumulation_steps=4,
    warmup_steps=5,
    learning_rate=2e-4,
    num_train_epochs=5,
    fp16=not is_bfloat16_supported,
    bf16=is_bfloat16_supported,
    optim="adamw_8bit",
    weight_decay=0.01,
    lr_scheduler_type="linear",
    seed=3407,
    output_dir="./outputs",
    report_to="wandb",
    logging_steps=1,
    logging_strategy="steps",
    # 評価関連のパラメータを追加
    # evaluation_strategy="steps",     # ステップごとに評価を実行
    # eval_steps=5,                 # 100ステップごとに評価を実行
    save_strategy="steps",          # ステップごとにモデルを保存
    save_steps=5,                 # 100ステップごとにモデルを保存
    greater_is_better=False,        # メトリクスは小さい方が良いか（lossの場合はFalse）
    save_only_model=False
)

### wandbの設定

In [15]:
config = {
    "output_dir": training_args.output_dir,
    "num_train_epochs": training_args.num_train_epochs,
    "per_device_train_batch_size": training_args.per_device_train_batch_size,
    "bf16": training_args.bf16,
    "optim": training_args.optim,
    "learning_rate": training_args.learning_rate,
    "warmup_ratio": training_args.warmup_ratio,
    "lr_scheduler_type": training_args.lr_scheduler_type
}

In [16]:
# modify
from transformers import TrainerCallback

class LoggingCallback(TrainerCallback):
    def __init__(self, log_every=16):
        self.log_every = log_every

    def on_log(self, args, state, control, logs=None, **kwargs):
        if state.global_step % self.log_every == 0:
            if "loss" in logs:
                wandb.log({"training_loss": logs["loss"]}, step=state.global_step)
            if "eval_loss" in logs:
                wandb.log({"validation_loss": logs["eval_loss"]}, step=state.global_step)

logging_callback = LoggingCallback(log_every=1)

In [17]:
import os
import wandb
from google.colab import userdata

def setup_wandb(project_name: str, run_name: str, config: str, job_type=None):
    # set up your API key
    try:
        WANDB_KEY = userdata.get('WANDB_API_KEY')
        wandb.login(key=WANDB_KEY)
        os.environ["WANDB_ENTITY"] = "y-hiroki-rad"
    except KeyError:
        raise EnvironmentError("WANDB_API_KEY is not set in the environment variables.")
    except Exception as e:
        print(f"Error logging into WandB: {e}")

    # Optional: Log models
    os.environ["WANDB_LOG_MODEL"] = "checkpoint"
    os.environ["WANDB_WATCH"] = "all"
    os.environ["WANDB_SILENT"] = "true"

    # Initialize the WandB run
    try:
        wandb.init(project=project_name, name=run_name, config=config, job_type=job_type)
        print(f"WandB run initialized: Project - {project_name}, Run - {run_name}")
    except Exception as e:
        print(f"Error initializing WandB run: {e}")

In [18]:
proj_name = model_name.replace("/", "-")

project_name = f"{proj_name}-{r}-{lora_alpha}-task_solve-llm-jp-13b"
run_name = "elyza-100-fine-tuning-epoch-5"
job_type = "fine-tuning"

setup_wandb(project_name=project_name, run_name=run_name, config=config, job_type=job_type)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


[34m[1mwandb[0m: Currently logged in as: [33my-hiroki-rad[0m. Use [1m`wandb login --relogin`[0m to force relogin


WandB run initialized: Project - llm-jp-llm-jp-3-13b-16-16-task_solve-llm-jp-13b, Run - elyza-100-fine-tuning


### Training with SFTTrainer and Unsloth

In [19]:
from trl import SFTTrainer

trainer = SFTTrainer(
    model = model,
    processing_class = tokenizer,
    train_dataset = train_data,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc=2,
    # eval_dataset  = eval_data,
    packing = False,
    args = training_args,
    callbacks = [logging_callback],
    tokenizer = tokenizer
)

Map (num_proc=2):   0%|          | 0/436 [00:00<?, ? examples/s]

In [20]:
# Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.564 GB.
8.621 GB of memory reserved.


In [21]:
from unsloth import unsloth_train

trainer_stats = unsloth_train(trainer)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 436 | Num Epochs = 5
O^O/ \_/ \    Batch size per device = 16 | Gradient Accumulation steps = 4
\        /    Total batch size = 64 | Total steps = 35
 "-____-"     Number of trainable parameters = 572,456,960


Step,Training Loss
1,2.1101
2,2.0902
3,2.0026
4,1.7899
5,1.5415
6,1.2409
7,1.0605
8,0.7799
9,0.6586
10,0.6397


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-5)... Done. 5.7s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-10)... Done. 5.9s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-15)... Done. 7.8s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-20)... Done. 7.8s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-25)... Done. 10.6s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-30)... Done. 10.2s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-35)... Done. 6.5s


In [22]:
wandb.finish()

VBox(children=(Label(value='16254.461 MB of 16255.604 MB uploaded\r'), FloatProgress(value=0.9999296682101896,…

0,1
train/epoch,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/global_step,▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇████
train/grad_norm,██▇▆▄▄▅▄▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/learning_rate,▂▄▅▇███▇▇▇▇▆▆▆▆▅▅▅▅▅▄▄▄▄▃▃▃▃▂▂▂▂▁▁▁
train/loss,███▇▆▅▄▃▃▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
training_loss,███▇▆▅▄▃▃▂▂▂▂▂▂▁▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
total_flos,1.1057147768807424e+17
train/epoch,5.0
train/global_step,35.0
train/grad_norm,0.48063
train/learning_rate,0.0
train/loss,0.28
train_loss,0.66175
train_runtime,1939.0475
train_samples_per_second,1.124
train_steps_per_second,0.018


### モデルをマージしてHuggingFaceにPushする

In [23]:
import torch
from peft import PeftModel
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer
)

In [24]:
%ls

elyza_tasks_100_labels.csv     llm-jp-3-13b-task_solve.ipynb      [0m[01;34mwandb[0m/
[01;34mhuggingface_tokenizers_cache[0m/  [01;34moutputs[0m/
label_data_all.csv             [01;34m_unsloth_temporary_saved_buffers[0m/


In [25]:
model_name = "llm-jp/llm-jp-3-13b"
adapter = "outputs/checkpoint-35"
output_dir = f"./{model_name}-{r}-epoch-1-ft"

In [26]:
model = AutoModelForCausalLM.from_pretrained(
          model_name,  device_map={"": 0}, torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Loading checkpoint shards:   0%|          | 0/6 [00:00<?, ?it/s]

tokenizer_config.json:   0%|          | 0.00/494 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/6.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/237 [00:00<?, ?B/s]

In [27]:
model = PeftModel.from_pretrained(model, adapter)
model = model.merge_and_unload()

In [28]:
HF_TOKEN_WRITE = userdata.get("HF_TOKEN_WRITE")

!huggingface-cli login --token $HF_TOKEN_WRITE

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
The token `LLM_new_token` has been saved to /root/.cache/huggingface/stored_tokens
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: `LLM_new_token`


In [31]:
repo_id = f'hiroki-rad/{model_name.replace("/", "-")}-{r}-ft'

model.push_to_hub(repo_id)
tokenizer.push_to_hub(repo_id)

  0%|          | 0/6 [00:00<?, ?it/s]

model-00004-of-00006.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00003-of-00006.safetensors:   0%|          | 0.00/4.88G [00:00<?, ?B/s]

model-00001-of-00006.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00006.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00005-of-00006.safetensors:   0%|          | 0.00/4.93G [00:00<?, ?B/s]

model-00006-of-00006.safetensors:   0%|          | 0.00/2.71G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/hiroki-rad/llm-jp-llm-jp-3-13b-16-ft/commit/72c232922cd200ac17b09fbe0682165596fd3547', commit_message='Upload tokenizer', commit_description='', oid='72c232922cd200ac17b09fbe0682165596fd3547', pr_url=None, repo_url=RepoUrl('https://huggingface.co/hiroki-rad/llm-jp-llm-jp-3-13b-16-ft', endpoint='https://huggingface.co', repo_type='model', repo_id='hiroki-rad/llm-jp-llm-jp-3-13b-16-ft'), pr_revision=None, pr_num=None)