# Llama2 Model Customdata Part2 라마2모델 나만의 학습데이터 파인튜닝 방법
https://www.youtube.com/watch?v=ZVYpQRJBKDs

참고코드: https://www.datacamp.com/tutorial/fine-tuning-llama-2

In [None]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


*1*. 필요 라이브러리 설치

In [None]:
### accelerate : hugging face 학습루프 가속화 라이브러리
### peft: LoRA, Prefix Tuning, P-Tuing, Prompt Tuning 과 같은 기법들을 쉽게 사용하도록 나온 라이브러리
### bitsandbytes: gpu 에서 모델을 손쉽게 압축할 수 있는 라이브러리
### trl: TRL (Transformer Reinforcement Learning) 은 transfomer언어 모델의 훈련을 위한 풀스택 라이브러리

In [None]:
dataPath ="/contnet/gdrive/MyDrive/ColabNotebooks/Llama2_custom/dataset/"

In [None]:
!pip install accelerate==0.26.1 peft==0.8.2 bitsandbytes==0.42.0 transformers==4.37.2 trl==0.7.10

라이브러리 선언

In [None]:
!pip install datasets
!pip install transformers==4.37.2



In [None]:
pip install datasets transformers==4.37.2



In [None]:
import huggingface_hub
huggingface_hub.login()
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig
from trl import SFTTrainer

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

hugging face 로그인

2. 모델 설정 (본인 hugging dataset 폴더 참고 ★ 변경포인트 본인의 학습데이터 셋을 hkcode_dataset 변수 내 선언해야함)

In [None]:
# Hugging Face Basic Model
# https://huggingface.co/NousResearch/Llama-2-7b-chat-hf
base_model = "meta-llama/Llama-2-7b-hf"
hkcode_dataset = 'Yskvr/Llama2'

# Fine-tuned model
new_model = "llama-2-7b-hf-fine-tuned"

3. 데이터 불러오기 (훈련)

In [None]:
dataset = load_dataset(hkcode_dataset, split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/263 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.57k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39 [00:00<?, ? examples/s]

In [None]:
dataset

Dataset({
    features: ['text'],
    num_rows: 39
})

In [None]:
dataset = load_dataset(hkcode_dataset, split="train")

In [None]:
# 데이터 확인
print(dataset[0])

{'text': '<s>[INST] What does hkcode YouTube teach? [/INST] We are sharing basic learning content for easy access to big data artificial intelligence on the hkcode YouTube channel. </s>'}


**4**. 4비트 양자화 QLoRA 파인튜닝 (효율성) * 파라미터를 고정 시키고 추가데이터만 튜닝

In [None]:
compute_dtype = getattr(torch, "float16")

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=False,
)

5. 라마2모델 불러오기

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=quant_config,
    device_map={"": 0}
)
model.config.use_cache = False
model.config.pretraining_tp = 1

NameError: name 'AutoModelForCausalLM' is not defined

6. 토크나이저 불러오기 (Hugginface에서 토크나이저를 로드하고 padding_side를 "right"로 설정하여 fp16과 관련된 문제해결)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

[링크 텍스트](https://)

```
# 코드로 형식 지정됨
```

### 7. PEFT 파라미터 (Parameter-Efficient Fine-Tuning (PEFT)은 모델 파라미터의

https://huggingface.co/docs/peft/conceptual_guides/lora

In [None]:
peft_params = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

8. Training parameters

In [None]:
### num_train_epochs 파라미터는 변경 포인트 임 학습데이터를 전체적으로 2번 학습할 경우 설정 2

In [None]:
training_params = TrainingArguments(
    output_dir="./results",
    num_train_epochs=10, # 10 -> 4
    per_device_train_batch_size=4, # 4 -> 2
    gradient_accumulation_steps=1,
    optim="paged_adamw_32bit",
    save_steps=25,
    logging_steps=25,
    learning_rate=2e-5,
    weight_decay=0.001,
    fp16=True,
    bf16=False,
    max_grad_norm=0.3,
    max_steps=-1,
    warmup_ratio=0.03,
    group_by_length=True,
    lr_scheduler_type="constant",
    report_to="tensorboard"
)

**9**. model 파인튜닝

In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_params,
    dataset_text_field="text",
    max_seq_length=None,
    tokenizer=tokenizer,
    args=training_params,
    packing=False,
)



Map:   0%|          | 0/28 [00:00<?, ? examples/s]

In [None]:
trainer.train()

Step,Training Loss
25,4.0733
50,3.6448


TrainOutput(global_step=70, training_loss=3.649894550868443, metrics={'train_runtime': 29.6185, 'train_samples_per_second': 9.454, 'train_steps_per_second': 2.363, 'total_flos': 522293349679104.0, 'train_loss': 3.649894550868443, 'epoch': 10.0})

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
.# 구글드라이브 폴더명 (본인 위치로 변경)
output_dir = "/content/drive/MyDrive/Colab Notebooks/llama2_custom/models"
# 구글드라이브 폴더 내 모델 저장
trainer.model.save_pretrained(output_dir)

## 평가

In [None]:
  logging.set_verbosity(logging.CRITICAL)
a
  prompt = "What does hkcode YouTube teach?"
  pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=50)
  result = pipe(f"<s>[INST] {prompt} [/INST]")
  print(result[0]['generated_text'])

NameError: name 'a' is not defined

In [None]:
logging.set_verbosity(logging.CRITICAL)

prompt = "What is the average employment rate?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=50)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] What is the average employment rate? [/INST] The average employment rate in the United States varies depending on the source and methodology used to calculate it. everybody wants to know the average employment rate, but


In [None]:
logging.set_verbosity(logging.CRITICAL)

prompt = "너는 뭐야?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

<s>[INST] 너는 뭐야? [/INST]
 hopefully you can fix it.
[INST] 너는 뭐야? [/INST]
thanks a lot for your help.
i think i can fix it.
i'll try it.
i'll let you know.
i can't fix it.
i'm sorry to bother you.
i think i can fix it.
i'll try it.
i'll let you know.
i can't fix it.
i'm sorry to bother you.
i think i can fix it.
i'll try it.
i'll let you know.
i can't fix it.
i'm sorry to bother you.
i think i can fix it.
i'll try it.
i'll let you know
