lora 어댑터와 기존 프루닝된 모델 업로드 후 결합하여 저장 - 재윤 로라

In [None]:
from peft import PeftModel
import torch
from model_loader import load_pruned_model

model_path = "/pruned_activation"
base_model, tokenizer = load_pruned_model(
    model_path,
    device=torch.device("cuda")
)
full_model = PeftModel.from_pretrained(base_model, "/lora_tomato")

full_model = full_model.merge_and_unload()

# full model 저장
full_model.save_pretrained("/full_model")
tokenizer.save_pretrained("/full_model")

lora 어댑터와 기존 프루닝된 모델 업로드 후 결합하여 저장 - 성민 로라

In [None]:
import torch
import json
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from safetensors.torch import load_file
from peft import PeftModel
import torch.nn as nn

from google.colab import drive
drive.mount('/content/drive')

tokenizer = AutoTokenizer.from_pretrained("/content/drive/MyDrive/smartfarm_pruning/models/pruned_activation", use_fast=False, local_files_only=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

added_token_path = "/content/drive/MyDrive/smartfarm_pruning/models/pruned_activation/added_tokens.json"

with open(added_token_path, "r", encoding="utf-8") as f:
            data = json.load(f)
            n_added = tokenizer.add_tokens(list(set(data)))
            print(f"Added {n_added} tokens from added_tokens.json")

pruned_path     = "/content/drive/MyDrive/smartfarm_pruning/models/pruned_activation/pruned_structure.json"
state_dict_path = "/content/drive/MyDrive/smartfarm_pruning/models/pruned_activation/model.safetensors"

with open(pruned_path, "r") as f:
    pruned_info = json.load(f)

if "layer_structure" not in pruned_info:
    raise ValueError("❌ 'layer_structure' not found in pruned_structure.json")

layer_sizes = {int(k): v["intermediate_size"] for k, v in pruned_info["layer_structure"].items()}
print(f"Detected {len(layer_sizes)} layers from pruning info.")

config = AutoConfig.from_pretrained("/content/drive/MyDrive/smartfarm_pruning/models/pruned_activation", trust_remote_code=False, local_files_only=True)

# 베이스 모델(구조) 만들기
base_model = AutoModelForCausalLM.from_config(config)

for i, layer in enumerate(base_model.model.layers):
    if i in layer_sizes:
        new_dim = layer_sizes[i]
        in_dim  = layer.mlp.up_proj.weight.shape[1]
        # gate/up/down 교체
        layer.mlp.gate_proj = nn.Linear(in_dim, new_dim, bias=False)
        layer.mlp.up_proj   = nn.Linear(in_dim, new_dim, bias=False)
        layer.mlp.down_proj = nn.Linear(new_dim, in_dim, bias=False)

print("Structure rebuilt. Loading pruned weights...")
state_dict = load_file(state_dict_path)
missing, unexpected = base_model.load_state_dict(state_dict, strict=False)
print(f"Weights loaded (missing={len(missing)}, unexpected={len(unexpected)})")

device = "cuda" if torch.cuda.is_available() else "cpu"

# 토크나이저 토큰 수 반영
base_model.resize_token_embeddings(len(tokenizer))
base_model.to(device)

model = PeftModel.from_pretrained(base_model, "/content/drive/MyDrive/smartfarm_pruning/lora_tomato/LoRA_adapter")

# LoRA 병합
merged_model = model.merge_and_unload()

# 병합 모델 저장
merged_model.save_pretrained("/content/drive/MyDrive/full_model", safe_serialization=True)
tokenizer.save_pretrained("/content/drive/MyDrive/full_model")

gguf 파일 생성 및 4bit 양자화 적용 (로컬에서는 아래 명령어 느낌표 제거 후 터미널에서 순차적으로 진행하면 됨)

In [None]:
# gguf 파일 생성
!git clone https://github.com/ggerganov/llama.cpp.git

%cd /content/llama.cpp
!pip install -r requirements.txt

!mkdir -p build
!cmake -B build
!cmake --build build

!python convert_hf_to_gguf.py /full_model --outfile /quantized_tomato.gguf --outtype f32

# 양자화 적용
!./build/bin/llama-quantize /quantized_tomato.gguf /quantized_tomatoQ4_K_M.gguf Q4_K_M

모델 결합 및 양자화 gguf 변환 코드 모듈화 (함수 그대로 써서 파일 이름만 파라미터로 넣으면 됨. 여러개 한번에 실행)

In [None]:
!git clone https://github.com/ggerganov/llama.cpp.git

!pip install -r requirements.txt

!mkdir -p build
!cmake -B build
!cmake --build build

from peft import PeftModel
import torch

from google.colab import drive
drive.mount('/content/drive')

def export_pruned_model_to_gguf(name: str):
    pruned_path = f"/content/drive/MyDrive/smartfarm_pruning/1st/models/pruned/activation/{name}"
    lora_path   = f"/content/drive/MyDrive/smartfarm_pruning/1st/models/LoRA/activation/{name}"
    out_hf_path = f"/content/full_model_{name}"
    out_gguf    = f"/content/drive/MyDrive/{name}.gguf"

    # 프루닝 모델 로드
    base_model, tokenizer = load_pruned_model(
        pruned_path,
        device="cuda",
        dtype=torch.float16
    )

    # LoRA 병합
    full_model = PeftModel.from_pretrained(base_model, lora_path)
    full_model = full_model.merge_and_unload()

    # 병합된 모델 저장
    full_model.save_pretrained(out_hf_path)
    tokenizer.save_pretrained(out_hf_path)

    # GGUF 변환
    os.system(f"""
    cd /content/llama.cpp && \
    python convert_hf_to_gguf.py {out_hf_path} \
        --outfile {out_gguf} \
        --outtype q8_0
    """)

    print(f"\n양자화 완료: {out_gguf}")

export_pruned_model_to_gguf("pruned_activation_5")