<a href="https://colab.research.google.com/github/softmurata/colab_notebooks/blob/main/llm/minotaur_15B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installation

In [None]:
!pip install auto-gptq

In [None]:
!pip install xformers

Load model

In [None]:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

model_name_or_path = "TheBloke/minotaur-15B-GPTQ"
model_basename = "gptq_model-4bit-128g"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

Inference

In [None]:
# Note: check the prompt template is correct for this model.
prompt = "Please create the 20 minute presentation about AI."
prompt_template=f'''USER: {prompt}
ASSISTANT:'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=2048)
print(tokenizer.decode(output[0], skip_special_tokens=True))

In [None]:
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])

Summarization

In [None]:
prompt = """Please summarize the following context within 100 words. Context: Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers,
benefiting both academia and society. While existing approaches have focused
on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs
with limited resources. In this work, we propose a new optimizer, LOw-Memory
Optimization (LOMO), which fuses the gradient computation and the parameter
update in one step to reduce memory usage. By integrating LOMO with existing
memory saving techniques, we reduce memory usage to 10.8% compared to the
standard approach (DeepSpeed solution). Consequently, our approach enables the
full parameter fine-tuning of a 65B model on a single machine with 8×RTX 3090,
each with 24GB memory.1
1 INTRODUCTION
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP), demonstrating remarkable abilities such as emergence and grokking (Wei et al., 2022), pushing model size
to become larger and larger. However, training these models with billions of parameters, such as
those with 30B to 175B parameters, raises the bar for NLP research. Tuning LLMs often requires
expensive GPU resources, such as 8×80GB devices, making it difficult for small labs and companies
to participate in this area of research.
Recently, parameter-efficient fine-tuning methods (Ding et al., 2022), such as LoRA (Hu et al., 2022)
and Prefix-tuning (Li & Liang, 2021), provide solutions for tuning LLMs with limited resources.
However, these methods do not offer a practical solution for full parameter fine-tuning, which has
been acknowledged as a more powerful approach than parameter-efficient fine-tuning (Ding et al.,
2022; Sun et al., 2023). In this work, we aim to explore techniques for accomplishing full parameter
fine-tuning in resource-limited scenarios.
We analyze the four aspects of memory usage in LLMs, namely activation, optimizer states, gradient
tensor and parameters, and optimize the training process in three folds: 1) We rethink the functionality of an optimizer from an algorithmic perspective and find that SGD is a good replacement in terms
of fine-tuning full parameter for LLMs. This allows us to remove the entire part of optimizer states
since SGD does not store any intermediate state (Sec-3.1). 2) Our proposed optimizer, LOMO as illustrated in Figure 1, reduces the memory usage of gradient tensors to O(1), equivalent to the largest
gradient tensor’s memory usage (Sec-3.2). 3) To stabilize mix-precision training with LOMO, we
integrate gradient normalization, loss scaling, and transition certain computations to full precision
during training (Sec-3.3)."""

prompt_template=f'''USER: {prompt}
ASSISTANT:'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=4096)
print(tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt_template):])

In [None]:
prompt = "最新のAIについてプレゼンを書いて"

prompt_template=f'''USER: {prompt}
ASSISTANT:'''

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=4096)
print(tokenizer.decode(output[0], skip_special_tokens=True)[len(prompt_template):])