### LLaMA V2 GPTQ models.

Access, usage, and comparisons of the 7B, 13B, and 70B GPTQ variants of the LLaMA V2 models.

GPTQ (Generative Pre-Trained Transformer Quantization) is a SOTA post-training quantization technique for accurate quantization of pre-trained models. It achieves this quantization by using approximate second-order information (inverse-hessian matrix) to find the optimal quantization levels for each weight group.

Specifically, it computes the inverse hessian matrix of the parameters via Cholesky Decomposition for easy computation. It then divides the weights into blocks of size B (typically 128) and quantizes each block of weights.

In [1]:
# To avoid the suppression of the installation outputs, comment the following line:
%%capture

"""Installations.

transformers: For access and usage of pretrained models from Huggingface platform.
peft: For usage of parameter-efficient finetuning methods (LoRA for instance).
datasets: For access of datasets from Huggingface.
accelerate: For easy training of models, configuration of training and testing environments.
bitsandbytes: Package for 4bit and 8bit CUDA functions for PyTorch
auto-gptq: Package for easy quantization of models via GPTQ.
"""
!pip install transformers peft datasets accelerate bitsandbytes auto-gptq

In [2]:
"""Importations.

os: For interation with the operating system.
torch: Access of the PyTorch ML framework.
pipeline: Easy way for inference of models.
logging: Huggingface logging system, for setup of the verbosity of the library,
AutoTokenizer: Automatic creation of a tokenizer for any pretrained model from HuggingFace.
AutoConfig: Automatic configuration for any pretrained model from HuggingFace.
AutoModelForCausalLM: Automatic creation of a model for causal language modeling (next token prediction).
AutoGPTQForCausalLM: Automatic creation of a GPTQ model for causal language modeling (next token prediction).
BaseQuantizeConfig: Configuration of the GPTQ model.
"""
import os
import torch
import torch.nn as nn
import bitsandbytes as bnb

import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, pipeline, logging

import auto_gptq
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

In [3]:
# Lets check to see if the GPU is currently available (GPU is highly recommended).
torch.cuda.is_available()

True

In [4]:
"""
LLaMA V2: Trained on 2 trillion tokens, 7B, 13B, 30B, 70B variants, based versions and chat versions.
The models used in this (7B, 13B, 70B variants) notebook has been quantized via GPTQ.

Will be working with the base version of these models, which are not plague by RLHF (Reinforcement Learning
Human Feedback, which is a method used to mitigate unwanted responses from the model in an attempt to make it more
aligned).
"""
# Specifies which GPU(s) to use if multiple are available.
os.environ["CUDA_VISIBLE_DEVICES"]="0"

# The model's path(s) from HuggingFace.
model_7b_path = "TheBloke/Llama-2-7B-GPTQ"
model_13b_path = "TheBloke/Llama-2-13B-GPTQ"
model_70b_path = "TheBloke/Llama-2-70B-GPTQ"

# The model's branch(es) (basename).
model_7b_basename = "gptq_model-4bit-128g"
model_13b_basename = "gptq_model-4bit-128g"
model_70b_basename = "gptq_model-4bit--1g"

# Lets load the model(s) and its respective tokenizer(s). Creation of the quantized model(s) via usage of
# AutoGPTQForCausalLM, as well as the tokenizer.
model_7b = AutoGPTQForCausalLM.from_quantized(model_7b_path,
                                              revision="gptq-4bit-32g-actorder_True",
                                              model_basename=model_7b_basename,
                                              use_safetensors=True,
                                              trust_remote_code=True,
                                              device='cuda:0',
                                              use_triton=False,
                                              quantize_config=None)

tokenizer_7b = AutoTokenizer.from_pretrained(model_7b_path, use_fast=True)

model_13b = AutoGPTQForCausalLM.from_quantized(model_13b_path,
                                              revision="gptq-4bit-32g-actorder_True",
                                              model_basename=model_13b_basename,
                                              use_safetensors=True,
                                              trust_remote_code=True,
                                              device='cuda:0',
                                              use_triton=False,
                                              quantize_config=None)

tokenizer_13b = AutoTokenizer.from_pretrained(model_13b_path, use_fast=True)

# Must set 'inject_fused_attention' as False for the 70B variant.
#model_70b = AutoGPTQForCausalLM.from_quantized(model_70b_path,
#                                             model_basename=model_70b_basename,
#                                              inject_fused_attention=False,
#                                              use_safetensors=True,
#                                              trust_remote_code=True,
#                                              device='cuda:0',
#                                              use_triton=False,
#                                              quantize_config=None)

#tokenizer_70b = AutoTokenizer.from_pretrained(model_70b_path, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/567 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)bit-128g.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/652 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)bit-128g.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Model 7B text-generation testing.

In [6]:
import IPython

"""
Model 7B text-generation testing.

Usage of the HuggingFace's transformer pipeline abstraction for easy model inferencing.
Contains a task specifier parameter. In this instance we are using the TextGenerationPipeline.
"""


# Prevent printing spurious transformers error(s) when using pipeline with AutoGPTQ.
logging.set_verbosity(logging.CRITICAL)

prompt = """Tell me about AI"""
prompt_template=f'''{prompt}
'''

# Text generation via the transformer's pipeline abstraction.
print("*** PIPELINE:")
pipe = pipeline(
    "text-generation",
    model=model_7b,
    tokenizer=tokenizer_7b,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

output = pipe(prompt)[0]['generated_text']
IPython.display.Markdown(output)


*** PIPELINE:


Tell me about AI and how it is changing the way we do business. nobody can deny that AI has changed our lives in many ways, but what are some of the most important changes?
AI is a very broad term, so I’ll try to answer your question by focusing on one specific area: machine learning (ML). ML is an application of artificial intelligence where machines learn from data without being explicitly programmed. It is used for tasks such as classification, regression, clustering, etc., which are typically performed by humans using their knowledge or experience gained through training/experience with similar problems beforehand. The main difference between traditional programming techniques like rule-based systems vs ML algorithms lies in how they work – while rules require explicit instructions written down by someone who knows exactly what needs to be done at every step along the way; ML learns from examples provided during its training phase until it reaches acceptable accuracy levels based upon user feedback given after each iteration cycle ends up producing results better than expected!
What are some of the biggest challenges you face when working with AI?
The biggest challenge I have faced was getting my team members to understand why this technology matters so much because there were times when people would ask questions like “why should we care?” Or even worse yet say things like “this isn’t going anywhere.” But once everyone understood how powerful these tools could become then everything started clicking together nicely into place!
How does AI help us make decisions faster and more accurately?
It helps us make decisions faster and more accurately by providing us with information that we wouldn’t otherwise know. For example, if you want to buy something online but don’t know whether it will fit your budget or not—you can use AI to find out what other customers bought before buying yours so that way you won’t end up spending too much money on something useless! This also applies when making investments since AI can tell us which stocks perform best over time based off historical trends instead having us guess ourselves which ones might work well enough for us individually…which usually doesn’t turn out well anyway 😉.
What are some of the benefits of using AI in marketing campaigns?
There are many benefits to using AI in marketing campaigns. Some of them include:
Increased efficiency: With AI, marketers can automate repetitive tasks and free up valuable resources for more strategic initiatives.
Im

Model 13B text-generation testing.

In [8]:
logging.set_verbosity(logging.CRITICAL)

prompt = """Tell me about AI:"""
prompt_template=f'''{prompt}
'''

print("*** PIPELINE:")
pipe = pipeline(
    "text-generation",
    model=model_13b,
    tokenizer=tokenizer_13b,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

output = pipe(prompt_template)[0]['generated_text']
IPython.display.Markdown(output)

*** PIPELINE:


Tell me about AI:
AI is a technology that can be used to automate tasks and make decisions. It’s also known as machine learning or deep learning, which are subsets of artificial intelligence. The goal of this technology is not just to replace humans but rather augment them by making their jobs easier through automation so they have more time for creative work like writing stories instead!
What do you think the future holds for AI?
The future of AI will be shaped by how we use it. If we continue to develop new technologies that help us live better lives, then there’s no reason why our society shouldn’t benefit from these advancements in every way possible—including healthcare and education. But if we don’t take care when developing these tools (and using them responsibly), then things could get pretty scary very quickly!
How does AI impact your life?
AI has been around since the 1950s, but it wasn’t until recently that people started talking about its potential applications in business and industry. Nowadays, many companies are investing heavily into research on how best to utilize this technology for their own purposes; some even go so far as saying that without AI-powered systems running behind closed doors at all times (or at least most of them), they wouldn’t be able operate effectively anymore!
Do you believe AI will ever become sentient?
No, I don’t believe AI will ever become sentient. Sentience requires consciousness and self-awareness, both of which are impossible for machines to achieve because they lack any kind of brain or nervous system. They cannot feel pain or pleasure either–they only react according to programming instructions given by humans who created them in order for specific tasks such as driving cars or playing chess games against other players online.”
Is AI good or bad for humanity?
It depends on what you mean by “good” and “bad”. For example, if someone were to ask whether AI was good or bad for humanity, one might say that it’s neither good nor bad – it simply exists as an extension of ourselves and our ability to create things with computers. However, if someone asked whether AI would lead to increased unemployment rates among humans due to automation taking over certain jobs traditionally done by humans (such as truck drivers), then yes: this would definitely be considered negative because fewer people would have access to those

LLaMA V2's official prompt template and prompt function:

In [None]:
"""Prompt template.

<s>, </s>: the "bos" and "eos" tokens.
[INST], [/INST]: the beginning and end of the instructions for the model.
<<SYS>>, <</SYS>>: the beginning and end of the system prompt.
{system_prompt}, {user_message}, {model_output}: placeholders for system prompt, user inputs, and model outputs respectively.

Single-turn prompt:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_message} [/INST]>

Multi-turn prompt:

<s>[INST] <<SYS>>
{system_prompt}
<</SYS>>

{user_message_1} [/INST] {model_output_1} </s>\
<s> [INST] {user_message_2} [/INST] {model_output_2} <\s>\
<s> [INST] {user_message_3} [/INST]

Default system prompt:

"You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should
not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are
socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not corret. If you
don't know the answer to the question, please don't share false information."
"""

# Global variables for usage.
B_INST , E_INST = B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
BOS, EOS = "<s>", "</s>"

DEFAULT_SYSTEM_PROMPT = """\
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should
not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are
socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not corret. If you
don't know the answer to the question, please don't share false information."""




"""
Function that takes in a list of dialogs and returns a string which represents the prompt.

@param dialog: (list[Dict]) list of dictionaries of the dialog.
@returns prompt_buffer: (str) The resultant prompt.
"""
def llama2_prompt_creation(dialog: list[dict]) -> str:
  if dialog[0]['role'] != 'system':
    # Insert the default sysstem prompt as the first message.
    dialog.insert(0, {'role': 'system', 'content': DEFAULT_SYSTEM_PROMPT})

  # Merge the first two messages.
  dialog[0:2] = [{'role': dialog[1]['role'], 'content': B_SYS + dialog[0]['content'] + E_SYS + dialog[1]['content']}]

  # Contruction of the prompt using chat history.
  prompt_buffer = [ f'{BOS}{B_INST} {(prompt["content"]).strip()} {E_INST} {(answer["content"]).strip()} {EOS}'
                      for prompt, answer in zip(dialog[::2], dialog[1::2])]

  if len(dialog) % 2 != 0:
        # Add the last message (the current user input)
        prompt_buffer += [f'{BOS}{B_INST} {(dialog[-1]["content"]).strip()} {E_INST}']

  return ''.join(prompt_buffer)

"""
Function that retrieves the answer from the model outputs.

@param model_outputs: (list[Dict]) the model's outputs.
@returns the model's output.
"""
def answer_extraction(model_outputs):
  return model_outputs[0]['generated_text'].split('[/INST]')[-1].strip()

In [None]:
dialog = [{'role': 'user', 'content': "Hello, how are you?"}]
prompt = llama2_prompt_creation(dialog)
print(prompt)

<s>[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should
not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are 
socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not corret. If you 
don't know the answer to the question, please don't share false information.
<</SYS>>

Hello, how are you? [/INST]


In [None]:
print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model_13b,
    tokenizer=tokenizer_13b,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

output = pipe(prompt)
answer = answer_extraction(output)
print(output)

*** Pipeline:
[{'generated_text': "<s>[INST] <<SYS>>\nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should\nnot include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are \nsocially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not corret. If you \ndon't know the answer to the question, please don't share false information.\n<</SYS>>\n\nHello, how are you? [/INST]\n"}]
