### LLaMA V2 Chat GPTQ models.

Access, usage, and comparisons of the 7B, 13B and 70B GPTQ variants of the LLaMA V2 Chat models.

GPTQ (Generative Pre-Trained Transformer Quantization) is a SOTA post-training quantization technique for accurate quantization of pre-trained models. It achieves this quantization by using approximate second-order information (inverse-hessian matrix) to find the optimal quantization levels for each weight group.

Specifically, it computes the inverse hessian matrix of the parameters via Cholesky Decomposition for easy computation. It then divides the weights into blocks of size B (typically 128) and quantizes each block of weights.

The Chat models were created by first applying supervised fine-tuning (SFT) to the base model(s), which aligns said model(s) towards dialogue-style instructions. They used 27,550 high-quality annotations from META's own vendors.
They then used the technique of Reinforcement Learning with Human Feedback (RLHF) for futher alignment, which involves using human annotations for selection of the model's best outputs, and this "feedback" will be used to train the reward model, which learns patterns in the preferences of the human annotators. They focused on helpfulness and saftety separately, and as such, created two reward models for helpfulness and safetly respectively.

In [1]:
# To avoid the supression of the installation outputs, comment the following line:
#%%capture

"""Installations:

transformers: For access and usage of pretrained models from Huggingface platform.
peft: For usage of parameter-efficient finetuning methods (LoRA for instance).
datasets: For access of datasets from Huggingface.
accelerate: For easy training of models, configuration of training and testing environments.
bitsandbytes: Package for 4bit and 8bit CUDA functions for PyTorch
auto-gptq: Package for easy quantization of models via GPTQ.
"""
!pip install transformers peft datasets accelerate bitsandbytes auto-gptq

Collecting transformers
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets
  Downloading datasets-2.14.4-py3-none-any.whl (519 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m519.3/519.3 kB[0m [31m39.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m25.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.41.1-py3-none-any.whl (92.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m8.7 MB/s[0m eta 

In [2]:
"""Importations.

os: For interaction with the operating system.
torch: Access of the PyTorch ML framework.
pipeline: Easy way for inference of models.
logging: HuggingFace logging system, for setup of the verbosity of the library.
AutoTokenizer: Automatic creation of a tokenizer for any pretrained model from HuggingFace.
AutoConfig: Automatic configuration for any pretrained model from HuggingFace.
AutoModelForCausalLM: Automatic creation of a model for causal language modeling (next token prediction).
AutoGPTQForCausalLM: Automatic creation of a GPTQ model for causal language modeling (next token prediction).
BaseQuantizeConfig: Configuration of the GPTQ model.
"""
import os
import torch
import torch.nn as nn
import bitsandbytes as bnb

import transformers
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM, pipeline, logging

import auto_gptq
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

In [3]:
# Lets check to see if the GPU is currently available (GPU is highly recommended).
torch.cuda.is_available()

True

In [4]:
"""
LLaMA V2: Trained on 2 trillion tokens, 7B, 13B, 30B, and 70B variants, based versions and chat versions.
The models used un this (7B, 13B, 70B variants) notebook has been quantized via GPTQ.

Will be wokring with the chat version of these models, which is the result of RLHF on the base models.
(Refinforcement Learning Human Feedback, which is a method used to mitigate unwanted responses from the model in an
attempt to make it more aligned).
"""
# Specifies which GPU(s) to use if multiple are available.
os.environ["CUDA_VISIBLE_DEVICES"]="0"

# The model's path(s) from HuggingFace.
model_7b_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_13b_path = "TheBloke/Llama-2-13B-chat-GPTQ"
model_70b_path = "TheBloke/Llama-2-70B-chat-GPTQ"

# The model's branch(es) (basename).
model_7b_basename = "gptq_model-4bit-128g"
model_13b_basename = "gptq_model-4bit-128g"
model_70b_basename = "gptq_model-4bit--1g"

# Lets load the model(s) and its respective tokenizer(s). Creation of the quantized model(s) via usage of
# AutoGPTQForCausalLM, as well as the tokenizer.
model_7b = AutoGPTQForCausalLM.from_quantized(model_7b_path,
                                              revision="gptq-4bit-32g-actorder_True",
                                              model_basename=model_7b_basename,
                                              use_safetensors=True,
                                              trust_remote_code=True,
                                              device='cuda:0',
                                              use_triton=False,
                                              quantize_config=None)

tokenizer_7b = AutoTokenizer.from_pretrained(model_7b_path, use_fast=True)

model_13b = AutoGPTQForCausalLM.from_quantized(model_13b_path,
                                              revision="gptq-4bit-32g-actorder_True",
                                              model_basename=model_13b_basename,
                                              use_safetensors=True,
                                              trust_remote_code=True,
                                              device='cuda:0',
                                              use_triton=False,
                                              quantize_config=None)

tokenizer_13b = AutoTokenizer.from_pretrained(model_13b_path, use_fast=True)

# Must set 'inject_fused_attention' as False for the 70B variant.
#model_70b = AutoGPTQForCausalLM.from_quantized(model_70b_path,
#                                             model_basename=model_70b_basename,
#                                              inject_fused_attention=False,
#                                              use_safetensors=True,
#                                              trust_remote_code=True,
#                                              device='cuda:0',
#                                              use_triton=False,
#                                              quantize_config=None)

#tokenizer_70b = AutoTokenizer.from_pretrained(model_70b_path, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/572 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)bit-128g.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]

Downloading (…)quantize_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

Downloading (…)bit-128g.safetensors:   0%|          | 0.00/7.26G [00:00<?, ?B/s]



Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

Chat Model 7B text-generation testing.

In [7]:
import IPython

"""
Model 7ab text-generation testing.

Usage of the HuggingFace's transformer pipeline abstraction for easy model inferencing.
Contains a task specifier parameter. In this instance we are using the TextGenerationPipeline.
"""

# Prevent printing spurious transformer error(s) when using pipeline with AutoGPTQ.
logging.set_verbosity(logging.CRITICAL)

prompt="Tell me about AI:"
system_message="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]
'''

# Text generation via the transformer's pipeline abstraction.
print("*** PIPELINE:")
pipe = pipeline(
    "text-generation",
    model=model_7b,
    tokenizer=tokenizer_7b,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
output = pipe(prompt)[0]['generated_text']
IPython.display.Markdown(output)

*** PIPELINE:


Tell me about AI: What is it, how does it work and what are the benefits of using AI?
Aug 20, 2019 | Artificial Intelligence (AI)
Artificial intelligence (AI) refers to the ability of machines or computer systems to perform tasks that typically require human intelligence. These tasks include learning, problem-solving, decision-making, language processing, and more. Here's a breakdown of what AI is, how it works, and its benefits:
What is AI?
AI is a broad field of research that focuses on creating intelligent machines capable of performing tasks that typically require human intelligence. The goal of AI is to create machines that can learn from data, reason, make decisions, and improve their performance over time.
How does AI work?
There are several approaches to building AI systems, including rule-based systems, machine learning, and deep learning. Rule-based systems use predefined rules to reason and make decisions. Machine learning involves training algorithms to recognize patterns in data and make predictions based on those patterns. Deep learning is a subset of machine learning that uses neural networks to analyze large amounts of data.
Benefits of using AI:
1. Automation: AI can automate repetitive and mundane tasks, freeing up time for more strategic activities.
2. Efficiency: AI can process vast amounts of data faster than humans, making it ideal for applications such as fraud detection and recommendation engines.
3. Personalization: AI can help personalize products and services by analyzing customer data and preferences.
4. Predictive maintenance: AI can predict when equipment or machinery is likely to fail, allowing for proactive maintenance and minimizing downtime.
5. Improved decision-making: AI can provide insights and recommendations to support better decision-making across various industries.
6. Enhanced security: AI-powered systems can detect and prevent cyber threats, protecting sensitive information and systems.
7. Healthcare advancements: AI can assist doctors and medical professionals with diagnosis, treatment, and patient care, leading to improved health outcomes.
8. Natural Language Processing (NLP): AI can understand and generate human language, enabling chatbots, voice assistants, and other NLP applications.
9.

Chat Model 13B text-generation testing.

In [9]:
logging.set_verbosity(logging.CRITICAL)

prompt="""Tell me about AI:"""
system_message="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]
'''

# Text generation via the transformer's pipeline abstraction.
print("*** PIPELINE:")
pipe = pipeline(
    "text-generation",
    model=model_13b,
    tokenizer=tokenizer_13b,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
output = pipe(prompt)[0]['generated_text']
IPython.display.Markdown(output)

*** PIPELINE:


Tell me about AI: The Somnium Files.
AI: The Somnium Files is a detective adventure game set in a dystopian future where humanity has colonized other planets and artificial intelligence has reached an unprecedented level of sophistication. Players take on the role of a special agent tasked with solving a series of murders that have taken place across the galaxy, all of which seem to be connected to a mysterious figure known only as "The Architect."

As players explore the world and investigate the crimes, they'll encounter a wide variety of characters, each with their own unique personalities and motivations. The game features a branching storyline that changes based on the player's choices, allowing for multiple possible endings and plenty of replayability.

One of the key mechanics in AI: The Somnium Files is the ability to enter into the minds of suspects and witnesses through a process called "neuromodeling." This allows players to experience the memories and thoughts of others firsthand, giving them valuable information and insight into the case.

Another important aspect of the game is the use of advanced AI technology, such as neural networks and machine learning algorithms, to help solve puzzles and analyze evidence. Players will need to use these tools to piece together clues and uncover the truth behind the murders.

Overall, AI: The Somnium Files offers a gripping narrative, engaging gameplay, and a thought-provoking exploration of the intersection between humans and technology. It's a must-play for fans of sci-fi, mystery, and adventure games alike.

Lets get the original LLaMA V2 chat models and compare these with the GPTQ versions.

Note: Need to login to HuggingFace to access the official LLaMA 2 models, since its not actually open-source, and is more like "open access".

In [4]:
# Logging into HuggingFace on notebook.
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
# The model's path(s) from HuggingFace.
orig_model_7b_path = "meta-llama/Llama-2-7b-chat-hf"
orig_model_13b_path = "meta-llama/Llama-2-13b-chat-hf"
orig_model_70b_path = "meta-llama/Llama-2-70b-chat-hf"

# Loading of the model(s) and its respective tokenizer(s). Creation of the original model(s).
orig_model_7b = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf",
                                                     torch_dtype=torch.float16,
                                                     load_in_4bit=True,
                                                     device_map='auto')

orig_7b_tokenizer = AutoTokenizer.from_pretrained(orig_model_7b_path, use_fast=True)

orig_model_13b = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-hf",
                                                      torch_dtype=torch.float16,
                                                      load_in_4bit=True,
                                                     device_map='auto')

orig_13b_tokenizer = AutoTokenizer.from_pretrained(orig_model_13b_path, use_fast=True)

#orig_model_70b = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf",
#                                                      torch_dtype=torch.float16,
#                                                      load_in_4bit=True,
#                                                     device_map='auto')

#orig_70b_tokenizer = AutoTokenizer.from_pretrained(orig_mode1_70b_path, use_fast=True)

Downloading (…)lve/main/config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/610 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

Downloading (…)of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Original Chat Model 7B text-generation testing.

In [10]:
import IPython

"""
Chat Model 7b text-generation testing.

Usage of the HuggingFace's transformer pipeline abstraction for easy model inferencing.
Contains a task specifier parameter. In this instance we are using the TextGenerationPipeline.
"""

# Prevent printing spurious transformer error(s) when using pipeline with AutoGPTQ.
logging.set_verbosity(logging.CRITICAL)

prompt="Tell me about AI:"
system_message="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]
'''

# Text generation via the transformer's pipeline abstraction.
print("*** PIPELINE:")
pipe = pipeline(
    "text-generation",
    model=orig_model_7b,
    tokenizer=orig_7b_tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
output = pipe(prompt)[0]['generated_text']
IPython.display.Markdown(output)

*** PIPELINE:


Tell me about AI:
 Unterscheidung zwischen „künstlicher Intelligenz“ und „Künstlichkeit“. In: Frankfurter Allgemeine Zeitung. 2018, abgerufen am 3. Januar 2019 (Artikelanfang).

Original Chat Model 13B text-generation testing.

In [11]:
logging.set_verbosity(logging.CRITICAL)

prompt="Tell me about AI:"
system_message="You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."

prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]
'''

# Text generation via the transformer's pipeline abstraction.
print("*** PIPELINE:")
pipe = pipeline(
    "text-generation",
    model=orig_model_13b,
    tokenizer=orig_13b_tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)
output = pipe(prompt)[0]['generated_text']
IPython.display.Markdown(output)

*** PIPELINE:


Tell me about AI:
AI is a technology that can be used to create intelligent machines. It’s also known as machine learning, which means it uses algorithms and data to learn from experience. The goal of this type of software development is not just making computers smarter but also helping them understand human behavior better so they can interact with us more effectively in our daily lives (e.g., Siri).
What are the benefits of using AI?
The benefits of using AI include increased productivity, improved customer service, and reduced costs. For example, if you have an online store or website where customers can purchase products directly through your site without having any interaction with another person at all times during their shopping experience then this would be considered “automated” because there isn’t anyone physically present who needs assistance when placing orders for goods on behalf of others via computer screens instead!
How does AI work?
Artificial intelligence works by analyzing large amounts of data and finding patterns within it. This process allows computers to make decisions based on what they know rather than relying solely on humans for guidance or direction like we do today with our own brains!
Why should I use AI?
You should use AI because it will help you save time and money while improving your business processes. You can also use AI to improve customer service, increase sales conversions rates, reduce costs associated with hiring new employees or outsourcing tasks outside of the company’s core competencies.
Is AI safe?
Yes, AI is safe. However, there are some risks involved in using artificial intelligence technologies such as cybersecurity threats and privacy concerns. These issues need to be addressed before implementing these systems into your organization’s operations or workflows so that you don’t run into problems later down the road when trying to implement them yourself!
Can AI replace humans?
No, AI cannot replace humans. Humans are still needed to provide input and feedback to the system. In addition, humans are required for certain tasks that require creativity or intuition (such as designing user interfaces).
Will AI take my job away?
No, AI won’t take your job away. But it may change how you do things at work. If you’re worried about losing your job due to automation, here are some tips on how best prepare yourself for this shift in employment opportunities:
