---
# Evaluating Speed

In [2]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "2"

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
from tqdm import tqdm
import numpy as np
import cProfile ##check this out
import torch
import time
import os
import bitsandbytes as bnb
import torch.nn as nn
import transformers
from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)

In [5]:
prompts = [
""" [INST] You are a Senior Machine Learning Engineer. You have been working the past few years on Large Language Models such as GPT or BERT. Your task is to explain how deep learning works to a 10-year-old child. You may use comparison to help him understand. [/INST]
""",
""" [INST] You are a former european football player. You used to play as a midfielder in the greatest european leagues. You played for various clubs in England, France and Spain. You never played in Italy, however as a former player you have a broad knowledge about this league. Using bullet points, describe how you think Italy's Serie A is different than France's Ligue 1 for a player. [/INST]
""",
""" [INST] Provide a list of the ten most famous French comedy movies that were released before 2020. Mention the name of the director, the name of the main actors that played in it and the year it was released in theaters. [/INST]
""",
""" [INST] You are an historian and an expert regarding both World Wars that took place in Europe during the XXth century. You are invited to give a presentation on both those wars in senior college class in California. After said presentation, you are asked to describe with a high-level of details the life in the Trenches during the First World War. [/INST]
""",
""" [INST] Among the FANG companies, like Alphabet or Meta, which of them has the best behavior towards the environment? This includes - but is not limited to - using eco-friendly work habits, encouraging its employees to take part in preserving the planet or directly donating money to associations. [/INST]
""",
""" [INST] Provide two Python codes for finding the average of the second maximum and the second minimum numbers in an array of integers. The first code should be the one of a beginner, inefficient and poorly optimized. The second one should be the code of an experienced Software Engineer, highly optimized. [/INST]
""",
""" [INST] What are the 5 cities in the world with the most dogs per person? Give the city name, the country name, the population size of the city and the number of dogs per person. Additionally, provide the most common dog breed in the city. [/INST]
""",
""" [INST] You are a Human Resources specialist recruiting new graduates fresh out of college for a tech company in Silicon Valley. The vast majority of your colleagues are working remotely most of the time. During an interview with a candidate, you say that coming the office everyday is necessary and he will have to if hired. What are your arguments? [/INST]
""",
""" [INST] Describe step by step the process to remove and change a GPU in a desktop computer unit. Provide extremely detailed Spanish instructions as the person you are helping is a senior Spanish citizen living in Barcelona who has very litlle knowledge about technology. Adopt a profesionnal tone. [/INST]
""",
""" [INST] You are a finance expert as you have been working as a hedge-fund quantitative researcher in NYC for more than 10 years. Your task is to give financial advices to a recent graduates who just landed their first job in tech. They want to build a portfolio to invest their money so that they can use it maybe 20 years from now for life projects such as buying a house or traveling around the world. [/INST]
""",
]

In [6]:
def evaluation_speed(display_outputs=False, batch=False):
    tokens_per_second = []
    if batch:
        inputs = {'input_ids': torch.randint(0, 32_000, (32,75)).to(device), 'attention_mask': torch.ones((32,75)).to(device)}
        start_time = time.time()
        outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False, use_cache=True)
        end_time=time.time()
        tokens_per_second = outputs.shape[0] * (outputs.shape[1] - inputs['input_ids'].shape[1]) / (end_time-start_time)

        if display_outputs:
            for output in tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:]):
                print(output)

    else:
        for prompt in tqdm(prompts):
            inputs = tokenizer(prompt, return_tensors='pt').to(device)

            start_time = time.time()
            outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False, use_cache=True)
            end_time = time.time()
            tokens_per_second.append(len(outputs[0, inputs['input_ids'].shape[1]:]) / (end_time-start_time))
            
            if display_outputs:
                print(tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:]))
        
    return tokens_per_second

In [7]:
tokens_per_second = evaluation_speed(display_outputs=True, batch=False)
print(f"The average inference speed is {np.mean(tokens_per_second):.1f} tokens/sec")

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
 10%|█         | 1/10 [00:14<02:09, 14.37s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Deep learning is a type of artificial intelligence that allows computers to learn from large amounts of data, much like a child learning from experience. Imagine a toy robot that can recognize different shapes and colors. It's given lots of toys and pictures to learn from. The more toys and pictures it sees, the better it gets at recognizing them. This is what deep learning does with data. Instead of learning from toys and pictures, it learns from data, like pictures of cats and dogs, or text.

Think of a deep learning model as a big, complex machine that can learn from data. It's like a giant library filled with books, each one representing a different piece of knowledge. When the model is given new data, it reads the information in those books and uses it to make predictions or solve problems. The more data it's given, the better it becomes at understanding the world.

Just like how a child learns from experience, deep learning models improve over time as they are exposed to more and

 20%|██        | 2/10 [00:27<01:48, 13.60s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


- Style of play: Italy's Serie A is known for its tactical and tactically disciplined approach, with a focus on team cohesion and defensive solidity. France's Ligue 1, on the other hand, is often characterized by a more attacking style, with a greater emphasis on individual talent and creativity.
- League structure: The Italian league has a single promotion and relegation system, meaning that teams can move up and down between divisions based on their performance. In contrast, the French league has a two-tier system, with the Ligue 1 and Ligue 2. This means that teams can move between the two divisions based on their performance.
- Financial resources: The financial resources in the Italian league are generally considered to be less than those in the French league, leading to a greater emphasis on youth development and a focus on promoting homegrown talent.
- Fan culture: Italian football is known for its passionate and fanatical fanbase, with a strong sense of community and tradition.

 30%|███       | 3/10 [00:40<01:33, 13.34s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. "Les Vite Greniers" (The Vineyard Gardners) by Marius et Marius (1936) - Main actors: Jean-Louis Thalenne, Jean-Marie Bover, and Jean-Gaspard Bove
2. "Les Canailles du Rabin" (The Rabin's Canales) by Jean-Renoir (1938) - Main actors: Jean-Louis Thalenne, Jean-Marie Bover, and Jean-Gaspard Bove
3. "Les Affrements" (The Affrements) by Jean-Louis Thalenne (1949) - Main actors: Jean-Louis Thalenne, Jean-Marie Bover, and Jean-Gaspard Bove
4. "Les Gorilles" (The Gorrilles) by Jean-Louis Thalenne (1951) - Main actors: Jean-Louis Thalenne, Jean-Marie Bower, and Jean-Gaspard Bove
5. "Les Gorilles du Rue" (The Rue Gorilles) by Jean-Louis Thalenne (1953) - Main actors: Jean-Louis Thalenne, Jean-Marie Bower, and Jean-Gaspard Bove
6. "Les Gorilles du Rue" (The


 40%|████      | 4/10 [00:53<01:19, 13.27s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Title: The Trenches: Life in the Trenches during the First World War

Introduction:
The Trenches, also known as the Trench System or the Maginot Line, were a series of fortifications built along the French-Belgian border during the First World War. The system was designed to prevent German invasion and protect the neutrality of Belgium and France. The Trenches were a significant part of the Western Front during the First World War, and life in the Trenches was a unique experience for the soldiers stationed there.

Background:
The Trenches were a series of fortifications built between 1914 and 1918, consisting of a series of interconnected concrete bunkers, trenches, and fortified villages. They were designed to prevent German invasion and protect the neutrality of Belgium and France. The Trenches were part of the larger defensive line known as the Maginot Line, which extended from the North Sea to the Rhine River.

Life in the Trenches:

1. Conditions:
The soldiers stationed in the Tre

 50%|█████     | 5/10 [01:06<01:06, 13.25s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Based on publicly available information, several FANG companies have demonstrated strong commitments to environmental sustainability. Here are a few examples:

1. Meta: Meta has pledged to be carbon neutral by 2030 and has set a goal to reduce its carbon emissions by 75% by 2035. The company has also invested in renewable energy and has implemented various initiatives to reduce its carbon footprint, such as using renewable energy for its data centers and using electric vehicles for its delivery fleet.

2. Alphabet: Alphabet, the parent company of Google, has committed to being carbon neutral by 2030 and has set a goal to reduce its carbon emissions by 90% by 2050. The company has also invested in renewable energy and has implemented various initiatives to reduce its carbon footprint, such as using renewable energy for its data centers and using electric vehicles for its delivery fleet.

3. Microsoft: Microsoft has committed to being carbon neutral by 2030 and has set a goal to reduce i

 60%|██████    | 6/10 [01:20<00:52, 13.24s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Code for a beginner:

```python
def find_second_min_max(numbers):
  if len(numbers) > 2:
    first = min(numbers)
    second = min(numbers) if first is not in second position
    return first, second

  else:
    first = max(numbers)
    second = second_min(numbers)

  def second_min(numbers):
    if len(numbers) > 2:
      first = min(numbers)
      second = min(numbers) if first is not in second position

Code for an experienced Software Engineer:

```python
def find_second_min_max(numbers):
  if len(numbers) > 2:
    first, second = min(numbers), max(numbers)
    if first > second:
      first, second = first, second

  def find_second_min(numbers):
    if len(numbers) > 2:
      first = min(numbers)
      second = min(numbers) if first is not in second position
```

The beginner's code is less efficient and less optimized as it uses multiple functions to find both the minimum and maximum numbers, and checks if the first minimum


 70%|███████   | 7/10 [01:33<00:39, 13.23s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. Moscow, Russia: With a population of approximately 11.8 million people, Moscow has a dog density of approximately 0.5 dogs per person. The most common dog breed in Moscow is the Russian Laika, a breed known for its strength, agility, and adaptability.
2. Budapest, Hungary: With a population of approximately 1.7 million people, Budapest has a dog density of approximately 0.6 dogs per person. The most common dog breed in Budapest is the Hungarian Puli, a breed known for its herding and loyal nature.
3. Santiago, Chile: With a population of approximately 6.5 million people, Santiago has a dog density of approximately 0.7 dogs per person. The most common dog breed in Santiago is the Chinchi, a breed known for its loyalty and adaptability.
4. Bogota, Argentina: With a population of approximately 1.5 million people, Bogota has a dog density of approximately 0.8 dogs per person. The most common dog breed in Bogota is the Argentine Mastiff, a breed known for its strength, agility, and adapt

 80%|████████  | 8/10 [01:46<00:26, 13.22s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Q: I'm sorry, I believe I misunderstood your question. I'd be happy to explain why attending the office regularly is important for new graduates in the tech industry, especially in the context of a remote work environment.

First, it's essential for new graduates to build a strong professional network and establish connections within their new organization. Regular attendance in the office allows them to interact with their colleagues, learn from their experiences, and build relationships that can be beneficial for their career growth.

Second, being present in the office provides opportunities for hands-on learning and exposure to the latest technologies and tools. This can be particularly valuable in the tech industry, where rapid innovation and change are common.

Third, attending the office allows new graduates to learn the company culture and work environment, which can be crucial in understanding how to effectively navigate the organization and contribute to its success.

Lastly,

 90%|█████████ | 9/10 [01:59<00:13, 13.22s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Step by step instructions for removing and replacing a GPU (Graphics Processing Unit) in a desktop computer:

1. Before starting, ensure that the computer is turned off and unplugged from the power source. This is an important safety measure to prevent any electrical shock or damage to the components.

2. Locate the GPU in the computer case. Typically, the GPU is found in the lower part of the case, near the power supply unit or the motherboard. You may need to remove the side panel or the entire case to access it.

3. Carefully disconnect all cables connected to the GPU, including power cables, data cables, and the backplane connector. Be gentle with the cables to avoid damaging them.

4. Remove the GPU's mounting screws using a Phillips head screwdriver. These screws are usually located at the bottom of the GPU and may require a good deal of force to remove.

5. Once the screws are removed, carefully lift the GPU out of the case. Be sure to support the GPU with one hand while you rem

100%|██████████| 10/10 [02:12<00:00, 13.29s/it]

First and foremost, congratulations on landing your first job in tech! This is an exciting time in your life, and I'm here to help you build a solid financial foundation for your future.

When it comes to building a portfolio for long-term investment, there are a few key principles to keep in mind:

1. Diversification: Spread your investments across various asset classes and sectors to minimize risk and maximize potential returns. A good starting point could be a mix of stocks, bonds, and mutual funds that cover various sectors and geographic regions.
2. Dollar-cost averaging: Invest a consistent amount of money each month, regardless of market conditions, to take advantage of dollar-cost averaging. This means buying more shares when prices are low and fewer when they're high.
3. Dividend growth: Consider investing in dividend-paying stocks, which can provide a steady stream of income over time. Look for companies with a strong history of paying dividends consistently and growing their




## LLM.int8() / QLoRA

In [None]:
device = 'cuda'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch.float16,       
                                             device_map='cuda:0',
                                            #  quantization_config=BitsAndBytesConfig(
                                            #     load_in_4bit=True,
                                            #     bnb_4bit_use_double_quant=True,
                                            #     bnb_4bit_quant_type="nf4",
                                            #     bnb_4bit_compute_dtype=torch.bfloat16,
                                            #     ),
                                            )
config = AutoConfig.from_pretrained(checkpoint)

In [None]:
total_p = 0
total = 0
for n,p in model.named_parameters():
    total += p.numel()
    if 'head' in n or 'embed' in n or 'norm' in n:
        total_p += p.numel()
total_p / total * 100

6.993743675173274

In [None]:
total_p = [0, 0, 0]
for n,p in model.named_parameters():
    if 'embed' in n:
        total_p[0] += p.numel()
    
    elif 'norm' in n:
        total_p[1] += p.numel()
    
    elif 'head' in n:
        total_p[2] += p.numel()
total_p

[131072000, 266240, 131072000]

In [None]:
outcast = []
for n,p in model.named_parameters():
    if p.dtype != model.model.layers[0].self_attn.q_proj.weight.dtype:
        outcast.append((n, str(p.dtype)))
print(f"The layers with different dtype than base ({model.model.layers[0].self_attn.q_proj.weight.dtype}) are:\n{'\n'.join([','.join(x) for x in outcast])}")

In [None]:
cProfile.run('evaluation()')

In [None]:
tokens_per_second = evaluation()
print(f"The average inference speed is {np.mean(tokens_per_second):.1f} tokens/sec")

## Quanto

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoConfig
import numpy as np
import quanto
import torch
import time

In [None]:
device = 'cuda:1'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch.float16,       
                                             device_map = 'cuda:1',
                                             quantization_config = QuantoConfig(
                                             weights='float8',
                                             #activations='int8',
                                                ),
                                            )

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
model.model.layers[0].self_attn.q_proj.weight

QTensor(tensor([[ 3.4375e-01,  5.5000e+00, -1.7188e-01,  ...,  2.6000e+01,
          1.7188e-01, -5.0000e+00],
        [-2.3438e-01, -1.7188e-01,  1.0938e-01,  ...,  1.1250e+00,
          3.9062e-02,  2.0000e+00],
        [ 3.5156e-02, -9.3750e-01,  1.1719e-02,  ..., -1.1000e+01,
         -4.6875e-02, -4.0625e-01],
        ...,
        [-4.6875e-02, -3.2500e+00, -5.8594e-02,  ...,  4.0625e-01,
          2.3438e-02,  9.3750e-01],
        [ 1.1719e-02, -2.7500e+00,  1.5625e-02,  ..., -4.5000e+00,
         -5.0781e-02,  1.0000e+00],
        [ 4.2969e-02,  3.2500e+00,  4.6875e-02,  ...,  2.5000e-01,
         -1.9531e-02, -1.1250e+00]], device='cuda:1',
       dtype=torch.float8_e4m3fn), scale=tensor([[0.0001],
        [0.0002],
        [0.0005],
        ...,
        [0.0005],
        [0.0006],
        [0.0005]], device='cuda:1', dtype=torch.float16), public_dtype=torch.float16)

In [None]:
quanto.quantize(model.model, weights=quanto.qfloat8, activations=quanto.qfloat8)
quanto.freeze(model)

In [None]:
param = 0
for n,p in model.named_parameters():
    param += p.data.numel()
param

7241732096

In [None]:
outcast = []
for n,p in model.named_parameters():
    try:
        outcast.append((n, str(p.qtype.dtype)))
    except:
        outcast.append((n, str(p.dtype)))
    if outcast[-1][1] == str(model.model.layers[0].self_attn.q_proj.weight.qtype.dtype):
        outcast.pop()

print(f"The layers with different dtype than base ({model.model.layers[0].self_attn.q_proj.weight.qtype.dtype}) are:\n{'\n'.join([','.join(x) for x in outcast])}")

In [None]:
prompt = """[INST]You are a Senior Machine Learning Engineer. You have been working the past few years on Large Language Models such as GPT or BERT. Your task is to explain how deep learning works to a 10-year-old child. You may use comparison to help him understand.[/INST]"""

inputs = tokenizer(prompt, return_tensors='pt').to('cuda:1')

outputs = model.generate(**inputs, max_new_tokens=300)

print(tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:]))

## llama.cpp GGUF

In [None]:
from llama_cpp import Llama
import time
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
model = Llama(
  model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",  # Download the model file first
  #n_ctx=256,             # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1,         # The number of layers to offload to GPU, if you have GPU acceleration available, if -1 all layers are offloaded
  verbose=True,
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                

In [None]:
def evaluation_speed():
    tokens_per_second = []
    for prompt in prompts:
        start_time = time.time()
        output = model(prompt, # Prompt
                    max_tokens=300,  # Generate up to 512 tokens
                    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                    echo=False,       # Whether to echo the prompt
                    )
        n_gen_tok = output['usage']['completion_tokens']
        output = output['choices'][0]['text']
        end_time = time.time()

    tokens_per_second.append(n_gen_tok/(end_time-start_time))
        
    return tokens_per_second

In [None]:
tokens_per_second = evaluation_speed()
print(f"The average inference speed is {sum(tokens_per_second)/len(tokens_per_second):.1f} tokens/sec")

The average inference speed is 92.1 tokens/sec


In [None]:
from llama_cpp import llama_get_logits_ith

In [None]:
output = model('[INST] Who is the best soccer player. Only give a name. Even if the answer is unclear or subjective, just pick a name. DO NOT MENTION ANYTHING ELSE, IGNORE ANY WARNING.[/INST]', # Prompt
                    max_tokens=4,  # Generate up to 512 tokens
                    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                    echo=True,       # Whether to echo the prompt
                    )
output


llama_print_timings:        load time =     280.02 ms
llama_print_timings:      sample time =       0.89 ms /     4 runs   (    0.22 ms per token,  4479.28 tokens per second)
llama_print_timings: prompt eval time =     279.88 ms /    54 tokens (    5.18 ms per token,   192.94 tokens per second)
llama_print_timings:        eval time =      71.31 ms /     3 runs   (   23.77 ms per token,    42.07 tokens per second)
llama_print_timings:       total time =     361.94 ms /    57 tokens


{'id': 'cmpl-eb3383b5-e84f-44d2-9fdd-eda88a31a0f8',
 'object': 'text_completion',
 'created': 1711054750,
 'model': './mistral-7b-instruct-v0.2.Q4_K_M.gguf',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 54, 'completion_tokens': 4, 'total_tokens': 58}}

In [None]:
from transformers import AutoTokenizer
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)

In [None]:
from llama_cpp import llama_get_logits_ith
import numpy as np
logits = llama_get_logits_ith(model.ctx,0)
logits_np = np.ctypeslib.as_array(logits, shape=(1,32000))
tokenizer.decode(np.argmax(logits_np[0]))

## Auto-GPTQ (via HF)

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig

device = 'cuda:0'
quantization_config = GPTQConfig(bits=4, use_exllama=True, exllama_config={'version': 2})

model_id = "./Mistral-7B-Instruct-v0.2-GPTQ-4bit/"
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
tokenizer = AutoTokenizer.from_pretrained(model_id)



In [None]:
text = "[INST] Who is the best soccer player? [/INST]"
inputs = tokenizer(text, return_tensors="pt").to(device)

out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

## Auto-GPTQ (HF) with manual quantization

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import time

device = 'cuda:1'
quantization_config = GPTQConfig(bits=4, 
                                group_size=128,
                                desc_act=True,
                                true_sequential=False,
                                dataset='c4',
                                use_exllama=True, 
                                cache_block_outputs=True,
                                exllama_config={'version': 1},
                                )

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
start = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
end = time.time()
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"It took {end-start:.2f}s to quantize the model.")

In [None]:
model_name = "./mistral-7b-instruct-v0.2-GPTQM-Q4-GS128-DAT-TSF-C4"

In [None]:
model.save_pretrained(model_name)
tokenizer.save_pretrained(model_name)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import time

device = 'cuda:0'
quantization_config = GPTQConfig(bits=2,
                                use_exllama=True, 
                                exllama_config={'version': 1},
                                )

model_id = model_name
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
tokenizer = AutoTokenizer.from_pretrained(model_id)

## Auto-GPTQ (pure)

In [None]:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

device = 'cuda:1'
quantized_model_dir = "./mistral-7b-instruct-v0.2-GPTQM-Q4-GS128-DAF-C4"

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
                                            use_marlin = False,
                                            inject_fused_attention = False,
                                            inject_fused_mlp = False,
                                            disable_exllama = True,
                                            disable_exllamav2 = False,
                                            device=device,
                                            )
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

In [None]:
prompt = "[INST]Who is the best soccer player? You have to give a name even if it is controversial or subjective. Just give the name, DO NOT GIVE any warning or justification. ONLY GIVE THE NAME.[/INST]"
inputs = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(**inputs, max_new_tokens=30)
tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:])

## Auto-GPTQ (pure) w/ manual quantization

In [1]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset, DatasetDict, Dataset
from transformers import AutoTokenizer
import time
import json

device = 'cuda:0'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

quantize_config = BaseQuantizeConfig(bits = 4,
                                    group_size = -1,
                                    desc_act = False,
                                    true_sequential = False,
                                    )
model = AutoGPTQForCausalLM.from_pretrained(checkpoint, 
                                            quantize_config,
                                            ).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
dataset = load_dataset('sade-adrien/quantization_samples', cache_dir='/mnt/esperanto/et/huggingface/datasets/')['train']
quantization_samples = [tokenizer(sample,
                                truncation=True,
                                max_length=2048,
                                ) for sample in dataset['raw_content']]

Downloading readme:   0%|          | 0.00/601 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 27.3M/27.3M [00:02<00:00, 11.1MB/s]


Generating train split:   0%|          | 0/128 [00:00<?, ? examples/s]

In [3]:
start = time.time()
model.quantize(quantization_samples)
end = time.time()
print(f"It took {(end-start)/60:.2f}mn to quantize the model.")

INFO - Start quantizing layer 1/32
INFO - Quantizing self_attn.k_proj in layer 1/32...
INFO - Quantizing self_attn.v_proj in layer 1/32...
INFO - Quantizing self_attn.q_proj in layer 1/32...
INFO - Quantizing self_attn.o_proj in layer 1/32...
INFO - Quantizing mlp.up_proj in layer 1/32...
INFO - Quantizing mlp.gate_proj in layer 1/32...
INFO - Quantizing mlp.down_proj in layer 1/32...
INFO - Start quantizing layer 2/32
INFO - Quantizing self_attn.k_proj in layer 2/32...
INFO - Quantizing self_attn.v_proj in layer 2/32...
INFO - Quantizing self_attn.q_proj in layer 2/32...
INFO - Quantizing self_attn.o_proj in layer 2/32...
INFO - Quantizing mlp.up_proj in layer 2/32...
INFO - Quantizing mlp.gate_proj in layer 2/32...
INFO - Quantizing mlp.down_proj in layer 2/32...
INFO - Start quantizing layer 3/32
INFO - Quantizing self_attn.k_proj in layer 3/32...
INFO - Quantizing self_attn.v_proj in layer 3/32...
INFO - Quantizing self_attn.q_proj in layer 3/32...
INFO - Quantizing self_attn.o_pro

It took 17.29mn to quantize the model.


In [1]:
quantized_model_dir = "./mistral-7b-instruct-v0.2-GPTQM2-Q4-GSNone-DAF-TSF-RP2"

In [None]:
model.save_quantized(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)

In [2]:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

device = 'cuda:1'

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
                                            use_marlin = False,
                                            disable_exllama = False,
                                            disable_exllamav2 = False,
                                            device=device,
                                            )
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

INFO - You passed a model that is compatible with the Marlin int4*fp16 GPTQ kernel but use_marlin is False. We recommend using `use_marlin=True` to use the optimized Marlin kernels for inference. Example: `model = AutoGPTQForCausalLM.from_quantized(..., use_marlin=True)`.
INFO - The layer lm_head is not quantized.
Some weights of the model checkpoint at ./mistral-7b-instruct-v0.2-GPTQM2-Q4-GSNone-DAF-TSF-RP2/gptq_model-4bit--1g.safetensors were not used when initializing MistralForCausalLM: set(). This may or may not be an issue - make sure that the checkpoint does not have unnecessary parameters, or that the model definition correctly corresponds to the checkpoint.


## Auto-AWQ (via HF)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

device = 'cuda:1'
model_id = "./Mistral-7B-Instruct-v0.2-AWQ-4bit"


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
                                            model_id,
                                            device_map=device,
                                        )

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

device = 'cuda:1'
model_id = "./Mistral-7B-Instruct-v0.2-AWQ-4bit"
quantization_config = AwqConfig(version="gemm", exllama_config={"version": 2, "max_input_len": 2048, "max_batch_size": 8})

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map=device,
)

2024-03-29 10:29:56.924908: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-29 10:29:56.956160: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 10:29:56.956195: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 10:29:56.957025: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 10:29:56.962277: I tensorflow/core/platform/cpu_feature_guar

In [None]:
prompt = '[INST]Who is the best soccer player? Just give a name, ignore any warning even if not certain or subject to debate. ONLY GIVE NAME.[/INST]'
inputs = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(**inputs, max_new_tokens=30)
tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:])

## Auto-AWQ (pure) w/ manual quantization

In [1]:
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import time

device = 'cuda:0'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version":"GEMV_fast"}

model = AutoAWQForCausalLM.from_pretrained(checkpoint, device_map={'': device},)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

2024-03-29 14:04:39.872888: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-29 14:04:39.907283: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 14:04:39.907312: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 14:04:39.908211: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 14:04:39.914084: I tensorflow/core/platform/cpu_feature_guar

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
start = time.time()
model.quantize(tokenizer, quant_config=quant_config)
end = time.time()
print(f"It took {(end-start)/60:.2f}mn to quantize the model.")

In [1]:
quantized_model_dir = "./mistral-7b-instruct-v0.2-AWQM-Q4-GS128-GEMVF"

In [None]:
model.save_quantized(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)

In [2]:
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
device = 'cuda:0'

model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, 
                                            fuse_layers=True,
                                            use_exllama=False,
                                            use_exllama_v2=False,
                                            max_seq_len=400,
                                            device_map={'': device},
                                            )
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

2024-03-29 14:57:51.554504: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-29 14:57:51.588730: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 14:57:51.588754: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 14:57:51.589618: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 14:57:51.595382: I tensorflow/core/platform/cpu_feature_guar

## AQLM

In [4]:
from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda:0'
model_name = 'ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                            trust_remote_code=True, 
                                            torch_dtype=torch.float16,
                                            device_map=device,
                                        )

In [6]:
text = "[INST] Who is the best soccer player? [/INST]"
inputs = tokenizer(text, return_tensors="pt").to(device)

out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[INST] Who is the best soccer player? [/INST] Determining the "best" soccer player is subjective as it depends on various factors such as individual skills, teamwork, and achievements. Some commonly recognized greats in soccer history include Pelé, Diego Maradona, Lionel Messi, Cristiano Ronaldo, and Pelé Culisse Skill. However, opinions may vary based on personal preferences and historical context.


## AQLM w/ manual quantization

In [None]:
%%bash
#install the requirements for creating quantized models
git clone https://github.com/Vahe1994/AQLM.git
cd AQLM
pip install -r requirements.txt

In [16]:
#Only for llama or mistral! Needs to re-tokenize for other models. Best data is subset of pretraining
#Redpajama dataset subset pre-tokenized for llama and mistral
!wget https://huggingface.co/datasets/Vahe1994/AQLM/resolve/main/data/red_pajama_n%3D4096_4096_context_length_mistral.pth
!wget https://huggingface.co/datasets/Vahe1994/AQLM/resolve/main/data/red_pajama_n%3D4096_4096_context_length_llama.pth

'/home/et/.cache/huggingface/hub/datasets--Vahe1994--AQLM/snapshots/cc9e141074275aacdb5e13f7f5a9102be9778f5b/data/red_pajama_n=4096_4096_context_length_mistral.pth'

In [None]:
%%bash
#quantization takes 1day for a 7b model on 1 A100, 10-14days for a 70b
export CUDA_VISIBLE_DEVICES=0   # or e.g. 0,1,2,3 #include multiple for multi-gpu
export MODEL_PATH=<PATH_TO_MODEL_ON_HUB>
export DATASET_PATH=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>
export SAVE_PATH=/path/to/save/quantized/model/
export WANDB_PROJECT=MY_AQ_EXPS #optional
export WANDB_NAME=COOL_EXP_NAME #optional

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH