---
# Evaluating Speed

In [1]:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
from tqdm import tqdm
import numpy as np
import cProfile ##check this out
import torch
import time
import os
import bitsandbytes as bnb
import torch.nn as nn
import transformers
from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)

In [16]:
prompts = [
""" [INST] You are a Senior Machine Learning Engineer. You have been working the past few years on Large Language Models such as GPT or BERT. Your task is to explain how deep learning works to a 10-year-old child. You may use comparison to help him understand. [/INST]
""",
""" [INST] You are a former european football player. You used to play as a midfielder in the greatest european leagues. You played for various clubs in England, France and Spain. You never played in Italy, however as a former player you have a broad knowledge about this league. Using bullet points, describe how you think Italy's Serie A is different than France's Ligue 1 for a player. [/INST]
""",
""" [INST] Provide a list of the ten most famous French comedy movies that were released before 2020. Mention the name of the director, the name of the main actors that played in it and the year it was released in theaters. [/INST]
""",
""" [INST] You are an historian and an expert regarding both World Wars that took place in Europe during the XXth century. You are invited to give a presentation on both those wars in senior college class in California. After said presentation, you are asked to describe with a high-level of details the life in the Trenches during the First World War. [/INST]
""",
""" [INST] Among the FANG companies, like Alphabet or Meta, which of them has the best behavior towards the environment? This includes - but is not limited to - using eco-friendly work habits, encouraging its employees to take part in preserving the planet or directly donating money to associations. [/INST]
""",
""" [INST] Provide two Python codes for finding the average of the second maximum and the second minimum numbers in an array of integers. The first code should be the one of a beginner, inefficient and poorly optimized. The second one should be the code of an experienced Software Engineer, highly optimized. [/INST]
""",
""" [INST] What are the 5 cities in the world with the most dogs per person? Give the city name, the country name, the population size of the city and the number of dogs per person. Additionally, provide the most common dog breed in the city. [/INST]
""",
""" [INST] You are a Human Resources specialist recruiting new graduates fresh out of college for a tech company in Silicon Valley. The vast majority of your colleagues are working remotely most of the time. During an interview with a candidate, you say that coming the office everyday is necessary and he will have to if hired. What are your arguments? [/INST]
""",
""" [INST] Describe step by step the process to remove and change a GPU in a desktop computer unit. Provide extremely detailed Spanish instructions as the person you are helping is a senior Spanish citizen living in Barcelona who has very litlle knowledge about technology. Adopt a profesionnal tone. [/INST]
""",
""" [INST] You are a finance expert as you have been working as a hedge-fund quantitative researcher in NYC for more than 10 years. Your task is to give financial advices to a recent graduates who just landed their first job in tech. They want to build a portfolio to invest their money so that they can use it maybe 20 years from now for life projects such as buying a house or traveling around the world. [/INST]
""",
]

In [17]:
def evaluation_speed(display_outputs=False, batch=False):
    tokens_per_second = []
    if batch:
        inputs = {'input_ids': torch.randint(0, 32_000, (32,75)).to(device), 'attention_mask': torch.ones((32,75)).to(device)}
        start_time = time.time()
        outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False, use_cache=True)
        end_time=time.time()
        tokens_per_second = outputs.shape[0] * (outputs.shape[1] - inputs['input_ids'].shape[1]) / (end_time-start_time)

        if display_outputs:
            for output in tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:]):
                print(output)

    else:
        for prompt in tqdm(prompts):
            inputs = tokenizer(prompt, return_tensors='pt').to(device)

            start_time = time.time()
            outputs = model.generate(**inputs, max_new_tokens=300, do_sample=False, use_cache=True)
            end_time = time.time()
            tokens_per_second.append(len(outputs[0, inputs['input_ids'].shape[1]:]) / (end_time-start_time))
            
            if display_outputs:
                print(tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:]))
        
    return tokens_per_second

In [18]:
tokens_per_second = evaluation_speed(display_outputs=True, batch=False)
print(f"The average inference speed is {np.mean(tokens_per_second):.1f} tokens/sec")

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 10%|█         | 1/10 [00:15<02:18, 15.41s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Imagine a magical storyteller, who lives in a magical land filled of letters and numbers. This storyteller, named "Deep Learning Robot," is very special because he can read and understand the stories of this magical land better than anyone else. He can read and understand the stories so well that he can create new stories from the old ones.

This magical storyteller, Deep Learning Robot, has a big, magical box filled with words and numbers. He uses this box to read and understand the stories, and then he makes new stories from the old ones. He does this by using a special method called "training," where he reads and learns from many stories, and then he makes new stories from the old ones.

This magical storyteller, Deep Learning Robot, is like a super-duper-super-magical-reader, who can read and understand stories better than anyone else. He can read and understand the stories so well that he can make new stories from the old ones, just like a magic robot.

Now, imagine a magical sto

 20%|██        | 2/10 [00:28<01:53, 14.18s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



1. **Competition Level:** While both Italy's Serie A and France's Ligue 1 are top-tier football leagues, there are notable differences in the level of competition.

2. **Domestic Football:** The domestic football culture in Italy and France differs significantly. In Italy, the Serie A is the primary league, while in France, Ligue 1 is the main league.

3. **Player Salaries:** The salary levels for players in Italy and France are distinct. Italian clubs often offer lower salaries compared to French clubs.

4. **Club Size:** The size of clubs in Italy and France varies significantly. Italian clubs tend to have smaller budgets and smaller teams, while French clubs have larger budgets and larger teams.

5. **Player Talent:** The level of talent in Italy and France differs. Italian clubs often have a different mix of players compared to French clubs.

6. **Player Infrastructure:** The infrastructure for players in Italy and France differs. In Italy, the infrastructure is often less organiz

 30%|███       | 3/10 [00:41<01:36, 13.77s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. "Les Comicables" (1963) - directed by Marcel Lenzion, starring Jean-Louis Tudor and Michele Souter, released in French theaters in 1963.
2. "Les Nouyages" (1963) - directed by Marcel Lenzion, starring Jean-Louis Tudor and Michele Souter, released in French theaters in 1963.
3. "Les Realites" (1968) - directed by Marcel Lenzion, starring Jean-Louis Tudor and Michele Souter, released in French theaters in 1968.
4. "Les Laughs" (1972) - directed by Marcel Lenzion, starring Jean-Lois Tudor and Michele Souter, released in French theaters in 1972.
5. "Les Laughs" (1973) - directed by Marcel Lenzion, starring Jean-Lois Tudor and Michele Souter, released in French theaters in 1973.
6. "Les Laughs" (1974) - directed by Marcel Lenzion, starring Jean-Lois Tudor and Michele Souter, released in French theaters in 1974.
7


 40%|████      | 4/10 [00:55<01:21, 13.58s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Ladies and gentlemen, I am delighted to share with you my insights on the two World Wars that took place in Europe during the XXth century, specifically focusing on the life in the trenches during the First World War.

The First World War, also known as the Great War, was a global conflict that lasted from 1914 to 1919. It was a war that saw the mobilization of millions of soldiers, and the devastating impact of modern warfare. The trenches, the front lines, were the primary battlegrounds where the armies clashed, and the conditions there were far from ordinary.

The trenches were a place of mud, rain, and muddy water. The soldiers were living in the most inhospitable conditions, with little comfort or comfort. The trenches were a place of constant danger, where the soldiers were living in the most unhealthy conditions.

The soldiers were living in trenches that were often flooded with water, making the living conditions even more challenging. The soldiers were living in trenches that 

 50%|█████     | 5/10 [01:06<01:03, 12.77s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Among the FANG companies, Alphabet, which is the parent company of Google, is known for its commitment to sustainability and environmental initiatives. Google has a long-standing commitment to reducing its carbon footprint and has set ambitious goals to reduce its carbon emissions. The company has implemented various initiatives such as using renewable energy sources, reducing energy consumption, and reducing the carbon footprint of its data centers.

Google has also encouraged its employees to take part in preserving the planet through various initiatives such as the "Green Energy" challenge, which encourages employees to reduce their carbon footprint by using renewable energy sources and reducing energy consumption. The company has also donated millions of dollars to various environmental associations and initiatives.

In addition, Google has pledged to reduce its carbon footprint by 70% by 2030 and has set a goal to use renewable energy for all its data centers by 2030. The company 

 60%|██████    | 6/10 [01:19<00:51, 12.95s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


As a beginner, you might write the following code:
```
function finding_second_min_and_max(array):
# Initialize some variables
min = min(array)
max = max(array)
# Find the second minimum and second maximum
second_min = None
second_max = None
for i in range of array:
if i < min:
if i < max:
if i > min:
then set second_min = i
if i > second_min:
then set second_max = i
# Calculate average of second minimum and second maximum
averages_min_and_max = (min + max + sum_of_array_numbers_between_min_and_max)/2
return averages_min_and_max
```
This code is inefficient and poorly optimized, as it involves unnecessary comparisons and redundant variables.

As an experienced Software Engineer, you might write the following code:
```
function finding_second_min_and_max(array):
# Initialize some variables
min = min(array)
max = max(array)
# Find the second minimum and second maximum
second_min = None
second_max = None
for i in range of array:
if i < min:
if i > min:
if i > max:
if


 70%|███████   | 7/10 [01:33<00:39, 13.06s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. Bangi, Bangladesh: With a population of approximately 1.3 million people, Bangi is known for having the highest number of dogs per person. The city is located in the Sylhet district of Pakistan. The most common dog breed in Bangi is the Afghan Hound, a breed that is indigenous to the region.

2. Tbilisi, Georgia: With a population of around 1.5 million people, Tbilisi is the capital city of Georgia. The city is known for its bustling markets and vibrant culture. The most common dog breed in Tbilisi is the Georgian National Dog, a breed that is unique to the region.

3. Lima, Peru: With a population of around 2.5 million people, Lima is the capital city of Peru. The city is known for its beautiful climate and delicious food. The most common dog breed in Lima is the Peruvian Street Dog, a breed that is indigenous to the region.

4. New Delhi, India: With a population of around 10 million people, New Delhi is the capital city of India. The city is known for its bustling markets and cul

 80%|████████  | 8/10 [01:46<00:26, 13.13s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. Productivity and Focus: Working in an office setting allows employees to focus on their tasks without distractions. The office setting provides a quiet and focused environment where employees can concentrate on their work and achieve greater productivity.
2. Collaborative Work: Working in an office setting allows employees to collaborate and work together on projects. This can lead to better teamwork, improved problem-solving, and a more efficient workflow.
3. Learning Opportunities: Working in an office setting allows employees to learn from their colleagues and gain valuable experience. This can lead to better career development and professional growth.
4. Professional Development: Working in an office setting allows employees to develop professionally, gain experience, and learn from their colleagues. This can lead to better career opportunities and professional growth.
5. Social Interaction: Working in an office setting allows employees to interact socially with their colleagues

 90%|█████████ | 9/10 [01:59<00:13, 13.18s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. Identiate the GPU (Graphics Processing Unit) in the desktop computer unit that needs to be replaced.
2. Identify the model and make of the computer unit.
3. Locate the GPU within the unit. It is usually located on the motherboard.
4. Disassemble the computer unit, carefully and thoroughly.
5. Access the motherboard, where the GPU is located.
6. Remove the motherboard from the computer unit.
7. Inspect the motherboard for the GPU, ensuring the GPU is visible and accessible.
8. Remove the GPU from the motherboard.
9. Replace the GPU with a new one, ensuring the new GPU is compatible with the computer unit.
10. Reinstall the motherboard into the computer unit, ensuring it is securely connected.
11. Reassemble the computer unit, ensuring all components are properly connected.
12. Test the computer unit to ensure the new GPU is functioning properly.
13. Monitor the computer unit for any signs of instability or irregularity.
14. Ensure the computer unit is functioning at optimal performan

100%|██████████| 10/10 [02:13<00:00, 13.31s/it]

1. Diversify your portfolio: The key to building a successful investment portfolio is diversifying it across various asset classes and sectors. This means investing in a mix of stocks, bonds, real estate, and other alternative investments.

2. Invest in broad markets: Start with broad markets, such as the S&P 500 or the FTSE, as a foundation for your portfolio. These markets offer a wide range of investments across various sectors and industries.

3. Invest in individual sectors: Within those broad markets, consider investing in specific sectors that align with your interests and goals. For example, investing in technology, renewable energy, or healthcare sectors.

4. Invest in individual companies: Within those sectors, consider investing in specific companies that align with your interests and goals. For example, investing in companies that are leaders in their respective sectors or have a strong track record.

5. Consider the timing of your investments: Consider the timing of your i




## LLM.int8() / QLoRA

In [None]:
device = 'cuda'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch.float16,       
                                             device_map='cuda:0',
                                            #  quantization_config=BitsAndBytesConfig(
                                            #     load_in_4bit=True,
                                            #     bnb_4bit_use_double_quant=True,
                                            #     bnb_4bit_quant_type="nf4",
                                            #     bnb_4bit_compute_dtype=torch.bfloat16,
                                            #     ),
                                            )
config = AutoConfig.from_pretrained(checkpoint)

In [None]:
total_p = 0
total = 0
for n,p in model.named_parameters():
    total += p.numel()
    if 'head' in n or 'embed' in n or 'norm' in n:
        total_p += p.numel()
total_p / total * 100

6.993743675173274

In [None]:
total_p = [0, 0, 0]
for n,p in model.named_parameters():
    if 'embed' in n:
        total_p[0] += p.numel()
    
    elif 'norm' in n:
        total_p[1] += p.numel()
    
    elif 'head' in n:
        total_p[2] += p.numel()
total_p

[131072000, 266240, 131072000]

In [None]:
outcast = []
for n,p in model.named_parameters():
    if p.dtype != model.model.layers[0].self_attn.q_proj.weight.dtype:
        outcast.append((n, str(p.dtype)))
print(f"The layers with different dtype than base ({model.model.layers[0].self_attn.q_proj.weight.dtype}) are:\n{'\n'.join([','.join(x) for x in outcast])}")

In [None]:
cProfile.run('evaluation()')

In [None]:
tokens_per_second = evaluation()
print(f"The average inference speed is {np.mean(tokens_per_second):.1f} tokens/sec")

## Quanto

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoConfig
import numpy as np
import quanto
import torch
import time

In [None]:
device = 'cuda:1'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch.float16,       
                                             device_map = 'cuda:1',
                                             quantization_config = QuantoConfig(
                                             weights='float8',
                                             #activations='int8',
                                                ),
                                            )

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
model.model.layers[0].self_attn.q_proj.weight

QTensor(tensor([[ 3.4375e-01,  5.5000e+00, -1.7188e-01,  ...,  2.6000e+01,
          1.7188e-01, -5.0000e+00],
        [-2.3438e-01, -1.7188e-01,  1.0938e-01,  ...,  1.1250e+00,
          3.9062e-02,  2.0000e+00],
        [ 3.5156e-02, -9.3750e-01,  1.1719e-02,  ..., -1.1000e+01,
         -4.6875e-02, -4.0625e-01],
        ...,
        [-4.6875e-02, -3.2500e+00, -5.8594e-02,  ...,  4.0625e-01,
          2.3438e-02,  9.3750e-01],
        [ 1.1719e-02, -2.7500e+00,  1.5625e-02,  ..., -4.5000e+00,
         -5.0781e-02,  1.0000e+00],
        [ 4.2969e-02,  3.2500e+00,  4.6875e-02,  ...,  2.5000e-01,
         -1.9531e-02, -1.1250e+00]], device='cuda:1',
       dtype=torch.float8_e4m3fn), scale=tensor([[0.0001],
        [0.0002],
        [0.0005],
        ...,
        [0.0005],
        [0.0006],
        [0.0005]], device='cuda:1', dtype=torch.float16), public_dtype=torch.float16)

In [None]:
quanto.quantize(model.model, weights=quanto.qfloat8, activations=quanto.qfloat8)
quanto.freeze(model)

In [None]:
param = 0
for n,p in model.named_parameters():
    param += p.data.numel()
param

7241732096

In [None]:
outcast = []
for n,p in model.named_parameters():
    try:
        outcast.append((n, str(p.qtype.dtype)))
    except:
        outcast.append((n, str(p.dtype)))
    if outcast[-1][1] == str(model.model.layers[0].self_attn.q_proj.weight.qtype.dtype):
        outcast.pop()

print(f"The layers with different dtype than base ({model.model.layers[0].self_attn.q_proj.weight.qtype.dtype}) are:\n{'\n'.join([','.join(x) for x in outcast])}")

In [None]:
prompt = """[INST]You are a Senior Machine Learning Engineer. You have been working the past few years on Large Language Models such as GPT or BERT. Your task is to explain how deep learning works to a 10-year-old child. You may use comparison to help him understand.[/INST]"""

inputs = tokenizer(prompt, return_tensors='pt').to('cuda:1')

outputs = model.generate(**inputs, max_new_tokens=300)

print(tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:]))

## llama.cpp GGUF

In [None]:
from llama_cpp import Llama
import time
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
model = Llama(
  model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",  # Download the model file first
  #n_ctx=256,             # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1,         # The number of layers to offload to GPU, if you have GPU acceleration available, if -1 all layers are offloaded
  verbose=True,
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                

In [None]:
def evaluation_speed():
    tokens_per_second = []
    for prompt in prompts:
        start_time = time.time()
        output = model(prompt, # Prompt
                    max_tokens=300,  # Generate up to 512 tokens
                    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                    echo=False,       # Whether to echo the prompt
                    )
        n_gen_tok = output['usage']['completion_tokens']
        output = output['choices'][0]['text']
        end_time = time.time()

    tokens_per_second.append(n_gen_tok/(end_time-start_time))
        
    return tokens_per_second

In [None]:
tokens_per_second = evaluation_speed()
print(f"The average inference speed is {sum(tokens_per_second)/len(tokens_per_second):.1f} tokens/sec")

The average inference speed is 92.1 tokens/sec


In [None]:
from llama_cpp import llama_get_logits_ith

In [None]:
output = model('[INST] Who is the best soccer player. Only give a name. Even if the answer is unclear or subjective, just pick a name. DO NOT MENTION ANYTHING ELSE, IGNORE ANY WARNING.[/INST]', # Prompt
                    max_tokens=4,  # Generate up to 512 tokens
                    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                    echo=True,       # Whether to echo the prompt
                    )
output


llama_print_timings:        load time =     280.02 ms
llama_print_timings:      sample time =       0.89 ms /     4 runs   (    0.22 ms per token,  4479.28 tokens per second)
llama_print_timings: prompt eval time =     279.88 ms /    54 tokens (    5.18 ms per token,   192.94 tokens per second)
llama_print_timings:        eval time =      71.31 ms /     3 runs   (   23.77 ms per token,    42.07 tokens per second)
llama_print_timings:       total time =     361.94 ms /    57 tokens


{'id': 'cmpl-eb3383b5-e84f-44d2-9fdd-eda88a31a0f8',
 'object': 'text_completion',
 'created': 1711054750,
 'model': './mistral-7b-instruct-v0.2.Q4_K_M.gguf',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 54, 'completion_tokens': 4, 'total_tokens': 58}}

In [None]:
from transformers import AutoTokenizer
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)

In [None]:
from llama_cpp import llama_get_logits_ith
import numpy as np
logits = llama_get_logits_ith(model.ctx,0)
logits_np = np.ctypeslib.as_array(logits, shape=(1,32000))
tokenizer.decode(np.argmax(logits_np[0]))

## Auto-GPTQ (via HF)

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig

device = 'cuda:0'
quantization_config = GPTQConfig(bits=4, use_exllama=True, exllama_config={'version': 2})

model_id = "./Mistral-7B-Instruct-v0.2-GPTQ-4bit/"
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
tokenizer = AutoTokenizer.from_pretrained(model_id)



In [None]:
text = "[INST] Who is the best soccer player? [/INST]"
inputs = tokenizer(text, return_tensors="pt").to(device)

out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

## Auto-GPTQ (HF) with manual quantization

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import time

device = 'cuda:1'
quantization_config = GPTQConfig(bits=4, 
                                group_size=128,
                                desc_act=True,
                                true_sequential=False,
                                dataset='c4',
                                use_exllama=True, 
                                cache_block_outputs=True,
                                exllama_config={'version': 1},
                                )

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
start = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
end = time.time()
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"It took {end-start:.2f}s to quantize the model.")

In [None]:
model_name = "./mistral-7b-instruct-v0.2-GPTQM-Q4-GS128-DAT-TSF-C4"

In [None]:
model.save_pretrained(model_name)
tokenizer.save_pretrained(model_name)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import time

device = 'cuda:0'
quantization_config = GPTQConfig(bits=2,
                                use_exllama=True, 
                                exllama_config={'version': 1},
                                )

model_id = model_name
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
tokenizer = AutoTokenizer.from_pretrained(model_id)

## Auto-GPTQ (pure)

In [None]:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

device = 'cuda:1'
quantized_model_dir = "./mistral-7b-instruct-v0.2-GPTQM-Q4-GS128-DAF-C4"

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
                                            use_marlin = False,
                                            inject_fused_attention = False,
                                            inject_fused_mlp = False,
                                            disable_exllama = True,
                                            disable_exllamav2 = False,
                                            device=device,
                                            )
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

In [None]:
prompt = "[INST]Who is the best soccer player? You have to give a name even if it is controversial or subjective. Just give the name, DO NOT GIVE any warning or justification. ONLY GIVE THE NAME.[/INST]"
inputs = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(**inputs, max_new_tokens=30)
tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:])

## Auto-GPTQ (pure) w/ manual quantization

In [1]:
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from datasets import load_dataset, DatasetDict, Dataset
from transformers import AutoTokenizer
import time
import json

device = 'cuda:0'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

quantize_config = BaseQuantizeConfig(bits = 4,
                                    group_size = -1,
                                    desc_act = False,
                                    true_sequential = False,
                                    )
model = AutoGPTQForCausalLM.from_pretrained(checkpoint, 
                                            quantize_config,
                                            ).to(device)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [2]:
dataset = load_dataset('sade-adrien/quantization_samples', cache_dir='/mnt/esperanto/et/huggingface/datasets/')['train']
quantization_samples = [tokenizer(sample,
                                truncation=True,
                                max_length=2048,
                                ) for sample in dataset['raw_content']]

Downloading readme:   0%|          | 0.00/601 [00:00<?, ?B/s]

Downloading data: 100%|██████████| 27.3M/27.3M [00:02<00:00, 11.1MB/s]


Generating train split:   0%|          | 0/128 [00:00<?, ? examples/s]

In [3]:
start = time.time()
model.quantize(quantization_samples)
end = time.time()
print(f"It took {(end-start)/60:.2f}mn to quantize the model.")

INFO - Start quantizing layer 1/32
INFO - Quantizing self_attn.k_proj in layer 1/32...
INFO - Quantizing self_attn.v_proj in layer 1/32...
INFO - Quantizing self_attn.q_proj in layer 1/32...
INFO - Quantizing self_attn.o_proj in layer 1/32...
INFO - Quantizing mlp.up_proj in layer 1/32...
INFO - Quantizing mlp.gate_proj in layer 1/32...
INFO - Quantizing mlp.down_proj in layer 1/32...
INFO - Start quantizing layer 2/32
INFO - Quantizing self_attn.k_proj in layer 2/32...
INFO - Quantizing self_attn.v_proj in layer 2/32...
INFO - Quantizing self_attn.q_proj in layer 2/32...
INFO - Quantizing self_attn.o_proj in layer 2/32...
INFO - Quantizing mlp.up_proj in layer 2/32...
INFO - Quantizing mlp.gate_proj in layer 2/32...
INFO - Quantizing mlp.down_proj in layer 2/32...
INFO - Start quantizing layer 3/32
INFO - Quantizing self_attn.k_proj in layer 3/32...
INFO - Quantizing self_attn.v_proj in layer 3/32...
INFO - Quantizing self_attn.q_proj in layer 3/32...
INFO - Quantizing self_attn.o_pro

It took 17.29mn to quantize the model.


In [1]:
quantized_model_dir = "./mistral-7b-instruct-v0.2-GPTQM2-Q4-GSNone-DAF-TSF-RP2"

In [None]:
model.save_quantized(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)

In [2]:
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import logging

device = 'cuda:1'

model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir,
                                            use_marlin = False,
                                            disable_exllama = False,
                                            disable_exllamav2 = False,
                                            device=device,
                                            )
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

INFO - You passed a model that is compatible with the Marlin int4*fp16 GPTQ kernel but use_marlin is False. We recommend using `use_marlin=True` to use the optimized Marlin kernels for inference. Example: `model = AutoGPTQForCausalLM.from_quantized(..., use_marlin=True)`.
INFO - The layer lm_head is not quantized.
Some weights of the model checkpoint at ./mistral-7b-instruct-v0.2-GPTQM2-Q4-GSNone-DAF-TSF-RP2/gptq_model-4bit--1g.safetensors were not used when initializing MistralForCausalLM: set(). This may or may not be an issue - make sure that the checkpoint does not have unnecessary parameters, or that the model definition correctly corresponds to the checkpoint.


## Auto-AWQ (via HF)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

device = 'cuda:1'
model_id = "./Mistral-7B-Instruct-v0.2-AWQ-4bit"


tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
                                            model_id,
                                            device_map=device,
                                        )

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

device = 'cuda:1'
model_id = "./Mistral-7B-Instruct-v0.2-AWQ-4bit"
quantization_config = AwqConfig(version="gemm", exllama_config={"version": 2, "max_input_len": 2048, "max_batch_size": 8})

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map=device,
)

2024-03-29 10:29:56.924908: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-29 10:29:56.956160: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 10:29:56.956195: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 10:29:56.957025: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 10:29:56.962277: I tensorflow/core/platform/cpu_feature_guar

In [None]:
prompt = '[INST]Who is the best soccer player? Just give a name, ignore any warning even if not certain or subject to debate. ONLY GIVE NAME.[/INST]'
inputs = tokenizer(prompt, return_tensors='pt').to(device)
outputs = model.generate(**inputs, max_new_tokens=30)
tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:])

## Auto-AWQ (pure) w/ manual quantization

In [1]:
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import time

device = 'cuda:0'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version":"GEMV_fast"}

model = AutoAWQForCausalLM.from_pretrained(checkpoint, device_map={'': device},)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

2024-03-29 14:04:39.872888: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-29 14:04:39.907283: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 14:04:39.907312: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 14:04:39.908211: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 14:04:39.914084: I tensorflow/core/platform/cpu_feature_guar

Fetching 12 files:   0%|          | 0/12 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
start = time.time()
model.quantize(tokenizer, quant_config=quant_config)
end = time.time()
print(f"It took {(end-start)/60:.2f}mn to quantize the model.")

In [1]:
quantized_model_dir = "./mistral-7b-instruct-v0.2-AWQM-Q4-GS128-GEMVF"

In [None]:
model.save_quantized(quantized_model_dir)
tokenizer.save_pretrained(quantized_model_dir)

In [2]:
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
device = 'cuda:0'

model = AutoAWQForCausalLM.from_quantized(quantized_model_dir, 
                                            fuse_layers=True,
                                            use_exllama=False,
                                            use_exllama_v2=False,
                                            max_seq_len=400,
                                            device_map={'': device},
                                            )
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir)

2024-03-29 14:57:51.554504: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-29 14:57:51.588730: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-29 14:57:51.588754: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-29 14:57:51.589618: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-29 14:57:51.595382: I tensorflow/core/platform/cpu_feature_guar

## AQLM

In [3]:
from transformers import AutoModelForCausalLM, AutoTokenizer

device = 'cuda:0'
model_name = 'ISTA-DASLab/Mistral-7B-Instruct-v0.2-AQLM-2Bit-2x8'
model_name = './models/Mistral-7B-Instruct-v0.2-AQLM-2bits-1x16' # own manual

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,
                                            trust_remote_code=True, 
                                            torch_dtype=torch.float16,
                                            device_map=device,
                                        )

In [13]:
text = "[INST] Who is the best soccer player? [/INST]"
inputs = tokenizer(text, return_tensors="pt").to(device)

out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].


[INST] Who is the best soccer player? [/INST] There are many great soccer players, and it's difficult to definitively say who is the best. However, some players have been consistently recognized for their exceptional skills and performances. Here are some of the top soccer players:

1. Lionel Messi - Argentinian striker for Barcelona and the national team, who has won numerous awards and accolades for his goal scoring abilities and overall performance.
2. Cristian Ronaldo - Portuguese striker for Real Madrid and the


## AQLM w/ manual quantization

In [None]:
%%bash
#install the requirements for creating quantized models
git clone https://github.com/Vahe1994/AQLM.git
cd AQLM
pip install -r requirements.txt

In [16]:
#Only for llama or mistral! Needs to re-tokenize for other models. Best data is subset of pretraining
#Redpajama dataset subset pre-tokenized for llama and mistral
!wget https://huggingface.co/datasets/Vahe1994/AQLM/resolve/main/data/red_pajama_n%3D4096_4096_context_length_mistral.pth
!wget https://huggingface.co/datasets/Vahe1994/AQLM/resolve/main/data/red_pajama_n%3D4096_4096_context_length_llama.pth

'/home/et/.cache/huggingface/hub/datasets--Vahe1994--AQLM/snapshots/cc9e141074275aacdb5e13f7f5a9102be9778f5b/data/red_pajama_n=4096_4096_context_length_mistral.pth'

In [None]:
%%bash
#quantization takes 1day for a 7b model on 1 A100, 10-14days for a 70b
export CUDA_VISIBLE_DEVICES=0   # or e.g. 0,1,2,3 #include multiple for multi-gpu
export MODEL_PATH=<PATH_TO_MODEL_ON_HUB>
export DATASET_PATH=<INSERT DATASET NAME OR PATH TO CUSTOM DATA>
export SAVE_PATH=/path/to/save/quantized/model/
export WANDB_PROJECT=MY_AQ_EXPS #optional
export WANDB_NAME=COOL_EXP_NAME #optional

python main.py $MODEL_PATH $DATASET_PATH \
 --nsamples=1024 \
 --val_size=128 \
 --num_codebooks=1 \
 --nbits_per_codebook=16 \
 --in_group_size=8 \
 --relative_mse_tolerance=0.01 \
 --finetune_batch_size=32 \
 --finetune_max_epochs=10 \
 --finetune_early_stop=3 \
 --finetune_keep_best \
 --local_batch_size=1 \
 --offload_activations \
 --wandb \
 --resume \
 --save $SAVE_PATH