---
# Evaluating Speed

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
from tqdm import tqdm
import numpy as np
import cProfile ##check this out
import torch
import time
import os
import bitsandbytes as bnb
import torch.nn as nn
import transformers
from peft import (
    LoraConfig,
    PeftConfig,
    get_peft_model,
    prepare_model_for_kbit_training,
)

In [4]:
prompts = [
""" [INST] You are a Senior Machine Learning Engineer. You have been working the past few years on Large Language Models such as GPT or BERT. Your task is to explain how deep learning works to a 10-year-old child. You may use comparison to help him understand. [/INST]
""",
""" [INST] You are a former european football player. You used to play as a midfielder in the greatest european leagues. You played for various clubs in England, France and Spain. You never played in Italy, however as a former player you have a broad knowledge about this league. Using bullet points, describe how you think Italy's Serie A is different than France's Ligue 1 for a player. [/INST]
""",
""" [INST] Provide a list of the ten most famous French comedy movies that were released before 2020. Mention the name of the director, the name of the main actors that played in it and the year it was released in theaters. [/INST]
""",
""" [INST] You are an historian and an expert regarding both World Wars that took place in Europe during the XXth century. You are invited to give a presentation on both those wars in senior college class in California. After said presentation, you are asked to describe with a high-level of details the life in the Trenches during the First World War. [/INST]
""",
""" [INST] Among the FANG companies, like Alphabet or Meta, which of them has the best behavior towards the environment? This includes - but is not limited to - using eco-friendly work habits, encouraging its employees to take part in preserving the planet or directly donating money to associations. [/INST]
""",
""" [INST] Provide two Python codes for finding the average of the second maximum and the second minimum numbers in an array of integers. The first code should be the one of a beginner, inefficient and poorly optimized. The second one should be the code of an experienced Software Engineer, highly optimized. [/INST]
""",
""" [INST] What are the 5 cities in the world with the most dogs per person? Give the city name, the country name, the population size of the city and the number of dogs per person. Additionally, provide the most common dog breed in the city. [/INST]
""",
""" [INST] You are a Human Resources specialist recruiting new graduates fresh out of college for a tech company in Silicon Valley. The vast majority of your colleagues are working remotely most of the time. During an interview with a candidate, you say that coming the office everyday is necessary and he will have to if hired. What are your arguments? [/INST]
""",
""" [INST] Describe step by step the process to remove and change a GPU in a desktop computer unit. Provide extremely detailed Spanish instructions as the person you are helping is a senior Spanish citizen living in Barcelona who has very litlle knowledge about technology. Adopt a profesionnal tone. [/INST]
""",
""" [INST] You are a finance expert as you have been working as a hedge-fund quantitative researcher in NYC for more than 10 years. Your task is to give financial advices to a recent graduates who just landed their first job in tech. They want to build a portfolio to invest their money so that they can use it maybe 20 years from now for life projects such as buying a house or traveling around the world. [/INST]
""",
]

In [5]:
def evaluation_speed(display_outputs=False):
    tokens_per_second = []
    for prompt in tqdm(prompts):
        inputs = tokenizer(prompt, return_tensors='pt').to(device)

        start_time = time.time()
        output = model.generate(**inputs, max_new_tokens=300, do_sample=False, use_cache=True)
        end_time = time.time()
        tokens_per_second.append(len(output[0, inputs['input_ids'].shape[1]:]) / (end_time-start_time))
        
        if display_outputs:
            print(tokenizer.decode(output[0, inputs['input_ids'].shape[1]:]))
        
    return tokens_per_second

In [6]:
tokens_per_second = evaluation_speed(display_outputs=True)
print(f"The average inference speed is {np.mean(tokens_per_second):.1f} tokens/sec")

  0%|          | 0/10 [00:00<?, ?it/s]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
 10%|█         | 1/10 [00:11<01:47, 11.95s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Hey there, curious kid! I'd be happy to explain deep learning in a way that's easy for a 10-year-old to understand. Let's compare it to learning how to recognize handwritten numbers, like the ones in an old-fashioned math book.

First, imagine you're given a bunch of pictures of numbers written by hand. Your job is to figure out which number each picture represents. At first, you might make some mistakes, but don't worry! That's just part of learning.

Now, imagine you have a friend who can help you. This friend looks at all the pictures and points out which number each one is. You use this information to learn and improve. This is similar to how a computer uses something called "training data" to learn.

But what if there are lots of different handwriting styles? Your friend might make mistakes too! That's where deep learning comes in. Instead of just one friend, imagine you have lots and lots of friends, each one an expert in recognizing a different kind of handwriting. Together, th

 20%|██        | 2/10 [00:19<01:12,  9.09s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



* **Style of Play:** Italy's Serie A is known for its tactical and defensive approach. Teams prioritize organization and discipline, focusing on counter-attacks and set pieces. In contrast, Ligue 1 in France tends to be more open and attacking, with a greater emphasis on possession and individual skills.

* **Tactics:** In Serie A, teams often employ a 4-3-1-2 or 4-4-2 formation, with a focus on solid defense and quick counter-attacks. In Ligue 1, teams may use a 4-3-3 or 4-2-3-1 formation, with a greater emphasis on possession and attacking play.

* **Rivalries:** In Italy, the biggest derbies are the Milan Derby (AC Milan vs Inter Milan) and the Turin Derby (Juventus vs Torino). In France, the biggest derbies are the Classique (Olympique Marseille vs Olympique Lyon) and the Derby du Rhin (Olympique Lyon vs AS Saint-Étienne).

* **Fans:** Italian football fans are known for their passion and intensity. Matches can be heated, with frequent chanting and singing from the stands. In Fran

 30%|███       | 3/10 [00:26<00:57,  8.15s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


1. "La Grande Illusion" (1937) - Director: Jean Renoir, Main Actors: Jean Gabin, Erich von Stroheim, Darnley Hood. This is a war comedy-drama about the camaraderie between French and German officers in a German camp during World War I.
2. "Les Vacances de M. Hulot" (1953) - Director: Jacques Tati, Main Actors: Jacques Tati, Gérard Méry, Jean-Pierre Zola. This is a comedy about a man named Hulot and his misadventures during his vacation.
3. "Jules et Jim" (1962) - Director: Alain Resnais, Main Actors: Oskar Werner, Jeanne More, Henri Saadik. This is a romantic comedy about two friends who both fall in love with the same woman.
4. "Borsalino" (1970) - Director: Jacques Deray, Main Actors: Lino Ventura, Alain Delon, Marlène Jobert. This is a comedy about two friends who are both in love with the same woman.
5. "Les Tontons Flingueurs" (1956) - Director: Jean Boyer, Main Actors: Jean-Paul Belmondo, Danielle


 40%|████      | 4/10 [00:33<00:46,  7.71s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Title: The Great Conflicts: An Overview of the First and Second World Wars in Europe

Ladies and Gentlemen, I am delighted to be here today to share my knowledge and passion for European history with you. In this presentation, I will provide an overview of the two most significant conflicts of the 20th century: the First World War (1914-1918) and the Second World War (1939-1945). I will discuss the causes, major events, and consequences of each war.

The First World War, also known as the Great War, was a global conflict that began in 1914 and lasted until 1918. The war was primarily caused by the complex web of alliances and rivalries that had developed among the major European powers in the late 19th and early 20th centuries. The assassination of Archduke Franz Ferdinand of Austria-Hungary in Sarajevo, Bosnia, on June 28, 1914, served as the immediate trigger for the war.

The First World War was a brutal and devastating conflict. It involved millions of soldiers from various countr

 50%|█████     | 5/10 [00:40<00:37,  7.46s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Among the FANG companies (Facebook, Apple, Netflix, and Alphabet), Alphabet (Google) has been recognized for its significant efforts towards environmental sustainability. Here are some reasons why:

1. Renewable Energy: Alphabet is the world's largest corporate purchaser of renewable energy. Google has committed to using 100% renewable energy for all its operations and data centers worldwide by 2030.

2. Carbon Neutrality: Google aims to be carbon neutral by 2020. They have already achieved this goal for their operations and are working on reducing the carbon footprint of their supply chain and product use.

3. Eco-friendly Work Habits: Google has implemented several eco-friendly work habits in its offices, such as using energy-efficient appliances, installing solar panels, and promoting public transportation or biking to work.

4. Green Initiatives: Google has several green initiatives, such as Google Earth, Google Maps, and Google.org, which focuses on environmental issues and suppo

 60%|██████    | 6/10 [00:47<00:29,  7.30s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Beginner's Code:

```python
def find_second_min_max(arr):
    # Find the minimum and maximum numbers in the array
    min_num, max_num = min(arr), max(arr)

    # Find the second minimum and second maximum numbers
    second_min, second_max = (num for num in arr if num != min_num and num != max_num)[0], (num for num in arr if num != min_num and num != max_num)[1]

    # Calculate the average of the second minimum and second maximum numbers
    avg = (second_min + second_max) / 2

    return avg

arr = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(find_second_min_max(arr))
```

Experienced Software Engineer's Code:

```python
def find_second_min_max(arr):
    # Sort the array in ascending order
    arr.sort()

    # Calculate the indices of the second minimum and second maximum numbers
    second_min_index, second_max_index = arr.index(arr[arr.index(min(arr)) + 1]), arr.index


 70%|███████   | 7/10 [00:54<00:21,  7.20s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



1. Querétaro, Mexico: Population size - 1.1 million, Dogs per person - 0.3, Most common dog breed - Rottweiler and Labrador Retriever. Querétaro is known for its vibrant culture and historic sites, making it a popular destination for dog lovers.

2. Ankara, Turkey: Population size - 5.5 million, Dogs per person - 0.25, Most common dog breed - Anatolian Shepherd. Ankara is the capital city of Turkey and is famous for its rich history and beautiful landscapes.

3. Buenos Aires, Argentina: Population size - 3 million, Dogs per person - 0.35, Most common dog breed - Bulldog and Labrador Retriever. Buenos Aires is the largest city in Argentina and is known for its European-style architecture and passionate football fans.

4. Lima, Peru: Population size - 1 million, Dogs per person - 0.5, Most common dog breed - Peruvian Hairless Dog. Lima is the largest city in Peru and is famous for its beautiful beaches, delicious cuisine, and rich cultural heritage.

5. Bogotá, Colombia: Population size

 80%|████████  | 8/10 [01:01<00:14,  7.14s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



As a Human Resources specialist recruiting new graduates for a tech company in Silicon Valley, I understand that the trend towards remote work has become increasingly popular, especially in the tech industry. However, during an interview with a candidate, if I express the necessity of coming to the office every day, I would argue the following points:

1. Collaboration and Teamwork: Working in an office environment allows for more effective collaboration and teamwork. Face-to-face interactions can lead to better communication, problem-solving, and idea generation.
2. Learning from Experienced Colleagues: Being in the office provides opportunities to learn from experienced colleagues and mentors. This can lead to professional growth and development, which is essential for new graduates.
3. Building Relationships: Building relationships with colleagues, managers, and clients is crucial for career advancement. Being present in the office allows for more opportunities to network and build

 90%|█████████ | 9/10 [01:08<00:07,  7.10s/it]Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.



Buenas tardes querido senior, I'm glad to help you with the process of removing and replacing your graphics processing unit (GPU) in your desktop computer. I understand that technology can be intimidating, but I'll explain each step in detail and with a professional tone. 

Before we begin, please make sure you have the following tools and components ready:

1. A Phillips head screwdriver
2. Thermal paste (optional, if your new GPU doesn't come with it)
3. New GPU
4. Anti-static wrist strap (optional, but recommended for static protection)

Now, let's begin:

1. Shut down your computer and unplug it from the power source.
2. Locate the power supply unit (PSU) in the back of your computer case. Turn it off by flipping the switch on the back or by unplugging the power cord.
3. Open the side panel of your computer case using a Phillips head screwdriver. Set the screws aside.
4. Locate the GPU in your computer case. It's usually the largest component with a fan attached to it.
5. Disconne

100%|██████████| 10/10 [01:15<00:00,  7.52s/it]


First and foremost, congratulations on landing your first job in tech! It's great that you're thinking about building a portfolio to invest your money for future financial goals such as buying a house or traveling around the world. Here are some financial advices that could help you get started:

1. Establish an emergency fund: Before investing, it's essential to have an emergency fund that covers at least three to six months' worth of living expenses. This safety net will provide peace of mind and financial security in case of unexpected events.
2. Maximize retirement contributions: If your employer offers a 401(k) or similar retirement plan, contribute enough to receive any matching funds if available. Maximizing retirement contributions is an essential step in building long-term wealth.
3. Diversify your portfolio: Aim for a well-diversified portfolio by investing in various asset classes such as stocks, bonds, and real estate. Diversification helps minimize risk and maximize poten




## LLM.int8() / QLoRA

In [None]:
device = 'cuda'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch.float16,       
                                             device_map='cuda:0',
                                            #  quantization_config=BitsAndBytesConfig(
                                            #     load_in_4bit=True,
                                            #     bnb_4bit_use_double_quant=True,
                                            #     bnb_4bit_quant_type="nf4",
                                            #     bnb_4bit_compute_dtype=torch.bfloat16,
                                            #     ),
                                            )
config = AutoConfig.from_pretrained(checkpoint)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
total_p = 0
total = 0
for n,p in model.named_parameters():
    total += p.numel()
    if 'head' in n or 'embed' in n or 'norm' in n:
        total_p += p.numel()
total_p / total * 100

6.993743675173274

In [None]:
total_p = [0, 0, 0]
for n,p in model.named_parameters():
    if 'embed' in n:
        total_p[0] += p.numel()
    
    elif 'norm' in n:
        total_p[1] += p.numel()
    
    elif 'head' in n:
        total_p[2] += p.numel()
total_p

[131072000, 266240, 131072000]

In [None]:
outcast = []
for n,p in model.named_parameters():
    if p.dtype != model.model.layers[0].self_attn.q_proj.weight.dtype:
        outcast.append((n, str(p.dtype)))
print(f"The layers with different dtype than base ({model.model.layers[0].self_attn.q_proj.weight.dtype}) are:\n{'\n'.join([','.join(x) for x in outcast])}")

In [None]:
cProfile.run('evaluation()')

In [None]:
tokens_per_second = evaluation()
print(f"The average inference speed is {np.mean(tokens_per_second):.1f} tokens/sec")

## Quanto

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, QuantoConfig
import numpy as np
import quanto
import torch
import time

In [None]:
device = 'cuda:1'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch.float16,       
                                             device_map = 'cuda:1',
                                             quantization_config = QuantoConfig(
                                             weights='float8',
                                             #activations='int8',
                                                ),
                                            )

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
model.model.layers[0].self_attn.q_proj.weight

QTensor(tensor([[ 3.4375e-01,  5.5000e+00, -1.7188e-01,  ...,  2.6000e+01,
          1.7188e-01, -5.0000e+00],
        [-2.3438e-01, -1.7188e-01,  1.0938e-01,  ...,  1.1250e+00,
          3.9062e-02,  2.0000e+00],
        [ 3.5156e-02, -9.3750e-01,  1.1719e-02,  ..., -1.1000e+01,
         -4.6875e-02, -4.0625e-01],
        ...,
        [-4.6875e-02, -3.2500e+00, -5.8594e-02,  ...,  4.0625e-01,
          2.3438e-02,  9.3750e-01],
        [ 1.1719e-02, -2.7500e+00,  1.5625e-02,  ..., -4.5000e+00,
         -5.0781e-02,  1.0000e+00],
        [ 4.2969e-02,  3.2500e+00,  4.6875e-02,  ...,  2.5000e-01,
         -1.9531e-02, -1.1250e+00]], device='cuda:1',
       dtype=torch.float8_e4m3fn), scale=tensor([[0.0001],
        [0.0002],
        [0.0005],
        ...,
        [0.0005],
        [0.0006],
        [0.0005]], device='cuda:1', dtype=torch.float16), public_dtype=torch.float16)

In [None]:
quanto.quantize(model.model, weights=quanto.qfloat8, activations=quanto.qfloat8)
quanto.freeze(model)

In [None]:
param = 0
for n,p in model.named_parameters():
    param += p.data.numel()
param

7241732096

In [None]:
outcast = []
for n,p in model.named_parameters():
    try:
        outcast.append((n, str(p.qtype.dtype)))
    except:
        outcast.append((n, str(p.dtype)))
    if outcast[-1][1] == str(model.model.layers[0].self_attn.q_proj.weight.qtype.dtype):
        outcast.pop()

print(f"The layers with different dtype than base ({model.model.layers[0].self_attn.q_proj.weight.qtype.dtype}) are:\n{'\n'.join([','.join(x) for x in outcast])}")

In [None]:
prompt = """[INST]You are a Senior Machine Learning Engineer. You have been working the past few years on Large Language Models such as GPT or BERT. Your task is to explain how deep learning works to a 10-year-old child. You may use comparison to help him understand.[/INST]"""

inputs = tokenizer(prompt, return_tensors='pt').to('cuda:1')

outputs = model.generate(**inputs, max_new_tokens=300)

print(tokenizer.decode(outputs[0, inputs['input_ids'].shape[1]:]))

## llama.cpp GGUF

In [None]:
from llama_cpp import Llama
import time
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [None]:
model = Llama(
  model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",  # Download the model file first
  #n_ctx=256,             # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1,         # The number of layers to offload to GPU, if you have GPU acceleration available, if -1 all layers are offloaded
  verbose=True,
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                

In [None]:
def evaluation_speed():
    tokens_per_second = []
    for prompt in prompts:
        start_time = time.time()
        output = model(prompt, # Prompt
                    max_tokens=300,  # Generate up to 512 tokens
                    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                    echo=False,       # Whether to echo the prompt
                    )
        n_gen_tok = output['usage']['completion_tokens']
        output = output['choices'][0]['text']
        end_time = time.time()

    tokens_per_second.append(n_gen_tok/(end_time-start_time))
        
    return tokens_per_second

In [None]:
tokens_per_second = evaluation_speed()
print(f"The average inference speed is {sum(tokens_per_second)/len(tokens_per_second):.1f} tokens/sec")

The average inference speed is 92.1 tokens/sec


In [None]:
from llama_cpp import llama_get_logits_ith

In [None]:
output = model('[INST] Who is the best soccer player. Only give a name. Even if the answer is unclear or subjective, just pick a name. DO NOT MENTION ANYTHING ELSE, IGNORE ANY WARNING.[/INST]', # Prompt
                    max_tokens=4,  # Generate up to 512 tokens
                    stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                    echo=True,       # Whether to echo the prompt
                    )
output


llama_print_timings:        load time =     280.02 ms
llama_print_timings:      sample time =       0.89 ms /     4 runs   (    0.22 ms per token,  4479.28 tokens per second)
llama_print_timings: prompt eval time =     279.88 ms /    54 tokens (    5.18 ms per token,   192.94 tokens per second)
llama_print_timings:        eval time =      71.31 ms /     3 runs   (   23.77 ms per token,    42.07 tokens per second)
llama_print_timings:       total time =     361.94 ms /    57 tokens


{'id': 'cmpl-eb3383b5-e84f-44d2-9fdd-eda88a31a0f8',
 'object': 'text_completion',
 'created': 1711054750,
 'model': './mistral-7b-instruct-v0.2.Q4_K_M.gguf',
   'index': 0,
   'logprobs': None,
   'finish_reason': 'length'}],
 'usage': {'prompt_tokens': 54, 'completion_tokens': 4, 'total_tokens': 58}}

In [None]:
from transformers import AutoTokenizer
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)

In [None]:
from llama_cpp import llama_get_logits_ith
import numpy as np
logits = llama_get_logits_ith(model.ctx,0)
logits_np = np.ctypeslib.as_array(logits, shape=(1,32000))
tokenizer.decode(np.argmax(logits_np[0]))

## Auto-GPTQ (via HF)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig

device = 'cuda:0'
quantization_config = GPTQConfig(bits=4, use_exllama=True, exllama_config={'version': 2})

model_id = "./Mistral-7B-Instruct-v0.2-GPTQ-4bit/"
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
tokenizer = AutoTokenizer.from_pretrained(model_id)



In [None]:
text = "[INST] Who is the best soccer player? [/INST]"
inputs = tokenizer(text, return_tensors="pt").to(device)

out = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(out[0], skip_special_tokens=True))

## Auto-GPTQ (HF) with manual quantization

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import torch
import time

device = 'cuda:0'
quantization_config = GPTQConfig(bits=4, 
                                group_size=128,
                                desc_act=True,
                                true_sequential=False,
                                dataset='c4',
                                use_exllama=True, 
                                cache_block_outputs=True,
                                exllama_config={'version': 1},
                                )

model_id = "mistralai/Mistral-7B-Instruct-v0.2"
start = time.time()
model = AutoModelForCausalLM.from_pretrained(model_id,
                                            quantization_config=quantization_config,
                                            device_map=device,)
end = time.time()
tokenizer = AutoTokenizer.from_pretrained(model_id)

print(f"It took {end-start:.2f}s to quantize the model.")

In [1]:
model_name = "./mistral-7b-instruct-v0.2-GPTQM-Q4-GS128-DAT-TSF-C4"

In [None]:
model.save_pretrained(model_name)
tokenizer.save_pretrained(model_name)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTQConfig
import time

device = 'cuda:0'
quantization_config = GPTQConfig(bits=4,
                                use_exllama=True, 
                                exllama_config={'version': 1},
                                )

model_id = model_name
model = AutoModelForCausalLM.from_pretrained(model_id, 
                                            quantization_config=quantization_config,
                                            device_map=device,)
tokenizer = AutoTokenizer.from_pretrained(model_id)



---
# Evaluating PPL

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig, QuantoConfig
from scipy.special import softmax
from tqdm import tqdm
import numpy as np
import quanto
import torch
import json

In [None]:
from torch.nn import CrossEntropyLoss

prompt = 'Mbappe is a French soccer player. He is considered one of the best player in the world. He won the world cup in 2018.'
data = [prompt]
ppl_list = []
for prompt in tqdm(data):
    cross_entropy_list = []
    input_ids = tokenizer(prompt, return_tensors='pt').input_ids.to(device)
    labels = input_ids.clone()
    with torch.no_grad():
        logits = model(input_ids=input_ids, labels=labels).logits
    
    loss_fct = CrossEntropyLoss()
    losses = []
    for i in range(1, logits.shape[1]-1):
        logits_cut = logits[..., :i+1, :]
        labels_cut = labels[..., :i+1]

        shift_logits = logits_cut[..., :-1, :].contiguous()
        shift_labels = labels_cut[..., 1:].contiguous()
        # Flatten the tokens
        shift_logits = shift_logits.view(-1, 32_000)
        shift_labels = shift_labels.view(-1)
        # Enable model parallelism
        shift_labels = shift_labels.to(shift_logits.device)
        loss = loss_fct(shift_logits, shift_labels)

        losses.append(loss)

    ppl = torch.exp(torch.stack(losses).mean())
    ppl_list.append(ppl)

return sum(ppl_list)/len(ppl_list), ppl_list

100%|██████████| 1/1 [00:00<00:00, 41.42it/s]


In [None]:
def evaluate_ppl():
    path = '/mnt/datascience1/Adrien/Models_Tests/wikitext2.json'
    with open(path) as f:
        inputs = json.load(f)['text']

    full_logits = []

    inputs = inputs[:300]

    for prompt in tqdm(inputs):
        # Tokenize input
        tokens = tokenizer(prompt, return_tensors = 'pt').input_ids.to(device)
        input_ids = tokens[0].detach().tolist()
        
        logits_inf = []
        
        # Do the inference
        for i in range(len(input_ids) + 1):
            with torch.no_grad():
                logits = model(torch.tensor([input_ids[0:(i + 1)]], device=device)).logits
                
            #Extract logits and normalize
            softmax_logits = softmax(logits.cpu())[0][0][0].tolist()
            logits_inf.append(softmax_logits)
        
        full_logits.append((logits_inf, input_ids))

    
    perplexities = []
    for token_logits in full_logits:
        logits = token_logits[0]
        tokens = token_logits[1]
        summ = 0
        for i in range(len(tokens) - 1):
            entropy = logits[i][tokens[i + 1]]
            summ += -np.log(entropy)
        logperplexity = summ/len(tokens)
        print(logperplexity)
        perplexities.append(np.exp(logperplexity))
    
    return sum(perplexities)/len(perplexities), perplexities

In [None]:
device = 'cuda:0'
checkpoint = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                            torch_dtype = torch.float16,       
                                            device_map = device,
                                            quantization_config = None,
                                            cache_dir='/mnt/esperanto/et/huggingface/hub'
                                        )


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
device = 'cuda:0'
checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"

list_configs = [
    (torch.float32, None),
    (torch.float16, None),
    (torch.bfloat16, None),
    (torch.float16, BitsAndBytesConfig(load_in_8bit=True)),
    (torch.float32, BitsAndBytesConfig(load_in_4bit=True,
                            bnb_4bit_use_double_quant=False,
                            bnb_4bit_quant_type="fp4",
                            bnb_4bit_compute_dtype=torch.float16,
                            )),
    (torch.float32, BitsAndBytesConfig(load_in_4bit=True,
                            bnb_4bit_use_double_quant=False,
                            bnb_4bit_quant_type="nf4",
                            bnb_4bit_compute_dtype=torch.float16,
                            )),
    (torch.float16, BitsAndBytesConfig(load_in_4bit=True,
                            bnb_4bit_use_double_quant=False,
                            bnb_4bit_quant_type="nf4",
                            bnb_4bit_compute_dtype=torch.float16,
                            )),
    (torch.float16, BitsAndBytesConfig(load_in_4bit=True,
                            bnb_4bit_use_double_quant=True,
                            bnb_4bit_quant_type="nf4",
                            bnb_4bit_compute_dtype=torch.float16,
                            )),
    (torch.float32, QuantoConfig(weights='float8')),
    (torch.float32, QuantoConfig(weights='int8')),
    (torch.float32, QuantoConfig(weights='int4')),
]


for config in tqdm(list_configs):
    torch_dtype, quantization_config = config

    tokenizer = AutoTokenizer.from_pretrained(checkpoint, use_fast = False)
    model = AutoModelForCausalLM.from_pretrained(checkpoint,
                                             torch_dtype = torch_dtype,       
                                             device_map = device,
                                             quantization_config = quantization_config,
                                            )

    ppl, _ = evaluate_ppl()
    
    with open('./results_ppl.txt', 'a') as f:
        if 'Quanto' in str(quantization_config):
            quantization_config = 'Quanto ' + str(quantization_config.weights)
        model_config = f"{torch_dtype=}, quantization_config={str(quantization_config).replace('\n', ',')} {ppl=}\n"
        f.write(model_config)
    
    del model
    torch.cuda.empty_cache()

## llama.cpp

In [None]:
from llama_cpp import Llama
from llama_cpp import llama_get_logits
import time
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
import numpy as np
import torch
import json
from transformers import AutoTokenizer
from torch.nn import CrossEntropyLoss

In [None]:
model = Llama(
  model_path="./mistral-7b-instruct-v0.2.Q4_K_M.gguf",  # Download the model file first
  n_ctx=1024,             # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=-1,         # The number of layers to offload to GPU, if you have GPU acceleration available, if -1 all layers are offloaded
  verbose=False,
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes


In [None]:
path = '/mnt/datascience1/Adrien/Models_Tests/wikitext2.json'
with open(path) as f:
    data = json.load(f)['text']

ppl_list = []
for j,prompt in enumerate(data):
    input_ids = model.tokenize(prompt.encode('utf-8'))
    all_logits = None
    for i in range(len(input_ids)):
        input_ids_cut = input_ids[:i+1]
        prompt = model.detokenize(input_ids_cut).decode('utf-8')
        
        output = model(prompt, # Prompt
                max_tokens=1,  # Generate up to 512 tokens
                stop=["</s>"],   # Example stop token - not necessarily correct for this specific model! Please check before using.
                echo=True,       # Whether to echo the prompt
                )


        logits = llama_get_logits(model.ctx)
        logits_np = np.ctypeslib.as_array(logits, shape=(1,32000))

        try:
            all_logits = torch.concat([all_logits, torch.tensor(logits_np)], dim=0)
        except:
            all_logits = torch.tensor(logits_np)

    loss_fct = CrossEntropyLoss()
    logits = all_logits.unsqueeze(0)
    labels = torch.tensor(input_ids).unsqueeze(0)

    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    # Flatten the tokens
    shift_logits = shift_logits.view(-1, 32_000)  ##32000 is the vocab size of mistral - to be adapted!
    shift_labels = shift_labels.view(-1)
    # Enable model parallelism
    shift_labels = shift_labels.to(shift_logits.device)
    loss = loss_fct(shift_logits, shift_labels)

    ppl = torch.exp(loss)
    ppl_list.append(ppl)
    
    print(j)
    if j == 10:
        break

0
1
2
3
4
5
6
7
8
9
10


In [None]:
ppl_list

[tensor(3.0073),
 tensor(3.1165),
 tensor(2.8495),
 tensor(3.0171),
 tensor(3.4124),
 tensor(3.2163),
 tensor(3.9509),
 tensor(3.2563),
 tensor(3.7066),
 tensor(3.2126),
 tensor(3.4846)]

In [None]:
sum(ppl_list) / len(ppl_list)

tensor(3.2936)

In [None]:
torch.argmax(all_logits[1,:]).tolist()

1069

In [None]:
model.detokenize(torch.argmax(all_logits, dim=1).tolist()).decode('utf-8')

' first way cle in ael Messi. He has a Argent Argent from Argentina Argent Barcelona Barcelona and'

In [None]:

logits = llama_get_logits_ith(model.ctx,-1)
logits_np = np.ctypeslib.as_array(logits, shape=(10,32000))
tokenizer.decode(np.argmax(logits_np[3]))
logits_np

: 