## AQLM transformers integration example

**Install the `aqlm` library**
- The only extra dependency to run AQLM models.
- Add `[gpu]` to install the required CUDA specific dependencies.
- To use nice features like `device_map` you'll need to install accelerate. To properly support AQLM, you'd have to install the latest version straight from their GitHub (to catch [PR#2376](https://github.com/huggingface/accelerate/pull/2376)).

In [None]:
!pip install aqlm[gpu]==1.0.1
!pip install git+https://github.com/huggingface/accelerate.git@main
!pip install git+https://github.com/BlackSamorez/transformers.git@aqlm

**Load the model as usual**

The tokenizer is just a normal `Mixtral` tokenizer.

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch",
    torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

Do a few forward passes to load CUDA and automatically compile the kernels. It's done separately here for it not to affect the generation speed benchmark below.

In [None]:
%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI  ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> The relationship between humans and AI   The relationship between humans and AI is a complex one. On the one hand, AI has the potential to improve our lives in many ways. It can help us to make better decisions, to solve problems more efficiently, and to understand the world around us better. On the other hand, AI can also be used to manipulate and control us. It can be used to manipulate our emotions, to control our thoughts, and to manipulate our behavior. The relationship between humans and AI is a complex one. On the one hand, AI has the potential to improve our lives in many ways. It can help us to make better decisions,
CPU times: user 25.4 s, sys: 97.7 ms, total: 25.5 s
Wall time: 32.1 s


**Measure generation speed**

In [None]:
import json
import textwrap

system_prompt = "A chat between a curious user and an blog writing assistant. "

def get_prompt(human_prompt):
    prompt_template=f"{system_prompt}\n\nUSER: {human_prompt} \nASSISTANT: "
    return prompt_template


def remove_human_text(text):
    return text.split('USER:', 1)[0]

def parse_text(data):
    for item in data:
        text = item['generated_text']
        assistant_text_index = text.find('ASSISTANT:')
        if assistant_text_index != -1:
            assistant_text = text[assistant_text_index+len('ASSISTANT:'):].strip()
            assistant_text = remove_human_text(assistant_text)
            wrapped_text = textwrap.fill(assistant_text, width=100)
            print("#####", wrapped_text)
            # return assistant_text


In [None]:
from transformers import GenerationConfig, pipeline

In [None]:

pipe = pipeline(
    "text-generation",
    model=quantized_model,
    tokenizer=tokenizer,
    max_length=1200,
    temperature=0.7,
    top_p=0.95,
    do_sample=True,
)

In [None]:
%%time
prompt = '''Write a short and engaging blog post of travelling in Bohol Island.
          '''
raw_output = pipe(get_prompt(prompt))

parse_text(raw_output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': "A chat between a curious user and an blog writing assistant. \n\nUSER: Write a short and engaging blog post of travelling in Bohol Island. \n           \nASSISTANT: \nHello! I'm a travel blogger writing about Bohol Island. I'm going to share my experience of visiting Bohol Island, an island in the Philippines.\n\nBohol Island is a beautiful island located in the Philippines. It is known for its stunning beaches, lush forests, and unique wildlife. I visited the island in 2019 and was amazed by its beauty.\n\nOne of the highlights of my trip was visiting the Chocolate Hills. These hills are unique because they are made of limestone and covered in chocolate-colored grass. They are a popular tourist attraction and are a must-see for anyone visiting Bohol Island.\n\nAnother highlight of my trip was visiting the Tarsier Sanctuary. This sanctuary is home to the Philippine tarsier, a small primate that is endangered. It was a unique experience to see these animals up close

In [None]:
!nvidia-smi

Mon Feb 19 07:31:49 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0              31W /  70W |  13719MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [None]:
%%time
prompt = '''Write a short python code to calculate factorial.
          '''
raw_output = pipe(get_prompt(prompt))

parse_text(raw_output)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


##### Factorial function is a mathematical function that calculates the product of all integers from 1 to
a given number.              The factorial function can be implemented in Python using the factorial
function in the math module.              Here is an example of how to calculate factorial in
Python:              import math              def factorial(n):              return
math.factorial(n)              # Example usage              print(factorial(5))              The
output will be 120.
CPU times: user 3min 13s, sys: 478 ms, total: 3min 13s
Wall time: 3min 17s


Note that `transformers` generation is not the fastest implementation and it's heavily influenced by CPU capabilities of _Google Colab_.