# Mixtral-8x7B-v0.1 benchmarks - with AQLM and HQQ quantization

Vaibhav (VB) Srivastav @reach_vb

Run Mixtral 8x7B w/ ~13 GB VRAM 🤯

https://x.com/reach_vb/status/1758237703580111058?t=bD7p-kc7O9TbGttgj4Y0Fw&s=09

On a free colab too, powered by Transformers & AQLM!

AQLM is a new SOTA method for low-bitwidth LLM quantization, targeted to the “extreme” 2-3bit / parameter range.

## Dependencies

In [19]:
pip install --upgrade aqlm[gpu]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting aqlm[gpu]
  Using cached aqlm-1.1.1-py3-none-any.whl (12 kB)
Collecting torch>=2.2.0
  Using cached torch-2.2.1-cp310-cp310-manylinux1_x86_64.whl (755.5 MB)
Collecting triton>=2.1
  Using cached triton-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (167.9 MB)
Collecting nvidia-curand-cu12==10.3.2.106
  Using cached nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
Collecting nvidia-cusolver-cu12==11.4.5.107
  Using cached nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26
  Using cach

In [None]:
#pip uninstall -y flash_attn

In [None]:
pip install --upgrade hqq

In [None]:
pip install --upgrade accelerate # git+https://github.com/huggingface/accelerate.git@main

In [None]:
pip install --upgrade transformers # git+https://github.com/huggingface/transformers.git@main

In [None]:
pip install --upgrade datasets

In [1]:
import importlib.metadata
importlib.metadata.version("torch")

'2.2.1'

In [2]:
importlib.metadata.version("aqlm")

'1.1.1'

In [3]:
importlib.metadata.version("hqq")

'0.1.5'

In [None]:
importlib.metadata.version("flash_attn")

In [4]:
importlib.metadata.version("transformers")

'4.38.2'

In [5]:
importlib.metadata.version("accelerate")

'0.27.2'

In [6]:
importlib.metadata.version("datasets")

'2.18.0'

## BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer_name = "mistralai/Mixtral-8x7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

model_name = "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch"
model = AutoModelForCausalLM.from_pretrained(model_name, use_safetensors=True, torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True)

In [2]:
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.utils.hub import cached_file

memory_unit_gb = 1024*1024*1024

def get_directory_size(directory):
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(directory):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            total_size += os.path.getsize(fp)
    return total_size

def get_model_path_and_size_on_disk(pretrained_model_id):    
    model_config_file = cached_file(pretrained_model_id, "config.json", local_files_only=True)
    model_directory = os.path.dirname(os.path.dirname(model_config_file))    
    total_size = get_directory_size(model_directory)
    return model_directory,total_size

def display_local_cache(model_name):
    print(f"Model {model_name} downloaded in local cache:")
    path,size = get_model_path_and_size_on_disk(model_name)
    print(f"--> model files size   : {(size/memory_unit_gb):.2f} GB")
    print(f"--> stored in directory: {path}")

In [3]:
display_local_cache(model_name)

Model BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch downloaded in local cache:
--> model files size   : 12.20 GB
--> stored in directory: /models/huggingface/transformers/models--BlackSamorez--Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch/snapshots


## mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ

In [7]:
!git clone https://github.com/mobiusml/hqq/

Cloning into 'hqq'...
remote: Enumerating objects: 458, done.[K
remote: Counting objects: 100% (273/273), done.[K
remote: Compressing objects: 100% (158/158), done.[K
remote: Total 458 (delta 160), reused 201 (delta 109), pack-reused 185[K
Receiving objects: 100% (458/458), 157.49 KiB | 6.30 MiB/s, done.
Resolving deltas: 100% (242/242), done.


In [None]:
!source .venv/bin/activate && cd hqq/hqq/kernels && python setup_cuda.py install

In [1]:
from hqq.engine.hf import HQQModelForCausalLM, AutoTokenizer

model_name = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = HQQModelForCausalLM.from_quantized(model_name)



[36mhqq_aten package available. Set backend to HQQBackend.ATEN for faster inference and HQQBackend.ATEN_BACKPROP for faster training![0m


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

100%|██████████| 32/32 [00:01<00:00, 17.20it/s]
100%|██████████| 32/32 [00:00<00:00, 293.15it/s]


In [2]:
#Optional: set backend/compile
#You will need to install CUDA kernels apriori
from hqq.core.quantize import *
HQQLinear.set_backend(HQQBackend.ATEN)

In [16]:
# ERROR
#import torch
#model = torch.compile(model)

## Cuda memory

In [3]:
import torch
from datetime import datetime
from IPython.display import HTML
import pickle

memory_unit = 1024*1024
total_memory = torch.cuda.get_device_properties(0).total_memory

def display_memory():
    print(torch.cuda.get_device_name(0))
    print(f"Total    : {(total_memory/memory_unit):8,.1f} MB")
    print("------------------------------")
    free_memory = torch.cuda.mem_get_info()[0]
    reserved_memory = torch.cuda.memory_reserved(0)
    used_memory = torch.cuda.memory_allocated(0)    
    max_used_memory = torch.cuda.max_memory_allocated(0)
    overhead_memory = total_memory - free_memory - reserved_memory
    print(f"Overhead : {(overhead_memory/memory_unit):8,.1f} MB - {int(overhead_memory/total_memory*100):3} %")
    print(f"Reserved : {(reserved_memory/memory_unit):8,.1f} MB - {int(reserved_memory/total_memory*100):3} %")
    print(f"Free     : {(free_memory/memory_unit):8,.1f} MB - {int(free_memory/total_memory*100):3} %")
    print("------------------------------")
    print(f"Used     : {(used_memory/memory_unit):8,.1f} MB - {int(used_memory/total_memory*100):3} %")
    print(f"Max used : {(max_used_memory/memory_unit):8,.1f} MB - {int(max_used_memory/total_memory*100):3} %")
    
def display_memory_summary():
    print(torch.cuda.memory_summary())
    
def release_cached_memory():
    torch.cuda.empty_cache()
    
def reset_peak_memory_stats():
    torch.cuda.reset_peak_memory_stats()

def record_memory_history(enabled):
    torch.cuda.memory._record_memory_history(enabled=enabled)
    
def dump_memory_snapshot():
    filename_datetime = datetime.now().strftime("%Y%m%d_%H%M%S")
    filename = f"memory_snapshot_{filename_datetime}.pickle"
    s = torch.cuda.memory._snapshot(0)
    with open(filename, "wb") as f:
        pickle.dump(s, f)
    print(f"Dumped memory snapshot to file: {filename}")

# https://zdevito.github.io/2022/08/16/memory-snapshots.html
# https://zdevito.github.io/2022/12/09/memory-traces.html

def display_memory_snapshot():
    url = "https://pytorch.org/memory_viz"
    return HTML(f"Call dump_memory_snapshot(), <a href='{url}' target='_blank'>click here to open Pytorch memory viz</a>, then drag and drop the snapshot file")
    
display_memory()

NVIDIA GeForce RTX 4090
Total    : 24,563.5 MB
------------------------------
Overhead :  1,592.1 MB -   6 %
Reserved : 21,794.0 MB -  88 %
Free     :  1,177.4 MB -   4 %
------------------------------
Used     : 20,513.8 MB -  83 %
Max used : 20,513.8 MB -  83 %


## Chat test

In [3]:
import transformers 
from threading import Thread

def chat_processor(chat, max_new_tokens=100, do_sample=True):
    tokenizer.use_default_system_prompt = False
    streamer = transformers.TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)

    generate_params = dict(
        tokenizer("<s> [INST] " + chat + " [/INST] ", return_tensors="pt").to('cuda'),
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=do_sample,
        top_p=0.90,
        top_k=50,
        temperature= 0.6,
        num_beams=1,
        repetition_penalty=1.2,
    )

    t = Thread(target=model.generate, kwargs=generate_params)
    t.start()
    outputs = []
    for text in streamer:
        outputs.append(text)
        print(text, end="", flush=True)

    return outputs

In [12]:
outputs = chat_processor("présente-moi les missions du CIC banque privée comme un récit homérique", max_new_tokens=1000, do_sample=True)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Écoutez, noble voyageur, l'histoire héroïque et épique de la mission du CIC Banque Privée. Comme Ulysse dans son périple vers Ithaque, ce guerrier des temps modernes entreprend un long voyage à travers les mers agitées de la finance pour accompagner ses clients vers leur destination ultime : la prospérité et la sécurité financière.

Au début de cette odyssée moderne se trouve le conseiller en gestion de patrimoine, un guide sage et expérimenté qui aide les clients à naviguer les eaux troubles de la planification financière. Il est armé de sa connaissance approfondie des marchés financiers et de sa capacité à comprendre les besoins spécifiques de chaque client. Avec une écoute attentive et un dévouement sans faille, il offre des solutions sur mesure pour atteindre leurs objectifs financiers.

Dans sa quête pour offrir le meilleur service possible, le CIC Banque Privée doit faire face à de nombreux défis, tels que les tempêtes économiques imprévues ou les monstres fiscaux qui menacent le

In [14]:
len([token for token in outputs if token!='']) / 115

3.034782608695652

Generation speed: 3 tokens / sec

## Perplexity test

In [4]:
with open("/workspace/hftoken", 'r') as file:
    myhftoken = file.read().strip()

In [5]:
from datasets import load_dataset

dataset_name_fr = "frenchtext/banque-fr-2311"
dataset_fr = load_dataset(dataset_name_fr, token=myhftoken)

#dataset_name_en = "frenchtext/bank-en-2401"
#dataset_en = load_dataset(dataset_name_en, token=myhftoken)

#dataset_name_de = "frenchtext/bank-de-2401"
#dataset_de = load_dataset(dataset_name_de, token=myhftoken)

#dataset_name_es = "frenchtext/bank-es-2401"
#dataset_es = load_dataset(dataset_name_es, token=myhftoken)

dataset_name = dataset_name_fr
split = "valid"
dataset = dataset_fr[split]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/42 [00:00<?, ?it/s]



Loading dataset shards:   0%|          | 0/24 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/24 [00:00<?, ?it/s]

Loading dataset shards:   0%|          | 0/24 [00:00<?, ?it/s]

In [6]:
def get_dataset_batches(dataset, batch_size=32):
    filtered_dataset = dataset.filter(lambda example: example["Words"]>15)
    sorted_dataset = filtered_dataset.sort("Words",reverse=True)
    
    dataset_length = len(sorted_dataset)
    for start_idx in range(0, dataset_length, batch_size):
        end_idx = min(start_idx + batch_size, dataset_length)
        yield sorted_dataset[start_idx:end_idx]

def get_encoding_offsets(encoding):
    start_index = encoding.offsets[0][0]
    end_index = encoding.offsets[-1][1]
    if end_index==0: end_index = -1
    return (start_index, end_index)

def encode_dataset_batch(tokenizer, dataset_batch, stride=256):
    encodings = tokenizer(text = dataset_batch["Text"], add_special_tokens=True, 
                      padding="longest", truncation=True, return_overflowing_tokens=True, stride=stride,
                      # 2020: https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#tensor-core-shape
                      # However now in 2023, this is less and less true, newer drivers and cuda versions are smarter about this and will be able to use tensorcores even without this aligned padding
                      pad_to_multiple_of=16, return_tensors="pt")

    encodings["overflow_to_sample_uri"] = list(map(lambda sample_id: dataset_batch["Uri"][sample_id.item()], encodings["overflow_to_sample_mapping"]))
    encodings["overflow_to_sample_offset"] = list(map(get_encoding_offsets, encodings.encodings))
    
    return encodings

def get_encodings_batches(tokenizer, dataset, batch_size=32, stride=256):
    for dataset_batch in get_dataset_batches(dataset, batch_size):
        encodings = encode_dataset_batch(tokenizer, dataset_batch, stride)
        
        encodings_length = len(encodings.encodings)
        for start_idx in range(0, encodings_length, batch_size):
            end_idx = min(start_idx + batch_size, encodings_length)
            yield {key: encodings[key][start_idx:end_idx] for key in encodings.data.keys()}

In [7]:
import torch.nn.functional as F

class PPLu():
    
    def __init__(self, dataset_iterator, tokenizer, device):
        if hasattr(tokenizer,"vocab"):
            self.vocab_size = len(tokenizer.vocab)
        else:
            self.vocab_size = tokenizer.vocab_size
        dataset_token_id_counts = torch.zeros(self.vocab_size+1, dtype=torch.int64)
        dataset_tokens_count = 0
        
        for idx,dataset_batch in enumerate(dataset_iterator):
            encodings = tokenizer(text = dataset_batch["Text"], add_special_tokens=True, padding="longest", return_tensors="pt")
            
            # Padding tokens should be ignored: count them as token_id=vocabulary_size
            token_ids = encodings.input_ids*encodings.attention_mask + self.vocab_size*(1-encodings.attention_mask)
            
            token_id_counts = torch.bincount(token_ids.view(-1), minlength=self.vocab_size+1)
            tokens_count = encodings.attention_mask.sum()

            dataset_token_id_counts += token_id_counts
            dataset_tokens_count += tokens_count
            if idx%100==9: print(f"... {dataset_tokens_count:,} tokens")
        
        # Then discard the tokens count for token_id=vocabulary_size
        self.token_id_probs =  (dataset_token_id_counts[:-1] / dataset_tokens_count).unsqueeze(1).to(device)
        self.perplexity_loss = torch.nn.CrossEntropyLoss(ignore_index=-100, reduction="none")
        print(f"Done: {dataset_tokens_count:,} tokens")

    def __call__(self, input_ids, attention_mask, output_logits):
        # Next-token prediction: shift prediction scores and input ids by one
        logits = output_logits[:, :-1, :].permute(0, 2, 1).contiguous()
        labels = input_ids[:, 1:].contiguous()
        labels_to_ignore = attention_mask[:, 1:]

        # Number of tokens predicted, ignoring padding tokens
        predicted_tokens_count = labels_to_ignore.sum(dim=1)
        
        # Cross entropy loss (ignore_index=-100)
        labels_for_crossentropy = labels*labels_to_ignore -100*(1-labels_to_ignore)
        batch_perplexity_losses = (1/predicted_tokens_count)*self.perplexity_loss(logits, labels_for_crossentropy).sum(1)
        
        # Unigram probability loss
        labels_probs = F.embedding(labels, self.token_id_probs).squeeze()
        # prob = 1 for padding tokens => log prob = 0, ignored in the sum below
        labels_probs = labels_probs*labels_to_ignore + (1-labels_to_ignore) 
        batch_unigram_losses = -(1/predicted_tokens_count)*torch.log(labels_probs).sum(dim=1)
        
        # Unigram-nomralized perplexities
        perplexities = torch.exp(batch_perplexity_losses)
        unigram_normalized_perplexities = torch.exp(batch_perplexity_losses - batch_unigram_losses)
        
        return predicted_tokens_count, batch_perplexity_losses, batch_unigram_losses, perplexities, unigram_normalized_perplexities

class NormalizedPerplexityLogger:
    def __init__(self, dataset_name, split, model_name):
        self.filename = f"{dataset_name.replace('/','_')}_{split}_{model_name.replace('/','_')}_pplu.csv"
        self.file = open(self.filename, 'w')
        
    def log_batch(self, ppl, pplu, uri, span):
        self.file.write(f"{ppl},{pplu},{uri},{span}\n")

In [8]:
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Optimize perf on RTX 4090
tokenizer.model_max_length = 8192
    
print(f"Computing perplexity on dataset {dataset_name}:{split} for {model_name}")
print(f"- model vocabulary: {len(tokenizer.vocab)}")
print(f"- model sequence length: {int(tokenizer.model_max_length)}")
print(f"- model torch dtype: {model.dtype}")

Computing perplexity on dataset frenchtext/banque-fr-2311:valid for mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
- model vocabulary: 32000
- model sequence length: 8192
- model torch dtype: torch.float16


In [9]:
%%time
pplu_loss = PPLu(get_dataset_batches(dataset), tokenizer, model.device)

Token indices sequence length is longer than the specified maximum sequence length for this model (335203 > 8192). Running this sequence through the model will result in indexing errors


... 5,765,378 tokens
... 12,887,127 tokens
... 15,169,531 tokens
Done: 15,453,930 tokens
CPU times: user 39.8 s, sys: 12.3 s, total: 52.1 s
Wall time: 10.4 s


In [10]:
batch_size = 1
stride = 256

print(f"- dataset examples: {len(dataset)}")
print(f"- batch_size={batch_size}, stride={stride}")

- dataset examples: 8522
- batch_size=1, stride=256


In [11]:
%%capture
output = model.generate(tokenizer("", return_tensors="pt")["input_ids"].cuda(), max_new_tokens=10)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
%%time
import math

logger = NormalizedPerplexityLogger(dataset_name, split, model_name)

def display_perplexities(pred_tokens_count, ppl_losses, unigram_losses):        
    pt_pred_tokens_count = torch.Tensor(pred_tokens_count)
    total_pred_tokens_count = pt_pred_tokens_count.sum().item()
    
    pt_ppl_losses = torch.Tensor(ppl_losses)
    pt_unigram_losses = torch.Tensor(unigram_losses)    
    pt_pplu_losses = pt_ppl_losses - pt_unigram_losses

    ppl = math.exp((pt_ppl_losses*pt_pred_tokens_count).sum().item() / total_pred_tokens_count)
    pplu = math.exp((pt_pplu_losses*pt_pred_tokens_count).sum().item() / total_pred_tokens_count)

    print(f"-> perplexity = {ppl:.3f}")
    print(f"-> unigram-normalized perplexity = {pplu*1000:.3f} (x1000)")
    
pred_tokens_count = [] 
ppl_losses = []   
unigram_losses = [] 
for idx,encodings_batch in enumerate(get_encodings_batches(tokenizer, dataset, batch_size=batch_size, stride=stride)):
    with torch.no_grad():
        # predict next token
        inputs = encodings_batch["input_ids"].to(model.device)
        attention_mask = encodings_batch["attention_mask"].to(model.device)
        outputs = model(input_ids=inputs, attention_mask=attention_mask, use_cache=False, output_attentions=False, output_hidden_states=False)

        batch_pred_tokens_count, batch_ppl_losses, batch_unigram_losses, batch_ppl, batch_pplu = pplu_loss(inputs, attention_mask, outputs.logits)
        
        pred_tokens_count.extend(batch_pred_tokens_count.tolist())
        ppl_losses.extend(batch_ppl_losses.tolist())
        unigram_losses.extend(batch_unigram_losses.tolist())

    for ppl,pplu,uri,span in zip(batch_ppl.tolist(), batch_pplu.tolist(), encodings_batch["overflow_to_sample_uri"], encodings_batch["overflow_to_sample_offset"]):
        logger.log_batch(ppl, pplu, uri, span)

    #if idx%10 == 0:
    print(f"{(idx+1)*batch_size} encodings processed")
    display_perplexities(pred_tokens_count, ppl_losses, unigram_losses)

print(f"FINAL RESULT: {(idx+1)*batch_size} encodings processed")
display_perplexities(pred_tokens_count, ppl_losses, unigram_losses)

display_memory()

Computing perplexity on dataset frenchtext/banque-fr-2311:valid for mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ
- model vocabulary: 32000
- model sequence length: 8192
- model torch dtype: torch.float16
- dataset examples: 8522
- batch_size=1, stride=256
- 15,453,930 tokens in 10.4 s
- perplexity = 3.102
- unigram-normalized perplexity = 4.041 (x1000)

Max used : 23,865.6 MB -  97 %
Wall time: 5h 39min 3s

COMPARISON with unquantized version: uppl 3.967 batch size 6 wall time 1h30min
- unigram-normalized perplexity: +1.86 % only !
- wall time: 3,76x

Computing perplexity on dataset frenchtext/bank-en-2401:valid for lightonai/alfred-40b-1023
- model vocabulary: 65024
- model sequence length: 8192
- model torch dtype: torch.bfloat16
- dataset examples: 2555
- batch_size=4, stride=256
- 9,243,621 tokens in 14 sec
- perplexity = 4.690
- unigram-normalized perplexity = 6.273 (x1000)

2 h 22 min

Computing perplexity on dataset frenchtext/banque-fr-2311:valid for lightonai/alfred-40b-1023
- model vocabulary: 65024
- model sequence length: 8192
- model torch dtype: torch.bfloat16
- dataset examples: 8522
- batch_size=4, stride=256- 13,622,486 tokens in 14 sec

- perplexity = 3.- 5
> unigram-normalized perplexity = 4.098 (x100 0)

3h 17min

## Generation speed test with VLLM

In [None]:
pip install vllm==0.2.2

In [1]:
import importlib.metadata
importlib.metadata.version("torch")

'2.1.2'

In [2]:
importlib.metadata.version("vllm")

'0.2.2'

In [3]:
importlib.metadata.version("hqq")

'0.1.5'

In [4]:
from hqq.engine.vllm import HQQLLM

model_name = 'mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-3bit-metaoffload-HQQ'
model = HQQLLM.from_quantized(model_name)

HQQLinear.set_backend(HQQBackend.ATEN_BACKPROP)



[36mLangchain not installed. You can install it via "pip install langchain"[0m


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

AssertionError: Model architecture MixtralForCausalLM not supported yet.