# Mixtral in Colab

Welcome! In this notebook you can run [Mixtral8x7B-Instruct](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) with decent generation speed **right in Google Colab or on a consumer-grade GPU**. This was made possible by quantizing the original model in mixed precision and implementing a MoE-specific offloading strategy.

To learn more, read our [tech report](https://arxiv.org/abs/2312.17238) or check out the [repo](https://github.com/dvmazur/mixtral-offloading) on GitHub.

One will need approximately 16 GB of VRAM and 11 GB of RAM to run this notebook and generate somewhat long texts.


<details>

<summary>How to balance between RAM and GPU VRAM usage</summary>

You can balance between RAM and GPU VRAM usage by changing <code>offload_per_layer</code> variable in the <a href="#scrollTo=_mIpePTMFyRY&line=10&uniqifier=1">Initialize model</a> section. Increasing <code>offload_per_layer</code> will decrease GPU VRAM usage, increase RAM usage and decrease generation speed. Decreasing <code>offload_per_layer</code> will have the opposite effect.

Note that this notebook should run normally in Google Colab with <code>offload_per_layer = 4</code>, but may crush with other values. However, if you run this somewhere else, you're free to play with this variable.
</details>

## Install and import libraries

In [1]:
from IPython.display import clear_output

# fix triton in colab
!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

!git clone https://github.com/type-shangshu/mixtral-offloading.git --quiet
!cd mixtral-offloading && pip install -q -r requirements.txt
!huggingface-cli download lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo --quiet --local-dir Mixtral-8x7B-Instruct-v0.1-offloading-demo

clear_output()

In [6]:
!ls

Mixtral-8x7B-Instruct-v0.1-offloading-demo  mixtral-offloading


In [5]:
import sys
import os

# Get the absolute path to the project root directory
project_root = os.path.abspath(os.path.join(os.getcwd(),"mixtral-offloading"))
print(f"Project root: {project_root}")

# Add the project root to Python path
sys.path.insert(0, project_root)

# Now import other dependencies
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

# Import project specific modules
from src.build_model import OffloadConfig, QuantConfig, build_model

Project root: /kaggle/working/mixtral-offloading


## Login your huggingface

In [9]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Initialize model

In [None]:
# Initialize model parameters
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"
state_path = "Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)

device = torch.device("cuda:0")

##### Change this to 5 if you have only 12 GB of GPU VRAM #####
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

##### Change Cache Strategy as you want #####
cache_strategy = "random" #Options: "lru", "lfu", "random"

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
    cache_strategy=cache_strategy,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)


model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)



config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]



Loading experts:   0%|          | 0/32 [00:00<?, ?it/s]

## Specify cache strategy

In [None]:
new_strategy = "random" #options: "lru", "lfu", "random"

## Run the model

In [None]:
from transformers import AutoTokenizer, TextStreamer

tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

chat_history = []

def switch_cache_strategy(model, new_strategy: str):
    """Switch cache strategy during inference"""
    for layer in model.model.layers:
        if hasattr(layer, 'block_sparse_moe'):
            if hasattr(layer.block_sparse_moe, 'expert_cache'):
                layer.block_sparse_moe.expert_cache.switch_cache_strategy(new_strategy)
    print(f"Cache strategy switched to: {new_strategy}")

switch_cache_strategy(model, new_strategy)

while True:
    print("User: ", end="")
    user_input = input()
    # print(f"{user_input}\n")

    chat_history.append(dict(role="user", content=user_input))

    input_ids = tokenizer.apply_chat_template(chat_history, return_tensors="pt").to(device)
    attention_mask = torch.ones_like(input_ids)

    print("Mixtral: ", end="")
    result = model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        streamer=streamer,
        do_sample=True,
        temperature=0.9,
        top_p=0.9,
        max_new_tokens=512,
        pad_token_id=tokenizer.eos_token_id,
        return_dict_in_generate=True,
        output_hidden_states=False,
    )
    print("\n")

    output_ids = result["sequences"][0]
    reply = tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True)

    chat_history.append(dict(role="assistant", content=reply))

Cache strategy switched to: random
User: 

 how to implement fifo cache


Mixtral: A First-In-First-Out (FIFO) cache is a cache replacement algorithm in which the cache item that has been in the cache the longest (measured in the number of cache updates) is the first to be replaced when the cache is full and a new item needs to be added. Here is a basic outline of how to implement a FIFO cache in a systems with a fixed-size cache:

1. Initialize the cache as an empty array of fixed size, and use a separate data structure (such as a queue) to keep track of the order in which cache entries are added.
2. When an item is added to the cache, add it to the end of the array and also add it to the front of the queue.
3. When an item is to be removed from the cache to make room for a new item, remove the item at the front of the queue (the item that has been in the cache the longest) and also remove the corresponding entry from the array.
4. When an item is needed from the cache, check the array for the item. If it is present, 

KeyboardInterrupt: 