<a href="https://colab.research.google.com/github/zxfpro/work_space/blob/main/YIYIAI_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 一意AI增效家公众号

只需要12G显存+11G内存就可以流畅跑Mixtral-8x7B-Instruct-v0.1

本地模型文件太大，家里电脑还没下载完，后续考虑做一版纯本地版本

部署推理过程及代码分解请关注公众号

## 安装依赖和库

In [1]:
!unzip YIYIAI_NOTEBOOKS.zip

Archive:  YIYIAI_NOTEBOOKS.zip
  inflating: notebooks/YIYIAI_demo.ipynb  
  inflating: requirements.txt        
  inflating: src/build_model.py      
  inflating: src/custom_layers.py    
  inflating: src/expert_cache.py     
  inflating: src/expert_wrapper.py   
  inflating: src/packing.py          
  inflating: src/triton_kernels.py   
  inflating: src/utils.py            


In [2]:
import numpy
from IPython.display import clear_output

!export LC_ALL="en_US.UTF-8"
!export LD_LIBRARY_PATH="/usr/lib64-nvidia"
!export LIBRARY_PATH="/usr/local/cuda/lib64/stubs"
!ldconfig /usr/lib64-nvidia

!pip install -q -r requirements.txt
clear_output()

In [3]:
import sys

sys.path.append("mixtral-offloading")
import torch
from torch.nn import functional as F
from hqq.core.quantize import BaseQuantizeConfig
from huggingface_hub import snapshot_download
from IPython.display import clear_output
from tqdm.auto import trange
from transformers import AutoConfig, AutoTokenizer
from transformers.utils import logging as hf_logging

from src.build_model import OffloadConfig, QuantConfig, build_model

hf_logging.disable_progress_bar()

hqq_aten package not installed. HQQBackend.ATEN backend will not work unless you install the hqq_aten lib in hqq/kernels.


## 下载+初始化模型

In [7]:
!export HF_TOKEN=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

In [None]:
model_name = "mistralai/Mixtral-8x7B-Instruct-v0.1"
quantized_model_name = "lavawolfiee/Mixtral-8x7B-Instruct-v0.1-offloading-demo"

config = AutoConfig.from_pretrained(quantized_model_name)
state_path = snapshot_download(quantized_model_name)

device = torch.device("cuda:0")

############### 如果GPU显存（VRAM）只有12GB 就设置为5 ###########
offload_per_layer = 4
# offload_per_layer = 5
###############################################################

num_experts = config.num_local_experts

offload_config = OffloadConfig(
    main_size=config.num_hidden_layers * (num_experts - offload_per_layer),
    offload_size=config.num_hidden_layers * offload_per_layer,
    buffer_size=4,
    offload_per_layer=offload_per_layer,
)


attn_config = BaseQuantizeConfig(
    nbits=4,
    group_size=64,
    quant_zero=True,
    quant_scale=True,
)
attn_config["scale_quant_params"]["group_size"] = 256


ffn_config = BaseQuantizeConfig(
    nbits=2,
    group_size=16,
    quant_zero=True,
    quant_scale=True,
)
quant_config = QuantConfig(ffn_config=ffn_config, attn_config=attn_config)


model = build_model(
    device=device,
    quant_config=quant_config,
    offload_config=offload_config,
    state_path=state_path,
)

## 跑起来！

In [None]:
from transformers import TextStreamer


tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
past_key_values = None
sequence = None

seq_len = 0
while True:
  print("User: ", end="")
  user_input = input()
  print("\n")

  user_entry = dict(role="user", content=user_input)
  input_ids = tokenizer.apply_chat_template([user_entry], return_tensors="pt").to(device)

  if past_key_values is None:
    attention_mask = torch.ones_like(input_ids)
  else:
    seq_len = input_ids.size(1) + past_key_values[0][0][0].size(1)
    attention_mask = torch.ones([1, seq_len - 1], dtype=torch.int, device=device)

  print("Mixtral: ", end="")
  result = model.generate(
    input_ids=input_ids,
    attention_mask=attention_mask,
    past_key_values=past_key_values,
    streamer=streamer,
    do_sample=True,
    temperature=0.9,
    top_p=0.9,
    max_new_tokens=512,
    pad_token_id=tokenizer.eos_token_id,
    return_dict_in_generate=True,
    output_hidden_states=True,
  )
  print("\n")

  sequence = result["sequences"]
  past_key_values = result["past_key_values"]

User: Write a funny poem about Python, please


Mixtral: There once was a language so bright,
named Python, a helpful little wight.
It slithered through code with ease,
No bug stood a chance, to say the least.

Its syntax so clean, and easy on the eyes,
Programmers from far and wide would shout their surprise.
But unlike its namesake, it's not sneaky or sly,
It's open source, and free to the sky!

With libraries so vast, it's a data geek's dream,
From AI to web scraping, it's the ultimate scheme.
It's the Swiss Army knife of coding, or perhaps a black belt,
In the world of programming, it's the ultimate feel.

So, whether you're a beginner or a seasoned coder,
Python's the language that will make you feel higher.
With a community so welcoming and a syntax so neat,
It's no wonder that Python is simply hard to beat!

So here's to the snake, in programming so great,
May it continue to dominate, until a much later date.
In the world of technology, it's here to stay,
Long live Python, progr