# ZKML Task 3:
by Rayan Singh\
rayans2@illinois.edu\
rayanpurisingh@gmail.com. 

### 1. Convert Mistral 7B to a single-batch, fixed-context length ONNX or TFLite format

In [1]:
%pip install optimum[exporters]

Note: you may need to restart the kernel to use updated packages.


#### Export Mistral-7B to ONNX

In [2]:
import os

Turn off ONEDNN Optimizations to Avoid Round-Off Errors

In [3]:
os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'

#### Define Prompt + Generation Length

In [22]:
prompt = "Hello Professor Kang! My name is Rayan Singh and I am interested in joining your research group. "
text_generation_length = 100
num_batches = 1 # Single Batch

In [5]:
!optimum-cli export onnx --model mistralai/Mistral-7B-v0.1 mistral_onnx --batch_size {num_batches} --sequence_length {text_generation_length}

Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:47<00:00, 23.80s/it]
Automatic task detection to text-generation-with-past (possible synonyms are: causal-lm-with-past).
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.2+cu121
Overriding 1 configuration item(s)
	- use_cache -> True
  if (input_shape[-1] > 1 or self.sliding_window is not None) and self.is_causal:
  if past_key_values_length > 0:
  if seq_len > self.max_seq_len_cached:
  if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch.org/docs/master/onnx.html#avoid-inplace-operations-when-using-tensor-shape-in-tracing-mode
In-place op on output of tensor.shape. See https://pytorch

#### Import Necessary Libraries

In [6]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

  from .autonotebook import tqdm as notebook_tqdm


#### Load ONNX Model and Tokenizer. Create Inference Pipeline.

In [7]:
model_name = "mistralai/Mistral-7B-v0.1"

In [8]:
onnx_model = ORTModelForCausalLM.from_pretrained("./mistral_onnx")

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

#### Run Inference on ONNX Model

In [10]:
inputs = tokenizer(prompt, return_tensors="pt")
generated_ids = onnx_model.generate(**inputs, do_sample=False, max_length = text_generation_length)
onnx_generated_text = tokenizer.batch_decode(generated_ids)[0]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [11]:
onnx_generated_text

'<s> Hello Professor Kang! My name is Rayan Singh and I am interested in joining your research group.  I am a senior at the University of California, Irvine, majoring in Biological Sciences with a minor in Chemistry.  I am currently working in a lab at UCI, where I am studying the effects of the protein, CPEB, on the development of the nervous system.  I am also a member of the UCI chapter of the Society for Ne'

### 2. Confirm that it matches the output of the Pytorch reference code. 

In [12]:
from transformers import AutoModelForCausalLM

#### Load PyTorch Model and Tokenizer

In [13]:
pt_model = AutoModelForCausalLM.from_pretrained(model_name)
pt_model.eval()

Loading checkpoint shards: 100%|██████████| 2/2 [00:52<00:00, 26.28s/it]


MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )
    )
    (norm): MistralRMSNorm(

#### Run Inference on PyTorch Model

max_new_tokens=text_generation_length

In [14]:
model_inputs = tokenizer([prompt], return_tensors="pt")
generated_ids = pt_model.generate(**model_inputs, do_sample=False, max_length = text_generation_length)
pytorch_generated_text = tokenizer.batch_decode(generated_ids)[0]

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [15]:
pytorch_generated_text

'<s> Hello Professor Kang! My name is Rayan Singh and I am interested in joining your research group.  I am a senior at the University of California, Irvine, majoring in Biological Sciences with a minor in Chemistry.  I am currently working in a lab at UCI, where I am studying the effects of the protein, CPEB, on the development of the nervous system.  I am also a member of the UCI chapter of the Society for Ne'

#### Compare Generated Text

In [18]:
print(f"{prompt=},{text_generation_length=}\n\n")
print(f"PyTorch Model Generated Text: {pytorch_generated_text}\n")
print(f"ONNX Model Generated Text: {onnx_generated_text}")

prompt='Hello Professor Kang! My name is Rayan Singh and I am interested in joining your research group. ',text_generation_length=100


PyTorch Model Generated Text: <s> Hello Professor Kang! My name is Rayan Singh and I am interested in joining your research group.  I am a senior at the University of California, Irvine, majoring in Biological Sciences with a minor in Chemistry.  I am currently working in a lab at UCI, where I am studying the effects of the protein, CPEB, on the development of the nervous system.  I am also a member of the UCI chapter of the Society for Ne

ONNX Model Generated Text: <s> Hello Professor Kang! My name is Rayan Singh and I am interested in joining your research group.  I am a senior at the University of California, Irvine, majoring in Biological Sciences with a minor in Chemistry.  I am currently working in a lab at UCI, where I am studying the effects of the protein, CPEB, on the development of the nervous system.  I am also a member of the UCI chapter 

### 3. Compute the number of flops in the model.

In [20]:
from calflops import calculate_flops

In [24]:
flops, macs, params = calculate_flops(model=pt_model,
                                      input_shape=(num_batches, text_generation_length),
                                      transformer_tokenizer=tokenizer)
print("Mistral_7B FLOPs:%s   MACs:%s   Params:%s \n" %(flops, macs, params))

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



------------------------------------- Calculate Flops Results -------------------------------------
Notations:
number of parameters (Params), number of multiply-accumulate operations(MACs),
number of floating-point operations (FLOPs), floating-point operations per second (FLOPS),
fwd FLOPs (model forward propagation FLOPs), bwd FLOPs (model backward propagation FLOPs),
default model backpropagation takes 2.00 times as much computation as forward propagation.

Total Training Params:                                                  7.24 B  
fwd MACs:                                                               711.04 GMACs
fwd FLOPs:                                                              1.42 TFLOPS
fwd+bwd MACs:                                                           2.13 TMACs
fwd+bwd FLOPs:                                                          4.27 TFLOPS

-------------------------------- Detailed Calculated FLOPs Results --------------------------------
Each module cacul

### 4. Convert the first few layers to a ZKML framework of your choice (or write your own).