### Demonstrate the LLM GPT2 Model OnBoarding on Cloud AI 100 Platform

##### Download the OpenSource GPT2 based HuggingFace Model and Save in local *Cache* directory
###### We Modify the GPT2 Classes using the Optimized Software Library to generate model for Cloud AI 100.
###### User can disable this optmization by passing `transfrom=False` in the `from_pretrained` call
###### Here we generate models with below Optimizations:

* RMS Norm Fixes for FP16 Overflows and Underflow
* Causal Mask Fix
* Handling FP16 Overflows.
* KV Cache (Retention Changes).
* Triu/Tril Ops support.

In [None]:
# Initiate the Orignal Transformer model
import os

from QEfficient import QEFFAutoModelForCausalLM as AutoModelForCausalLM

# Please uncomment and use appropriate Cache Directory for transformers, in case you don't want to use default ~/.cache dir.
# os.environ["TRANSFORMERS_CACHE"] = "/local/mnt/workspace/hf_cache"

# ROOT_DIR = os.path.dirname(os.path.abspath(""))
# CACHE_DIR = os.path.join(ROOT_DIR, "tmp") #, you can use a different location for just one model by passing this param as cache_dir in below API.

# Model-Card name to be onboarded (This is HF Model Card name) : https://huggingface.co/gpt2-xl
model_name = "gpt2"  # Similar, we can change model name and generate corresponding models, if we have added the support in the lib.

qeff_model = AutoModelForCausalLM.from_pretrained(model_name)
print(f"{model_name} optmized for AI 100 \n", qeff_model)

##### Export the Optimized Pytorch model to the Onnx Framework

In [None]:
import QEfficient
from QEfficient.utils import load_hf_tokenizer
# We can now export the modified models to Onnx framework
# This will generate single Onnx Model for both Prefill and Decode Variations which are optimized for
# Cloud AI 100 Platform.

# This will generate Onnx model, clip the overflow constants to fp16
# Verify the model on Onnxruntime vs Pytorch
# Then generate inputs and customio yaml file required for compilation.

# We can generate the KV Style models with the flag "kv"
# Bertstyle models do not have any optimization w.r.t KV cache changes and are unoptimized version.
# It is recommended to use kv=True for better performance.
tokenizer = load_hf_tokenizer(model_name, use_cache=True)
base_path, onnx_path = QEfficient.export(
    model_name=model_name,
    model_kv=qeff_model,
    tokenizer=tokenizer,
    kv=True,
    form_factor="cloud",
    return_path=True,
)

##### Compile the Optimized KV Cache Single Model on Cloud AI 100 (**Config; 16C;32PL;128CTX;FP16**)

In [None]:
# Please use platform SDk to Check num_cores for your card.

generated_qpc_path = QEfficient.compile(
    onnx_path=onnx_path,
    num_cores=14,
    qpc_path=os.path.dirname(base_path),
    mxfp6=False,
    device_group=[0],
)

##### Execute the Optimized KV Model on H/W and Print the Latency Stats *(tok/sec)*

In [None]:
from QEfficient.generation.text_generation_inference import get_compilation_batch_size

# post compilation, we can print the latency stats for the kv models, We provide API to print token and Latency stats on AI 100
# We need the compiled prefill and decode qpc to compute the token generated, This is based on Greedy Sampling Approach
batch_size = get_compilation_batch_size(generated_qpc_path)
QEfficient.cloud_ai_100_exec_kv(batch_size=batch_size, tokenizer=tokenizer, qpc_path=generated_qpc_path, device_id=[0], prompt=["My name is"])