✨ This notebook can run on an Nvidia Tesla T4 Instance, for example using the **free** T4 Google Colab instance. To run, select **Runtime** followed by **Run all**.

This notebook demonstrates how to:

<div>
<ul style="list-style-type: '✅ ';">
<li>Fine-tune a TinyLlama model using QLoRA technique</li>
<li>Optimize the model for inferencing with the ONNX Runtime</li>
<li>Save the model and adapters seperately for multi-LoRA model serving with ONNX Runtime</li>
<li>Run the model using the ONNX Runtime</li>
</ul>
</div>

In [None]:
%%capture

!pip install git+https://github.com/microsoft/Olive
!pip install transformers accelerate bitsandbytes datasets trl protobuf einops peft numpy optimum onnxruntime-gpu flash-attn
!pip install onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/
!sudo apt-get -y install cudnn9-cuda-12

In this example, you'll be fine-tuning a phrase classify i.e. given an english phrase, the model will classify into one of joy/surprise/fear/sadness categories. The dataset, which is available on Hugging Face, is show below

In [8]:
from datasets import load_dataset

dataset = load_dataset("xxyyzzz/phrase_classification")
dataset['train'].to_pandas().head()

Unnamed: 0,phrase,tone
0,I'm thrilled to start my new job!,joy
1,I can't believe I lost my keys again.,surprise
2,This haunted house is terrifying!,fear
3,Winning the lottery is a dream come true.,joy
4,Missing the concert is really disappointing.,sadness


Next, the `olive finetune` command executes. This single command will not only fine-tune the model but also optimize the model to run with quality and performance on the [ONNX runtime](https://onnxruntime.ai/). 

🧠 Olive supports the following models out-of-the-box: Phi, Llama, Mistral, Gemma, Qwen. You will need more compute power to Fine-tune these models.

In [None]:
!olive finetune \
    --method qlora \
    --model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
#   --model_name_or_path microsoft/Phi-3.5-mini-instruct \
#   --model_name_or_path meta-llama/Meta-Llama-3.1-8B-Instruct \
#   --model_name_or_path Qwen/Qwen2-VL-2B-Instruct \
    --trust_remote_code \
    --use_ort_genai \
    --data_name xxyyzzz/phrase_classification \
    --text_template "<|user|>\n{phrase}</s>\n<|assistant|>\n{tone}</s>" \
    --max_steps 50

🧪 Test the model in the ONNX Runtime. You will be prompted to enter an input. Here are some phrases to try:

- "Cricket is a great game"
- "I was taken aback by the size of the whale"
- "there was concern about the dark lighting on the street"

In [None]:
import onnxruntime_genai as og
import numpy as np
import os

model_folder = "optimized-model"

print("loading model and weights...", end="")
model = og.Model(model_folder)
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()

weights_file = os.path.join(model_folder, "adapter_weights.npz")
adapter_weights = np.load(weights_file)
print("done!")

# Set the max length to something sensible by default,
# since otherwise it will be set to the entire context length
search_options = {}
search_options['max_length'] = 200
search_options['past_present_share_buffer'] = False

chat_template = "<|user|>\n{input}</s>\n<|assistant|>"

text = input("Input: ")

while text != "exit":
  if not text:
    print("Error, input cannot be empty")
    exit

  prompt = f'{chat_template.format(input=text)}'

  input_tokens = tokenizer.encode(prompt)

  params = og.GeneratorParams(model)
  for key in adapter_weights.keys():
      params.set_model_input(key, adapter_weights[key])
  params.set_search_options(**search_options)
  params.input_ids = input_tokens
  generator = og.Generator(model, params)


  print("Output: ", end='', flush=True)

  try:
    while not generator.is_done():
      generator.compute_logits()
      generator.generate_next_token()

      new_token = generator.get_next_tokens()[0]
      print(tokenizer_stream.decode(new_token), end='', flush=True)
  except KeyboardInterrupt:
      print("  --control+c pressed, aborting generation--")

  print()
  text = input("Input: ")

del generator
del model
del tokenizer
del tokenizer_stream