https://huggingface.co/docs/optimum/main/en/onnxruntime/usage_guides/models

In [2]:
!pip install optimum[onnxruntime]

Collecting optimum[onnxruntime]
  Downloading optimum-1.24.0-py3-none-any.whl.metadata (21 kB)
Collecting onnx (from optimum[onnxruntime])
  Downloading onnx-1.17.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (16 kB)
Collecting onnxruntime>=1.11.0 (from optimum[onnxruntime])
  Downloading onnxruntime-1.21.1-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting datasets>=1.2.1 (from optimum[onnxruntime])
  Downloading datasets-3.5.1-py3-none-any.whl.metadata (19 kB)
Collecting evaluate (from optimum[onnxruntime])
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting transformers>=4.29 (from optimum[onnxruntime])
  Downloading transformers-4.48.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets>=1.2.1->optimum[onnxruntime])
  Downloading dill-0.3.8-py3-none-any.

https://huggingface.co/openai-community/gpt2/discussions/84

In [3]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained("openai-community/gpt2", subfolder="onnx") # Add this line


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

The ONNX file decoder_model_merged.onnx is not a regular name used in optimum.onnxruntime that are ['model.onnx', 'model_quantized.onnx', 'model_optimized.onnx', 'decoder_with_past_model.onnx', 'decoder_with_past_model_quantized.onnx', 'decoder_with_past_model_optimized.onnx'], the ORTModelForCausalLM might not behave as expected.
ORTModelForCausalLM loaded a legacy ONNX model with no position_ids input, although this input is required for batched generation for the architecture gpt2. We strongly encourage to re-export the model with optimum>=1.14 for position_ids and batched inference support.
Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


He never went out without a book under his arm, not once."

He once said: "They wouldn't even call me a bookie. It wasn't a matter of looking as good as they did. That's the way it should


In [2]:
# -*- coding: utf-8 -*-
# تحديد الترميز لدعم اللغة العربية في التعليقات

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM
import logging

# --- الإعدادات ---
# اسم النموذج على Hugging Face Hub
model_name = "openai-community/gpt2"
# المجلد الفرعي الذي يحتوي على ملفات ONNX
onnx_subfolder = "onnx"
# النص المبدئي لتوليد النص منه
prompt = "He never went out without a book under his arm"
# الحد الأقصى لطول النص المُولَّد (بما في ذلك النص المبدئي)
max_generation_length = 50 # يمكنك تغيير هذا الرقم

# --- تحميل النموذج وتجهيزه ---

# ضبط مستوى تسجيل الرسائل لإظهار المعلومات والتحذيرات (اختياري)
logging.basicConfig(level=logging.INFO)
print(f"تحميل نموذج ONNX من: {model_name}, المجلد الفرعي: {onnx_subfolder}")

# تحميل نموذج ONNX المحسن باستخدام Optimum
# سيتم استخدام ONNX Runtime كمحرك للاستدلال
# ملاحظة: التحذيرات التي رأيتها بخصوص اسم الملف وتنسيق ONNX القديم قد لا تزال تظهر
# لأنها تتعلق بالملفات الموجودة على Hugging Face Hub نفسها.
# لإصلاح تحذير "legacy ONNX model"، يتطلب الأمر إعادة تصدير النموذج الأصلي باستخدام نسخة أحدث من optimum.
model = ORTModelForCausalLM.from_pretrained(
    model_name,
    subfolder=onnx_subfolder,
    # يمكنك إضافة provider='CUDAExecutionProvider' إذا كان لديك ONNX Runtime GPU مثبتًا وتريد استخدام الـ GPU
    # provider='CPUExecutionProvider' # هذا هو الافتراضي عادةً
)
print("تم تحميل نموذج ONNX.")

# تحميل الـ Tokenizer المطابق للنموذج الأصلي
print(f"تحميل الـ Tokenizer لـ: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("تم تحميل الـ Tokenizer.")

# --- إنشاء Pipeline لتوليد النص ---
print("إنشاء Pipeline لتوليد النص...")
# نستخدم النموذج المحمل (ONNX) والـ Tokenizer
# نحدد device=-1 للإشارة إلى استخدام CPU (أو يمكنك إزالة هذا السطر ليعتمد على الإعداد الافتراضي)
# يمكنك تجربة device=0 إذا كان لديك GPU وتريد استخدامه (ويتطلب onnxruntime-gpu)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=-1 # -1 للـ CPU, 0 للـ GPU الأول (إذا كان متاحًا ومدعومًا)
)
print("Pipeline جاهز.")

# --- توليد النص ---
print(f"\nالنص المبدئي:\n{prompt}\n")
print(f"جاري توليد النص (max_length={max_generation_length})...")

# استدعاء الـ Pipeline لتوليد النص
# pad_token_id مطلوب أحيانًا لتجنب تحذيرات مع بعض النماذج عند استخدام max_length
result = pipe(
    prompt,
    max_length=max_generation_length,
    pad_token_id=tokenizer.eos_token_id # استخدام end-of-sentence token كـ pad token
)
print("اكتمل التوليد.")

# --- عرض النتيجة ---
# الـ Pipeline يُرجع قائمة، نأخذ العنصر الأول ونعرض النص المُولَّد
generated_text = result[0]["generated_text"]
print("\nالنص المُولَّد:")
print(generated_text)

# طباعة النص المضاف فقط (اختياري)
added_text = generated_text[len(prompt):]
print("\nالنص المضاف فقط:")
print(added_text.strip()) # .strip() لإزالة أي مسافات بيضاء في البداية أو النهاية

تحميل نموذج ONNX من: openai-community/gpt2, المجلد الفرعي: onnx


The ONNX file decoder_model_merged.onnx is not a regular name used in optimum.onnxruntime that are ['model.onnx', 'model_quantized.onnx', 'model_optimized.onnx', 'decoder_with_past_model.onnx', 'decoder_with_past_model_quantized.onnx', 'decoder_with_past_model_optimized.onnx'], the ORTModelForCausalLM might not behave as expected.
ORTModelForCausalLM loaded a legacy ONNX model with no position_ids input, although this input is required for batched generation for the architecture gpt2. We strongly encourage to re-export the model with optimum>=1.14 for position_ids and batched inference support.


تم تحميل نموذج ONNX.
تحميل الـ Tokenizer لـ: openai-community/gpt2


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


تم تحميل الـ Tokenizer.
إنشاء Pipeline لتوليد النص...
Pipeline جاهز.

النص المبدئي:
He never went out without a book under his arm

جاري توليد النص (max_length=50)...
اكتمل التوليد.

النص المُولَّد:
He never went out without a book under his arm to read.

So that means he must have read to others.

This is the point I am trying to make here. I think it's fair.

On his journey to

النص المضاف فقط:
to read.

So that means he must have read to others.

This is the point I am trying to make here. I think it's fair.

On his journey to


In [7]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained("optimum/gpt2") # Add this line


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

ORTModelForCausalLM loaded a legacy ONNX model with no position_ids input, although this input is required for batched generation for the architecture gpt2. We strongly encourage to re-export the model with optimum>=1.14 for position_ids and batched inference support.
Device set to use cpu


He never went out without a book under his arm." He called it his "first big break" and he still doesn't recall where he went from there.

The man says "when the cops came he said "oh my God, you


In [None]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text with parameters
result = pipe(
    "what is granite-3.0 model?",
    max_new_tokens=128,
    temperature=0.4,  # Adjust as needed
    top_k=40,         # Adjust as needed
    top_p=0.5,         # Adjust as needed
    repetition_penalty=1.2,  # Adjust as needed
    do_sample=True      # Set to False for deterministic generation
)
print(result[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


what is granite-3.0 model?

The Granite-3.0 model is a machine learning framework developed by IBM for building and deploying AI models. It provides a unified platform for developing, training, and serving deep learning models. The model supports various types of neural network architectures and offers features like automatic differentiation, distributed training, and model optimization. It is designed to be flexible and scalable, allowing users to build and deploy AI models for a wide range of applications.


In [None]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
#result = pipe("He never went out without a book under his arm")
result = pipe("who is ai?", max_new_tokens=55)
print(result[0]["generated_text"])

In [None]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_token=11)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi-3-tutorial.md









https://huggingface.co/onnx-community/Llama-3.2-3B-Instruct-ONNX

In [1]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

In [9]:
model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="Llama-3.2-1B/onnx/model.onnx")

FileNotFoundError: Could not find any ONNX model file for the regex ['^((?!decoder).)*.onnx', '(.*)?decoder(.*)?with_past(.*)?\\.onnx'] in onnx-community/Llama-3.2-1B/Llama-3.2-1B/onnx/model.onnx.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

In [None]:

- from transformers import AutoModelForCausalLM
+

- model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B") # PyTorch checkpoint
+  # ONNX checkpoint





In [12]:
model = ORTModelForCausalLM.from_pretrained("onnx-community/Llama-3.2-1B", subfolder="Llama-3.2-1B/onnx")


config.json:   0%|          | 0.00/980 [00:00<?, ?B/s]

FileNotFoundError: Could not find any ONNX model file for the regex ['^((?!decoder).)*.onnx', '(.*)?decoder(.*)?with_past(.*)?\\.onnx'] in onnx-community/Llama-3.2-1B/Llama-3.2-1B/onnx.

In [13]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/Llama-3.2-1B",
    subfolder="Llama-3.2-1B/onnx",
    file_name="decoder_model.onnx",  # Specify the actual file name
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

EntryNotFoundError: 404 Client Error. (Request ID: Root=1-680ff288-7e97bf1b3dd4b5bd55a40e36;bb8b3e63-10c3-4229-af68-cbd4e4d59555)

Entry Not Found for url: https://huggingface.co/onnx-community/Llama-3.2-1B/resolve/main/Llama-3.2-1B/onnx/decoder_model.onnx.

In [26]:
!rm -rf /root/.cache

In [24]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/Llama-3.2-1B",
    subfolder="Llama-3.2-1B/onnx",
    file_name="model.onnx_data",  # Specify the actual file name
)

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

EntryNotFoundError: 404 Client Error. (Request ID: Root=1-680ff396-1cdeb42d758840f83f2c8077;4d39bd0c-60f2-451f-b7b2-c6c82d13da7f)

Entry Not Found for url: https://huggingface.co/onnx-community/Llama-3.2-1B/resolve/main/Llama-3.2-1B/onnx/model.onnx_data.

In [27]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Load the ONNX model - corrected path and removed file_name parameter
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/Llama-3.2-1B",
    provider="CPUExecutionProvider"  # or "CUDAExecutionProvider" if you have GPU
)

# Load the tokenizer from the original model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")  # corrected model name

# Create the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

config.json:   0%|          | 0.00/980 [00:00<?, ?B/s]

EntryNotFoundError: 404 Client Error. (Request ID: Root=1-680ff466-732f72607c92f6a457bce2bb;bc45ec41-c260-476d-b07a-4830ae7d3957)

Entry Not Found for url: https://huggingface.co/onnx-community/Llama-3.2-1B/resolve/main/model.onnx.

In [28]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Load the ONNX model - corrected path and removed file_name parameter
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/opus-mt-en-fr",
    provider="CPUExecutionProvider"  # or "CUDAExecutionProvider" if you have GPU
)

# Load the tokenizer from the original model
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")  # corrected model name

# Create the pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

The ONNX file decoder_model_merged.onnx is not a regular name used in optimum.onnxruntime that are ['model.onnx', 'model_quantized.onnx', 'model_optimized.onnx', 'decoder_with_past_model.onnx', 'decoder_with_past_model_quantized.onnx', 'decoder_with_past_model_optimized.onnx'], the ORTModelForCausalLM might not behave as expected.


EntryNotFoundError: 404 Client Error. (Request ID: Root=1-680ff4a0-6571420d6e915aeb65992f9d;5c967d71-86a3-471f-a1dd-6c63c1ff40a6)

Entry Not Found for url: https://huggingface.co/onnx-community/opus-mt-en-fr/resolve/main/decoder_model_merged.onnx.

In [29]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Load ONNX model (example: LLaMA 2 7B ONNX)
model = ORTModelForCausalLM.from_pretrained(
    "optimum/llama2-7b-onnx",  # Correct ONNX model
    provider="CPUExecutionProvider",  # Use CUDA if available
)

# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Create text-generation pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
result = pipe("He never went out without a book under his arm", max_length=50)
print(result[0]["generated_text"])

RepositoryNotFoundError: 401 Client Error. (Request ID: Root=1-680ff4d3-2c77b816371e0cb3164d09de;c3a1ff67-a815-4306-ac87-48f432a02a6a)

Repository Not Found for url: https://huggingface.co/api/models/optimum/llama2-7b-onnx/tree/main?recursive=True&expand=False.
Please make sure you specified the correct `repo_id` and `repo_type`.
If you are trying to access a private or gated repo, make sure you are authenticated. For more details, see https://huggingface.co/docs/huggingface_hub/authentication
Invalid username or password.

In [30]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Load ONNX model (example: LLaMA 2 7B ONNX)
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/Llama-3.2-3B-instruct-hexagon-npu-assets",  # Correct ONNX model
    provider="CPUExecutionProvider",  # Use CUDA if available
)

# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-instruct")

# Create text-generation pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
result = pipe("He never went out without a book under his arm", max_length=50)
print(result[0]["generated_text"])

ValueError: The library name could not be automatically inferred. If using the command-line, please provide the argument --library {transformers,diffusers,timm,sentence_transformers}. Example: `--library diffusers`.

In [31]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/Llama-3.2-3B-instruct-hexagon-npu-assets",
    provider="CPUExecutionProvider",
    library_name="transformers" # Add this line
)

# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-instruct")

# Create text-generation pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
result = pipe("He never went out without a book under his arm", max_length=50)
print(result[0]["generated_text"])

ValueError: The library name could not be automatically inferred. If using the command-line, please provide the argument --library {transformers,diffusers,timm,sentence_transformers}. Example: `--library diffusers`.

def __init__(model: onnxruntime.InferenceSession, config: 'PretrainedConfig', use_io_binding: Optional[bool]=None, model_save_dir: Optional[Union[str, Path, TemporaryDirectory]]=None, preprocessors: Optional[List]=None, generation_config: Optional[GenerationConfig]=None, use_cache: Optional[bool]=None, **kwargs)
ONNX model with a causal language modeling head for ONNX Runtime inference. This class officially supports bloom, codegen, falcon, gpt2, gpt_bigcode, gpt_neo, gpt_neox, gptj, llama.

This model inherits from [~onnxruntime.modeling_ort.ORTModel], check its documentation for the generic methods the library implements for all its model (such as downloading or saving).

This class should be initialized using the [onnxruntime.modeling_ort.ORTModel.from_pretrained] method.

In [None]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained(
    "openai-community/gpt2",
    provider="CPUExecutionProvider",
    library_name="transformers" # Add this line
)

# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

# Create text-generation pipeline
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
result = pipe("He never went out without a book under his arm", max_length=50)
print(result[0]["generated_text"])

In [34]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained("openai-community/gpt2", subfolder="onnx") # Add this line


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

config.json:   0%|          | 0.00/879 [00:00<?, ?B/s]

The ONNX file decoder_model_merged.onnx is not a regular name used in optimum.onnxruntime that are ['model.onnx', 'model_quantized.onnx', 'model_optimized.onnx', 'decoder_with_past_model.onnx', 'decoder_with_past_model_quantized.onnx', 'decoder_with_past_model_optimized.onnx'], the ORTModelForCausalLM might not behave as expected.
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


decoder_model_merged.onnx:   0%|          | 0.00/655M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

ORTModelForCausalLM loaded a legacy ONNX model with no position_ids input, although this input is required for batched generation for the architecture gpt2. We strongly encourage to re-export the model with optimum>=1.14 for position_ids and batched inference support.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [36]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

Device set to use cpu
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


He never went out without a book under his arm. He never left his books unattended so that I, as well as my siblings, could read them. He could even sit in the armchair listening to those who would listen and write. He


In [35]:
def onnx_inference(question, answer):
    inputs = tokenizer(question, answer, return_tensors="pt")
    input_names = ort_session.get_inputs()
    inputs_onnx = {
        input_name.name: inputs[input_name.name].numpy() for input_name in input_names
    }
    outputs = ort_session.run(None, inputs_onnx)
    return outputs[0][0]

In [1]:
# -*- coding: utf-8 -*-
# تحديد الترميز لدعم اللغة العربية في التعليقات

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM
import logging

# --- الإعدادات ---
# اسم النموذج على Hugging Face Hub
model_name = "openai-community/gpt2"
# المجلد الفرعي الذي يحتوي على ملفات ONNX
onnx_subfolder = "onnx"
# النص المبدئي لتوليد النص منه
prompt = "He never went out without a book under his arm"
# الحد الأقصى لطول النص المُولَّد (بما في ذلك النص المبدئي)
max_generation_length = 50 # يمكنك تغيير هذا الرقم

# --- تحميل النموذج وتجهيزه ---

# ضبط مستوى تسجيل الرسائل لإظهار المعلومات والتحذيرات (اختياري)
logging.basicConfig(level=logging.INFO)
print(f"تحميل نموذج ONNX من: {model_name}, المجلد الفرعي: {onnx_subfolder}")

# تحميل نموذج ONNX المحسن باستخدام Optimum
# سيتم استخدام ONNX Runtime كمحرك للاستدلال
# ملاحظة: التحذيرات التي رأيتها بخصوص اسم الملف وتنسيق ONNX القديم قد لا تزال تظهر
# لأنها تتعلق بالملفات الموجودة على Hugging Face Hub نفسها.
# لإصلاح تحذير "legacy ONNX model"، يتطلب الأمر إعادة تصدير النموذج الأصلي باستخدام نسخة أحدث من optimum.
model = ORTModelForCausalLM.from_pretrained(
    model_name,
    subfolder=onnx_subfolder,
    # يمكنك إضافة provider='CUDAExecutionProvider' إذا كان لديك ONNX Runtime GPU مثبتًا وتريد استخدام الـ GPU
    # provider='CPUExecutionProvider' # هذا هو الافتراضي عادةً
)
print("تم تحميل نموذج ONNX.")

# تحميل الـ Tokenizer المطابق للنموذج الأصلي
print(f"تحميل الـ Tokenizer لـ: {model_name}")
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("تم تحميل الـ Tokenizer.")

# --- إنشاء Pipeline لتوليد النص ---
print("إنشاء Pipeline لتوليد النص...")
# نستخدم النموذج المحمل (ONNX) والـ Tokenizer
# نحدد device=-1 للإشارة إلى استخدام CPU (أو يمكنك إزالة هذا السطر ليعتمد على الإعداد الافتراضي)
# يمكنك تجربة device=0 إذا كان لديك GPU وتريد استخدامه (ويتطلب onnxruntime-gpu)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=-1 # -1 للـ CPU, 0 للـ GPU الأول (إذا كان متاحًا ومدعومًا)
)
print("Pipeline جاهز.")

# --- توليد النص ---
print(f"\nالنص المبدئي:\n{prompt}\n")
print(f"جاري توليد النص (max_length={max_generation_length})...")

# استدعاء الـ Pipeline لتوليد النص
# pad_token_id مطلوب أحيانًا لتجنب تحذيرات مع بعض النماذج عند استخدام max_length
result = pipe(
    prompt,
    max_length=max_generation_length,
    pad_token_id=tokenizer.eos_token_id # استخدام end-of-sentence token كـ pad token
)
print("اكتمل التوليد.")

# --- عرض النتيجة ---
# الـ Pipeline يُرجع قائمة، نأخذ العنصر الأول ونعرض النص المُولَّد
generated_text = result[0]["generated_text"]
print("\nالنص المُولَّد:")
print(generated_text)

# طباعة النص المضاف فقط (اختياري)
added_text = generated_text[len(prompt):]
print("\nالنص المضاف فقط:")
print(added_text.strip()) # .strip() لإزالة أي مسافات بيضاء في البداية أو النهاية

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

تحميل نموذج ONNX من: openai-community/gpt2, المجلد الفرعي: onnx


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
The ONNX file decoder_model_merged.onnx is not a regular name used in optimum.onnxruntime that are ['model.onnx', 'model_quantized.onnx', 'model_optimized.onnx', 'decoder_with_past_model.onnx', 'decoder_with_past_model_quantized.onnx', 'decoder_with_past_model_optimized.onnx'], the ORTModelForCausalLM might not behave as expected.
ORTModelForCausalLM loaded a legacy ONNX model with no position_ids input, although this input is required for batched generation for the architecture gpt2. We strongly encourage to re-export the model with optimum>=1.14 for position_ids and batched infer

تم تحميل نموذج ONNX.
تحميل الـ Tokenizer لـ: openai-community/gpt2


Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


تم تحميل الـ Tokenizer.
إنشاء Pipeline لتوليد النص...
Pipeline جاهز.

النص المبدئي:
He never went out without a book under his arm

جاري توليد النص (max_length=50)...
اكتمل التوليد.

النص المُولَّد:
He never went out without a book under his arm and had a great deal to say about the matter, and what I told him was very different. He said, "Why don't you just sit this book down and write your own story or that

النص المضاف فقط:
and had a great deal to say about the matter, and what I told him was very different. He said, "Why don't you just sit this book down and write your own story or that


In [5]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained("optimum/gpt2") # Add this line


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("optimum/gpt2")

decoder_with_past_model.onnx:   0%|          | 0.00/653M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The ONNX model was probably exported with an older version of optimum. We are updating the input/output dimensions and overwriting the model file with new dimensions. This is necessary for the model to work correctly with the current version of optimum. If you encounter any issues, please re-export the model with the latest version of optimum for optimal performance.
ORTModelForCausalLM loaded a legacy ONNX model with no position_ids input, although this input is required for batched generation for the architecture gpt2. We strongly encourage to re-export the model with optimum>=1.14 for position_ids and batched inference support.


In [6]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

Device set to use cpu


He never went out without a book under his arm. But he was always the first one to go out the door with everything he had.
Meredith made the first movie, where the cast played a woman they were going to marry. A friend


In [None]:
use_cache
use_merged

In [14]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library
model = ORTModelForCausalLM.from_pretrained("onnx-community/granite-3.0-2b-instruct", subfolder="https://huggingface.co/onnx-community/granite-3.0-2b-instruct/tree/main/onnx") # Add this line


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

FileNotFoundError: Could not find any ONNX model file for the regex ['^((?!decoder).)*.onnx', '(.*)?decoder(.*)?with_past(.*)?\\.onnx'] in onnx-community/granite-3.0-2b-instruct/https://huggingface.co/onnx-community/granite-3.0-2b-instruct/tree/main/onnx.

In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

In [None]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx"  # Replace with the actual file name of your desired ONNX model
)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

model.onnx:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

model.onnx_data:   0%|          | 0.00/10.1G [00:00<?, ?B/s]

In [None]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx"  # Replace with the actual file name of your desired ONNX model
)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

In [1]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
result = pipe("He never went out without a book under his arm")
print(result[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/5.64k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/777k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/442k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/3.48M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/87.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/701 [00:00<?, ?B/s]

Device set to use cpu


He never went out without a book under his arm.

He was a man of many interests.

He was a man of many


In [None]:
, max_new_token=11

In [None]:
result = pipe("He never went out without a book under his arm", max_new_tokens=11)

In [1]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
#result = pipe("He never went out without a book under his arm")
result = pipe("who is ai?", max_new_tokens=55)
print(result[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


He never went out without a book under his arm.

He was a man of many interests.

He was a man of many interests.

He was a man of many interests.

He was a man of many interests.

He was a man of many interests


In [1]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text with parameters
result = pipe(
    "He never went out without a book under his arm",
    max_new_tokens=11,
    temperature=0.4,  # Adjust as needed
    top_k=40,         # Adjust as needed
    top_p=0.5,         # Adjust as needed
    repetition_penalty=1.2,  # Adjust as needed
    do_sample=True      # Set to False for deterministic generation
)
print(result[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


He never went out without a book under his arm.

The man was always seen with a book


In [1]:
from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForCausalLM

# Explicitly specify the 'transformers' library and the ONNX file name
model = ORTModelForCausalLM.from_pretrained(
    "onnx-community/granite-3.0-2b-instruct",
    subfolder="onnx",
    file_name="model.onnx",
    provider="CPUExecutionProvider",
    library_name="transformers",
    use_cache=True,
    use_merged=False,
    device_map="auto"

)


# Load tokenizer (must match the original model)
tokenizer = AutoTokenizer.from_pretrained("onnx-community/granite-3.0-2b-instruct")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text with parameters
result = pipe(
    "what is granite-3.0 model?",
    max_new_tokens=128,
    temperature=0.4,  # Adjust as needed
    top_k=40,         # Adjust as needed
    top_p=0.5,         # Adjust as needed
    repetition_penalty=1.2,  # Adjust as needed
    do_sample=True      # Set to False for deterministic generation
)
print(result[0]["generated_text"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Device set to use cpu


what is granite-3.0 model?

The Granite-3.0 model is a machine learning framework developed by IBM for building and deploying AI models. It provides a unified platform for developing, training, and serving deep learning models. The model supports various types of neural network architectures and offers features like automatic differentiation, distributed training, and model optimization. It is designed to be flexible and scalable, allowing users to build and deploy AI models for a wide range of applications.
