Open
Description
The olive optimized onnx graph of Llama3.2-1B generated using the olive auto-opt command in the getting started example is failing to run on the DirectML execution provider with the following error:
2025-05-29 15:58:14.0223186 [E:onnxruntime:onnxruntime-genai, inference_session.cc:2094 onnxruntime::InferenceSession::Initialize] This session cannot use the graph capture feature as requested by the user as all compute graph nodes have not been partitioned to the DmlExecutionProvider
Traceback (most recent call last):
File "C:\Users\rashi\dev\rai14\RyzenAI-SW-14\llm\gpu_run\run_model.py", line 140, in <module>
model = model_load(args.model_dir)
File "C:\Users\rashi\dev\rai14\RyzenAI-SW-14\llm\gpu_run\run_model.py", line 31, in model_load
model = og.Model(model_dir)
RuntimeError: This session cannot use the graph capture feature as requested by the user as all compute graph nodes have not been partitioned to the DmlExecutionProvider
Could you please assist in debugging this error?
I used the same commands in the getting started example changing only 2 arguments (device, provider) to run on DML EP :
Optimizing the model runs successfully:
olive auto-opt \
--model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
--trust_remote_code \
--output_path models/llama/ao \
--device gpu\
--provider DmlExecutionProvider \
--use_ort_genai \
--precision int4 \
--log_level 1
Script used to run the model fails with the error pasted above:
import onnxruntime_genai as og
def model_load(model_dir : str):
model = og.Model(model_dir)
return model
def get_tokenizer(model):
tokenizer = og.Tokenizer(model)
tokenizer_stream = tokenizer.create_stream()
return tokenizer, tokenizer_stream
def run():
prompt = "The sun is shining outside."
model = "path/to/olive/optimized/model"
tokenizer, tokenizer_stream = get_tokenizer(model)
input_tokens = tokenizer.encode(prompt)
search_options = {}
search_options['do_sample'] = True
params = og.GeneratorParams(model)
params.set_search_options(**search_options)
print("Creating generator")
generator = og.Generator(model, params)
generator.append_tokens(input_tokens)
num_tokens = 0
try:
while not generator.is_done():
generator.generate_next_token()
new_token = generator.get_next_tokens()[0]
print(tokenizer_stream.decode(new_token), end='', flush=True)
num_tokens += 1
except KeyboardInterrupt:
print(" --control+c pressed, aborting generation--")
run()
System configuration:
- HW: AMD Ryzen AI 9 365 w/ Radeon 880M
- GPU Driver: 32.0.21001.9024
- OS: Windows 11
Conda Environment:
Package Version
-------------------------- -----------
accelerate 1.7.0
alembic 1.16.1
annotated-types 0.7.0
bitsandbytes 0.46.0
certifi 2025.4.26
charset-normalizer 3.4.2
colorama 0.4.6
coloredlogs 15.0.1
colorlog 6.9.0
filelock 3.18.0
flatbuffers 25.2.10
fsspec 2025.5.1
greenlet 3.2.2
huggingface-hub 0.32.2
humanfriendly 10.0
idna 3.10
Jinja2 3.1.6
lightning-utilities 0.14.3
Mako 1.3.10
MarkupSafe 3.0.2
ml_dtypes 0.5.1
mpmath 1.3.0
networkx 3.4.2
numpy 1.26.4
olive-ai 0.10.0.dev0
onnx 1.18.0
onnxruntime-directml 1.22.0
onnxruntime-genai-directml 0.8.0
onnxscript 0.2.7
optimum 1.25.3
optuna 4.3.0
packaging 25.0
pandas 2.2.3
peft 0.15.2
pip 25.1
protobuf 6.31.1
psutil 7.0.0
pydantic 2.11.5
pydantic_core 2.33.2
pyreadline3 3.5.4
python-dateutil 2.9.0.post0
pytz 2025.2
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
safetensors 0.5.3
scipy 1.15.3
setuptools 78.1.1
six 1.17.0
SQLAlchemy 2.0.41
sympy 1.14.0
tokenizers 0.19.1
tomli 2.2.1
torch 2.7.0
torchmetrics 1.7.2
tqdm 4.67.1
transformers 4.44.2
typing_extensions 4.13.2
typing-inspection 0.4.1
tzdata 2025.2
urllib3 2.4.0
wheel 0.45.1