Performance #1382

naveen-kurra · 2025-04-07T21:26:09Z

Hello Community,
I'm using onnxruntime-genai for creating my application using phi3.5 onnx.
I have noticed that the genrator.append_tokens() function is taking around 1-1.5s to process. I wanna learn how to solve this issue as every milli second of latency is important for my application. I'm using python. Please let me know you need any further logs to solve this issue, i'm very happy to provide them

aciddelgado · 2025-04-07T22:47:00Z

Hello,
In order to assist it would be useful to have the following information. What platform are you running on, what specs, and where did you acquire the phi3.5 model from? If you can provide a sample of your script that would also be useful too. Thank you!

naveen-kurra · 2025-04-08T01:03:52Z

Hello @aciddelgado , thanks for your response.

I'm on running nvidia Jetson Orin NX 16gb ram., upto 157 tops.
I'm using microsoft phi 3.5 mini instruct onnx int4 (cuda version) from huggingface.
while I cannot share the entire code here's a snippet that could help.

`def generate_response(user_tokens, static_tokens, max_tokens=150, temperature=0.2):
global model, tokenizer, generator, params
try:
# Prepare the prompt and encode user input
all_tokens = list(static_tokens[2]) + list(static_tokens[0])+user_tokens+list(static_tokens[1])
log_event("reached gen")
generator.append_tokens(all_tokens)
log_event("Generating_start")
current_text = ""
llm_output_list = [] # Collect decoded output pieces

def get_static_tokens(static_prompt):
return list(tokenizer.encode(static_prompt))

user_tokens = list(tokenizer.encode(text))`

the time between both the logs statement above and below the generator.append_tokens(all_tokens) is around 1-1.7sec.
thank you

naveen-kurra · 2025-04-09T15:27:26Z

Hello @aciddelgado
I also tired pulling model again from source (Microsoft repo) but, I'm still having the same issue.
Please help to solve the issue.

Thank you

aciddelgado · 2025-04-11T00:27:32Z

Hello @naveen-kurra
Apologies for the delay, things have been very busy. What you've shown here seems alright; I would start by ensuring that the model is running on CUDA and not on CPU accidentally. You can force CUDA ep by creating a config like:

onnxruntime-genai/examples/python/model-qa.py

Lines 14 to 19 in e613206

    
           config = og.Config(args.model_path) 
        
           config.clear_providers() 
        
           if args.execution_provider != "cpu": 
        
               if args.verbose: print(f"Setting model to {args.execution_provider}") 
        
               config.append_provider(args.execution_provider) 
        
           model = og.Model(config)

The other two things that I can think of that may be influencing are the length of the list of tokens you are appending and the max length you are using; it would be useful to know those two things.

naveen-kurra · 2025-04-11T16:20:54Z

Hello @aciddelgado ,
Thanks for your response.
Please find the below snippet for the model initialization.

`config = og.Config(model_config_path)

execution_provider = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using execution provider: {execution_provider}")
if execution_provider == "cuda":
# provider_options = {
# "arena_extend_strategy": "kNextPowerOfTwo",
# "cudnn_conv_algo_search": "EXHAUSTIVE",
# "do_copy_in_default_stream": True
# }
config.append_provider(execution_provider)
else:
config.append_provider(execution_provider)

def initialize_model():
model = og.Model(config)
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.set_search_options(max_length=10000, batch_size=1)
generator = og.Generator(model, params)
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
return model, tokenizer, generator, params

global model, tokenizer, generator, params, done_listening
model, tokenizer, generator, params = initialize_model()`

and here a few test cases for Length of tokens(LoT).
test case 1:
when LoT==35. the latency is going upto 1-1.2s, and sometimes it takes only 0.3s. it's very dynamic.
test case 2:
it took 1.4s for append tokens function when LoT==72.

And the max_length is 10000.

I'll look forward for your response and I really appreciate you time.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance #1382

Performance #1382

naveen-kurra commented Apr 7, 2025 •

edited

Loading

aciddelgado commented Apr 7, 2025

Uh oh!

naveen-kurra commented Apr 8, 2025 •

edited

Loading

Uh oh!

naveen-kurra commented Apr 9, 2025

Uh oh!

aciddelgado commented Apr 11, 2025

Uh oh!

naveen-kurra commented Apr 11, 2025 •

edited

Loading

Uh oh!

Performance #1382

Performance #1382

Comments

naveen-kurra commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

aciddelgado commented Apr 7, 2025

Uh oh!

naveen-kurra commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naveen-kurra commented Apr 9, 2025

Uh oh!

aciddelgado commented Apr 11, 2025

Uh oh!

naveen-kurra commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naveen-kurra commented Apr 7, 2025 •

edited

Loading

naveen-kurra commented Apr 8, 2025 •

edited

Loading

naveen-kurra commented Apr 11, 2025 •

edited

Loading