Skip to content

Performance #1382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
naveen-kurra opened this issue Apr 7, 2025 · 5 comments
Open

Performance #1382

naveen-kurra opened this issue Apr 7, 2025 · 5 comments

Comments

@naveen-kurra
Copy link

naveen-kurra commented Apr 7, 2025

Hello Community,
I'm using onnxruntime-genai for creating my application using phi3.5 onnx.
I have noticed that the genrator.append_tokens() function is taking around 1-1.5s to process. I wanna learn how to solve this issue as every milli second of latency is important for my application. I'm using python. Please let me know you need any further logs to solve this issue, i'm very happy to provide them

@aciddelgado
Copy link
Contributor

Hello,
In order to assist it would be useful to have the following information. What platform are you running on, what specs, and where did you acquire the phi3.5 model from? If you can provide a sample of your script that would also be useful too. Thank you!

@naveen-kurra
Copy link
Author

naveen-kurra commented Apr 8, 2025

Hello @aciddelgado , thanks for your response.

  1. I'm on running nvidia Jetson Orin NX 16gb ram., upto 157 tops.
  2. I'm using microsoft phi 3.5 mini instruct onnx int4 (cuda version) from huggingface.
    while I cannot share the entire code here's a snippet that could help.

`def generate_response(user_tokens, static_tokens, max_tokens=150, temperature=0.2):
global model, tokenizer, generator, params
try:
# Prepare the prompt and encode user input
all_tokens = list(static_tokens[2]) + list(static_tokens[0])+user_tokens+list(static_tokens[1])
log_event("reached gen")
generator.append_tokens(all_tokens)
log_event("Generating_start")
current_text = ""
llm_output_list = [] # Collect decoded output pieces

def get_static_tokens(static_prompt):
return list(tokenizer.encode(static_prompt))

user_tokens = list(tokenizer.encode(text))`

the time between both the logs statement above and below the generator.append_tokens(all_tokens) is around 1-1.7sec.
thank you

@naveen-kurra
Copy link
Author

Hello @aciddelgado
I also tired pulling model again from source (Microsoft repo) but, I'm still having the same issue.
Please help to solve the issue.

Thank you

@aciddelgado
Copy link
Contributor

Hello @naveen-kurra
Apologies for the delay, things have been very busy. What you've shown here seems alright; I would start by ensuring that the model is running on CUDA and not on CPU accidentally. You can force CUDA ep by creating a config like:

config = og.Config(args.model_path)
config.clear_providers()
if args.execution_provider != "cpu":
if args.verbose: print(f"Setting model to {args.execution_provider}")
config.append_provider(args.execution_provider)
model = og.Model(config)

The other two things that I can think of that may be influencing are the length of the list of tokens you are appending and the max length you are using; it would be useful to know those two things.

@naveen-kurra
Copy link
Author

naveen-kurra commented Apr 11, 2025

Hello @aciddelgado ,
Thanks for your response.
Please find the below snippet for the model initialization.

`config = og.Config(model_config_path)

execution_provider = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using execution provider: {execution_provider}")
if execution_provider == "cuda":
# provider_options = {
# "arena_extend_strategy": "kNextPowerOfTwo",
# "cudnn_conv_algo_search": "EXHAUSTIVE",
# "do_copy_in_default_stream": True
# }
config.append_provider(execution_provider)
else:
config.append_provider(execution_provider)

def initialize_model():
model = og.Model(config)
tokenizer = og.Tokenizer(model)
params = og.GeneratorParams(model)
params.set_search_options(max_length=10000, batch_size=1)
generator = og.Generator(model, params)
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
return model, tokenizer, generator, params

global model, tokenizer, generator, params, done_listening
model, tokenizer, generator, params = initialize_model()`

and here a few test cases for Length of tokens(LoT).
test case 1:
when LoT==35. the latency is going upto 1-1.2s, and sometimes it takes only 0.3s. it's very dynamic.
test case 2:
it took 1.4s for append tokens function when LoT==72.

And the max_length is 10000.

I'll look forward for your response and I really appreciate you time.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants