Skip to content

Phi Silica text completions are really slow #5334

Open
@alejandro5042

Description

@alejandro5042

While embedding performance is OK (about 2,000 tokens/second), text completions are crazy slow: 5 seconds for roughly 250 prompt tokens and 20 completion tokens.

I don't think my use case is too crazy -- it's a subset of a much larger AI problem (measured in tokens & desired intelligence) we are effectively solving with GPT 4o mini today.

The answer is borderline workable and will need lots of prompt engineering. But even if it nailed it, for this to be workable in a UI and to choose Phi Silica over cloud models for this specific use case, I need it under 1 second -- ideally under 500ms.

Are these chat completion token counts in the ballpark for what this feature is designed? Am I missing some key configuration, a model selector, or something? What can I expect out of Phi Silica?

For reference, my code looks roughly like:

using Microsoft.Windows.AI.Generative;

if (!LanguageModel.IsAvailable()) await LanguageModel.MakeAvailableAsync();

using LanguageModel languageModel = await LanguageModel.CreateAsync();

var prompt = "...";

var options = new LanguageModelOptions();
options.Temp = ...;
options.Top_p = ...;

// I run the next line over 10 to 100 iterations, and average the iteration time:
var result = await languageModel.GenerateResponseAsync(options, prompt);

My setup is:

  • Dell Latitude 7455 with Snapdragon X Elite (X1E80100)
  • Windows 26120.3863
  • Microsoft.WindowsAppSDK v1.7.250127003-experimental3
  • .NET 8.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions