Description
While embedding performance is OK (about 2,000 tokens/second), text completions are crazy slow: 5 seconds for roughly 250 prompt tokens and 20 completion tokens.
I don't think my use case is too crazy -- it's a subset of a much larger AI problem (measured in tokens & desired intelligence) we are effectively solving with GPT 4o mini today.
The answer is borderline workable and will need lots of prompt engineering. But even if it nailed it, for this to be workable in a UI and to choose Phi Silica over cloud models for this specific use case, I need it under 1 second -- ideally under 500ms.
Are these chat completion token counts in the ballpark for what this feature is designed? Am I missing some key configuration, a model selector, or something? What can I expect out of Phi Silica?
For reference, my code looks roughly like:
using Microsoft.Windows.AI.Generative;
if (!LanguageModel.IsAvailable()) await LanguageModel.MakeAvailableAsync();
using LanguageModel languageModel = await LanguageModel.CreateAsync();
var prompt = "...";
var options = new LanguageModelOptions();
options.Temp = ...;
options.Top_p = ...;
// I run the next line over 10 to 100 iterations, and average the iteration time:
var result = await languageModel.GenerateResponseAsync(options, prompt);
My setup is:
- Dell Latitude 7455 with Snapdragon X Elite (X1E80100)
- Windows 26120.3863
- Microsoft.WindowsAppSDK v1.7.250127003-experimental3
- .NET 8.0