Task-based evaluation for LLM usage of Hanko SDK #2572

gyxlucy · 2026-03-30T15:48:08Z

gyxlucy
Mar 30, 2026

Hi — thanks for building Hanko, really interesting auth framework!

We’ve been exploring how well LLMs actually use SDKs like Hanko in practice — not just from docs, but through structured, task-based evaluation.

We’ve started generating a set of representative tasks that simulate how developers (or agents) would realistically use Hanko's SDK.

The goal is to capture things like:

Even with this early setup, we’re already seeing some interesting differences across models — especially around multi-step reliability.

Not trying to promote anything — just wanted to sanity-check whether this kind of task-level visibility would be useful from your perspective.

Happy to share more or tailor a few tasks specifically for this repo if helpful.

Curious if this resonates at all?