Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.
-
Updated
May 22, 2026 - Python
Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.
Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.
Add a description, image, and links to the ai-eval topic page so that developers can more easily learn about it.
To associate your repository with the ai-eval topic, visit your repo's landing page and select "manage topics."