ai-eval

Here are 2 public repositories matching this topic...

Mike-E-Log / ai-eval-toolkit

Eval toolkit for LLM-as-judge calibration — Cohen's kappa, Kendall-tau, regression gates.

python mcp calibration kappa cohens-kappa inter-rater-agreement kendall-tau evals llm-evaluation llm-as-judge mt-bench ai-eval

Updated May 22, 2026
Python

KarmaEnchanter / mental-health-llm-eval

Star

Open evaluation harness for mental health LLM responses. 5 clinically-grounded rubrics, LLM-as-judge with bias controls, crisis-detection routing to 988 protocols.

psychology cbt ai-safety conversational-ai clinical-ai cohen-kappa ollama llm-evaluation llm-as-judge mental-health-ai ai-eval inter-rater-reliability eval-harness lifeline-988 open-source-eval

Updated May 24, 2026
Python

Improve this page

Add a description, image, and links to the ai-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly