Skip to content

Plawbench is a rubric-based benchmark for evaluating LLMs.

Notifications You must be signed in to change notification settings

skylenage/PLawbench

Repository files navigation

PLawbench: A Rubric-Based Benchmark for Evaluating LLMs in Real-World Legal Practice

Introduction

PLawBench is a rubric-based benchmark designed to evaluate the performance of large language models (LLMs) in legal practice. It includes three legal tasks: legal consultation, case analysis, and legal document drafting, covering a wide range of real-world legal domains such as personal affairs, marriage and family law, intellectual property, and criminal litigation. The benchmark aims to evaluate LLMs’ practical capabilities in handling practical legal tasks.

image
  1. In the public legal consultation task, we draw on situations commonly encountered by lawyers to simulate the interaction between clients and lawyers. This task tests whether the model can correctly understand users’ legal needs, thereby identifying and eliciting key facts that remain undisclosed by the parties.

  2. In the case analysis task, each case is structured into four parts: conclusion, legal facts, reasoning, and legal provisions, with dedicated rubrics designed for each. For selected questions, we further specify particular legal reasoning paths to assess the model’s ability to conduct structured and sound legal reasoning in real-world cases.

  3. In the legal document drafting task, models are required to generate legal documents, such as complaints and statements of defense, based on provided scenarios. This task aims to evaluate the models' proficiency in professional legal writing.

https://github.com/skylenage/PLawbench/edit/main/README.md#:~:text=small.png

Dataset Description:

practical_case_analysis_250.jsonl consists of case analysis questions. We have open-sourced a total of 250 questions, including the questions, reference answers, scoring rubrics, and score sheets.

public_legal_consultation_18.json consists of legal consultation questions. We have open-sourced a total of 18 questions, including the consultation scenarios and scoring rubrics.

Defendants_Statement.json and Plantiffs_Statement.json are legal writing tasks for drafting statements of defense and complaints, respectively. We have open-sourced a total of 12 questions in total, including the writing scenarios and scoring rubrics.

https://github.com/skylenage/PLawbench/edit/main/README.md#:~:text=dataset-statistics.png

Contributions

Our work makes three main contributions:

1.More realistic simulation of legal practice:We faithfully simulate real-world legal practice scenarios, with all tasks adapted from authentic cases. The benchmark organizes legal tasks into three hierarchical levels—public legal consultation, practical case analysis, and legal document generation—reflecting the full workflow of legal practi- tioners and enabling a comprehensive evaluation of LLM performance across diverse legal tasks.

In real legal practice, user queries are often vague, logically inconsistent, emotionally charged, or even intentionally incomplete. While preserving the core logic of real cases, we deliberately amplify these cognitively challenging elements—such as ambiguous descriptions, omitted key facts, and misleading details—to assess whether LLMs can effectively operate under realistic legal consultation conditions.

2.Fine-Grained Reasoning Steps:Beyond evaluating final outcomes, our benchmark explicitly incorporates fine-grained legal reasoning steps into task design and evaluation. This allows us to examine whether LLMs can perform multi-stage legal reasoning, including issue identification, fact clarification, legal analysis, and conclusion validation,rather than relying on shallow pattern matching or surface-level reasoning.

3.Task-Specific Rubrics:Our evaluation framework adopts personalized, task-specific rubrics annotated by legal experts, moving beyond purely outcome-based or form-based metrics to assess substantive legal reasoning and decision-making processes.For each type of legal task, legal experts first define a rubric framework tailored to the task’s reasoning requirements. Subsequently, they annotate case-specific rubrics for each individual legal scenario. This two-stage annotation process ensures that evaluation criteria are both principled and context-sensitive, enabling a more fine-grained,comprehensive, and realistic assessment of LLM performance in legal practice settings.

Ranking

Family Model Overall Task2-Avg Task2-Conclusion Task2-Facts Task2-Reasoning Task2-Statute Task1 Task3-Avg Task3-Plaintiff Task3-Defendant
Claude Claude-sonnet-4-20250514 53.55 55.91 62.57 79.75 49.54 35.66 58.24 46.48 39.60 53.35
Claude Claude-sonnet-4-5-20250929 65.88 67.57 67.61 88.32 64.05 47.22 70.98 59.67 52.05 67.29
Claude Claude-opus-4-5-20251101 66.47 68.00 69.82 83.61 65.49 53.61 68.92 62.27 56.54 68.01
DeepSeek DeepSeek-V3.2 57.97 64.01 65.56 82.89 61.64 43.27 60.12 46.48 44.43 48.52
DeepSeek DeepSeek-V3.2-thinking-inner 63.23 64.11 63.91 86.35 60.24 44.19 72.09 55.87 46.13 65.61
Doubao Doubao-seed-1-6-250615 63.88 61.79 64.53 83.12 56.87 43.69 75.02 59.95 48.59 71.32
Ernie Ernie-5.0-thinking-preview 51.39 62.89 64.17 85.84 58.33 40.96 27.79 47.94 46.70 49.18
Gemini Gemini-2.5-flash 62.07 63.85 64.94 88.95 60.64 37.00 68.77 54.65 46.44 62.87
Gemini Gemini-2.5-pro 64.05 64.14 68.12 78.48 64.49 43.15 70.31 59.73 55.35 64.11
Gemini Gemini-3.0-pro-preview 66.35 64.95 72.03 77.79 65.00 46.42 70.17 66.13 63.84 68.42
GLM Glm-4.6 60.49 61.37 66.34 77.05 60.00 42.26 63.19 57.21 51.65 62.77
GPT GPT-4o-20240806 35.76 41.66 54.79 67.90 34.46 15.82 47.86 17.86 17.81 17.92
GPT GPT-5-0807-global 67.76 62.92 66.77 86.21 60.27 34.18 78.71 68.54 61.05 76.03
GPT GPT-5.2-1211-global 69.67 66.37 69.93 88.26 60.38 48.59 79.57 68.58 58.25 63.42
Grok Grok-4.1-fast 52.67 58.17 63.32 89.94 53.37 21.55 59.07 39.23 30.25 48.22
Kimi Kimi-k2 57.27 60.26 64.62 80.96 56.66 40.27 62.73 48.66 41.78 55.54
Qwen Qwen-4b-instruct-2507 50.53 53.72 52.61 89.98 44.77 23.76 58.49 39.37 31.08 47.66
Qwen Qwen-4b-thinking-2507 44.80 52.70 56.61 89.97 45.97 26.73 41.88 39.83 30.22 49.44
Qwen Qwen-8b 43.11 49.91 53.65 77.92 42.72 27.62 48.96 30.46 31.27 29.64
Qwen Qwen3-30b-a3b-instruct-2507 55.73 59.81 59.65 90.13 53.44 32.60 61.19 45.30 32.21 58.39
Qwen Qwen3-30b-a3b-thinking-2507 50.29 51.19 57.74 75.43 44.77 30.60 52.01 47.63 39.61 55.65
Qwen Qwen3-235b-a22b-instruct-2507 63.08 65.57 64.34 91.90 60.07 42.52 67.79 55.78 42.04 69.51
Qwen Qwen3-235b-a22b-thinking-2507 62.82 64.26 64.39 88.68 59.93 41.97 61.32 61.41 51.95 70.86
Qwen Qwen3-max 64.75 67.17 67.52 90.97 62.75 45.10 75.76 53.38 49.33 57.43
image image image image

About

Plawbench is a rubric-based benchmark for evaluating LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •