PLawBench is a rubric-based benchmark designed to evaluate the performance of large language models (LLMs) in legal practice. It includes three legal tasks: legal consultation, case analysis, and legal document drafting, covering a wide range of real-world legal domains such as personal affairs, marriage and family law, intellectual property, and criminal litigation. The benchmark aims to evaluate LLMs’ practical capabilities in handling practical legal tasks.
-
In the public legal consultation task, we draw on situations commonly encountered by lawyers to simulate the interaction between clients and lawyers. This task tests whether the model can correctly understand users’ legal needs, thereby identifying and eliciting key facts that remain undisclosed by the parties.
-
In the case analysis task, each case is structured into four parts: conclusion, legal facts, reasoning, and legal provisions, with dedicated rubrics designed for each. For selected questions, we further specify particular legal reasoning paths to assess the model’s ability to conduct structured and sound legal reasoning in real-world cases.
-
In the legal document drafting task, models are required to generate legal documents, such as complaints and statements of defense, based on provided scenarios. This task aims to evaluate the models' proficiency in professional legal writing.
practical_case_analysis_250.jsonl consists of case analysis questions. We have open-sourced a total of 250 questions, including the questions, reference answers, scoring rubrics, and score sheets.
public_legal_consultation_18.json consists of legal consultation questions. We have open-sourced a total of 18 questions, including the consultation scenarios and scoring rubrics.
Defendants_Statement.json and Plantiffs_Statement.json are legal writing tasks for drafting statements of defense and complaints, respectively. We have open-sourced a total of 12 questions in total, including the writing scenarios and scoring rubrics.
Our work makes three main contributions:
1.More realistic simulation of legal practice:We faithfully simulate real-world legal practice scenarios, with all tasks adapted from authentic cases. The benchmark organizes legal tasks into three hierarchical levels—public legal consultation, practical case analysis, and legal document generation—reflecting the full workflow of legal practi- tioners and enabling a comprehensive evaluation of LLM performance across diverse legal tasks.
In real legal practice, user queries are often vague, logically inconsistent, emotionally charged, or even intentionally incomplete. While preserving the core logic of real cases, we deliberately amplify these cognitively challenging elements—such as ambiguous descriptions, omitted key facts, and misleading details—to assess whether LLMs can effectively operate under realistic legal consultation conditions.
2.Fine-Grained Reasoning Steps:Beyond evaluating final outcomes, our benchmark explicitly incorporates fine-grained legal reasoning steps into task design and evaluation. This allows us to examine whether LLMs can perform multi-stage legal reasoning, including issue identification, fact clarification, legal analysis, and conclusion validation,rather than relying on shallow pattern matching or surface-level reasoning.
3.Task-Specific Rubrics:Our evaluation framework adopts personalized, task-specific rubrics annotated by legal experts, moving beyond purely outcome-based or form-based metrics to assess substantive legal reasoning and decision-making processes.For each type of legal task, legal experts first define a rubric framework tailored to the task’s reasoning requirements. Subsequently, they annotate case-specific rubrics for each individual legal scenario. This two-stage annotation process ensures that evaluation criteria are both principled and context-sensitive, enabling a more fine-grained,comprehensive, and realistic assessment of LLM performance in legal practice settings.
| Family | Model | Overall | Task2-Avg | Task2-Conclusion | Task2-Facts | Task2-Reasoning | Task2-Statute | Task1 | Task3-Avg | Task3-Plaintiff | Task3-Defendant |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Claude | Claude-sonnet-4-20250514 | 53.55 | 55.91 | 62.57 | 79.75 | 49.54 | 35.66 | 58.24 | 46.48 | 39.60 | 53.35 |
| Claude | Claude-sonnet-4-5-20250929 | 65.88 | 67.57 | 67.61 | 88.32 | 64.05 | 47.22 | 70.98 | 59.67 | 52.05 | 67.29 |
| Claude | Claude-opus-4-5-20251101 | 66.47 | 68.00 | 69.82 | 83.61 | 65.49 | 53.61 | 68.92 | 62.27 | 56.54 | 68.01 |
| DeepSeek | DeepSeek-V3.2 | 57.97 | 64.01 | 65.56 | 82.89 | 61.64 | 43.27 | 60.12 | 46.48 | 44.43 | 48.52 |
| DeepSeek | DeepSeek-V3.2-thinking-inner | 63.23 | 64.11 | 63.91 | 86.35 | 60.24 | 44.19 | 72.09 | 55.87 | 46.13 | 65.61 |
| Doubao | Doubao-seed-1-6-250615 | 63.88 | 61.79 | 64.53 | 83.12 | 56.87 | 43.69 | 75.02 | 59.95 | 48.59 | 71.32 |
| Ernie | Ernie-5.0-thinking-preview | 51.39 | 62.89 | 64.17 | 85.84 | 58.33 | 40.96 | 27.79 | 47.94 | 46.70 | 49.18 |
| Gemini | Gemini-2.5-flash | 62.07 | 63.85 | 64.94 | 88.95 | 60.64 | 37.00 | 68.77 | 54.65 | 46.44 | 62.87 |
| Gemini | Gemini-2.5-pro | 64.05 | 64.14 | 68.12 | 78.48 | 64.49 | 43.15 | 70.31 | 59.73 | 55.35 | 64.11 |
| Gemini | Gemini-3.0-pro-preview | 66.35 | 64.95 | 72.03 | 77.79 | 65.00 | 46.42 | 70.17 | 66.13 | 63.84 | 68.42 |
| GLM | Glm-4.6 | 60.49 | 61.37 | 66.34 | 77.05 | 60.00 | 42.26 | 63.19 | 57.21 | 51.65 | 62.77 |
| GPT | GPT-4o-20240806 | 35.76 | 41.66 | 54.79 | 67.90 | 34.46 | 15.82 | 47.86 | 17.86 | 17.81 | 17.92 |
| GPT | GPT-5-0807-global | 67.76 | 62.92 | 66.77 | 86.21 | 60.27 | 34.18 | 78.71 | 68.54 | 61.05 | 76.03 |
| GPT | GPT-5.2-1211-global | 69.67 | 66.37 | 69.93 | 88.26 | 60.38 | 48.59 | 79.57 | 68.58 | 58.25 | 63.42 |
| Grok | Grok-4.1-fast | 52.67 | 58.17 | 63.32 | 89.94 | 53.37 | 21.55 | 59.07 | 39.23 | 30.25 | 48.22 |
| Kimi | Kimi-k2 | 57.27 | 60.26 | 64.62 | 80.96 | 56.66 | 40.27 | 62.73 | 48.66 | 41.78 | 55.54 |
| Qwen | Qwen-4b-instruct-2507 | 50.53 | 53.72 | 52.61 | 89.98 | 44.77 | 23.76 | 58.49 | 39.37 | 31.08 | 47.66 |
| Qwen | Qwen-4b-thinking-2507 | 44.80 | 52.70 | 56.61 | 89.97 | 45.97 | 26.73 | 41.88 | 39.83 | 30.22 | 49.44 |
| Qwen | Qwen-8b | 43.11 | 49.91 | 53.65 | 77.92 | 42.72 | 27.62 | 48.96 | 30.46 | 31.27 | 29.64 |
| Qwen | Qwen3-30b-a3b-instruct-2507 | 55.73 | 59.81 | 59.65 | 90.13 | 53.44 | 32.60 | 61.19 | 45.30 | 32.21 | 58.39 |
| Qwen | Qwen3-30b-a3b-thinking-2507 | 50.29 | 51.19 | 57.74 | 75.43 | 44.77 | 30.60 | 52.01 | 47.63 | 39.61 | 55.65 |
| Qwen | Qwen3-235b-a22b-instruct-2507 | 63.08 | 65.57 | 64.34 | 91.90 | 60.07 | 42.52 | 67.79 | 55.78 | 42.04 | 69.51 |
| Qwen | Qwen3-235b-a22b-thinking-2507 | 62.82 | 64.26 | 64.39 | 88.68 | 59.93 | 41.97 | 61.32 | 61.41 | 51.95 | 70.86 |
| Qwen | Qwen3-max | 64.75 | 67.17 | 67.52 | 90.97 | 62.75 | 45.10 | 75.76 | 53.38 | 49.33 | 57.43 |


