Skip to content

sebimo/LegalSum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paper

This is the code used for the Summarization of German Court Rulings paper.

The dataset is available via this Dropbox link.

LegalSum

Codebase for the summarization of German court rulings. This includes:

  • data scraping & preprocessing pipeline
  • GUI for manual data validation
  • custom data loading & preprocessing
  • model implementation & training routines
  • evaluation routines

Dataset

Around 100.000 guiding principles (legal summarizations) from German courts with a total number of with 300k-400k summarization sentences.

The dataset consists of one JSON file per judgment. Each judgment has at least one summarization sentence. The data within a JSON file is structured via the following entries:

Key Value
"id" Assigned "Aktenzeichen" or ID for that judgment
"date" Date format: YYYY-MM-DD
"court" Which court produced the judgment. Format [type] [place], with [type] denoting the instance level. Not necesssarily only two words.
"normchain" The normchain assigned to the judgment. List[str] with each entry one normchain part.
"norms" Dictionary with norm and normplacerholders. If possible the norms within the texts are replaced with a placeholder. This might introduces some errors, but the norms can be easily pasted again into the texts.
"inst" Previous instance that deal with the specific case List[str]
"keywords" Keywords assigned to the judgment List[str]
"title" Title assigned to the judgment List[str]
"guiding_principle" Contains two list entries with each containing a varying number of sentences. The first entry corresponds to the "official" guiding principle sentences assigned by a court, whereas the second entry contains the "editorial" sentences by a third party. In this paper, they were treated equally and simply concatenated as they were always taken from the same original source judgment. List[List[str]]
"tenor" The immediate legal consequences of the judgment. Also some kind of summary, but not so interesting to study as it can be created much easier with templates List[str]
"facts" The facts of the judgment List[str]
"reasoning" The reasoning of the judgment List[str]

Usage

Install the requirements with the environment.yml into a conda enviroment.

  • entry point for training is main.py
  • method evaluation is done with oracle.py (for the evaluation of the extractive labels) and evaluate.py for all the other methods

Directory Structure

  • src/ contains all the preprocessing, dataloading, training and evaluation code for the extractive and abstractive summarization methods
  • data/ contains all the code for the acquisition of the dataset (scraping, processing, validation,...)

Not included in this repository is the dataset, which should be copied to data/dataset and some other binary files (information about them can be found within their folders).

Research

If you make use of this dataset in your research, we ask that you please cite our paper:

@inproceedings{glaser-etal-2021-summarization,
    title = "Summarization of {G}erman Court Rulings",
    author = "Glaser, Ingo  and
      Moser, Sebastian  and
      Matthes, Florian",
    booktitle = "Proceedings of the Natural Legal Language Processing Workshop 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.nllp-1.19",
    pages = "180--189",
    abstract = "Historically speaking, the German legal language is widely neglected in NLP research, especially in summarization systems, as most of them are based on English newspaper articles. In this paper, we propose the task of automatic summarization of German court rulings. Due to their complexity and length, it is of critical importance that legal practitioners can quickly identify the content of a verdict and thus be able to decide on the relevance for a given legal case. To tackle this problem, we introduce a new dataset consisting of 100k German judgments with short summaries. Our dataset has the highest compression ratio among the most common summarization datasets. German court rulings contain much structural information, so we create a pre-processing pipeline tailored explicitly to the German legal domain. Additionally, we implement multiple extractive as well as abstractive summarization systems and build a wide variety of baseline models. Our best model achieves a ROUGE-1 score of 30.50. Therefore with this work, we are laying the crucial groundwork for further research on German summarization systems.",
}

About

Codebase for the summarization of German court rulings

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published