# Exploring LEDGAR Dataset

The LEDGAR dataset is introduced by the [paper](https://aclanthology.org/2020.lrec-1.155.pdf) *LEDGAR: A Large-Scale Multilabel Corpus for
Text Classification of Legal Provisions in Contracts* 

The dataset is available at https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A

## Background

This dataset provides **contract provisions and their labels**. The labels are parsed from the provision titles.

The contract corpus is from the **U.S. Securities and Exchange Commission**(SEC). The paper claims "12,000 labels
annotated in almost 100,000 provisions in over 60,000 contracts".

## Download

Download the whole dataset
```Bash
wget https://drive.switch.ch/index.php/s/j9S0GRMAbGZKa1A/download?path=%2F&files=LEDGAR_2016-2019_clean.jsonl.zip

# unzip the file
unzip 'download?path=%2F'

# Then get the LEDGAR folder
# There are four files: LEDGAR_2016-2019_clean.jsonl.zip  LEDGAR_2016-2019.jsonl.zip  README.txt  sec_crawl_data.tgz

unzip LEDGAR_2016-2019.jsonl.zip
# get sec_corpus_2016-2019.jsonl

unzip LEDGAR_2016-2019_clean.jsonl.zip
# get LEDGAR_2016-2019_clean.jsonl
```


## Statistics

In [34]:
import json
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from IPython.display import display

In [24]:
def load_json(fname):
    return json.load(open(fname, encoding = 'utf8'))
def load_jsonl(fname):
    return [json.loads(k) for k in open(fname, encoding = 'utf8')]

# Functions of statistics and visualization
def count_labs(labs, no_print = False):
    """Statistics of a list of labels, including n_unique and most common"""
    TOP_K = 5
    ct = Counter(labs)
    most_comm = list(ct.items())[:TOP_K]
    if not no_print:
        print(f'Number of unique: {len(ct)}')
        print(f'{most_comm}')
    return ct

In [26]:
ds_folder = Path("/storage/rhshui/workspace/datasets/ledgar/LEDGAR")

clean_data = load_jsonl(ds_folder / 'LEDGAR_2016-2019_clean.jsonl')

raw_data = load_jsonl(ds_folder / 'sec_corpus_2016-2019.jsonl')

### Overview

There are two versions: *raw* and *clean*.

The *raw* version is described in Section 3.2 in the paper, including 1,850,284 labeled provisions in
72,605 contracts and a labelset of size 183,622.

The *clean* version is described in Section 3.3, including 846,274 provision and 12,608 labels.
- The cleanup consists of label split, e.g., by connections & and comma, pruning, et al.

**Format**: each sample consists of three fields: 
- `provision`: string of the provision text
- `label`: a list of provision labels
- `source`: SEC ID of the contracts

In [39]:
# Show this statistics
# For raw version
raw_df = pd.DataFrame(raw_data)
print('*'*20 + ' Raw ' + '*'*20)
display(raw_df.head(3))
print(f'Number of provisions: {len(raw_df)}')
print(f'Num of labels: {raw_df["label"].explode().nunique()}')
print(f'Num of contracts: {raw_df.source.nunique()}')

# For clean version
clean_df = pd.DataFrame(clean_data)
print('*'*20 + ' Clean ' + '*'*20)
display(clean_df.head(3))
print(f'Number of provisions: {len(clean_df)}')
print(f'Num of labels: {clean_df["label"].explode().nunique()}')
print(f'Num of contracts: {clean_df.source.nunique()}')

******************** Raw ********************


Unnamed: 0,provision,label,source
0,Section and Subsection headings in this Amendm...,[headings],2019/QTR1/000119312519044328/d691151dex101.htm
1,THIS AMENDMENT AND THE RIGHTS AND OBLIGATIONS ...,[applicable law],2019/QTR1/000119312519044328/d691151dex101.htm
2,This Amendment may be executed in any number o...,[counterparts],2019/QTR1/000119312519044328/d691151dex101.htm


Number of provisions: 1850284
Num of labels: 183622
Num of contracts: 72605
******************** Clean ********************


Unnamed: 0,provision,label,source
0,Section and Subsection headings in this Amendm...,[headings],2019/QTR1/000119312519044328/d691151dex101.htm
1,THIS AMENDMENT AND THE RIGHTS AND OBLIGATIONS ...,[applicable laws],2019/QTR1/000119312519044328/d691151dex101.htm
2,This Amendment may be executed in any number o...,[counterparts],2019/QTR1/000119312519044328/d691151dex101.htm


Number of provisions: 846274
Num of labels: 12608
Num of contracts: 60540


### Label Sub-sampling

Section 4.1 of the paper discuss the selection of labels. We replicate these selections.

**Prototypical** refers to top 13 most common labels, shown as follows:

In [51]:
lab_counts = clean_df.label.explode().value_counts()
proto_labels = list(lab_counts.items())[:14]
# uncomment the following line to sort alphabetically
# proto_labels.sort(key = lambda k: k[0])
print(proto_labels)

[('amendments', 13262), ('assignments', 6246), ('assigns', 6363), ('compliance with laws', 4991), ('counterparts', 11708), ('entire agreements', 11825), ('expenses', 9066), ('fees', 4504), ('governing laws', 17377), ('insurances', 4677), ('notices', 10359), ('representations', 6082), ('severability', 9023), ('successors', 8508), ('survival', 6226), ('taxes', 5376), ('terminations', 5436), ('terms', 4840), ('waivers', 9354), ('warranties', 5894)]


In [52]:
top_counts

label
governing laws          17377
amendments              13262
entire agreements       11825
counterparts            11708
notices                 10359
waivers                  9354
expenses                 9066
severability             9023
successors               8508
assigns                  6363
assignments              6246
survival                 6226
representations          6082
warranties               5894
terminations             5436
taxes                    5376
compliance with laws     4991
terms                    4840
insurances               4677
fees                     4504
Name: count, dtype: int64

In [42]:
raw_df.label.explode().value_counts().head(13)

label
governing law             36106
counterparts              30665
severability              30196
entire agreement          21805
headings                  16053
notices                   14923
successors and assigns    12748
survival                  11968
waiver of jury trial       8602
further assurances         8321
waiver                     8003
amendment                  7920
amendments                 7557
Name: count, dtype: int64