# CLAUDETTE: Dataset for Unfair Clauses in Online Terms of Service (TOS)

## Overview

This dataset is introduced by the paper "CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service" [Link](https://arxiv.org/pdf/1805.01217#page=4.90)

The dataset can be accessed on this [website](http://claudette.eui.eu/corpora/index.html).
- This website holds a series of datasets related to clause detection, published by the CLAUDETTE project of European University Institute (EUI).
- The project aims to "empower consumers and civil society" and focuses on "evaluating terms of service and privacy policies of online platforms and apps".
- The source of legislation is "Unfair Contract Terms Directive 93/13 and the GDPR" under EU.

## Introduction
The authors identify 8 categories of unfair clauses and annotate 50 contracts.

### Format
The data of each field is stored under a child directory and contains files named with contract ID filled with field values.

- There are 9 fields related to labels `Labels_<cat>`. Each file consists of lines with -1 and 1 value, indicating whether the sentence fall into the unfairness category.
- `OriginalTaggedDocuments` and `Sentences` fields provide the clause text. Clauses are separated by `\n`. The annotation is corresponding to `Sentences` field.
- The left two fields are related to PoS tag and parsing tree.

### Statistics
There are **9414** sentences, **1032** of them are labled as positive. [Link](#)

In [1]:
import json
from pathlib import Path
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from IPython.display import display

In [99]:
def load_json(fname):
    return json.load(open(fname, encoding = 'utf8'))
def load_jsonl(fname):
    return [json.loads(k) for k in open(fname, encoding = 'utf8')]

# Functions of statistics and visualization
def count_labs(labs, no_print = False):
    """Statistics of a list of labels, including n_unique and most common"""
    TOP_K = 5
    ct = Counter(labs)
    most_comm = list(ct.items())[:TOP_K]
    if not no_print:
        print(f'Number of unique: {len(ct)}')
        print(f'{most_comm}')
    return ct

def load_claudette_raw(path):
    """
    Load the claudette dataset.

    Each field is stored under a directory. Load them and combine them by contracts.
    """
    path = Path(path)
    # Load all fields and merge into a dataframe. The first column is contract id 
    # and the left columns are the fields
    data = None
    for field_dir in path.iterdir():
        field_name = field_dir.name
        field_df = pd.DataFrame([{'cont_id': cont_file.stem,
                                  field_name: cont_file.read_text()} 
                                    for cont_file in field_dir.iterdir()]).astype('str')
        # display(field_df.head(3))
        if data is None:
            data = field_df
        else:
            data = data.merge(field_df, on = 'cont_id', how = 'outer')
    return data

def load_claudette(path = None, raw_df = None):
    """Load and processing and return by sentences"""
    data_df = raw_df if raw_df is not None else load_claudette_raw(path)
    # remove unwanted fields
    data_df = data_df.drop(['OriginalTaggedDocuments', 'Trees', 'Postags'], axis = 1)
    # Get the pairs of sentences and their labels
    lines = []
    for _, cont in data_df.iterrows():
        cont_id = cont['cont_id']
        # fields of sentences and labels
        whole = cont.drop(labels = 'cont_id')
        # split by sentences
        whole = whole.apply(lambda k: k.strip('\n').split('\n'))
        keys = whole.index.tolist()
        values = whole.tolist()
        # transpose
        for line in list(zip(*values)):
            line_dt = {'cont_id': cont_id, **dict(zip(keys, line))}
            lines.append(line_dt)
    return pd.DataFrame(lines)


In [108]:
home = Path('/storage/rhshui/workspace/datasets/legal/claudette/ToS')

# demo to load one field
dir_labs = home / 'Labels'
field_df = pd.DataFrame([{'cont_id': cont_file.stem,
                                  'Labels': cont_file.read_text()} 
                                    for cont_file in dir_labs.iterdir()]).astype('str')

# Load the raw fields
raw_df = load_claudette_raw(home)
display(raw_df.head(3))

# Load and process fields
data_df = load_claudette(raw_df = raw_df)
# reset the columns order
lab_names = sorted(filter(lambda k: k.startswith('Labels'), data_df.columns))
data_df = data_df.reindex(['cont_id', 'Sentences', *lab_names], axis = 1)
# rename sent column
data_df = data_df.rename(columns = {'Sentences': 'sent'})
# change the value
data_df[lab_names] = data_df[lab_names].astype('int')
display(data_df.head(5))

Unnamed: 0,cont_id,OriginalTaggedDocuments,Labels_A,Labels_LAW,Postags,Labels_J,Labels_CR,Sentences,Trees,Labels_LTD,Labels_USE,Labels,Labels_CH,Labels_TER
0,eBay,"Introduction\nThis User Agreement, the User Pr...",-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,"DT NN NN , DT NN NNP NNP , DT NNP NNPS NNS , C...",-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,"this user agreement , the user privacy notice ...",(ROOT (S (NP (NP (DT This) (NN User) (NN Agree...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...,-1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
1,Crowdtangle,"CrowdTangle, Inc. (“CrowdTangle,” “we” or “us”...",-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,"NNP , NNP -LRB- `` NNP , '' `` PRP '' CC `` PR...",-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,"crowdtangle , inc. -lrb- `` crowdtangle , '' `...","(ROOT (S (NP (NP (NNP CrowdTangle)) (, ,) (NP ...",-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n...,-1\n-1\n1\n1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-...,-1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
2,Academia,"Last Updated Date: May 15, 2017\nAcademia, Inc...",-1\n-1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,"JJ NNP NNP : NNP CD , CD \nNNP , NNP -LRB- `` ...",-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,"last updated date : may 15 , 2017 \nacademia ,...",(ROOT (NP (NP (JJ Last) (NNP Updated) (NNP Dat...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...,-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...


Unnamed: 0,cont_id,sent,Labels,Labels_A,Labels_CH,Labels_CR,Labels_J,Labels_LAW,Labels_LTD,Labels_TER,Labels_USE
0,eBay,"this user agreement , the user privacy notice ...",-1,-1,-1,-1,-1,-1,-1,-1,-1
1,eBay,you can find an overview of our policies here .,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,eBay,"all policies , the mobile devices terms , and ...",-1,-1,-1,-1,-1,-1,-1,-1,-1
3,eBay,you agree to comply with all of the above when...,1,-1,-1,-1,-1,-1,-1,-1,1
4,eBay,the entity you are contracting with is ebay in...,-1,-1,-1,-1,-1,-1,-1,-1,-1


### Statistics

In [111]:
print(f'Total sentences: {len(data_df)}')
print(data_df['Labels'].value_counts())

Total sentences: 9414
Labels
-1    8382
 1    1032
Name: count, dtype: int64


In [112]:
labs = data_df.iloc[:,2:].to_numpy()
multi = np.apply_along_axis(np.any, 1, (labs[:,1:] == 1))

In [113]:
np.all(multi == labs[:,0])

False

In [115]:
labs[0]

array([-1, -1, -1, -1, -1, -1, -1, -1, -1])

In [60]:
type(cont)
cont

cont_id                                                                 eBay
OriginalTaggedDocuments    Introduction\nThis User Agreement, the User Pr...
Labels_A                   -1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
Labels_LAW                 -1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
Postags                    DT NN NN , DT NN NNP NNP , DT NNP NNPS NNS , C...
Labels_J                   -1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
Labels_CR                  -1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
Sentences                  this user agreement , the user privacy notice ...
Trees                      (ROOT (S (NP (NP (DT This) (NN User) (NN Agree...
Labels_LTD                 -1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...
Labels_USE                 -1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...
Labels                     -1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\...
Labels_CH                  -1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1...

In [61]:
next(cont.items())

('cont_id', 'eBay')

In [66]:
cont.tolist()

['eBay',
 '-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1\n-1