# CLAUDETTE: Dataset for Unfair Clauses in Online Terms of Service (TOS)

## Overview

This dataset is introduced by the paper "CLAUDETTE: an Automated Detector of Potentially Unfair Clauses in Online Terms of Service" [Link](https://arxiv.org/pdf/1805.01217#page=4.90)

The dataset can be accessed on this [website](http://claudette.eui.eu/corpora/index.html).
- This website holds a series of datasets related to clause detection, published by the CLAUDETTE project of European University Institute (EUI).
- The project aims to "empower consumers and civil society" and focuses on "evaluating terms of service and privacy policies of online platforms and apps".
- The source of legislation is "Unfair Contract Terms Directive 93/13 and the GDPR" under EU.

## Introduction
The authors identify 8 categories of unfair clauses and annotate 50 contracts.

### Format
The dataset folder have several sub folders, whose names denote "*fields*".

- **Label Fields**: there are 9 label fields `Labels_<cat>`. Under the field subfoder, there are files named with contracts containing lines with -1 and 1 value.
- **Doc Fields**: two fields relate to contract contents. `OriginalTaggedDocuments` contain unfair clauses labeld by `<lab></lab>` tags. `Sentences` contain clauses separated by '\n' and is correponding to the label annotations.
- The left two fields are related to PoS tag and parsing tree. Disregard.

### Statistics
There are **9414** sentences. **1032** of them are labled as positive. [Link](#count-of-sentences-and-unfair-clauses)

## Labels
There are 8 labels of unfair clause categories.
- **Arbitration** (Labels_A)
  - **Def**: Requires or allows the parties to resolve their disputes through an arbitration process, before the case could go to court.
  - **Unfair**: The arbitration should take place in a state other than the consumer's residence or be based on arbiter's discretion.
  - **Potentially Unfair**: Arbitration is not fully optional.
- **Unilateral Change** (Labels_CH)
  - **Def**: Clauses specifying the conditions under which the service provider could amend and modify the terms of service and/or the service itself.
  - **Potentially Unfair**: Always marked as such.
- **Content Removal** (Labels_CR)
  - **Def**: Giving the provider a right to modify/delete user’s content, including in-app purchases, with or without specified conditions.
  - **Potentially Unfair**: Always marked as such.
- **Jurisdiction** (Labels_J)
  - **Def**: What courts will have the competence to adjudicate disputes under the contract.
  - **Fair**: clauses giving consumers a right to bring disputes in their place of residence.
  - **Unfair**: clauses stating that any judicial proceeding takes a residence away.
- **Choice of Law** (Labels_LAW)
  - **Def**: What law will govern the contract and will be applied to potential disputes.
  - **Fair**: Clauses defining the applicable law as the law of the consumer’s country of residence.
  - **Potentially Unfair**: Other cases. As we need consider other conditions more than the wording.
- **Limitation of Liability** (Labels_LTD)
  - **Def**: Clauses stipulates that the duty to pay damages is limited or excluded, for certain kind of losses, under certain conditions.
  - **Fair**: Clauses that explicitly affirm non-excludable providers’ liabilities.
  - **Potentially Unfair**:  Clauses that reduce, limit, or exclude the liability of the service provider when concerning broad categories of losses or causes of them， *e.g.*, suspension of the service.
  - **Unfair**: Clause meant to reduce, limit, or exclude the liability of the service provider for physical injuries, intentional damages as well as in case of gross negligence.
- **Unilateral Termination** (Labels_TER)
  - **Def**: Clauses giving provider the right to suspend and/or terminate the service and/or the contract, and sometimes details the circumstances under which the provider claims to have a right to do so.
  - **Potentially Unfair**: Termination is with reasons.
  - **Unfair**: Termination can happen at any time for any or no reasons and/or without notice.
- **Contract by Using** (Labels_USE)
  - **Def**: Clauses stipulating that by using the service, the consumer agrees to the terms of use without needing to confirm they have read and accepted them.
  - **Potentially Unfair**: Always marked as such.

The statistics of clause types: [Link]()

In [10]:
import json
from pathlib import Path
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from collections import Counter
from IPython.display import display

In [4]:
def load_json(fname):
    return json.load(open(fname, encoding = 'utf8'))
def load_jsonl(fname):
    return [json.loads(k) for k in open(fname, encoding = 'utf8')]

# Functions of statistics and visualization
def count_labs(labs, no_print = False):
    """Statistics of a list of labels, including n_unique and most common"""
    TOP_K = 5
    ct = Counter(labs)
    most_comm = list(ct.items())[:TOP_K]
    if not no_print:
        print(f'Number of unique: {len(ct)}')
        print(f'{most_comm}')
    return ct

def load_claudette_raw(path):
    """
    Load the claudette dataset.

    Each field is stored under a directory. Load them and combine them by contracts.
    """
    path = Path(path)
    # Load all fields and merge into a dataframe. The first column is contract id 
    # and the left columns are the fields
    data = None
    for field_dir in path.iterdir():
        field_name = field_dir.name
        field_df = pd.DataFrame([{'cont_id': cont_file.stem,
                                  field_name: cont_file.read_text()} 
                                    for cont_file in field_dir.iterdir()]).astype('str')
        # display(field_df.head(3))
        if data is None:
            data = field_df
        else:
            data = data.merge(field_df, on = 'cont_id', how = 'outer')
    return data

def load_claudette(path = None, raw_df = None):
    """Load and processing and return by sentences"""
    data_df = raw_df if raw_df is not None else load_claudette_raw(path)
    # remove unwanted fields
    data_df = data_df.drop(['OriginalTaggedDocuments', 'Trees', 'Postags'], axis = 1)
    # Get the pairs of sentences and their labels
    lines = []
    for _, cont in data_df.iterrows():
        cont_id = cont['cont_id']
        # fields of sentences and labels
        whole = cont.drop(labels = 'cont_id')
        # split by sentences
        whole = whole.apply(lambda k: k.strip('\n').split('\n'))
        keys = whole.index.tolist()
        values = whole.tolist()
        # transpose
        for line in list(zip(*values)):
            line_dt = {'cont_id': cont_id, **dict(zip(keys, line))}
            lines.append(line_dt)
    return pd.DataFrame(lines)


In [5]:
home = Path('/storage/rhshui/workspace/datasets/legal/claudette/ToS')

# demo to load one field
# dir_labs = home / 'Labels'
# field_df = pd.DataFrame([{'cont_id': cont_file.stem,
#                                   'Labels': cont_file.read_text()} 
#                                     for cont_file in dir_labs.iterdir()]).astype('str')

# Load the raw fields
raw_df = load_claudette_raw(home)
# display(raw_df.head(3))
print(raw_df.columns)

# Load and process fields
data_df = load_claudette(raw_df = raw_df)
# reset the columns order
lab_names = sorted(filter(lambda k: k.startswith('Labels'), data_df.columns))
data_df = data_df.reindex(['cont_id', 'Sentences', *lab_names], axis = 1)
# rename sent column
data_df = data_df.rename(columns = {'Sentences': 'sent'})
# change the value
data_df[lab_names] = data_df[lab_names].astype('int')
display(data_df.head(5))

Index(['cont_id', 'OriginalTaggedDocuments', 'Labels_A', 'Labels_LAW',
       'Postags', 'Labels_J', 'Labels_CR', 'Sentences', 'Trees', 'Labels_LTD',
       'Labels_USE', 'Labels', 'Labels_CH', 'Labels_TER'],
      dtype='object')


Unnamed: 0,cont_id,sent,Labels,Labels_A,Labels_CH,Labels_CR,Labels_J,Labels_LAW,Labels_LTD,Labels_TER,Labels_USE
0,eBay,"this user agreement , the user privacy notice ...",-1,-1,-1,-1,-1,-1,-1,-1,-1
1,eBay,you can find an overview of our policies here .,-1,-1,-1,-1,-1,-1,-1,-1,-1
2,eBay,"all policies , the mobile devices terms , and ...",-1,-1,-1,-1,-1,-1,-1,-1,-1
3,eBay,you agree to comply with all of the above when...,1,-1,-1,-1,-1,-1,-1,-1,1
4,eBay,the entity you are contracting with is ebay in...,-1,-1,-1,-1,-1,-1,-1,-1,-1


In [5]:
data_df[(data_df['cont_id'] == 'eBay') & (data_df['Labels'] == 1)]

Unnamed: 0,cont_id,sent,Labels,Labels_A,Labels_CH,Labels_CR,Labels_J,Labels_LAW,Labels_LTD,Labels_TER,Labels_USE
3,eBay,you agree to comply with all of the above when...,1,-1,-1,-1,-1,-1,-1,-1,1
40,eBay,failure to meet these standards may result in ...,1,-1,-1,-1,-1,-1,-1,1,-1
41,eBay,if we believe you are abusing ebay in any way ...,1,-1,-1,1,-1,-1,-1,1,-1
44,eBay,we may cancel unconfirmed accounts or accounts...,1,-1,-1,-1,-1,-1,-1,1,-1
45,eBay,"additionally , we reserve the right to refuse ...",1,-1,-1,-1,-1,-1,-1,1,-1
49,eBay,we may change our seller fees from time to tim...,1,-1,1,-1,-1,-1,-1,-1,-1
62,eBay,• we may revise data in the ebay product catal...,1,-1,1,-1,-1,-1,-1,-1,-1
102,eBay,the permission to use catalog content is subje...,1,-1,1,-1,-1,-1,-1,1,-1
161,eBay,we may suspend the ebay money back guarantee i...,1,-1,-1,-1,-1,-1,-1,1,-1
168,eBay,"in addition , to the extent permitted by appli...",1,-1,-1,-1,-1,-1,1,-1,-1


In [1]:
# ori_doc = raw_df.iloc[0,1]
# print(ori_doc)

### Statistics

### Count of sentences and unfair clauses.

In [111]:
print(f'Total sentences: {len(data_df)}')
print(data_df['Labels'].value_counts())

Total sentences: 9414
Labels
-1    8382
 1    1032
Name: count, dtype: int64


### Analyze tags

In [6]:
tagged_docs = raw_df['OriginalTaggedDocuments']

total_raw_sent = tagged_docs.apply(lambda k: len(k.split('\n'))).sum()
print(f'Total raw sent: {total_raw_sent}')

Total raw sent: 8451


In [18]:
doc = tagged_docs.iloc[1]
for k in re.finditer(r'<[a-z]+?[0-9]>.+</[a-z]+?[0-9]>', doc):
    print(k)

<re.Match object; span=(545, 699), match='<use2> In order to access and use our Services, y>
<re.Match object; span=(702, 1092), match='<ch2>CrowdTangle may, in its sole discretion, upd>
<re.Match object; span=(2652, 2768), match='<ter3>CrowdTangle may terminate the license grant>
<re.Match object; span=(7693, 8071), match='<ter3>CrowdTangle, in its sole discretion, has th>
<re.Match object; span=(9566, 10277), match='<ltd2>TO THE MAXIMUM EXTENT PERMITTED BY APPLICAB>
<re.Match object; span=(10279, 11417), match='<ltd3>WITH THE EXCEPTION OF ANY ACTS OR OMISSIONS>
<re.Match object; span=(12616, 12711), match='<ltd2>In those cases, CrowdTangle will not be hel>
<re.Match object; span=(13654, 13824), match='<ch2>CrowdTangle may change these Terms of Servic>
<re.Match object; span=(14959, 15537), match='<j3>You will resolve any claim, cause of action o>


In [12]:
# print(raw_df.iloc[1]['Sentences'])