In [1]:
import pickle

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme()

import sys
sys.path.append('..')

'''
Examining Points by Case to help with per-Case datasets, using the clean data worked out in explore.ipynb
'''

'\nExamining Points by Case to help with per-Case datasets, using the clean data worked out in explore.ipynb\n'

In [2]:
VERSION = '181021'
cases = pickle.load(open(f'../data/cases_{VERSION}_clean.pkl', 'rb'))
documents = pickle.load(open(f'../data/documents_{VERSION}_clean.pkl', 'rb'))
points = pickle.load(open(f'../data/points_{VERSION}_clean.pkl', 'rb'))
services = pickle.load(open(f'../data/services_{VERSION}_clean.pkl', 'rb'))
topics = pickle.load(open(f'../data/topics_{VERSION}_clean.pkl', 'rb'))

In [3]:
# Things will simplify if we just focus on English docs. It might be possible to train a few models 
# for French or other eventually, but it's not worth the effort without more data.
# This will also have the effect of removing Points without docs (which could be useful for a couple Case
# models but probably not worth the effort right now)
documents = documents[documents.lang == 'en']
points = points[points.lang == 'en']

In [5]:
# Join Points with Documents, so we can check the quote contexts
points = pd.merge(points, documents, how='left', left_on='document_id', right_index=True, suffixes=['_point', '_doc'])
# Also join to attach Case info
points = pd.merge(points, cases, left_on='case_id', right_index=True, suffixes=['_point', '_case'])

In [6]:
print(f"Points per Case stats:\n{points.case_id.value_counts().describe()}\n")
print(points.groupby(['case_id', 'title_case']).size().sort_values(ascending=False).head(15))
points.groupby(['case_id', 'title_case']).size().sort_values(ascending=False).tail(10)

Points per Case stats:
count    241.000000
mean      66.987552
std       72.775310
min        1.000000
25%       18.000000
50%       41.000000
75%       89.000000
max      551.000000
Name: case_id, dtype: float64

case_id  title_case                                                                                                         
331      There is a date of the last update of the agreements                                                                   551
152      This service is only available to users over a certain age                                                             358
286      The service is provided 'as is' and to be used at your sole risk                                                       339
323      You are tracked via web beacons, tracking pixels, browser fingerprinting, and/or device fingerprinting                 278
146      You agree to defend, indemnify, and hold the service harmless in case of a claim related to your use of the service    256
3

case_id  title_case                                                                                            
317      You aren’t allowed to publicly post private messages                                                      3
497      Prices and fees may be changed at any time, without notice to you                                         2
168      The service is not transparent regarding government requests or inquiries that may involve your data.     1
309      You have the right to request lower Charges from Third Party Providers                                    1
141      Inconvenient process for obtaining personal data                                                          1
330      The service disables software that you are not licensed to use.                                           1
378      Service fines users for Terms of Service violations                                                       1
194      The service does not index or open files that you upload    

In [7]:
points['quote_len'] = points.quoteEnd - points.quoteStart

In [8]:
# My plan is to work through the Cases above in descending size order. Eventually datasets will get too small for ML.

# First up is "There is a date of the last update of the agreements"
print('\n-------------------\n'.join(points[points.case_id == 331].quoteText.sample(20).values))

Last modified: 27/03/2017
-------------------
<p>Last updated: 04/28/2020
-------------------
<br>Last modified: 30th December 2016
-------------------
<u>Last updated on December 8, 2020
-------------------
Updated October 4, 2019
-------------------
itch.io Terms of Service<blockquote>
<p>
<strong>Updated June 8 2018
-------------------
 <p>Spark Terms of Use (“Agreement") is effective as of May 22nd, 2018 (the "Effective Date"),
-------------------
Updated: February 11, 2019
-------------------
Last version 3rd March 2021.
-------------------
This policy has been modified on 28 Jan 2021 to reflect the replacement of BitPay with Coinbase as payment method for cryptocurrencies.
-------------------
Last updated: March 23, 2020</p>

-------------------
Last Revised August 2020.
-------------------
Effective: &nbsp;January 1, 2021
-------------------
THESE TERMS OF USE WERE UPDATED ON FEBRUARY 11TH 2019
-------------------
April 28th, 2016
-------------------
These Terms of Use are effec

In [10]:
# "This service is only available to users over a certain age"
print('\n-------------------\n'.join(points[points.case_id == 152].quoteText.sample(20).values))

 is indicated at the top of the text.
We ask all users to ensure that they are familiar with the most current wording of the Privacy Policy.
The amendment of
-------------------
Our Websites are not directed to persons under the age of 18 or the applicable age of majority in the jurisdiction from which the Websites are accessed (“minors”), and we prohibit minors from using the Websites.
-------------------
Those under 13 years of age are not authorized to use this Site or our Services, with or without registering.
-------------------
Our Services are not directed to children.
You’re not allowed to access or use our Services if you’re under the age of 13 (or 16 in Europe).
If you register as a user or otherwise use our Services, you represent that you’re at least 13 (or 16 in Europe).
You may use our Services only if you can legally form a binding contract with us.
In other words, if you’re under 18 years of age (or the legal age of majority where you live), you can only use our Service

In [11]:
# "The service is provided 'as is' and to be used at your sole risk"
print('\n-------------------\n'.join(points[points.case_id == 286].quoteText.sample(20).values))

THE SERVICES ARE PROVIDED "AS IS." 
-------------------
<p>THE SITE, THE GLYMPSE MATERIALS AND THE SERVICES ARE PROVIDED ON AN “AS IS” AND “AS AVAILABLE” BASIS WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED.
-------------------
THE PLATFORM, INCLUDING, WITHOUT LIMITATION, THE WEBSITE, THE APPS AND ALL CONTENT AND SERVICES ACCESSED THROUGH OR VIA THE WEBSITE, THE APPS, THE SERVICES OR OTHERWISE, ARE PROVIDED "AS IS", "AS AVAILABLE", AND "WITH ALL FAULTS
-------------------
THE SERVICE IS PROVIDED ON AN "AS IS" AND "AS AVAILABLE" BASIS.
-------------------
We provide our service as is, and we make no promises or guarantees about this service
-------------------
<p>The Services and the Content are provided on an "as is" basis and your use of the Services and the Content is at your own risk.
-------------------
THE SITE IS PROVIDED "AS IS" AND "AS AVAILABLE"
-------------------
THE WEBSITE AND ALL CONTENT ARE PROVIDED "AS IS" AND "WITH ALL FAULTS" AND THE ENTIRE RISK AS TO THE Q

In [12]:
# "You are tracked via web beacons, tracking pixels, browser fingerprinting, and/or device fingerprinting"
print('\n-------------------\n'.join(points[points.case_id == 323].quoteText.sample(20).values))


<p>When you interact with the Service, we also may use various technologies, including cookies, web beacons, pixel tags, log files, local shared objects (Flash cookies), HTML5 cookies, or other tracking technologies, 
-------------------
 device identification
-------------------
These third parties may collect information about you and your online activities, either on the Sites or on other websites, through cookies, web beacons, and other technologies to understand your interests and tailor certain advertisements to your interests.</p>
<p
-------------------
We and service providers acting on our behalf store log files and use tracking technologies such as cookies, web beacons, tracking pixels, and local shared objects, also known as flash cookies, to collect information relating to you and your use of the Service.
-------------------
Verizon Media&nbsp;may use device IDs, cookies, and other signals, including information obtained from third parties, to associate accounts and/or dev

In [13]:
# "The court of law governing the terms is in location X"
print('\n-------------------\n'.join(points[points.case_id == 163].quoteText.sample(20).values))


We are registered in England and Wales under company number 08804411 and have our registered office at 7 Westferry Circus, Canary Wharf, London, England, E14 4HD.</p>
-------------------
a company incorporated under the law of Belgium
-------------------
this Agreement and the relationship between you and Apple shall be governed by the laws of the State of California
-------------------
These Terms will be governed by and construed in accordance with the laws of the State of California, without giving effect to any conflict of laws rules or provisions.</p>
<p>You agree that any action of whatever nature arising from or relating to these Terms, the Site, or Services will be filed only in the state or federal courts located in Los Angeles County, California.
-------------------
Except to the extent any applicable law provides otherwise, these Terms of Service and any access to or use of the Service will be governed by the laws of the state of California, U.S.A., excluding its conflict of

In [14]:
# "Third-party cookies are used for statistics"
print('\n-------------------\n'.join(points[points.case_id == 325].quoteText.sample(20).values))

we work with a number of analytics partners, including Google Analytics
-------------------
On specific sites, Wolfram may use third-party cookies when working with outside partners for analytics and to optimize delivery of information that may be of interest to you.
-------------------

</p>Analytics and research<p>To help us improve and understand how people use our Services.
</p>
<p>For example, cookies help us test different versions of our services to see which particular features or content users prefer or to share information about our services to you as you interact online across the internet.
We might also optimize and improve your experience on Remind by using cookies to see how you interact with our services, such as when and how often you use them and what links you click on.
We may use Google Analytics to assist us with this.</p>

-------------------
Some of the cookies used on our Website are set by us, and some are set by third parties that are delivering services on our

In [15]:
# "You can request access, correction and/or deletion of your data"
print('\n-------------------\n'.join(points[points.case_id == 195].quoteText.sample(20).values))

You also have the right to request that it be corrected, blocked, or deleted.
-------------------
Right to access</strong> - The right to request (I) copies of your personal Data or (II) access to the information you submited and we hold at any time.</li>
<li>
<strong>Right to correct</strong> - The right to have your Data rectified if it is inaccurate or incomplete.*</li>
<li>
<strong>Right to erase</strong> - The right to request delete or remove your Data from our servers.</li>
<li>
-------------------
 to ask us to make any corrections to inaccurate or incomplete Personal Data we have about you.&nbsp.
You can also request that we erase your Personal Data when it is no longer needed for the purposes for which you provided it
-------------------

To correct, update, or remove personally identifiable information, please email us at dpo@eb.com.
-------------------
 If you would like to review or correct your personal information we maintain, please contact us.
Calyx will use good faith