# Goal

In this notebook, we want to inspect the complexity of these terms of service. We're not lawyers, but we can read. So that means we're going to be looking for the reading ease and grade level ([flesch-kincaid tests](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)) of each document as well as the estimated time it would take to read each document.

Finally, we will explore some visualizations of that data and hopefully make something interactive for the masses to engage with the results!

## Import Tooling

In [1]:
import pandas as pd
import spacy
from spacy_readability import Readability
nlp = spacy.load('en')
read = Readability()
nlp.add_pipe(read, last=True)

from tqdm import tqdm
tqdm.pandas()
import readtime

from IPython.display import display, Markdown
import numpy as np

## Import Data

In [2]:
agreements = pd.read_csv('../data/processed/agreements.csv', parse_dates = [4])

In [3]:
agreements.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9241 entries, 0 to 9240
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   relativePath  9241 non-null   object        
 1   companyName   9241 non-null   object        
 2   documentType  9241 non-null   object        
 3   documentName  9241 non-null   object        
 4   timestamp     9241 non-null   datetime64[ns]
 5   fullFilePath  9241 non-null   object        
 6   fullText      9232 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 505.5+ KB


## Testing Spacy Readability

We want to be sure that our hypothetical function using .apply with a pandas DataFrame will work. Here, we can run the test example to make sure we've got everything installed and running smoothly on spacy's end

In [4]:
# example
doc = nlp("I am some really difficult text. I use obnoxiously large words.")
print('flesch kincaid grade level: ', doc._.flesch_kincaid_grade_level)
print('flesch kincaid reading ease: ', doc._.flesch_kincaid_reading_ease)
print('SMOG: ', doc._.smog)
print('Coleman Luau Index :', doc._.coleman_liau_index)
print('Automated Readability Index: ', doc._.automated_readability_index)
print('Forcast: ', doc._.forcast)

flesch kincaid grade level:  5.864090909090908
flesch kincaid reading ease:  62.81613636363636
SMOG:  0
Coleman Luau Index : 6.080000000000002
Automated Readability Index:  3.15727272727273
Forcast:  0


## Create function for df.apply using spaCy readability

Below, we created a function that will return the scores for four metrics in a dictionary. That dictionary can later be expanded out to columns in the dataframe. For the function, this single dictionary output should play well with pd.series.apply().

In [5]:
def get_all_scores(string):
    import numpy as np
    
    # construct empty dictionary to hold scores
    tempDict = {}

    # from scipy-readability
    try:
        # creating the doc with nlp and assigning scores
        doc = nlp(string)
        tempDict['flesch_kincaid_grade_level'] = round(doc._.flesch_kincaid_grade_level,3)
        tempDict['flesch_kincaid_reading_ease'] = round(doc._.flesch_kincaid_reading_ease,3)
        tempDict['smog'] = round(doc._.smog,3)
        tempDict['coleman_liau_index'] = round(doc._.coleman_liau_index,3)
        
    except Exception as e:
        # some exception has happened, but we can just write null data for each of the dictionary keys
        tempDict['flesch_kincaid_grade_level'] = ''
        tempDict['flesch_kincaid_reading_ease'] = ''
        tempDict['smog'] = ''
        tempDict['coleman_liau_index'] = ''
 
    # use readtime to get the readtime in seconds per agreement text
    try: 
        # readtime
        readtime_obj = readtime.of_text(string)
        tempDict['read_time'] = readtime_obj.seconds
        
    except Exception as e:
        # if exception, populate dictionary at read_time with null
        tempDict['read_time'] = np.nan
    
    # return a dictionary with all scores
    return tempDict


## Generating our Scoring Data

In [6]:
# # apply our function to the full text for each agreement in the dataset
# # progress apply gives you a progress bar using tqdm

# agreements['scoreDict'] = agreements.fullText.progress_apply(get_all_scores)

In [7]:
# agreements[[
#     'flesch_kincaid_grade_level', 'flesch_kincaid_reading_ease', 'smog',
#     'coleman_liau_index', 'read_time'
# ]] = agreements['scoreDict'].apply(pd.Series)

# # since this took 1:53:46, write the scored dataset to a csv
# 100%|██████████| 9241/9241 [1:53:46<00:00,  1.35it/s]
# agreements.to_csv('../data/processed/scoredAgreements.csv', header=True, index=False)

## Read in the Scored Dataset

In [8]:
agreements_scored = pd.read_csv('../data/processed/scoredAgreements.csv', parse_dates = [4])

In [9]:
agreements_scored.dtypes

relativePath                           object
companyName                            object
documentType                           object
documentName                           object
timestamp                      datetime64[ns]
fullFilePath                           object
fullText                               object
scoreDict                              object
flesch_kincaid_grade_level            float64
flesch_kincaid_reading_ease           float64
smog                                  float64
coleman_liau_index                    float64
read_time                             float64
dtype: object

## Analysis

## Isolate Terms of Service Documents from the Dataset

In [10]:
tos_df = agreements_scored[agreements_scored['documentType'] ==
                           'Terms of Service']

In [11]:
## Expand metrics to all 

### How many companies and how many documents?

In [12]:
# determine how many companies are represented in dataset
display(
    Markdown('#### There are {} unique companies in the dataset.'.format(
        agreements_scored.companyName.nunique())))

#### There are 174 unique companies in the dataset.

In [13]:
# determine how many types of documents are included in dataset
display(
    Markdown('#### There are {} unique documents in the dataset.'.format(
        len(agreements_scored))))

#### There are 9241 unique documents in the dataset.

### How many companies have terms of service agreements?

In [14]:
val = tos_df['companyName'].nunique()

# display an aesthetic print statement
display(
    Markdown(
        '#### There are {} unique companies with a Terms of Service agreement in the dataset.'
        .format(val)))

#### There are 151 unique companies with a Terms of Service agreement in the dataset.

### What's the oldest/youngest ToS Document we have?

In [15]:
earliest_timestamp = tos_df['timestamp'].min()

# these are the companies with the earliest terms of service documents
earliest_company = tos_df[tos_df['timestamp'] == earliest_timestamp]['companyName']


# display an aesthetic print statement
display(
    Markdown(
        '#### The earliest ToS agreement we have is from {} from {}.'
        .format(str(earliest_timestamp)[:-9], earliest_company.values[0])))

latest_timestamp = tos_df['timestamp'].max()

# these are the companies with the earliest terms of service documents
latest_companies = tos_df[tos_df['timestamp'] == latest_timestamp]['companyName']


# display an aesthetic print statement
display(
    Markdown(
        '#### The latest ToS agreement we have is from {} from {}.'
        .format(str(latest_timestamp)[:-9],latest_companies.values[0])))

#### The earliest ToS agreement we have is from 2012-10-04 from Windstream.

#### The latest ToS agreement we have is from 2021-01-05 from WordPress.com.

### Average and Median Reading Level for Most Recent ToS Per company

In [16]:
# sort by datetime ascending
tos_df = tos_df.sort_values(by='timestamp', ascending=True)

# dedupe to company level and keep the last (most recent) record
tos_latest = tos_df.drop_duplicates(subset='companyName', keep='last')

# get mean scores
mean_fk_grade_level = round(np.mean(tos_latest['flesch_kincaid_grade_level']),
                            3)
mean_fk_ease = round(np.mean(tos_latest['flesch_kincaid_reading_ease']), 3)

# get median scores
median_fk_grade_level = round(
    np.median(tos_latest['flesch_kincaid_grade_level']), 3)
median_fk_ease = round(np.median(tos_latest['flesch_kincaid_reading_ease']), 3)

# display an aesthetic print statement
display(
    Markdown(''' 
     - The mean ToS Flesch-Kincaid reading ease is {fk_1}
     - The mean ToS Flesch-Kincaid grade level is {fk_2}
     - The median ToS Flesch-Kincaid reading ease is {fk_3}
     - The median ToS Flesch-Kincaid grade level is {fk_4}.'''.format(
        fk_1=mean_fk_ease,
        fk_2=mean_fk_grade_level,
        fk_3=median_fk_ease,
        fk_4=median_fk_grade_level)))

 
     - The mean ToS Flesch-Kincaid reading ease is 46.922
     - The mean ToS Flesch-Kincaid grade level is 11.522
     - The median ToS Flesch-Kincaid reading ease is 46.38
     - The median ToS Flesch-Kincaid grade level is 11.552.

### What is the median and mean reading time?

In [17]:
mean_rt = round(np.mean(tos_latest['read_time']/60), 0)
median_rt = round(np.median(tos_latest['read_time']/60), 0)

# display an aesthetic print statement
display(
    Markdown(''' 
     - The mean reading time for the latest ToS agreements is {r1} minutes
     - The median reading time for the latest ToS agreements is {r2} minutes'''.format(
        r1=mean_rt,
        r2=median_rt)))

 
     - The mean reading time for the latest ToS agreements is 21.0 minutes
     - The median reading time for the latest ToS agreements is 18.0 minutes

### How much has the mean reading ease, grade level, and read time increased since each company put out their first TOS?

In [18]:
# sort by datetime descending
tos_df = tos_df.sort_values(by='timestamp', ascending=False)

# dedupe to company level and keep the last (earliest) record
tos_earliest = tos_df.drop_duplicates(subset='companyName', keep='last')

# get mean scores
mean_fk_grade_level = round(np.mean(tos_earliest['flesch_kincaid_grade_level']),
                            3)
mean_fk_ease = round(np.mean(tos_earliest['flesch_kincaid_reading_ease']), 3)

# get median scores
median_fk_grade_level = round(
    np.median(tos_earliest['flesch_kincaid_grade_level']), 3)
median_fk_ease = round(np.median(tos_earliest['flesch_kincaid_reading_ease']), 3)

# display an aesthetic print statement
display(
    Markdown(''' 
     - The mean ToS Flesch-Kincaid reading ease is {fk_1}
     - The mean ToS Flesch-Kincaid grade level is {fk_2}
     - The median ToS Flesch-Kincaid reading ease is {fk_3}
     - The median ToS Flesch-Kincaid grade level is {fk_4}.'''.format(
        fk_1=mean_fk_ease,
        fk_2=mean_fk_grade_level,
        fk_3=median_fk_ease,
        fk_4=median_fk_grade_level)))


mean_rt = round(np.mean(tos_earliest['read_time']/60), 0)
median_rt = round(np.median(tos_earliest['read_time']/60), 0)

# display an aesthetic print statement
display(
    Markdown(''' 
     - The mean reading time for the latest ToS agreements is {r1} minutes
     - The median reading time for the latest ToS agreements is {r2} minutes'''.format(
        r1=mean_rt,
        r2=median_rt)))

 
     - The mean ToS Flesch-Kincaid reading ease is 46.888
     - The mean ToS Flesch-Kincaid grade level is 11.063
     - The median ToS Flesch-Kincaid reading ease is 46.48
     - The median ToS Flesch-Kincaid grade level is 11.252.

 
     - The mean reading time for the latest ToS agreements is 23.0 minutes
     - The median reading time for the latest ToS agreements is 19.0 minutes

### Which company added the most complexity to their TOS between their first and last agreement?

In [19]:
# isolate earliest records of interest
left_df = tos_earliest[['companyName', 'timestamp', 'flesch_kincaid_grade_level', 'flesch_kincaid_reading_ease', 'smog',
    'coleman_liau_index', 'read_time']]


# isolate most recent records of interest
right_df =  tos_latest[['companyName', 'timestamp', 'flesch_kincaid_grade_level', 'flesch_kincaid_reading_ease', 'smog',
    'coleman_liau_index', 'read_time']]

# inner join companies' earliest data and their most recent with some suffixes for field names
combined_df = left_df.merge(right_df, on=['companyName'], how ='inner', suffixes=('__first', '__last'))

combined_df.shape

(151, 13)

In [20]:
# create fields of interest

# difference between first and last ToS doc
combined_df['days_diff'] = combined_df['timestamp__last'] - combined_df[
    'timestamp__first']

# difference in scores per company
combined_df['flesch_kincaid_grade_level_diff'] = combined_df[
    'flesch_kincaid_grade_level__last'] - combined_df[
        'flesch_kincaid_grade_level__first']

combined_df['flesch_kincaid_reading_ease_diff'] = combined_df[
    'flesch_kincaid_reading_ease__last'] - combined_df[
        'flesch_kincaid_reading_ease__first']

combined_df['read_time_diff'] = combined_df['read_time__last'] - combined_df[
    'read_time__first']



In [21]:
# get max and min diffs
max_fk_gl = combined_df['flesch_kincaid_grade_level_diff'].max()
min_fk_gl = combined_df['flesch_kincaid_grade_level_diff'].min()

max_fk_ease = combined_df['flesch_kincaid_reading_ease_diff'].max()
min_fk_ease = combined_df['flesch_kincaid_reading_ease_diff'].min()



In [22]:
combined_df.loc[combined_df['flesch_kincaid_grade_level_diff'].idxmax(),'companyName']


# display an aesthetic print statement
display(
    Markdown(''' 
     - The Company with the largest increase in Flesch-Kincaid grade level is {c1}.
     - The Company with the largest decrease in Flesch-Kincaid grade level is {c2}.
     
     - The Company with the largest increase in Flesch-Kincaid reading ease is {c3}.
     - The Company with the largest decrease in Flesch-Kincaid reading ease is {c4}.
     '''.format(
        c1=combined_df.loc[combined_df['flesch_kincaid_grade_level_diff'].idxmax(),'companyName'],
        c2=combined_df.loc[combined_df['flesch_kincaid_grade_level_diff'].idxmin(),'companyName'],
        c3=combined_df.loc[combined_df['flesch_kincaid_reading_ease_diff'].idxmax(),'companyName'],
        c4=combined_df.loc[combined_df['flesch_kincaid_reading_ease_diff'].idxmin(),'companyName'],

    )))



 
     - The Company with the largest increase in Flesch-Kincaid grade level is Coursera.
     - The Company with the largest decrease in Flesch-Kincaid grade level is WebProNews.
     
     - The Company with the largest increase in Flesch-Kincaid reading ease is WebProNews.
     - The Company with the largest decrease in Flesch-Kincaid reading ease is Coursera.
     