# Goal

In this notebook, we want to inspect the complexity of these terms of service. We're not lawyers, but we can read. So that means we're going to be looking for the reading ease and grade level ([flesch-kincaid tests](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests)) of each document as well as the estimated time it would take to read each document.

Finally, we will explore some visualizations of that data and hopefully make something interactive for the masses to engage with the results!

## Import Tooling

In [1]:
import pandas as pd
import spacy
from spacy_readability import Readability
nlp = spacy.load('en')
read = Readability()
nlp.add_pipe(read, last=True)

from tqdm import tqdm
tqdm.pandas()
import readtime

## Import Data

In [2]:
agreements = pd.read_csv('../data/processed/agreements.csv', parse_dates = [4])

In [3]:
agreements.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9241 entries, 0 to 9240
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   relativePath  9241 non-null   object        
 1   companyName   9241 non-null   object        
 2   documentType  9241 non-null   object        
 3   documentName  9241 non-null   object        
 4   timestamp     9241 non-null   datetime64[ns]
 5   fullFilePath  9241 non-null   object        
 6   fullText      9232 non-null   object        
dtypes: datetime64[ns](1), object(6)
memory usage: 505.5+ KB


In [4]:
# example
doc = nlp("I am some really difficult text. I use obnoxiously large words.")
print(doc._.flesch_kincaid_grade_level)
print(doc._.flesch_kincaid_reading_ease)
print(doc._.dale_chall)
print(doc._.smog)
print(doc._.coleman_liau_index)
print(doc._.automated_readability_index)
print(doc._.forcast)

5.864090909090908
62.81613636363636
8.215663636363637
0
6.080000000000002
3.15727272727273
0


## Create functions for df.apply using spaCy readability

These functions with

In [13]:
def get_fk_grade_level(string):
    doc = nlp(string)
    return(doc._.flesch_kincaid_grade_level)


def get_fk_reading_ease(string):
    doc = nlp(string)
    return(doc._.flesch_kincaid_reading_ease)


def get_smog(string):
    doc = nlp(string)
    return(doc._.smog)


def get_coleman_liau_index(string):
    doc = nlp(string)
    return(doc._.coleman_liau_index)


def get_automated_readability_index(string):
    doc = nlp(string)
    return(doc._.automated_readability_index)

def get_all_scores(string):
    import numpy as np
    tempDict = {}

    # from scipy-readability
    try:
        doc = nlp(string)
        tempDict['flesch_kincaid_grade_level'] = round(doc._.flesch_kincaid_grade_level,3)
        tempDict['flesch_kincaid_reading_ease'] = round(doc._.flesch_kincaid_reading_ease,3)
        tempDict['smog'] = round(doc._.smog,3)
        tempDict['coleman_liau_index'] = round(doc._.coleman_liau_index,3)
        
    except Exception as e:
        tempDict['flesch_kincaid_grade_level'] = ''
        tempDict['flesch_kincaid_reading_ease'] = ''
        tempDict['smog'] = ''
        tempDict['coleman_liau_index'] = ''
 
    try: 
        # readtime
        readtime_obj = readtime.of_text(string)
        tempDict['read_time'] = readtime_obj.seconds
        
    except Exception as e:
        tempDict['read_time'] = np.nan
    
    # return a dictionary with all scores
    return tempDict


In [15]:
# TESTING TESTING 123

# tiny_df = agreements[:25]

# tiny_df['scoreDict'] = tiny_df.fullText.progress_apply(get_all_scores)

# tiny_df

100%|██████████| 25/25 [00:15<00:00,  1.64it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  tiny_df['scoreDict'] = tiny_df.fullText.progress_apply(get_all_scores)


Unnamed: 0,relativePath,companyName,documentType,documentName,timestamp,fullFilePath,fullText,scoreDict
0,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2012-10-03--20-16-21.md,2012-10-03 20:16:21,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile Careers In...,"{'flesch_kincaid_grade_level': 12.761, 'flesch..."
1,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2012-10-10--21-18-01.md,2012-10-10 21:18:01,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile ...,"{'flesch_kincaid_grade_level': 11.069, 'flesch..."
2,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2018-05-10--05-43-56.md,2018-05-10 05:43:56,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile ...,"{'flesch_kincaid_grade_level': 10.896, 'flesch..."
3,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2018-05-19--05-45-06.md,2018-05-19 05:45:06,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile ...,"{'flesch_kincaid_grade_level': 10.902, 'flesch..."
4,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2018-05-20--05-48-46.md,2018-05-20 05:48:46,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile ...,"{'flesch_kincaid_grade_level': 10.902, 'flesch..."
5,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2018-05-23--05-46-34.md,2018-05-23 05:46:34,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile ...,"{'flesch_kincaid_grade_level': 12.467, 'flesch..."
6,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2018-05-24--05-43-56.md,2018-05-24 05:43:56,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Corporate Info Company Profile ...,"{'flesch_kincaid_grade_level': 12.472, 'flesch..."
7,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2020-02-28--07-00-30.md,2020-02-28 07:00:30,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Ecards Photocards Connect ...,"{'flesch_kincaid_grade_level': 12.246, 'flesch..."
8,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2020-02-29--06-59-42.md,2020-02-29 06:59:42,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Ecards Photocards Connect ...,"{'flesch_kincaid_grade_level': 11.984, 'flesch..."
9,../data/raw/dataset-2021-01-06-e365c67,123Greetings,Privacy Policy,2020-03-02--07-04-39.md,2020-03-02 07:04:39,../data/raw/dataset-2021-01-06-e365c67\123Gree...,Home Ecards Photocards Connect ...,"{'flesch_kincaid_grade_level': 11.905, 'flesch..."


In [None]:
agreements['scoreDict'] = agreements.fullText.progress_apply(get_all_scores)

agreements.to_csv('../data/processed/scoredAgreements.csv', header=True, index=False)

 93%|█████████▎| 8639/9241 [1:46:54<19:53,  1.98s/it]   

In [None]:
agreements.head(25)

In [10]:
# newCols = ['fk_grade_level', 'fk_reading_ease', 'smog', 'cl_index', 'automated_readability_index']
# funcs = [get_fk_grade_level, get_fk_reading_ease, get_smog, get_coleman_liau_index, get_automated_readability_index]