# Goal

The goal of this notebook is to develop a way to get these terms of service documents into a single tabular dataset. We'd like to have each record contain the timestamp of the TOS file, the company that created it, and the full text in the TOS.

## Import tooling

In [1]:
import pandas as pd
import glob as glob
import os
from datetime import datetime
from collections import Counter

from IPython.display import display, Markdown

from tqdm.notebook import tqdm
tqdm.pandas()

  from pandas import Panel


## Determine All the Relevant Files and their Paths

In [2]:
# get relative path for each company in the dataset parent directory
company_directories = glob.glob('../data/raw/dataset-2021-01-06-e365c67/*',
                                recursive=True)

# keep only the company name from each path
company_names = [x.split("\\")[-1] for x in company_directories]

# determine how many types of legal documents are included in the dataset
document_types = glob.glob('../data/raw/dataset-2021-01-06-e365c67/*/*',
                           recursive=True)
# keep only the document type from each path
document_names = [x.split("\\")[-1] for x in document_types]

## Inspecting the Dataset

In [3]:
# determine how many companies are represented in dataset
display(
    Markdown('#### There are {} unique companies in the dataset.'.format(
        len(set(company_names)))))

# determine how many types of documents are represented in dataset
display(
    Markdown(
        '#### There are {} unique types of documents in the dataset, but Terms of Service are the most common.'
        .format(len(set(document_names)))))

counts_of_documents = pd.DataFrame.from_dict(Counter(document_names),
                                             orient='index')
counts_of_documents.columns = ['count']
counts_of_documents.sort_values(by='count', ascending=False)

#### There are 174 unique companies in the dataset.

#### There are 23 unique types of documents in the dataset, but Terms of Service are the most common.

Unnamed: 0,count
Terms of Service,151
Privacy Policy,138
Trackers Policy,12
Developer Terms,11
Copyright Claims Policy,6
Community Guidelines,6
Acceptable Use Policy,5
Data Processor Agreement,4
Commercial Terms,4
Parent Organization Privacy Policy,4


## Transform the Nested File Structure into a Tabular Representation

In [4]:
# find all the files within each company, each type of document, and each version
# each item in the list should represent a unique document
file_list = glob.glob('../data/raw/dataset-2021-01-06-e365c67/*/*/*',
                      recursive=True)

# construct a list of lists, where each interior list will ultimately be a new row in a pandas dataframe
list_of_lists = [x.split("\\") for x in file_list]

# construct a pandas dataframe from this nested list

file_df = pd.DataFrame(
    list_of_lists,
    columns=['relativePath', 'companyName', 'documentType', 'documentName'])
file_df.sample(4)

Unnamed: 0,relativePath,companyName,documentType,documentName
3575,../data/raw/dataset-2021-01-06-e365c67,Allstate,Privacy Policy,2017-05-22--04-58-00.md
9180,../data/raw/dataset-2021-01-06-e365c67,Zillow,Privacy Policy,2019-10-11--06-01-58.md
203,../data/raw/dataset-2021-01-06-e365c67,ACDelco,Privacy Policy,2013-08-10--04-28-34.md
6212,../data/raw/dataset-2021-01-06-e365c67,Xanga,Privacy Policy,2014-09-07--04-29-49.md


In [5]:
# convert filename string to a datetime
file_df['timestamp'] = file_df['documentName'].apply(
    lambda x: datetime.strptime(x[:-3], '%Y-%m-%d--%H-%M-%S'))

file_df['fullFilePath'] = file_list

file_df.sample(3)

Unnamed: 0,relativePath,companyName,documentType,documentName,timestamp,fullFilePath
8033,../data/raw/dataset-2021-01-06-e365c67,Xanga,Privacy Policy,2019-11-24--06-59-57.md,2019-11-24 06:59:57,../data/raw/dataset-2021-01-06-e365c67\Xanga\P...
6417,../data/raw/dataset-2021-01-06-e365c67,Xanga,Privacy Policy,2015-04-01--04-40-01.md,2015-04-01 04:40:01,../data/raw/dataset-2021-01-06-e365c67\Xanga\P...
4379,../data/raw/dataset-2021-01-06-e365c67,Myspace,Terms of Service,2020-08-25--15-30-33.md,2020-08-25 15:30:33,../data/raw/dataset-2021-01-06-e365c67\Myspace...


## Append the Text from Each File to DataFrame

First, we'll define a function to use a filepath, read the contents, remove some markdown characters, and then return the contents as a string. Then we can use .apply and a lambda function to use this function on our files. Since we're storing the filepath in the dataframe, it should be a pretty simple one liner.

In [6]:
def get_file_contents(filepath):
    with open(filepath, 'r', encoding='utf-8') as file:

        # get the full contents of the file
        contents = file.readlines()

        # replace some markdown characters.
    contents = [
        x.replace("\n", '').replace("*", '').replace("\'",
                                                     '').replace("Â»", '')
        for x in contents
    ]

    # return file contents as a string
    return ' '.join(contents)

### Applying our Function

In [7]:
file_df['fullText'] = file_df['fullFilePath'].progress_apply(
    lambda x: get_file_contents(x))

  0%|          | 0/9241 [00:00<?, ?it/s]

### Writing out the tabular representation of the documents to a csv

In [8]:
file_df.to_csv('../data/processed/agreements.csv', index=False, header=True)