# Tokenize privacy policies into a sentences corpus

We'll use NLTK's [`sent_tokenize`](https://www.nltk.org/api/nltk.tokenize.html) to split our privacy policies corpus into a sentences corpus to feed our crowdsourced labeling tool.

In [17]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/javi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [2]:
import pandas as pd
from nltk.tokenize import sent_tokenize

In [19]:
policies_corpus = pd.read_pickle("data/1360_privacy_policies.pkl")

## Try with just a privacy policy first

Let's just try with the first privacy policy and see what we get.

In [20]:
policies_corpus[:10]

Unnamed: 0,name,policy
0,google.co.nz,Privacy Policy Last mo...
1,ibnlive.in.com,Privacy Policy IBN7IBN7...
2,gocomics.com,Privacy Policy At Univ...
3,petsmart.com,This document provides ...
4,duolingo.com,Privacy Policy ...
5,usda.gov,Privacy Policy Thank ...
6,fcbarcelona.com,PRIVACY POLICY. PROTECT...
7,zara.com,Zara´s Privacy Statemen...
8,infowars.com,"Infowars LLC, Terms of ..."
9,change.org,Privacy Policy About Te...


In [21]:
first_policy = policies_corpus.iloc[0]

In [22]:
policy_sentences = sent_tokenize(first_policy.policy)

In [23]:
policy_sentences

['                       Privacy Policy  Last modified: December 20, 2013 (view archived versions)  There are many different ways you can use our services – to search for and share information, to communicate with other people or to create new content.',
 'When you share information with us, for example by creating a Google Account, we can make those services even better – to show you more relevant search results and ads, to help you connect with people or to make sharing with others quicker and easier.',
 "As you use our services, we want you to be clear how we're using information and the ways in which you can protect your privacy.",
 'Our Privacy Policy explains:  What information we collect and why we collect it.',
 'How we use that information.',
 'The choices we offer, including how to access and update information.',
 "We've tried to keep it as simple as possible, but if you're not familiar with terms like cookies, IP addresses, pixel tags and browsers, then read about these key

In [24]:
policy_sentences[0]

'                       Privacy Policy  Last modified: December 20, 2013 (view archived versions)  There are many different ways you can use our services – to search for and share information, to communicate with other people or to create new content.'

## Tokenize the whole policies dataset

Now let's apply those NLTK and tokenization learnings to the whole privacies dataset and generate our sentences corpus.

In [25]:
sentences_cols = ["policy_id", "text"]
sentences_corpus = pd.DataFrame(columns = sentences_cols)
sentences_corpus

Unnamed: 0,policy_id,text


In [26]:
for policy_index, row in policies_corpus.iterrows():
    print("Tokenizing policy number: ", policy_index, row['name'])
    tokenized_privacy_policy = sent_tokenize(row['policy'])
    for sentence in tokenized_privacy_policy:
        sentences_corpus = sentences_corpus.append(pd.Series([
            policy_index,
            sentence
            ],
            index = sentences_cols), 
         ignore_index = True)

In [30]:
sentences_corpus.describe()

Unnamed: 0,policy_id,text
count,115813,115813.0
unique,1360,97837.0
top,474,2.0
freq,936,102.0


In [32]:
sentences_corpus.to_pickle("data/sentences_corpus.pkl")

In [3]:
sentences_corpus = pd.read_pickle("data/sentences_corpus.pkl")

In [4]:
sentences_corpus.to_csv("data/sentences_corpus.csv")