# Introduction and Importing Data
Welcome to my code notebook for this project! Here, you'll find my code for the project as well as some documentation for the steps I take. If you haven't already, check my `README.md` for information about my project, my data, and my licensing.   
  
The first step to any data analyzation process is actually finding and importing the data. I'll begin by importing some Python packages and importing a CSV file with all the website from my data in them.

In [1]:
import glob
import nltk
import pandas as pd
import numpy as np
import matplotlib as plt
import seaborn as sns

In [2]:
data = pd.read_csv('../fulldata/sites.csv')

In [3]:
data

Unnamed: 0,filename,websites,titles
0,theatlantic.txt,theatlantic.com,The Atlantic
1,imdb.txt,imdb.com,IMDB
2,nytimes.txt,nytimes.com,The New York Times
3,voxmedia.txt,voxmedia.com,Vox
4,nbcuniversal.txt,nbcuniversal.com,NBC Universal Media
...,...,...,...
108,dailynews.txt,dailynews.com,Los Angeles Daily News
109,lids.txt,lids.com,Lids
110,sports-reference.txt,sports-reference.com,Sports Reference
111,foxsports.txt,foxsports.com,Fox Sports Insider


Our dataframe here is 113 rows x 3 columns, so there are a total of 113 websites in my dataset. These websites include news sites, social media sites, and business sites. 

  
Next, I will import the content of the privacy policies from the text files that I converted to text from HTML. I was going to try to read in the HTML files with a Python package called BeautifulSoup, but it was giving me trouble. I make use of the glob package here to read in each of the contents of the files and match them up to their respective websites. 

In [4]:
filepath = '../fulldata/textpolicies/'
def readtxt(fn):
    f = open(glob.glob(filepath + fn)[0])
    text = f.read()
    f.close()
    return text

data['content'] = data['filename'].apply(readtxt)

data.head()

Unnamed: 0,filename,websites,titles,content
0,theatlantic.txt,theatlantic.com,The Atlantic,"*Privacy Policy *\n\n*Effective: January 1, 20..."
1,imdb.txt,imdb.com,IMDB,"IMDb Privacy Notice\n\n|||Last Updated, Decemb..."
2,nytimes.txt,nytimes.com,The New York Times,"*Privacy Policy *\n\nLast Updated on June 10, ..."
3,voxmedia.txt,voxmedia.com,Vox,Vox Media Privacy Policy\n\n|||*Updated as of ...
4,nbcuniversal.txt,nbcuniversal.com,NBC Universal Media,Full Privacy Policy\n\nLast updated: 14 Januar...


In [5]:
len(data)

113

# Data Cleaning
As you can see from the dataframe above, there's a lot of non-alphanumeric symbols in the content column. There's asterisks for denoting bold, newline characters, and three vertical bars that denote headings. These aren't relevant to the analysis I'm going to perform, so I am going to drop these characters by splitting the strings by a character into an array of strings, joining them back together, and repeating until the asterisks, newlines, and vertical bars are gone. 

In [6]:
for i in range(len(data)):
    data.content[i] = data.content[i].split('|||')
    tmplist = data.content[i]
    tmpstring = ' '.join(tmplist)
    data.content[i] = tmpstring
    
    data.content[i] = data.content[i].split('*')
    tmplist = data.content[i]
    tmpstring = ' '.join(tmplist)
    data.content[i] = tmpstring
    
    data.content[i] = data.content[i].split('\n')
    tmplist = data.content[i]
    tmpstring = ' '.join(tmplist)
    data.content[i] = tmpstring
    
    data.content[i] = data.content[i].split(':')
    tmplist = data.content[i]
    tmpstring = ' '.join(tmplist)
    data.content[i] = tmpstring
    
    data.content[i] = data.content[i].split(' ')

In [7]:
data

Unnamed: 0,filename,websites,titles,content
0,theatlantic.txt,theatlantic.com,The Atlantic,"[, Privacy, Policy, , , , , Effective, , Janua..."
1,imdb.txt,imdb.com,IMDB,"[IMDb, Privacy, Notice, , , Last, Updated,, De..."
2,nytimes.txt,nytimes.com,The New York Times,"[, Privacy, Policy, , , , Last, Updated, on, J..."
3,voxmedia.txt,voxmedia.com,Vox,"[Vox, Media, Privacy, Policy, , , , Updated, a..."
4,nbcuniversal.txt,nbcuniversal.com,NBC Universal Media,"[Full, Privacy, Policy, , Last, updated, , 14,..."
...,...,...,...,...
108,dailynews.txt,dailynews.com,Los Angeles Daily News,"[, PRIVACY, POLICY, , , , This, policy, descri..."
109,lids.txt,lids.com,Lids,"[Privacy, Policy, , , Last, updated, , August,..."
110,sports-reference.txt,sports-reference.com,Sports Reference,"[SPORTS, REFERENCE, LLC, -, Privacy, Statement..."
111,foxsports.txt,foxsports.com,Fox Sports Insider,"[Privacy, Policy, Effective, Date, , June, 11,..."


All of those unneeded characters are out of the way, but now there are lots of empty strings/strings that are just spaces in the array of words. I'll drop those. I'm also dropping punctuation after I tokenize the words with NLTK's word_tokenize function. I'm also going to set everything to lowercase so it's easier to look at type-token ratio later on. This way, a capitalized word and a lowercase word won't be counted as two different words. See my comments in the following cells for details on what each portion does. 

In [None]:
for i in range(len(data)): # dropping empty strings
    ct = 0
    while ct < (len(data.content[i])):
        if data.content[i][ct] == '' or data.content[i][ct] == ' ':
            del data.content[i][ct]
        else:
            ct += 1
    tmplist = data.content[i]
    tmpstring = ' '.join(tmplist)
    data.content[i] = tmpstring # joining the array of strings back together into one string

In [None]:
wordtokens = data.content.map(nltk.word_tokenize) # tokenizing the words in each of the entries in content cols

In [None]:
data['tokens'] = wordtokens # creating a new column for the tokenized words

In [None]:
def removepunc(s): # a function for removing commas, periods, etc, as they are unimportant
    words = s
    words = [word.lower() for word in words if word.isalnum()] # making word tokens lowercase too!
    return words

In [None]:
data['tokens'] = data['tokens'].apply(removepunc) # removing punctuation... 

In [None]:
data # done!

# Analysis
## Length in Words
Our original data is clean, so let's get into some analysis. There are several ways to analyze what privacy policy is a good one. Length in amount of words, average word length, and type-token ratio (word uniqueness) are the three ways of analysis that I'll be looking at. I'll start by just looking at the total length in words. 

In [None]:
data['length'] = None
for i in range(len(data)):
    data['length'][i] = len(data['tokens'][i])

In [None]:
data

Now, I'm going to check the shortest and longest policies by word. 

In [None]:
data['length'].max()

In [None]:
data['length'].min()

As we can see here, there's a pretty large range as far as amount of words goes. For the sake of visualization, I'm going to make another column that shows the range of words. These ranges will be in 500-word intervals for counts less than 5000, and 1000-word intervals for word counts more than 5000. 

In [None]:
data['lencat'] = None
for i in range(len(data)):
    if len(data['tokens'][i]) < 500:
        data['lencat'][i] = '0-499'
    elif len(data['tokens'][i]) >= 500 and len(data['tokens'][i]) < 1000:
        data['lencat'][i] = '500-999'
    elif len(data['tokens'][i]) >= 1000 and len(data['tokens'][i]) < 1500:
        data['lencat'][i] = '1000-1499'
    elif len(data['tokens'][i]) >= 1500 and len(data['tokens'][i]) < 2000:
        data['lencat'][i] = '1500-1999'
    elif len(data['tokens'][i]) >= 2000 and len(data['tokens'][i]) < 2500:
        data['lencat'][i] = '2000-2499'
    elif len(data['tokens'][i]) >= 2500 and len(data['tokens'][i]) < 3000:
        data['lencat'][i] = '2500-2999'
    elif len(data['tokens'][i]) >= 3000 and len(data['tokens'][i]) < 3500:
        data['lencat'][i] = '3000-3499'
    elif len(data['tokens'][i]) >= 3500 and len(data['tokens'][i]) < 4000:
        data['lencat'][i] = '3500-3999'
    elif len(data['tokens'][i]) >= 4000 and len(data['tokens'][i]) < 4500:
        data['lencat'][i] = '4000-4499'
    elif len(data['tokens'][i]) >= 4500 and len(data['tokens'][i]) < 5000:
        data['lencat'][i] = '4500-4999'
    elif len(data['tokens'][i]) >= 5000 and len(data['tokens'][i]) < 6000:
        data['lencat'][i] = '5000-5999'
    elif len(data['tokens'][i]) >= 6000 and len(data['tokens'][i]) < 7000:
        data['lencat'][i] = '6000-6999'
    elif len(data['tokens'][i]) >= 7000 and len(data['tokens'][i]) < 8000:
        data['lencat'][i] = '7000-7999'

In [None]:
data

Let's take a look at these categories to see what ranges are the most common. 

In [None]:
data['lencat'].value_counts()

In [None]:
lencounts = data['lencat'].value_counts()

In [None]:
lencounts.plot(kind='bar', figsize=(7,5))

The most common lengths for privacy policies seem to be 3000-3499, 1000-1499, and 0-499 words. According to [wordcounter.io](https://wordcounter.io/faq/how-many-pages-is-1500-words/), 3000-3499 words is like a 12-14 page paper, double spaced. 1000-1499 words is 4-6 pages, double spaced. 500 words is 2 pages, double spaced. It is promising to see quite a few policies below 1000 words. A shorter privacy policy means that people are probably more inclined to read it.  
  
However, there are a couple policies that are worryingly large. There are two policies that are from 7000-7999 words long, which is approximately 30 pages, double spaced. This is ten pages longer than my limit for my research paper during my senior year of high school, and to expect someone to be able to read these policies is ridiculous. Let's check out which policies are over 7000 words.  

In [None]:
data['titles'][pd.Index(data['lencat']).get_loc('7000-7999')]

This is interesting. Barnes and Noble's policy being long makes a good bit of sense, since there's probably some form of data collection for online orders. I don't know much about Latin Post, though. While it makes sense that B&N's policy is long, that doesn't mean it should be, especially for a website that handles transactions. It can be a complicated process, sure, but it should also be a transparent process that doesn't take half an hour to understand the extent of.  
  
What about the shortest policy with 81 words?

In [None]:
data['titles'][pd.Index(data['length']).get_loc(81)]

In [None]:
data['content'][pd.Index(data['length']).get_loc(81)]

This is a pretty solid and easy-to-read policy. It gives the reader a way to opt out of email information being collected, and it specifies that the data Tanger Outlets collects from its customers is not sold or redistributed in any way. However, short privacy policies like this one may not be sufficient, as it doesn't describe what is defined as "personal identifiable information" (or PII). Lengthier policies are more likely to describe what constitutes this information. They're also more likely to elaborate on what the website uses this PII for. 
  
  
While I did anticipate that a shorter policy in terms of word count would be easy to read, that doesn't necessarily mean it's a good policy. A human readable policy should be on the lower end of word count, yes, but too short of a word count can mean that important information and definitions regarding PII could potentially be omitted. This has the potential to leave the reader as clueless for a tiny policy as they would be for a huge, novel-length policy. In short, a lower word count tends to be better, but a website must be careful to not make the policy too short. This way, readers won't be overwhelmed by the amount of information, but they'll still be informed. 

## Average Word Length
The next thing we'll work with is the average word length. In general, shorter words tend to be more common and easy to use, while longer words tend to be more difficult to use. This doesn't go for every single word in the English language, so this method of analysis by itself should be taken with a grain of salt. However, when combined with other metrics, it can be telling. Let's get started by creating a function that will get the average word length for each of the policies. 

In [None]:
def wordlen(t): # function to get word lengths; 
    lengths = [len(w) for w in t if w.isalpha()]
    return np.mean(lengths)

data['avg_wlen'] = data['tokens'].apply(wordlen)
data

Alright, just from what we see here, the average of the average word lengths is around five characters, no matter the length. Let's double check that, though. 

In [None]:
data['avg_wlen'].max()

In [None]:
data['avg_wlen'].min()

Okay, so there is a difference between the largest average word length and the smallest average word length, even if it is just by a character. Let's look at where these policies are from and check out their word counts. 

In [None]:
data['titles'][pd.Index(data['avg_wlen']).get_loc(data['avg_wlen'].max())]

In [None]:
data['length'][pd.Index(data['avg_wlen']).get_loc(data['avg_wlen'].max())]

This is a short policy... let's read it:

In [None]:
data['content'][pd.Index(data['avg_wlen']).get_loc(data['avg_wlen'].max())]

This is a pretty spot-on privacy policy. There's little legalese and while it is daunting to look at in one big blob of text like this, it's very probable that with appropriate headings and spaces, it's easy to read. It's under 1000 words, it reports on what services it uses with PII (Google Analytics, etc.), it references specific laws, and is very clear under what circumstances PII is released. This goes to show that average word length isn't really that good of a metric to measure whether a privacy policy is readable or not. However, as I mentioned before, it can be combined with other metrics to make a more informed argument.  
  
I'm still curious as to what the minimum average word length's policy looks like. 

In [None]:
data['titles'][pd.Index(data['avg_wlen']).get_loc(data['avg_wlen'].min())]

In [None]:
data['length'][pd.Index(data['avg_wlen']).get_loc(data['avg_wlen'].min())]

Another, even shorter one! Since it's short (less than 1000 words), we can look at it like we did with the Dallas County Community College District privacy policy above. 

In [None]:
data['content'][pd.Index(data['avg_wlen']).get_loc(data['avg_wlen'].min())]

This is another great policy. It clearly states what it defines as PII, and even defines a cookie to those who might not be as Internet-literate. This definition and explanation of cookies is especially important to ensure that people know what they're signing up for. It even specifies that PII will be removed upon request if a user wishes to do so and instructs readers on how to do it.
  
  
While DCCCD's policy is a bit longer, there is more PII that a community college may collect on its students. School websites have the justification to do this, so it makes sense that more regulations are outlined in DCCCD. Both privacy policies that I've looked at do all of the things a privacy policy should: keep it short and human-readable, but still outline the policy in a thorough way so users won't be left guessing over how much of their information is collected by the website. This thoroughness is what the Tanger Outlets policy lacks. It tells us about how PII is not shared with third parties, but does not specify anything about what PII exactly it collects. 
  
  
To conclude this section, though: average word length should *not* be the only metric used in determining whether a privacy policy is readable to the average Internet user. Both the policies with the longest and shortest average word length were perfectly readable and accomplished everything a privacy policy should. This doesn't mean this section was completely useless, however. As I've mentioned before, this metric can be a supplement for other metrics. That said, let's move on. 

## Type-Token Ratio
The final metric that I'll be looking at is type-token ratio (TTR). TTR is a measure of lexical density, or word uniqueness. It takes the number of unique words and divides that number by the number of total words in a given string. A high TTR in this case could be an indication of lots of legalese being used. I will begin here by defining a function, much like the functions I've defined above, that gets the TTR for all of these policies. 

In [None]:
def get_ttr(tokens):
    lower = [w.lower() for w in tokens]
    return len(set(lower))/len(lower)

In [None]:
data['TTR'] = data.tokens.map(get_ttr)
data

We're already seeing some interesting things here. Despite Latin Post having an incredibly long word count, it has an incredibly low TTR. We'll just repeat the procedure we did with the other two metrics:

In [None]:
data['TTR'].max()

In [None]:
data['titles'][pd.Index(data['TTR']).get_loc(data['TTR'].max())]

We already looked at Tanger Outlets and their privacy policy was very short. Sometimes, shorter privacy policies tend to have a higher TTR just because they're short. We should probably check out some other higher TTR ratios for policies that are longer. Let's do that next by looking at TTRs above 0.5.

In [None]:
for i in range(len(data)):
    if data['length'][i] >= 100 and data['TTR'][i] >= 0.5:
        print(data['titles'][i], "- length:", data['length'][i])

All of these policies have greater than 100 words in them, but they also have less than 500 words in them. Again, shorter policies usually have higher TTRs not because of lexical diversity, but because there's less of a lexicon being used in the first place. This doesn't mean these are good or bad policies; it really depends on the content. Let's take a peek at one of them:

In [None]:
data['content'][pd.Index(data['length']).get_loc(195)] # community coffee privacy policy

This policy does the job. It isn't perfect, but it explains things like cookies and tells the reader some of what they consider PII. It explains what this PII is used for (creating a "more personalized shopping experience") and  ensures encryption of the PII it collects. Let's look at one more policy just to be sure. 

In [None]:
data['content'][pd.Index(data['length']).get_loc(210)] # dog breed info center

Another decent policy. It outlines what PII is used for (submitting photos/classifieds/surveys) and says that it doesn't use cookies. It also notifies the user of Google Adsense being used, which is essentially a third party. It redirects the user to Google Adsense sites and explains Google Adsense on a surface level in order to answer any questions a reader might have. Now, we should maybe take a peek at longer policies that don't necessarily have a higher TTR than these shorter ones, but a higher TTR nonetheless. 

In [None]:
for i in range(len(data)):
    if data['length'][i] >= 2000 and data['TTR'][i] >= 0.25: # 8 page paper double spaced, TTR 25% uniqueness
        print(data['titles'][i], "- length:", data['length'][i], "- TTR:", data['TTR'][i])

I'm going to come back to these later... let's check out the smallest TTR. 

In [None]:
data['TTR'].min()

In [None]:
data['titles'][pd.Index(data['TTR']).get_loc(data['TTR'].min())]

In [None]:
data['length'][pd.Index(data['TTR']).get_loc(data['TTR'].min())]

This also makes a bit of sense. TTR has a lot to do with text length here. This is why we need to combine it with different metrics and only look at texts of a certain length, which is what I'll be doing next. 

## Moving On and Combining Metrics
TTR tends to be tricky to work with because of its dependency on length. Because of this, I'm going to have to take some alternative approaches to looking at some of these TTR numbers. What I'm going to do next is look at all of the longer unique words in a couple of these policies. 

In [None]:
def unique(ls): 
    unq = [] 
    for x in ls: 
        # check if exists in unique_list or not 
        if x not in unq: 
            unq.append(x) 
    return unq

In [None]:
unq_pbs = unique(data['tokens'][pd.Index(data['length']).get_loc(2237)]) # looking at pbs's website!

In [None]:
for x in unq_pbs:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

Quite a few big words being used here - 129, to be exact. It's hard to say whether they can be considered *too* complicated because we can't see the context in which they're being used, but you have to wonder whether these words, like "aggregate" and "affiliates" can be properly understood at an eighth grade level. 
  
There is another policy in here with a higher TTR than PBS by a notable amount; it might look like only a 2% increase but depending on what words we find here, this policy could be worse. Or, it could be better, and the increase is just because there's less words in this policy -- the Gawker policy. We'll simply repeat the process we did above with the PBS policy. 

In [None]:
unq_gawker = unique(data['tokens'][pd.Index(data['titles']).get_loc('Gawker')])

In [None]:
for x in unq_gawker:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

There's 134 bigger words here, and this doesn't seem to be any better than PBS's policy as far as words that a middle schooler would be able to understand. We see "aggregate" and "affiliate" again, which are words that I admittedly had to look up the definitions for. 

Let's go back again and check out the unique words in Honda's policy. It's a lengthy policy, but it has a low TTR. Is this because it has a low level of lexical diversity, or is it just a matter of length? 

In [None]:
unq_honda = unique(data['tokens'][pd.Index(data['titles']).get_loc('Honda')])

In [None]:
for x in unq_honda:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

Looks like it's just a matter of length. There's still a large amount of lengthier words. Some are easy to understand and define, but some aren't. As far as the eighth grader-readable thing goes, I'm sure 13-year-olds probably won't have any reason to go onto a Honda website, but that readability is important to ensure anyone who needs to use that website can use it in an informed manner.  
  
  
Next, I'm going to look at the averages of each of these metrics to find the "average" privacy policies. Let's see just how many bigger words there are. 

In [None]:
data['TTR'].mean() # average TTR 

In [None]:
data['length'].mean() # average word count

In [None]:
data['avg_wlen'].mean() # average word length

In [None]:
for i in range(len(data)): # finding policies that loosely fit these averages
    if (data['length'][i] >= 1500 and data['length'][i] < 3000) and (data['avg_wlen'][i] >= 4.9 and data['avg_wlen'][i] < 5.2) and (data['TTR'][i] >= 0.26 and data['TTR'][i] < 0.31): 
        print(data['titles'][i], "- length:", data['length'][i], "- TTR:", data['TTR'][i], "- average word length:", data['avg_wlen'][i] )

In writing the above lines of code, I had to stretch the averages a bit so I could get more policies to look at. We already looked at PBS and Gawker, so I'm going to peek at two of these other policies: IMDB and Post Gazette, websites that I'm familiar with. I'm just going to call `unique` on these two policies to see if there are any complicated words. 

In [None]:
unq_imdb = unique(data['tokens'][pd.Index(data['titles']).get_loc('IMDB')])
for x in unq_imdb:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

I found these longer words from IMDB's site easier to understand than the previous sites I looked at. There are some words that aren't really perfect, like "subsidaries", and "affiliated" shows up again, but I'm seeing more common, less confusing words here. This is a good sign, because IMDB isn't a site that seems to handle transactions or super sensitive information. Let's see if the same goes with Post Gazette. It isn't the same type of website as IMDB, but we'll see how it compares:

In [None]:
unq_pg = unique(data['tokens'][pd.Index(data['titles']).get_loc('Post Gazette')])
for x in unq_pg:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

Even less hard-to-understand words here! "Affiliate" doesn't make an appearance here, which is already pretty great. It isn't perfect (none of these policies are, I'm sure), but Post Gazette does a good job of keeping it easy to understand. Since there aren't too many complicated words, the TTR is relatively low, and the word count is under 2000, we can conclude that (by these metrics, at least) the Post Gazette privacy policy does its job. 
  
  
The last thing I'll investigate are the stats of some bigger-name websites: Google, Amazon, Instagram, and Yahoo. Whether or not I'll print out the unique words of these sites depends on the stats. Some may be too long to even consider looking at, because I know they'll already be hard enough to read! Let's begin with Google. 

In [None]:
google_ind = pd.Index(data['titles']).get_loc('Google')
print(data['titles'][google_ind])
print("- length:", data['length'][google_ind])
print("- TTR:", data['TTR'][google_ind])
print("- average word length:", data['avg_wlen'][google_ind])

Google's policy, surprisingly enough, looks pretty good! There's less than 3000 words and a relatively low TTR. Let's peek at the unique words while we're at it. 

In [None]:
unq_google = unique(data['tokens'][google_ind])
for x in unq_google:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

Another instance of not perfect, but pretty good. From what I can see, it's likely that Google outlines ways to opt out of collection ("uninstall"). "Transparency" is also a good sign. Google seems to want its users to know what they're collecting and why they're collecting it. 
  
  
How about another search engine?

In [None]:
yahoo_ind = pd.Index(data['titles']).get_loc('Yahoo!')
print(data['titles'][yahoo_ind])
print("- length:", data['length'][yahoo_ind])
print("- TTR:", data['TTR'][yahoo_ind])
print("- average word length:", data['avg_wlen'][yahoo_ind])

Pretty great on the word count! TTR could be better, but that could just be because the word count is lower. This is another site worth looking at longer unique words. 

In [None]:
unq_yahoo = unique(data['tokens'][yahoo_ind])
for x in unq_yahoo:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

It's good to see that these two search engines have relatively short and easy to understand (at least, from these metrics) because search engines are an Internet user's gateway to an infinite amount of information. Let's compare them to the world's most famous shopping site: Amazon.

In [None]:
az_ind = pd.Index(data['titles']).get_loc('Amazon')
print(data['titles'][az_ind])
print("- length:", data['length'][az_ind])
print("- TTR:", data['TTR'][az_ind])
print("- average word length:", data['avg_wlen'][az_ind])

We're closer in length here to Google's policy than Yahoo's. This isn't necessarily a bad thing, because a website like Amazon that handles location data and transactions *should* probably be a bit longer. Longer unique words could make or break the policy, though. Certain longer words like "transmission" or anything money-related are good signs. 

In [None]:
unq_az = unique(data['tokens'][az_ind])
for x in unq_az:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

We're seeing money-related terms, which is a good sign. "Affiliated" is there again, but from these words alone, it looks like the policy is doing what it needs to do. It could probably use some more clear wording to clarify some of the words that seem confusing. Websites like Amazon handle more sensitive information than other websites, so the importance of clarity and true informed consent is incredibly important here. 
  
  
We looked at search engines, news sites, shopping sites, and more. I'm going to conclude this section by looking at Instagram to cover some social media. Instagram has questionable privacy practices, especially because of it being owned by Facebook. But we're not here to discuss these privacy policies in practice; we're here to see if it's reasonable for someone to read them. Let's get on with it!

In [None]:
ig_ind = pd.Index(data['titles']).get_loc('Instagram')
print(data['titles'][ig_ind])
print("- length:", data['length'][ig_ind])
print("- TTR:", data['TTR'][ig_ind])
print("- average word length:", data['avg_wlen'][ig_ind])

A shorter average word length and TTR, but a higher word count... it's probably worth it to take a look at the unique words like we did before. 

In [None]:
unq_ig = unique(data['tokens'][ig_ind])
for x in unq_ig:
    if len(x) > 8: # longer words are generally more complicated
        print(x, end=' ')

Some interesting things here. Not sure how much of a place the word "beautiful" has in a privacy policy, so that's a bit of a red flag. There's more legal-related terms here, with "jurisdiction(s)" and "governmental" -- governmental isn't really a hard to understand word, but it's a word we haven't seen in the policies we looked at before. Many of these words are easy to understand and there aren't too many longer words, but some of the words like the ones I mentioned previously make me feel a bit more uneasy about whether or not this policy does what it needs to. 

All in all, I think I've concluded that this is just a hard task to analyze computationally. To determine something as "human-readable" should involve a human, shouldn't it...? I'll address this more in the next and final heading!

# Concluding Remarks
What follows is a bulleted list (for the sake of ease of reading -- take notes, policy writers!) of what I've learned in completing this project, including advice for privacy policy writers as well as general things about my process of text processing. 

## Some disclaimers
- Word count, word length, and TTR are only so telling on their own; even combined, there are still questions that need answers. All of the above comments on my code with these metrics should be taken with a grain of salt. Each metric on its own has its flaws:
    - Word count is probably the easiest metric to make a judgement on privacy policies on, but again, shorter doesn't mean better. The policy, while long, could also still be easy to understand. 
    - Average word length can be affected by words like articles and prepositions that are short, but not that meaningful on their own. 
    - TTR is largely dependent on the length of the policy. A higher TTR *could* mean lexical diversity, or it could just mean the policy is really short. 
- I did not get too many conclusive results out of this, and it's okay. This was less of a paper and more of an experiment with a hypothesis. My hypothesis that these metrics would be incredibly telling of the readability of privacy policies wasn't entirely correct, and that's fine. 
- Determining whether a policy is human-readable or not is a hard thing for a computer to do, so it's no surprise that I wasn't given all the answers. This idea is similar to the idea that I based this project off of:
    - In my Data Science for Linguists (LING1340) class, we used these same metrics to determine the proficiency of L2 English speakers. This was a hard task for a computer to do, mostly for the same reasons as I stated above.
    - I could only look at so many of these policies in full; printing out the entire policy would create blobs of text that would be daunting to read, which is exactly what I'm trying to tell these policy writers to avoid! 
- If I say a website's privacy policy is a good one, I am speaking on my data metrics. A good policy according to these metrics doesn't mean a good policy in practice. I'm well aware that Google has questional privacy practices, but *based on my metrics*, its privacy policy is at least human-readable. 

## Advice for Policy Writers
- A shorter policy does not always mean a better policy! You can notify the reader whether or not your site uses cookies, but without an explanation of cookies, this doesn't inform them of much of anything
- Despite each of the metrics I used not being of that much use by themselves on their own, word count is probably still what turns people off from reading the privacy policies. 
    - A reader will be overwhelmed if they have to read anything more than the equivalent of a few pages, double spaced. Keep it less than 8 pages, double spaced, which is equivalent to around 2000 words. Anything more and people will feel either overwhelmed by the amount of content or get bored halfway through and start skimming. 
- The fact that there aren't any legal-specific terms in your privacy policy doesn't necessarily mean they're easier to understand. There are quite a few words that I saw that an eighth grader still wouldn't get. This lack of understanding means a lack of informed consent. 
- Be *especially* careful when writing your policies that involve transactions. This is one of those cases where it's okay to be a bit lengthier.
- Have outsiders proofread the policy just to make sure it makes sense. This sort of review is important because something that makes sense to a policy writer may not make the same amount of sense who isn't a policy writer. 

## Advice for Internet Users
- These policies can be overwhelming, and it isn't necessarily the fault of the user that some of them are too daunting to take on, but there are some steps you can take to navigate the longer policies:
    - Use CTRL+F to look for privacy-related keywords, like "personal identifiable information", "cookies", "collection", etc. If a policy is full of fluff, it'll be easier to find what you really need to know if you look for certain keywords. 
    - Be informed! Look up the definitions of words like the above ("cookies", for example) if you don't know what they mean. Your understanding of them is essential to giving your true informed consent. If the policy doesn't define these things for you (which they should, if they're longer), then look them up.
- Again, a shorter policy doesn't mean a better policy. If a privacy policy is lacking, fill in the gaps.
    - See above with "be informed"! If a policy lacks certain definitions, look them up. 
    - If there isn't a list of what the website classifies as personal identifiable information, find a way to get into contact with the website to find out what exactly the site will be collecting. 
    
## Final Statements
This was a fun experiment to do. I don't have too many tools to expand on this further at the moment, but as I spend more time at Pitt and grow more proficient with text analysis, I may expand on this project to produce better results! I also tried to keep the writing that I did in my markdown cells within the range of a 6-8 page-ish paper, so I knew I'd be doing an equivalent amount of work to my peers who wrote papers. 
  
  
Thank you for reading, and stay informed!
