# Nonprofit Version of "Survey of the top 100,000 most popular websites"

This notebook contains the analysis used to generate summary statistics for our story. It's a nonprofit-specific version of the original ["Survey of the top 100,000 most popular websites"](https://github.com/the-markup/investigation-blacklight-the-high-cost-of-free/blob/master/0-100k-scan.ipynb) notebook.

The original analysis was used in the original Blacklight 'Show Your Work' [How We Built a Real-time Privacy Inspector]( https://themarkup.org/blacklight/2020/09/22/how-we-built-a-real-time-privacy-inspector) and story [The High Privacy Cost of a “Free” Website](https://themarkup.org/blacklight/2020/09/22/the-high-privacy-cost-of-a-free-website). 

For more information about the columns of the reports used in this analysis please refer to the [Blacklight Reporter](https://github.com/the-markup/blacklight-reporter#reports) Github repository.

In [306]:
import os
import json
import pandas as pd

In [307]:
def get_summary_col_pct(df, col, total_count):
    num = set(df[df[col] == True].origin_domain)
    return (round(len(num)/total_count*100),len(num))

def get_sites_with_canvas_fp(df, total_count):
    fp_origin_domain = df_data[(df_data.has_third_party_canvas_fingerprinters == True) |  (df_data.has_first_party_canvas_fingerprinters == True)].origin_domain
    num = set(fp_origin_domain)
    return (round(len(num)/total_count*100),len(num))

In [308]:
input_dir = f"../data/maddy-nonprofit-list-oct-13"

In [309]:
df = pd.read_csv('../data/nonprofit-websites.csv')
df = df.drop_duplicates()

In [311]:
s_df = pd.read_csv(os.path.join(input_dir,'summary.csv'))
s_df.origin_domain = s_df.apply( lambda x: x.inspection_path.split('/')[-2] if pd.isna(x.origin_domain)  else x.origin_domain, axis=1) 

u_df = df.rename(columns={'WEBSITE':'origin_domain'}) #[df.label.eq(category)]

# Join the results from the survey with the ranked list.
summary = pd.merge(s_df,u_df, how='left', left_on="origin_domain", right_on="origin_domain")
summary = summary[~summary['EIN'].isna()]
summary["failed"] = summary.no_data.apply( lambda x: True if pd.isna(x) or x is True  else False)


In [312]:
s_df['origin_domain'].nunique()

28675

In [313]:
# Dataframe with successful captures.
df_data = summary[summary.no_data == False]
# Dataframe of failed captures.
failed = summary[summary['failed'] == True]
total_count = summary.origin_domain.nunique()
success_count = total_count - failed.origin_domain.nunique()
print(f"We attempted to scan {total_count} urls and got parsable results for {success_count} of them giving us a {round((success_count/total_count)*100, 2)}% success rate.\nAll percentages mentioned are from the collection of successful captures.")

We attempted to scan 24356 urls and got parsable results for 23856 of them giving us a 97.95% success rate.
All percentages mentioned are from the collection of successful captures.


In [315]:
canvas_fp = get_sites_with_canvas_fp(df_data, success_count)
session_recorder_pct = get_summary_col_pct(df_data, "has_session_recorders", success_count)
key_loggers_pct = get_summary_col_pct(df_data, "has_key_loggers", success_count)

print(f"{canvas_fp[0]}% ({canvas_fp[1]}) of the {success_count}  sites use canvas fingerprinting.\n"
f"{session_recorder_pct[0]}% ({session_recorder_pct[1]}) of the {success_count}  sites use session recorders.\n"
f"{key_loggers_pct[0]}% ({key_loggers_pct[1]}) of the {success_count}  sites log keystrokes")

6% (1379) of the 23856 ALL sites use canvas fingerprinting.
2% (465) of the 23856 ALL sites use session recorders.
2% (449) of the 23856 ALL sites log keystrokes


In [316]:
# How many session recorders found on sites
df_data[df_data['has_session_recorders'] & df_data['NTEE_GROUP'].eq('F')]['origin_domain'].nunique()
sr = pd.read_csv(f'{input_dir}/session_recorders.csv')

print(sr[ 
    sr['origin_domain'].isin(df_data['origin_domain']) &
    (~sr['script_domain_owner'].isna())][['origin_domain','script_domain_owner']].drop_duplicates().nunique())

# How many mental health ones did
print(sr[ 
    sr['origin_domain'].isin(df_data[df_data['NTEE_GROUP'].eq('F')]['origin_domain']) &
    (~sr['script_domain_owner'].isna())][['origin_domain','script_domain_owner']].drop_duplicates().nunique())

origin_domain          439
script_domain_owner      7
dtype: int64
origin_domain          89
script_domain_owner     5
dtype: int64


In [317]:
df_data = summary[summary.no_data == False]
no_tpt = df_data[(df_data.has_third_party_cookies == False) & (df_data.has_tracking_requests == False)& (df_data.has_third_party_canvas_fingerprinters == False)]
url_count = no_tpt.origin_domain.nunique()
f"{round(url_count/success_count*100,2)}% of sites with no third party cookies or tracking network requests"

'22.03% of sites with no third party cookies or tracking network requests'

## Tracking Technology

In [318]:
tpt_df = pd.read_csv(os.path.join(input_dir,'third_party_trackers.csv'))
tpt_df = tpt_df[tpt_df['origin_domain'].isin(u_df.origin_domain)].copy()

In [319]:
tpt = df_data[(df_data.has_third_party_cookies == True) | (df_data.has_tracking_requests == True)]
url_count = tpt.origin_domain.nunique()
f"{round(url_count/success_count*100,2)}% ({url_count}) of the top {success_count} sites had third party cookies or tracking network requests"

'86.05% (20528) of the top 23856 sites had third party cookies or tracking network requests'

In [320]:
median_tpt = tpt_df.groupby(["origin_domain"]).script_domain.nunique().median()
print(f"Median number of third party trackers {median_tpt}")

Median number of third party trackers 3.0


## Google

In [321]:
google_df = tpt_df[tpt_df.fillna("")["script_domain_owner"].str.contains('Google')]
domains = ['google-analytics.com', 'doubleclick.net', 'googletagmanager.com',"googletagservices","googlesyndication.com","googleadservices","2mdn.net"]
gtrack_df = google_df[google_df.script_domain.isin(domains)]
pct_google = gtrack_df.origin_domain.nunique()/success_count
f"Percentage of sites with Google tracking technology {round(pct_google,2)*100}% ({gtrack_df.origin_domain.nunique()})"

'Percentage of sites with Google tracking technology 54.0% (12963)'

In [322]:
domains = ['google-analytics.com']
ga_df = google_df[google_df.script_domain.isin(domains)]
pct_google = ga_df.origin_domain.nunique()/success_count
f"Percentage of sites using Google Analytics {round(pct_google,2)*100}% ({ga_df.origin_domain.nunique()})"

'Percentage of sites using Google Analytics 53.0% (12526)'

In [323]:
ga = google_df[(google_df.script_url.str.contains("stats.g.doubleclick")) & (google_df.script_url.str.contains("UA-"))]
pct_ga_ra = ga.origin_domain.nunique()/success_count
f"Percentage of sites use the Google Analytics 'Remarketing Audiences' feature is {round(pct_ga_ra,2)*100}%"

"Percentage of sites use the Google Analytics 'Remarketing Audiences' feature is 18.0%"

## AddThis

In [324]:
addthis_df = tpt_df[tpt_df.fillna("")["script_domain"].str.contains('addthis')  ]

In [325]:
pct_addthis = addthis_df.origin_domain.nunique()/success_count
f"Percentage of sites with AddThis scripts {round(pct_addthis,2)*100}% ({addthis_df.origin_domain.nunique()})"

'Percentage of sites with AddThis scripts 3.0% (663)'

## Facebook

In [326]:
fb_pixel = pd.read_csv(os.path.join(input_dir,'fb_pixel_events.csv'))
fb_pixel = fb_pixel[fb_pixel['origin_domain'].isin(u_df.origin_domain)].copy()
fb_pixel_pct = fb_pixel.origin_domain.nunique()/success_count
f"Percentage of the top {success_count} sites with the Facebook pixel is {round(fb_pixel_pct,2)*100}% ({fb_pixel.origin_domain.nunique()})"

'Percentage of the top 23856 sites with the Facebook pixel is 11.0% (2528)'

In [327]:
fb_df = tpt_df[tpt_df.fillna("")["script_domain_owner"].str.contains('Facebook') ]
domains = ['facebook.com', 'facebook.net', 'atdmt.com']
fbtrack_df = fb_df[fb_df.script_domain.isin(domains)]
pct_fb = fbtrack_df.origin_domain.nunique()/success_count
f"Percentage of sites with Facebook tracking technology {round(pct_fb,2)*100}% ({fbtrack_df.origin_domain.nunique()})"

'Percentage of sites with Facebook tracking technology 14.000000000000002% (3409)'

## Third-party Cookies

In [328]:
cookie_df = pd.read_csv(os.path.join(input_dir,'cookies.csv'))
tpc_df =  cookie_df[cookie_df.cookie_is_third_party == True & cookie_df['origin_domain'].isin(u_df.origin_domain)]
tg = tpc_df.groupby("origin_domain")["cookie_domain"].nunique().sort_values(ascending = False)


In [329]:
print(f"On average, a site loaded {round(tg.median(),0)} third-party cookies.")

On average, a site loaded 2.0 third-party cookies.
