# Survey of the top 100,000 most popular websites

This notebook contains the analysis for the data collected from our survey of the most popular 100,000 websites as determined by the [Tranco List](https://tranco-list.eu/list/V3KN/full). 

This analysis was used mention in our 'Show Your Work' [How We Built a Real-time Privacy Inspector]( https://themarkup.org/blacklight/2020/09/22/how-we-built-a-real-time-privacy-inspector) and story [The High Privacy Cost of a “Free” Website](https://themarkup.org/blacklight/2020/09/22/the-high-privacy-cost-of-a-free-website). 

For more information about the columns of the reports used in this analysis please refer to the [Blacklight Reporter](https://github.com/the-markup/blacklight-reporter#reports) Github repository.

In [2]:
import os
import json
import pandas as pd

In [3]:
def get_summary_col_pct(df, col, total_count):
    num = set(df[df[col] == True].origin_domain)
    return (round(len(num)/total_count*100),len(num))

def get_sites_with_canvas_fp(df, total_count):
    fp_origin_domain = df_data[(df_data.has_third_party_canvas_fingerprinters == True) |  (df_data.has_first_party_canvas_fingerprinters == True)].origin_domain
    num = set(fp_origin_domain)
    return (round(len(num)/total_count*100),len(num))

In [4]:
input_dir = f"data/tranco-V3Kn-100k-2020-09-07"

In [4]:
s_df = pd.read_csv(os.path.join(input_dir,'summary.csv'))
s_df.origin_domain = s_df.apply( lambda x: x.inspection_path.split('/')[-2] if pd.isna(x.origin_domain)  else x.origin_domain, axis=1) 

u_df = pd.read_table(os.path.join('data','tranco-V3Kn-100k.txt'), delim_whitespace=True)
u_df['rank'] = u_df.index
u_df["origin_domain"] = u_df.urls.apply(lambda x: x.replace('http://',""))  
del u_df["urls"]

# Join the results from the survey with the ranked list.
summary = pd.merge(u_df,s_df, how='left', on=['origin_domain', 'origin_domain'])
summary["failed"] = summary.no_data.apply( lambda x: True if pd.isna(x) or x is True  else False)


In [5]:
# Dataframe with successful captures.
df_data = summary[summary.no_data == False]
# Dataframe of failed captures.
failed = summary[summary['failed'] == True]
total_count = summary.origin_domain.nunique()
success_count = total_count - failed.origin_domain.nunique()
print(f"We attempted to scan {total_count} urls and got parsable results for {success_count} of them giving us a {round((success_count/total_count)*100, 2)}% success rate.\nAll percentages mentioned are from the collection of successful captures.")

We attempted to scan 100000 urls and got parsable results for 81617 of them giving us a 81.62% success rate.
All percentages mentioned are from the collection of successful captures.


In [6]:
canvas_fp = get_sites_with_canvas_fp(df_data, success_count)
session_recorder_pct = get_summary_col_pct(df_data, "has_session_recorders", success_count)
key_loggers_pct = get_summary_col_pct(df_data, "has_key_loggers", success_count)

print(f"{canvas_fp[0]}% ({canvas_fp[1]}) of the top {success_count} sites use canvas fingerprinting.\n"
f"{session_recorder_pct[0]}% ({session_recorder_pct[1]}) of the top {success_count} sites use session recorders.\n"
f"{key_loggers_pct[0]}% ({key_loggers_pct[1]}) of the top {success_count} sites log keystrokes")

6% (5214) of the top 81617 sites use canvas fingerprinting.
15% (12457) of the top 81617 sites use session recorders.
4% (3534) of the top 81617 sites log keystrokes


In [7]:
df_data = summary[summary.no_data == False]
no_tpt = df_data[(df_data.has_third_party_cookies == False) & (df_data.has_tracking_requests == False)& (df_data.has_third_party_canvas_fingerprinters == False)]
url_count = no_tpt.origin_domain.nunique()
f"{round(url_count/success_count*100,2)}% of sites with no third party cookies or tracking network requests"

'12.7% of sites with no third party cookies or tracking network requests'

## Tracking Technology

In [8]:
tpt_df = pd.read_csv(os.path.join(input_dir,'third_party_trackers.csv'))

In [10]:
tpt = df_data[(df_data.has_third_party_cookies == True) | (df_data.has_tracking_requests == True)]
url_count = tpt.origin_domain.nunique()
f"{round(url_count/success_count*100,2)}% ({url_count}) of the top {success_count} sites had third party cookies or tracking network requests"

'87.27% (71225) of the top 81617 sites had third party cookies or tracking network requests'

In [11]:
median_tpt = tpt_df.groupby(["origin_domain"]).script_domain.nunique().median()
print(f"Median number of third party trackers {median_tpt}")

Median number of third party trackers 7.0


## Google

In [12]:
google_df = tpt_df[tpt_df.fillna("")["script_domain_owner"].str.contains('Google') ]
domains = ['google-analytics.com', 'doubleclick.net', 'googletagmanager.com',"googletagservices","googlesyndication.com","googleadservices","2mdn.net"]
gtrack_df = google_df[google_df.script_domain.isin(domains)]
pct_google = gtrack_df.origin_domain.nunique()/success_count
f"Percentage of sites with Google tracking technology {round(pct_google,2)*100}%"

'Percentage of sites with Google tracking technology 74.0%'

In [13]:
domains = ['google-analytics.com']
ga_df = google_df[google_df.script_domain.isin(domains)]
pct_google = ga_df.origin_domain.nunique()/success_count
f"Percentage of sites using Google Analytics {round(pct_google,2)*100}% ({ga_df.origin_domain.nunique()})"

'Percentage of sites using Google Analytics 69.0% (56464)'

In [14]:
ga = google_df[(google_df.script_url.str.contains("stats.g.doubleclick")) & (google_df.script_url.str.contains("UA-"))]
pct_ga_ra = ga.origin_domain.nunique()/success_count
f"Percentage of sites use the Google Analytics 'Remarketing Audiences' feature.{round(pct_ga_ra,2)*100}%"

"Percentage of sites use the Google Analytics 'Remarketing Audiences' feature.50.0%"

## AddThis

In [15]:
addthis_df = tpt_df[tpt_df.fillna("")["script_domain"].str.contains('addthis') ]

In [16]:
pct_addthis = addthis_df.origin_domain.nunique()/success_count
f"Percentage of sites with AddThis scripts {round(pct_addthis,2)*100}% ({addthis_df.origin_domain.nunique()})"

'Percentage of sites with AddThis scripts 5.0% (4113)'

## Facebook

In [17]:
fb_pixel = pd.read_csv(os.path.join(input_dir,'fb_pixel_events.csv'))
fb_pixel_pct = fb_pixel.origin_domain.nunique()/success_count
f"Percentage of the top {success_count} sites with the Facebook pixel {round(fb_pixel_pct,2)*100} %"

'Percentage of the top 81617 sites with the Facebook pixel 30.0 %'

In [18]:
fb_df = tpt_df[tpt_df.fillna("")["script_domain_owner"].str.contains('Facebook') ]
domains = ['facebook.com', 'facebook.net', 'atdmt.com']
fbtrack_df = fb_df[fb_df.script_domain.isin(domains)]
pct_fb = fbtrack_df.origin_domain.nunique()/success_count
f"Percentage of sites with Facebook tracking technology {round(pct_fb,2)*100}% ({fbtrack_df.origin_domain.nunique()})"

'Percentage of sites with Facebook tracking technology 33.0% (26651)'

## Third-party Cookies

In [5]:
cookie_df = pd.read_csv(os.path.join(input_dir,'cookies.csv'))
tpc_df =  cookie_df[cookie_df.cookie_is_third_party == True]
tg = tpc_df.groupby("origin_domain")["cookie_domain"].nunique().sort_values(ascending = False)


In [6]:
print(f"On average, a site loaded {round(tg.median(),0)} third-party cookies.")

On average, a site loaded 3.0 third-party cookies.
