# Analysis: 

My analysis revolves around checking what percentage of along with which websites and scripts in this dataset are tracking users location (geolocation) and language preferences as well as their country code. So, as to provide them with a customized content based on the users preferences (eg. location, language)

### Dataset used: Sample 10 percent
 - [sample 10 percent](https://public-data.telemetry.mozilla.org/bigcrawl/sample_10percent.parquet.tar.bz2) - 3.7GB download / 7.4GB on disk

## Step 1. Importing all the required libraries

In [1]:
#importing dask
import dask.dataframe as dd

#importing pandas
import pandas as pd

#importing Dask.distributed for distributed computing
from dask.distributed import Client, progress

#importing os
import os

#importing tldextract
import tldextract

  data = yaml.load(f.read()) or {}
  defaults = yaml.load(f)


In [2]:
# Extract domain function using tldextract
def extract_domain(url):
    """Use tldextract to return the base domain from a url"""
    try:
        extracted = tldextract.extract(url)
        return '{}.{}.{}'.format(extracted.subdomain, extracted.domain, extracted.suffix)
    except Exception as e:
        return 'ERROR'

In [3]:
# setting up the Dask.distributed client
client=Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:59049  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 17.12 GB


## Step 2. Loading the required dataset

In [4]:
# data directory where the data set is stored
DATA_DIR = 'D:\Outreachy\Datasets_Compressed\Datasets'

dataset = os.path.join(DATA_DIR, '10percent/')

In [5]:
#loading the dataset and creating ther dask dataframe using dd.read_parquet 
df = dd.read_parquet(dataset, engine='pyarrow', columns = ['location', 'script_url', 'operation', 'symbol', 'value'], index = False)

#checking what columns the dataset has
df.columns

Index(['location', 'script_url', 'operation', 'symbol', 'value'], dtype='object')

In [6]:
# memory usage statistics
memory_usage = df.memory_usage(deep=True).compute()
memory_usage.sum() / 1e9

30.625854378

## Step 3. Previewing the dataset

In [7]:
# First few rows (head)
df.head()

Unnamed: 0,location,script_url,operation,symbol,value
0,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,get,window.name,fXDcab74
1,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/api/xdm.js?1449919642,get,window.name,fXDcab74
2,https://vk.com/widget_comments.php?app=2297596...,https://vk.com/js/al/aes_light.js?592436914,get,window.navigator.userAgent,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...
3,https://pos.baidu.com/s?hei=70&wid=670&di=u313...,https://cpro.baidustatic.com/cpro/ui/noexpire/...,get,window.navigator.userAgent,Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko...
4,http://serienjunkies.org/smilf/smilf-season-1-...,https://apis.google.com/js/plusone.js?_=151338...,get,window.document.cookie,_ga=GA1.2.1529583939.1513387469; _gid=GA1.2.17...


In [8]:
# dataframe structure
df

Unnamed: 0_level_0,location,script_url,operation,symbol,value
npartitions=299,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
,object,object,int8,int16,object
,...,...,...,...,...
...,...,...,...,...,...
,...,...,...,...,...
,...,...,...,...,...


In [9]:
# dataframe website urls / location
total_location_unique_count = df.location.nunique().compute()
total_script_url_unique_count = df.script_url.nunique().compute()

# printing no. of entries
print('Total no. unique websites in this dataset:', total_location_unique_count)
print('Total no. unique script_urls in this dataset:', total_script_url_unique_count)

Total no. unique websites in this dataset: 205949
Total no. unique script_urls in this dataset: 166862


## 4. Language preferences
The `NavigatorLanguage.languages` read-only property returns an array of DOMStrings representing the user's preferred languages. The language is described using BCP 47 language tags. In the returned array they are ordered by preference with the most preferred language first.

The value of `navigator.language` is the first element of the returned array.

When its value changes, as the user's preferred languages are changed a languagechange event is fired on the Window object.
- Lets see which websites/scripts are checking for user's language preference (window.navigator.language)
- Checking the `value` column can tell us what language the script has captured from its user.
- To see whether the `navigator.language` calls are `get` or not. We can verify the the call type by checking the `operation` column.

In [10]:
language_pref_df = df[df.symbol == 'window.navigator.language']
language_pref_df = language_pref_df[['location', 'script_url', 'operation', 'value']].drop_duplicates().persist()
progress(language_pref_df, notebook=False)

  result = method(y)


[########################################] | 100% Completed | 26.8s

In [11]:
# computing the language_pref_df dataframe
language_pref_df = language_pref_df.compute()

# previewing the head of language_pref_df dataframe
language_pref_df.head()

Unnamed: 0,location,script_url,operation,value
46,https://www.canada.ca/en/services.html,https://www.google-analytics.com/analytics.js,get,en-US
253,https://maniform.world.tmall.com/category-1282...,https://g.alicdn.com/secdev/sufei_data/3.2.2/i...,get,en-US
484,https://www.coches.net/fiat/segunda-mano/,https://jssdk.pulse.schibsted.com/autoTrackerC...,get,en-US
507,https://www.coches.net/fiat/segunda-mano/,https://www.coches.net/ztkieflaaxcvaiwh121837.js,get,en-US
573,https://www.coches.net/fiat/segunda-mano/,https://script.hotjar.com/modules-526d80f8c014...,get,en-US


- Here we can see that the value returned is `en-US` which means the user has the language preference for english (US).
- Here the values are being retrieved by `get` method as shown in the `operation` column. 

*Note: The `HTTP GET` method requests a representation of the specified resource. Requests using `GET` should only retrieve data.*


In [12]:
#len(language_pref_df.value.str.contains('en-US'))
#en_US_df = language_pref_df[language_pref_df.value == 'en-US']
#en_US_df.compute()

# total websites checking for language preference and country code
lang_pref_location_count = language_pref_df.location.nunique()
print('Total no. of websites checking for users browser language preference:', lang_pref_location_count)
print('Total % of websites checking for users browser language preference:', round(lang_pref_location_count*100/total_location_unique_count, 2),'% \n')

# total scripts checking for language preference and country code
lang_pref_script_url_count = language_pref_df.script_url.nunique()
print('Total no. of scripts making the call for users browser language preference:', lang_pref_script_url_count)
print('Total % of scripts making the call for users browser language preference:', round(lang_pref_script_url_count*100/total_script_url_unique_count, 2),'%')

Total no. of websites checking for users browser language preference: 46482
Total % of websites checking for users browser language preference: 22.57 % 

Total no. of scripts making the call for users browser language preference: 7700
Total % of scripts making the call for users browser language preference: 4.61 %


- we can see that `46482` or `22.57`% websites are checking for language preferences of the user
- we can see that `7700` or `4.61`% scripts are making the calls to check for language preferences of the user

#### Unique script domains:

- Lets now check which domains are actually executing the scripts which look for user's browser language and country code. 

In [13]:
language_pref_df.script_url.value_counts().head(10)

https://www.google-analytics.com/analytics.js                            17246
http://www.google-analytics.com/analytics.js                              5661
https://d31qbv1cthcecs.cloudfront.net/atrk.js                             2288
http://www.google-analytics.com/ga.js                                     2155
https://ssl.google-analytics.com/ga.js                                    1909
https://mc.yandex.ru/metrika/watch.js                                     1906
https://bat.bing.com/bat.js                                               1892
https://securepubads.g.doubleclick.net/gpt/pubads_impl_170.js             1881
https://script.hotjar.com/modules-526d80f8c01454f84b75838f21c8706e.js     1467
http://mc.yandex.ru/metrika/watch.js                                      1298
Name: script_url, dtype: int64

- We can see a lot of cdn script providers executing these scripts on various websites. 
- We can now extract all the domains along with their subdomains having parameter strings removed.

In [14]:
# Applying extract_domain() to script_urls and saving the domains in a new column inside language_pref_df
language_pref_df['scripts_domain'] = language_pref_df.script_url.apply(extract_domain)
language_pref_df.scripts_domain.value_counts().head(10)

www.google-analytics.com          25310
hm.baidu.com                       4712
mc.yandex.ru                       3306
bat.bing.com                       2554
d31qbv1cthcecs.cloudfront.net      2301
securepubads.g.doubleclick.net     2112
ssl.google-analytics.com           2034
pagead2.googlesyndication.com      1573
script.hotjar.com                  1467
stats.g.doubleclick.net             902
Name: scripts_domain, dtype: int64

In [15]:
# Calculating no. of unique domains
language_pref_unique_domains = language_pref_df.scripts_domain.nunique()
language_pref_unique_domains

2271

## 5. Geolocation capturing
The Geolocation interface represents an object able to programmatically obtain the position of the device. It gives Web content access to the location of the device. The API itself is agnostic of the underlying location information sources. Common sources of location information include Global Positioning System (GPS) and location inferred from network signals such as IP address, RFID, WiFi and Bluetooth MAC addresses, and GSM/CDMA cell IDs, as well as user input. No guarantee is given that the API returns the device's actual location.

The API is designed to enable both "one-shot" position requests and repeated position updates, as well as the ability to explicitly query the cached positions. This allows a Web site or app to offer customized results based on the user's location.

An object with this interface is obtained using the `navigator.geolocation` property implemented by the Navigator object.


- Lets see which websites/scripts are checking for user's geolocation (window.navigator.geolocation)

In [16]:
geolocation_df = df[df.symbol == 'window.navigator.geolocation']
geolocation_df = geolocation_df[['location', 'script_url','value']].drop_duplicates().persist()
progress(geolocation_df, notebook=False)

[########################################] | 100% Completed | 23.8s

In [17]:
# computing the geolocation_df dataframe
geolocation_df = geolocation_df.compute()

# previewing the head of geolocation_df dataframe
geolocation_df.head()

Unnamed: 0,location,script_url,value
4332,https://www.citilink.ru/catalog/for_gamers/igr...,https://static.citilink.ru/build/js/commons.bu...,{}
5338,https://www.bunnings.com.au/our-range/brands/a...,https://www.bunnings.com.au/assets/styleguide/...,{}
6160,http://www.elcorteingles.es/moda/accesorios/es...,http://www.elcorteingles.es/akam/10/240f2be0,{}
6277,http://www.elcorteingles.es/moda/accesorios/es...,http://analytics-static.ugc.bazaarvoice.com/pr...,{}
12036,https://gzhls.at/b/wm/C/hdaustria_wonder_set5_...,https://gzhls.at/b/wm/C/hdaustria_wonder_set5_...,{}


- We can see here that `value` fields are rather empty implying that the crawler wasn't able to detect the calls which were being executed.
- But whats strange here is that at `index`: `36401` the crawler pickup a call where a `getCurrentPosition()` Function was executed. It's shown below.

In [18]:
# this index location has the required call fuction in the value field but not others
geolocation_df.loc[36401]

location      http://www.cracked.com/article_25101_7-stories...
script_url    http://www.cracked.com/article_25101_7-stories...
value                         {"getCurrentPosition":"FUNCTION"}
Name: 36401, dtype: object

- We can clearly see here that the compared to most of the value fields where the braces are empty this one here `{"getCurrentPosition":"FUNCTION"}` is not and contains one fuction which was executed. 
- More information: [https://www.w3.org/TR/geolocation-API/#geolocation_interface](https://www.w3.org/TR/geolocation-API/#geolocation_interface)


- Lets calculate the total no. of unique websites & scripts checking for user's location

In [19]:
# total websites using geolocation api
geolocation_location_count = geolocation_df.location.nunique()
print('Total no. of websites checking for users location:', geolocation_location_count)
print('Total % of websites checking for users location:', round(geolocation_location_count*100/total_location_unique_count, 2),'% \n')

# total scripts using geolocation api
geolocation_script_url_count = geolocation_df.script_url.nunique()
print('Total no. of scripts making the call to check for users location:', geolocation_script_url_count)
print('Total % of scripts making the call to check for users location:', round(geolocation_script_url_count*100/total_script_url_unique_count, 2),'%')

Total no. of websites checking for users location: 2216
Total % of websites checking for users location: 1.08 % 

Total no. of scripts making the call to check for users location: 1359
Total % of scripts making the call to check for users location: 0.81 %


- We can see that `2216` or `1.08`% websites are checking for user location.
- we can see that `1359` or `0.81`% scripts are making the calls to check for users current location

#### Unique script domains:

- Lets now check which domains are actually executing the scripts which look for user's geolocation. 

In [20]:
geolocation_df.script_url.value_counts().head(10)

https://analytics-static.ugc.bazaarvoice.com/prod/static/3/bv-analytics.js                           123
http://analytics-static.ugc.bazaarvoice.com/prod/static/3/bv-analytics.js                            123
http://www.castorama.fr/store/js/main.js?v=16122017                                                   51
https://static.citilink.ru/build/js/commons.bundle.js?1513352112                                      33
https://www.bunnings.com.au/assets/styleguide/v1.0.884/bunnings-main-site/assets/js/siteJs.min.js     30
https://px.wayfair.com/px/client/main.min.js                                                          29
https://cdn4.forter.com/script.js?sn=d379f257f86d                                                     29
https://cdn.perfdrive.com/aperture/aperture.js                                                        21
http://stat.sputnik.ru/cnt.js                                                                         20
http://js.3conline.com/min2/temp/v2/plugin-locate,plugi

- We can see a lot of cdn script providers executing these scripts on various websites. 
- We can now extract all the domains along with their subdomains having parameter strings removed.

In [21]:
# Applying extract_domain() to script_urls and saving the domains in a new column inside geolocation_df
geolocation_df['scripts_domain'] = geolocation_df.script_url.apply(extract_domain)
geolocation_df.scripts_domain.value_counts().head(10)

analytics-static.ugc.bazaarvoice.com    246
www.coupang.com                         169
live.sekindo.com                        164
www.elcorteingles.es                    112
d26opx5dl8t69i.cloudfront.net            86
www.castorama.fr                         59
www.adidas.ca                            51
www.qvc.com                              51
pixel.yabidos.com                        49
www.johnlewis.com                        39
Name: scripts_domain, dtype: int64

In [22]:
# Calculating no. of unique domains
geolocation_unique_domains = geolocation_df.scripts_domain.nunique()
geolocation_unique_domains

400

## Inference:

### Overall:
+ Out of the total of __205949__ websites/locations in this dataset __46482__ (`22.57`%) websites were found to be checking for preferred language of the user, usually the language of the browser UI, and their subsequent location/scripts can be found in the `language_pref_df` dataframe.

+ Out of the total of __205949__ websites/locations in this dataset __2216__ (`1.08`%) websites were found to be checking for user's location using the geolocation api, and their subsequent location/scripts can be found in the `geolocation_df` dataframe.

### Scripts:

- Out of the total of __166862__ scripts in this dataset __7700__ (`4.61`%) scripts were found to be making the calls to check for preferred language of the user, usually the language of the browser UI, and their subsequent scripts can be found in the `language_pref_df.script_url` dataframe.

- Out of the total of __166862__ scripts in this dataset __1359__ (`0.81`%) scripts were found to be making the calls to check for user's location using the geolocation api, and their subsequent scripts can be found in the `geolocation_df.script_url` dataframe.

### Domains:

- For _geolocation tracking_: scripts were being executed from `400` unique domains (`geolocation_unique_domains`).
- For _language & country code tracking_: scripts were being executed from `2271` unique domains (`language_pref_unique_domains`).


Running it on the full dataset can yield an `higher accuracy`.
  
