# Attempt to replicate SMALL parts of the StatsNL paper

[2016 paper](https://www.cbs.nl/-/media/_pdf/2016/40/measuring-the-internet-economy.pdf)

[2020 paper](https://www.cbs.nl/en-gb/background/2020/19/measuring-the-internet-economy-with-big-data)

Mostly informed by reading the original paper (2016) as this was simpler and spent more time explaining their techniques for assigning websites to categories, whereas the later paper (2020) spent more time discussing how to link the businesses/websites to pre-existing StatsNL datasets and assessing the trends from 2016-2018. 

Both papers do not provide much detail about how the private firm DataProvider collected and collated information about the websites, which was then provided to StatsNL as a large table including extensive information about each website. However, the fields of this dataset are explained which can provide inspiration for information that should be *attempted* to scraped from NZ websites (ie. a idealistic, probably-unattainable goal if done only by StatsNZ expertise/personnel).

## Overview
Categories of businesses:
- A: Businesses without a website
- B1: Passive online presence
- B2: Active online presence
- C: Online stores
- D: Online services
- E: Internet related ICT

Categories C, D, & E constitute the "core of the internet economy". Section 5.1 explains the techniques used to allocate websites to categories.

Websites are assigned to Category C using the detection of shopping carts and online payment methods. DataProvider uses a "machine-learning algorithm" to construct probabilities of whether websites are "e-commerce websites" and StatsNL uses a 85% cut-off point as their decision rule. StatsNL then used an online list of the most popular online stores ([jouwaanbieding.nl](jouwaanbieding.nl)) to make manual adjustments. Finally, they used the Call To Action dataset to "further refine the category".

Websites are assigned to Category D & E using keyword detection. They also use these detected keywords to assign websites to subcategories and topics within Category D & E, which are detailed in 2016 Appendix B.
1. Keyword selection
2. Refinement using other fields in the DataProvider dataset and the linked General Business Register (GBR)
3. Manual adjustments - the 100 most important websites for every category were manually inspected, which informed both manual reallocations and refinement of earlier steps (e.g. improving linking between DataProvider dataset and GBR)

**This keyword selection process should be able to be tested using the current commoncrawl text output. The techniques using to assign websites to Category C may require greater expertise than Stats NZ, but at the very least would require re-processing the commoncrawl dataset to output the raw HTML from each website/webpage (so cannot be tried yet).**

In [1]:
import json
from collections import Counter
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
stops = TfidfVectorizer(stop_words='english').get_stop_words()

In [3]:
sites = pd.read_csv("site_agg_output/clusters10-seed777/clustered_websites.csv", index_col=[0])
sites['Counts'] = [json.loads(counts_str.replace("'", '"')) for counts_str in sites['Counts']]
sites['Filt_counts'] = [
    {word: count for word, count in countsi.items() if word not in stops}
    for countsi in sites['Counts']
]

In [4]:
sites.head()

Unnamed: 0,Netloc,Text,Counts,KM_cluster,Filt_counts
0,0064.co.nz,Is this your domain name?\n\nActivate your dom...,"{'is': 1, 'this': 1, 'your': 3, 'domain': 2, '...",5,"{'domain': 2, 'activate': 1, 'instantly': 1, '..."
1,01builders.co.nz,Welcome to WordPress. This is your first post....,"{'welcome': 1, 'to': 1, 'wordpress': 1, 'this'...",9,"{'welcome': 1, 'wordpress': 1, 'post': 1, 'edi..."
2,08004ducting.co.nz,This document uses a frameset.\n\nClick Here t...,"{'this': 1, 'document': 1, 'uses': 1, 'framese...",9,"{'document': 1, 'uses': 1, 'frameset': 1, 'cli..."
3,0800easymovers.co.nz,Thanks For Subscribe! Please wait it will be r...,"{'thanks': 4, 'for': 8, 'subscribe': 1, 'pleas...",9,"{'thanks': 4, 'subscribe': 1, 'wait': 1, 'redi..."
4,0800nz.co.nz,Hit enter to search or ESC to close | Hit ente...,"{'hit': 2, 'enter': 2, 'to': 5, 'search': 2, '...",9,"{'hit': 2, 'enter': 2, 'search': 2, 'esc': 2, ..."


In [5]:
category_mappings = {
    'D': {
        'Leisure': ['Hotels', 'Flights', 'Holidays', 'Food'], 
        'News and entertainment': ['News', 'Blogs', 'Vlogs', 'Games', 'Videos, music', 'Books', 'e-learning', 'Gambling', 'Adult'], 
        'Business': ['Advertising', 'Finance', 'Consultancy', 'Jobs'], 
        'Retail': ['Housing', 'Price comparison', 'Tickets', 'Auctions', 'Online trade'], 
        'General services': ['Dating', 'Visualisations', 'Transport', 'Online payment']
    },
    'E': {
        'Hosting and cloud': ['Webhosting', 'Cloud services, Datacentres', 'Website design, developing', 'App design, developing'], 
        'Software': ['Software products and services'],
        'Marketing and consultancy': ['Internet marketing', 'Internet consultancy'], 
        'Infrastructure and security': ['Firewalls', 'Datamining & Big Data']
    }
}

topic_to_keywords = {
    'Hotels': ['hotel','hotels','resort','resorts','hostel','booking'],
    'Flights': ['airlines','airline','tickets','ticket'],
    'Holidays': ['travel','tour','trip','vacation','vacations','travel','travels'],
    'Food': ['food','recipes','restaurant','dining','cooking'],  # 'home delivery'
    'News': ['news','weather','media','magazine'],
    'Blogs': ['blog','blogger'],
    'Vlogs': ['vlog','vlogger'],
    'Games': ['game','games','play','plays'],
    'Videos, music': ['tv','movie','movies','series','series','clip','clips','videos','video','music'],
    'Books': ['book','books'],
    'e-learning': ['learning','learn','course','courses','learns'],
    'Gambling': ['gambling','casino','casinos','gamble','slots'],
    'Adult': ['escort','porno','porn','sex','kinky','gay','erotic'],
    'Advertising': ['advertise','advertising'],  # 'advertise agency', 'advertise agencies'
    'Finance': ['banking','bank','investing','investors','investor','investments'],  # 'online banking'
    'Consultancy': ['advice','consultancy'],
    'Jobs': ['job','jobs','vacancy','vacancies','recruitment','recruiter','career','employment','part-time'],  # 'job site', 'job sites', 'employment agency' etc.
    'Housing': ['house','housing','mortgage','mortgages','broker'],  # 'supply', 'property for sale', 'real estate agents', 'rental properties'
    'Price comparison': ['energy','insurance','subscriptions','prices','providers','cheapest','save','subscription','mobile'],  # some
    'Tickets': ['tickets','ticketing','ticket'],
    'Auctions': ['auction','auctions','auctioneer'],  # some others
    'Online trade': ['martketplace'],  # 'second hand'
    'Dating': ['date','dating'],  # some others
    'Visualisations': ['visualisation','visualisations','visualization','visualizations','animation','animations','design'],
    'Transport': ['car','carpool','lift','parking'],  # 'trip planner', 'route planner'
    'Online payment': ['service','services','payment'],  # 'payment solutions'
    'Webhosting': ['hosting','domain','server','service','cloud','development'],  # 'domain name', 'web hosting'    ************ i don't like this one because it will collect parked websites
    'Cloud services, Datacentres': ['cloud','databases','database'],  # 'cloud services', 'data center' etc.
    'Website design, developing': ['software','graphic','developer','development'],
    'App design, developing': ['mobile','iphone','android','tablet'],
    'Software products and services': ['software','develop','develop','databases'],
    'Internet marketing': ['ecommerce','emarketing','adwords','b2b','b2c','sales','purchasing','marketing','seo','optimization',
                           'intelligence','analytics','advertisers','advertiser'],  # 'internet marketing' etc.
    'Internet consultancy': ['internet','online','development','consultancy','research','advice','analysis','technology',
                             'communication','wiki','community','hangout','skype','youtube','tweets','twitter','snapchat','facebook'],  # some others
    'Firewalls': ['firewall','firewalls','cyber','vpn','spyware','antispam','hacking','hackers','security','cybercrime','spam',
                  'phishing','tracking','pharming','risk','incident','virus','solutions','protects','specialist','burglary',
                  'mutilation','passwords','protect','secured','blockade'],  # some others
    'Datamining & Big Data': ['robots','automate','crawler','data','mining','text','supplier','web','google','discover','patterns',
                              'intelligence','platform','recognition','machine','collect','collects','science','big']
}

categories = pd.DataFrame(
    [
        [category, subcategory, topic] 
        for category, subcategories in category_mappings.items() 
        for subcategory, topics in subcategories.items() 
        for topic in topics
    ],
    columns=['Category', 'Subcategory', 'Topic']
)
categories['Keywords'] = categories['Topic'].map(topic_to_keywords)

print(f"Topics not mapped yet: {categories[categories['Keywords'].isnull()]['Topic'].tolist()}\n")

categories.head(10)

Topics not mapped yet: []



Unnamed: 0,Category,Subcategory,Topic,Keywords
0,D,Leisure,Hotels,"[hotel, hotels, resort, resorts, hostel, booking]"
1,D,Leisure,Flights,"[airlines, airline, tickets, ticket]"
2,D,Leisure,Holidays,"[travel, tour, trip, vacation, vacations, trav..."
3,D,Leisure,Food,"[food, recipes, restaurant, dining, cooking]"
4,D,News and entertainment,News,"[news, weather, media, magazine]"
5,D,News and entertainment,Blogs,"[blog, blogger]"
6,D,News and entertainment,Vlogs,"[vlog, vlogger]"
7,D,News and entertainment,Games,"[game, games, play, plays]"
8,D,News and entertainment,"Videos, music","[tv, movie, movies, series, series, clip, clip..."
9,D,News and entertainment,Books,"[book, books]"


## Notes on reconstruction of keyword matching process

Keywords consists of most of the 'Keywords: base' in 2016 Appendix B. Some were excluded because they didn't seem relevant and some were removed because they translated (from Dutch) to two English words. \
e.g. thuisbezorgd->'home delivery' is currently not included.

I do not understand how the 'Keywords: base' vs 'Keywords: combination' columns in 2016 Appendix B interact/are used.

This list of topic keywords included singular/plural versions of the same words, but this might mean some singular/plural words are accidentally missed. A better method would just singularise the words on the website and just include singulars in the list of topic keywords, as this would ensure more consistency.

In [6]:
categories.sample(10).sort_index()

Unnamed: 0,Category,Subcategory,Topic,Keywords
8,D,News and entertainment,"Videos, music","[tv, movie, movies, series, series, clip, clip..."
10,D,News and entertainment,e-learning,"[learning, learn, course, courses, learns]"
11,D,News and entertainment,Gambling,"[gambling, casino, casinos, gamble, slots]"
13,D,Business,Advertising,"[advertise, advertising]"
18,D,Retail,Price comparison,"[energy, insurance, subscriptions, prices, pro..."
20,D,Retail,Auctions,"[auction, auctions, auctioneer]"
26,E,Hosting and cloud,Webhosting,"[hosting, domain, server, service, cloud, deve..."
28,E,Hosting and cloud,"Website design, developing","[software, graphic, developer, development]"
29,E,Hosting and cloud,"App design, developing","[mobile, iphone, android, tablet]"
34,E,Infrastructure and security,Datamining & Big Data,"[robots, automate, crawler, data, mining, text..."


In [7]:
# Find the 5 most common words in each website
sites['Keywords'] = [
    [word for word, count in Counter(filt_countsi).most_common(5)] 
    for filt_countsi in sites['Filt_counts']
]
sites.head()

Unnamed: 0,Netloc,Text,Counts,KM_cluster,Filt_counts,Keywords
0,0064.co.nz,Is this your domain name?\n\nActivate your dom...,"{'is': 1, 'this': 1, 'your': 3, 'domain': 2, '...",5,"{'domain': 2, 'activate': 1, 'instantly': 1, '...","[domain, activate, instantly, connecting, mult..."
1,01builders.co.nz,Welcome to WordPress. This is your first post....,"{'welcome': 1, 'to': 1, 'wordpress': 1, 'this'...",9,"{'welcome': 1, 'wordpress': 1, 'post': 1, 'edi...","[welcome, wordpress, post, edit, delete]"
2,08004ducting.co.nz,This document uses a frameset.\n\nClick Here t...,"{'this': 1, 'document': 1, 'uses': 1, 'framese...",9,"{'document': 1, 'uses': 1, 'frameset': 1, 'cli...","[document, uses, frameset, click, view]"
3,0800easymovers.co.nz,Thanks For Subscribe! Please wait it will be r...,"{'thanks': 4, 'for': 8, 'subscribe': 1, 'pleas...",9,"{'thanks': 4, 'subscribe': 1, 'wait': 1, 'redi...","[thanks, movers, expected, help, team]"
4,0800nz.co.nz,Hit enter to search or ESC to close | Hit ente...,"{'hit': 2, 'enter': 2, 'to': 5, 'search': 2, '...",9,"{'hit': 2, 'enter': 2, 'search': 2, 'esc': 2, ...","[hit, enter, search, esc, close]"


In [8]:
def get_topics(site_keywords, topic_to_keywords):
    """
    Return list of topics corresponding to the site_keywords.
    """
    site_topics = [
        topic for topic, topic_keywords in topic_to_keywords.items() 
        for site_keyword in site_keywords
        if site_keyword in topic_keywords
    ]
    
    return site_topics

def get_most_common_topics(site_keywords, topic_to_keywords):
    """
    Constructs a list of topic(s)s corresponding to the site_keywords and takes
    a vote to determine the most common topic(s).
    This voting system allows for multiple topics to share the same keyword, it
    will then be the presence of other keywords on the website which may (or 
    may not) choose between these.
    
    Return a list[string] of topic(s). This list will have:
    > zero elements if none of the site keywords correspond to a topic
    > one element if there is a single most common topic
    > multiple elements if there is multiple most common topics
    """
    site_topics = get_topics(site_keywords, topic_to_keywords)
    if len(site_topics) == 0:
        return []
    
    topics_ordered = Counter(site_topics).most_common()
    topics_counter_max = topics_ordered[0][1]
    
    most_common_topics = []
    for topic, topic_count in topics_ordered:
        if topic_count < topics_counter_max:
            break
        most_common_topics.append(topic)
    
    return most_common_topics


test_site_keywords = ['software', 'develop', 'iphone', 'android']
print("All topics:        ", get_topics(test_site_keywords, topic_to_keywords))
print("Most common topics:", get_most_common_topics(test_site_keywords, topic_to_keywords))

All topics:         ['Website design, developing', 'App design, developing', 'App design, developing', 'Software products and services', 'Software products and services']
Most common topics: ['App design, developing', 'Software products and services']


In [9]:
sites['Topics'] = [get_most_common_topics(keywordsi, topic_to_keywords) for keywordsi in sites['Keywords']]

In [10]:
# See which websites had at least one topic detected
sites[sites['Topics'].apply(len) > 0].head()

Unnamed: 0,Netloc,Text,Counts,KM_cluster,Filt_counts,Keywords,Topics
0,0064.co.nz,Is this your domain name?\n\nActivate your dom...,"{'is': 1, 'this': 1, 'your': 3, 'domain': 2, '...",5,"{'domain': 2, 'activate': 1, 'instantly': 1, '...","[domain, activate, instantly, connecting, mult...",[Webhosting]
9,0800treetrim.nz,Trees And First Impressions When it comes to R...,"{'trees': 218, 'and': 411, 'first': 24, 'impre...",9,"{'trees': 218, 'impressions': 10, 'comes': 7, ...","[tree, trees, care, large, house]",[Housing]
17,101home.co.nz,Under Covid-19 Alert Level 3 our store will be...,"{'under': 877, 'covid': 867, 'alert': 867, 'le...",3,"{'covid': 867, 'alert': 867, 'level': 875, 'st...","[store, level, click, collect, closed]",[Datamining & Big Data]
20,107fm.co.nz,Lorem Ipsum is simply dummy text of the printi...,"{'lorem': 2, 'ipsum': 2, 'is': 1, 'simply': 1,...",9,"{'lorem': 2, 'ipsum': 2, 'simply': 1, 'dummy':...","[lorem, ipsum, dummy, text, industry]",[Datamining & Big Data]
21,1080design.co.nz,From concept through to completion we provide ...,"{'from': 1, 'concept': 1, 'through': 2, 'to': ...",9,"{'concept': 1, 'completion': 1, 'provide': 1, ...","[services, concept, completion, provide, range]",[Online payment]


In [11]:
# See which websites had multiple topics detected
sites[sites['Topics'].apply(len) > 1].head()

Unnamed: 0,Netloc,Text,Counts,KM_cluster,Filt_counts,Keywords,Topics
27,10tenentertainment.co.nz,10 Ten Entertainment is a Whakatane-based ente...,"{'ten': 1, 'entertainment': 4, 'is': 6, 'whaka...",9,"{'entertainment': 4, 'whakatane': 1, 'based': ...","[entertainment, reuben, provide, service, able]","[Online payment, Webhosting]"
48,185chairs.co.nz,Sign the Online Guest Book 54 physical comment...,"{'sign': 1, 'the': 3, 'online': 2, 'guest': 1,...",9,"{'sign': 1, 'online': 2, 'guest': 1, 'book': 1...","[online, comments, sign, guest, book]","[Books, Internet consultancy]"
115,2kcupsi.co.nz,Championship\n\nWhether you just want to be pa...,"{'championship': 10, 'whether': 1, 'you': 4, '...",9,"{'championship': 10, 'just': 3, 'want': 2, 'fu...","[cars, cup, committee, series, car]","[Videos, music, Transport]"
119,2ndnature.co.nz,Virtual Mimic® is a software application train...,"{'virtual': 2, 'mimic': 2, 'is': 1, 'software'...",9,"{'virtual': 2, 'mimic': 2, 'software': 2, 'app...","[video, virtual, mimic, software, application]","[Videos, music, Website design, developing, So..."
138,360development.co.nz,360 DEVELOPMENT RACKING AND SHELVING INSTALLAT...,"{'development': 4, 'racking': 3, 'and': 8, 'sh...",9,"{'development': 4, 'racking': 3, 'shelving': 3...","[development, racking, shelving, installation,...","[Webhosting, Website design, developing, Inter..."


Now map topics back to subcategories and categories.

In [12]:
topic_to_subcategory = {topic: subcategory for subcategory, topic in zip(categories['Subcategory'], categories['Topic'])}
sites['Subcategories'] = [
    set([topic_to_subcategory[topic] for topic in topicsi])
    for topicsi in sites['Topics']
]

topic_to_category = {topic: category for category, topic in zip(categories['Category'], categories['Topic'])}
sites['Categories'] = [
    set([topic_to_category[topic] for topic in topicsi])
    for topicsi in sites['Topics']
]

In [13]:
print(f"Currently {sites[sites['Topics'].apply(len) > 0].shape[0]} / {sites.shape[0]} websites have allocated topics/subcategories/categories.")
print()
print("Number of websites with multiple:")
print(f"Topics           | {sites[sites['Topics'].apply(len) > 1].shape[0]}")
print(f"Subcategories    | {sites[sites['Subcategories'].apply(len) > 1].shape[0]}")
print(f"Categories       | {sites[sites['Categories'].apply(len) > 1].shape[0]}")

Currently 11728 / 41166 websites have allocated topics/subcategories/categories.

Number of websites with multiple:
Topics           | 2542
Subcategories    | 2391
Categories       | 1752


## Results
It cannot be known whether this is a good number of websites assigned to either Category D or E until we add a process to assign websites to Category C.

There is a fair bit of overlap even at category level - this is where further refinement can help. Section 5.1.4 in 2016 paper discusses *dealing with overlap*, but this was after all categorising and all refinement steps to websites in Categories C, D, & E (they reported only 4% of the websites being allocated to more than one category after all these steps).