ContentSquare Data Science Test
===========
## Goal
The goal of the exercise is to predict the category of a webpage based on its URL string and custom variables. The solution should be able to classify pages into categories even for new websites.

## Instructions
* You should not spend more than 5 hours working on the test. We're aware that it's not enough to fully answer the
  problem, but that's OK. This is done on purpose and we don't expect you to finish the exercise
* To get started using the dataset you can refer to the [getting started notebook](getting_started.ipynb)
* You're encouraged to use comments to describe the steps you're taking and explain your code
* If you use external libraries please make sure to list them in a `requirements.txt` file
* Provide a `README.md` explaining what you did and specifying which code to run to reproduce your results
* Submission should include all your scripts/notebooks as well as the `requirements.txt` and `README.md` files

## Assessment
We will assess the exercise based on the following criteria
* Depth and quality of the scientific approach
* Code quality and cleanliness
* Logic and relevance of your chosen steps
* Clarity of the comments and README

## Expectations
This test is part of a project that we are working on since several months. As a result, we don't expect from you to
find a working solution. Depending on how you think you will manage your time, you can either:
* craft an end to end solution (business analysis, training data analysis, modeling and training, results interpretation) and iterate on it
* or choose some sections only and produce an in-depth analysis.

If you go for the second option, chosen sections must at least include business analysis & training data analysis.

As a result, training a machine learning model is not a requirement to succeed the test. However if you don't train a model, tell us how you would do it or what other system you would implement to solve the problem.

If you have any doubts or questions, do not hesitate to send an email to your contact point at ContentSquare.

# Solution

## Play with data

Import the data with pandas, see what it looks like, list ideas of features and distribution of the different targets thanks to the code in the tutorial notebook

In [9]:
import pandas as pd

# Load with pandas and show a few elements

website_df = pd.read_csv('/Users/victormoeneclaey/Downloads/dataset.csv')
website_df.head(10)

Unnamed: 0,url,prefix,category_name
0,https://www.pizzahut.co.uk/order/deals/?cs-pop...,www.pizzahut.co.uk,Other
1,https://us.pandora.net/on/demandware.store/Sit...,us.pandora.net,My account
2,https://www.oakleysi.com/en-us/my-account/edit...,www.oakleysi.com,My account
3,https://www.gites-de-france.com/fr/search?depa...,www.gites-de-france.com,Search
4,https://www.hermes.com/us/en/product/candy-san...,www.hermes.com,Product
5,https://www.ralphlauren.fr//emailorderdetails?...,www.ralphlauren.fr,Other
6,https://www.travelrepublic.co.uk/holidays/hote...,www.travelrepublic.co.uk,Category
7,https://www.ochsner-shoes.ch/CH/fr/shop/enfant...,www.ochsner-shoes.ch,Category
8,https://www.spacenk.com/uk/home?&cm_mmc=PPC%7c...,www.spacenk.com,Home
9,https://www.ikks.com/fr/tee-shirt-blanc-a-impr...,www.ikks.com,Product


In [22]:
website_df["url"][0]

'https://www.pizzahut.co.uk/order/deals/?cs-popin-modal--DealBotModalB$$Device Type=desktop&Login Status=logged-out&Platform Type=website&Screen Type=landing-page&Trace ID=737897fa-effb-4436-ab44-d9dc477340bb'

In [8]:
# Get size of the dataset

print(website_df.size)

300000


The url format is the following: {schema}+{prefix}+{path}?{query}$${custom_variables}

In [11]:
website_df['c_vars'] = website_df.url.apply(lambda url: url.split('$$')[-1])
website_df['query'] = website_df.url.apply(lambda url: url.split('$$')[0].split('?')[-1])
website_df['path'] = website_df.apply(
    lambda row: row.url.split('$$')[0].split('?')[0].split(row.prefix)[-1], axis=1
)

In [12]:
website_df[['prefix', 'path', 'query','c_vars', 'category_name']].head(10)

Unnamed: 0,prefix,path,query,c_vars,category_name
0,www.pizzahut.co.uk,/order/deals/,cs-popin-modal--DealBotModalB,Device Type=desktop&Login Status=logged-out&Pl...,Other
1,us.pandora.net,/on/demandware.store/Sites-en-US-Site/en_US/Or...,orderNumber=PND11077497,country=us&customer_login_stat=Logged in&page_...,My account
2,www.oakleysi.com,/en-us/my-account/edit-address/11813567168535,,Action=US:EN:R::Generic &Page Name=:Generic&Pa...,My account
3,www.gites-de-france.com,/fr/search,department=81&departments=100&destination=Tarn,pageCategory=resultatsDeRecherche&peopleNumber=2,Search
4,www.hermes.com,/us/en/product/candy-sandal-H211070Zv9G375/mod...,,category=Z/Z01&clienttype=non-connected&depart...,Product
5,www.ralphlauren.fr,//emailorderdetails,orderEmail=CS_ANONYMIZED_EMAIL&orderID=7000813164,Login State=guest&Page Level 1=account&Page Le...,Other
6,www.travelrepublic.co.uk,/holidays/hotels/c2f165a2-f433-4703-b721-70edf...,fbt=3&fcd=2%7C218701&fstr=4,Page path=/holidays/hotels/c2f165a2-f433-4703-...,Category
7,www.ochsner-shoes.ch,/CH/fr/shop/enfants/enfants-chaussures/enfants...,filter-size=212@26,Country ID=ch&Lang ID=fr&Pagetype=CATEGORY,Category
8,www.spacenk.com,/uk/home,&cm_mmc=PPC%7cGoogle%7cUK-_-Brand%7cCore-_-Spa...,Currency=GBP&Is Ndulge=Yes&LogStatus=Soft&Page...,Home
9,www.ikks.com,/fr/tee-shirt-blanc-a-imprime-cours-en-all-ove...,,"brand=OUTLET&category=Tee-shirt, top, chemise&...",Product


In [111]:
category_and_proba = website_df.category_name.value_counts(normalize=True) * 100
category_and_proba

Product                   36.928666
Category                  18.071541
Search                    12.437898
Home                       8.970564
Checkout                   5.570506
Cart                       3.483895
Confirmation               3.356587
My account                 2.639314
Formations / services      2.628964
Other                      2.270845
Brand image                1.360023
Store locator              0.752463
Information / legals       0.586859
Help / support             0.366399
Press / news               0.307403
Offers & services          0.154219
Favorites / wishlist       0.064172
Form                       0.017595
Appointments / booking     0.016560
Careers & applications     0.015525
Name: category_name, dtype: float64

In [197]:
# Define useful global variables

from pprint import pprint

ALL_CATEGORIES = list(res.axes[0])
ALL_CATEGORIES_SET = set(ALL_CATEGORIES)
CATEGORY_TO_PRIOR_PROBABILITY = {
    category: proba
    for category, proba in zip(ALL_CATEGORIES, list(category_and_proba))
}
print(ALL_CATEGORIES)
pprint(CATEGORY_TO_PRIOR_PROBABILITY)

['Product', 'Category', 'Search', 'Home', 'Checkout', 'Cart', 'Confirmation', 'My account', 'Formations / services', 'Other', 'Brand image', 'Store locator', 'Information / legals', 'Help / support', 'Press / news', 'Offers & services', 'Favorites / wishlist', 'Form', 'Appointments / booking', 'Careers & applications']
{'Appointments / booking': 0.016560404073859402,
 'Brand image': 1.3600231845657036,
 'Careers & applications': 0.01552537881924319,
 'Cart': 3.4838950070381713,
 'Category': 18.071540945599075,
 'Checkout': 5.5705059203444565,
 'Confirmation': 3.356586900720378,
 'Favorites / wishlist': 0.06417156578620518,
 'Form': 0.017595429328475617,
 'Formations / services': 2.6289641467251803,
 'Help / support': 0.36639894013413926,
 'Home': 8.970563881758716,
 'Information / legals': 0.5868593193673926,
 'My account': 2.6393143992713424,
 'Offers & services': 0.15421876293781567,
 'Other': 2.2708454086279706,
 'Press / news': 0.30740250062101515,
 'Product': 36.92866605945185,
 '

The dataset is not balanced -> be careful, maybe rebalance some features or carry out some balanced training (sample weight). Is the dataset biased ? If not, there is a prior on the probabiity to belong to a certain class.<br>
The "Other" category is already in the dataset

In [103]:
website_df.prefix.value_counts(normalize=True)[:20] * 100

www2.hm.com                    14.790
www.gites-de-france.com         7.827
www.pizzahut.co.uk              4.094
www.cosstores.com               4.069
www.gucci.com                   3.693
www.oakley.com                  3.316
www.travelrepublic.co.uk        3.276
www.specsavers.co.uk            3.206
www.spacenk.com                 2.943
www.funkypigeon.com             2.665
www.hobbycraft.co.uk            2.491
uk.pandora.net                  2.380
www.pecheur.com                 2.240
us.pandora.net                  2.197
www.which.co.uk                 2.150
www.hermes.com                  2.013
www.bottegaveneta.com           1.584
app.contentsquare.com           1.583
www.tomtom.com                  1.517
www.natureetdecouvertes.com     1.498
Name: prefix, dtype: float64

In [15]:
website_df.groupby('path')['path'].count().sort_values(ascending=False)[:20]

path
/fr/search                     3861
/                              3171
/de_de/search-results.html     1707
/en_us/search-results.html     1199
/order/deal/                   1088
/search                        1025
/en_gb/checkout                 996
/en/checkout                    851
/advancedsearchresults.aspx     847
/order/deals/                   786
/en/shopping-bag                705
/en/                            627
/uk/cart                        612
/en/search                      585
/order/pizzas/                  581
/glasses/all-glasses            538
/us/en/                         482
/en                             410
/en-us/search                   357
/en-us/cart                     331
Name: path, dtype: int64

## Explore and get some stats

My first idea was to investigate a given category and watch the most frequent words that those urls handle. We might expect "Search" urls to contain certain keywords that other urls don't contain often, like "search" for instance. Those words can be strong clues towards a given category.<br>
Similarly, specific websites or actions (c_vars) on those website can helps us guess which category the url belongs to

In [59]:
# Most frequent path chunks for a given page category. Example with "search" category pages

from collections import Counter

def get_most_occurrent_path_chunks(category):

    category_path_chunks = [
        chunk
        for path, cat in zip(website_df["path"], website_df["category_name"])
        for chunk in path.split("/")
        if cat == category and chunk
    ]

    print(f"Number of different path chunks: {len(set(category_path_chunks))}")
    pprint(sorted(Counter(category_path_chunks).items(), key=lambda x: -x[1]))

get_most_occurrent_path_chunks("Search")

Number of different path chunks: 408
[('search', 6680),
 ('fr', 4055),
 ('search-results.html', 2995),
 ('de_de', 1724),
 ('en', 1307),
 ('en_us', 1198),
 ('Search', 504),
 ('us', 434),
 ('st', 359),
 ('newsearchpage', 359),
 ('en-us', 337),
 ('recherche', 327),
 ('search.html', 310),
 ('recherche.asp', 223),
 ('uk', 210),
 ('Index', 200),
 ('de', 111),
 ('it', 106),
 ('search-results', 101),
 ('en_gb', 92),
 ('en_eur', 87),
 ('nl', 69),
 ('en_gbp', 69),
 ('Cerca', 69),
 ('ja_jp', 66),
 ('catalogsearch', 56),
 ('result', 56),
 ('en-ca', 55),
 ('jp', 50),
 ('cs', 49),
 ('vyhledavani', 49),
 ('en-gb', 40),
 ('kr', 40),
 ('es_us', 39),
 ('ja', 38),
 ('Recherche', 38),
 ('en_usd', 37),
 ('CH', 36),
 ('shop', 36),
 ('ko', 35),
 ('eu', 35),
 ('au', 32),
 ('en_au', 31),
 ('pl', 29),
 ('fr_fr', 29),
 ('particulieren', 28),
 ('zoek.html', 28),
 ('es', 27),
 ('SearchDisplay', 23),
 ('sk', 21),
 ('vyhladavanie', 21),
 ('10', 21),
 ('de-de', 16),
 ('Zoeken', 15),
 ('zh', 14),
 ('gb', 14),
 ('it-it

We see a mixture of relevant words like "search" "recherche" and non-relevant words like language ones, that should appear in all categories. TF-IDF should solve that

In [60]:
# Same for "Product"

get_most_occurrent_path_chunks("Product")

Number of different path chunks: 34343
[('fr', 4718),
 ('en', 3178),
 ('de_de', 2796),
 ('en_us', 2569),
 ('product', 2565),
 ('women', 2550),
 ('us', 2279),
 ('pr', 1556),
 ('womenswear', 1493),
 ('uk', 1443),
 ('en_eur', 1326),
 ('en-us', 1291),
 ('__', 957),
 ('men', 941),
 ('order', 940),
 ('card', 919),
 ('de', 893),
 ('analyze', 782),
 ('deal', 759),
 ('auvergne-rhone-alpes', 630),
 ('bretagne', 599),
 ('model-selection', 585),
 ('eu', 511),
 ('ja_jp', 507),
 ('personalized-products', 497),
 ('video', 470),
 ('nouvelle-aquitaine', 460),
 ('it', 447),
 ('pages', 447),
 ('zoning-v2', 444),
 ('occitanie', 437),
 ('es', 417),
 ('skincare', 414),
 ('menswear', 405),
 ('zh', 397),
 ('dresses', 388),
 ('hotels', 380),
 ('normandie', 376),
 ('v3', 371),
 ('shop', 355),
 ('jp', 344),
 ('accessories', 342),
 ('en_gb', 340),
 ('tops', 336),
 ('customisecards.aspx', 330),
 ('makeup', 307),
 ('CH', 289),
 ('editor', 289),
 ('en_usd', 289),
 ('en_gbp', 289),
 ('nl', 286),
 ('preview', 284),
 (

 ('6936729', 4),
 ('retro-gamer-magazine-subscription.thtml', 4),
 ('used-ford-ecosport', 4),
 ('hollywood-flawless-filter-MUK200024604.html', 4),
 ('creme-de-la-mer-moisturizing-cream-UK200020044.html', 4),
 ('productpage.0757903026.html', 4),
 ('product.oversized-shirt-dress-turquoise.0984571002.html', 4),
 ('product.teddy-zip-up-jacket-orange.0935368001.html', 4),
 ('productpage.0812748001.html', 4),
 ('product.patch-pocket-cardigan-green.0960249001.html', 4),
 ('6936669', 4),
 ('playstation-official-magazine-subscription.thtml', 4),
 ('mathematiques', 4),
 ('pf', 4),
 ('W0OO9334', 4),
 ('99470', 4),
 ('boys-0-36m', 4),
 ('portefeuilles-et-petits-accessoires', 4),
 ('W0OO4123OSI', 4),
 ('demande-essai', 4),
 ('wellness', 4),
 ('supplements', 4),
 ('elefanten', 4),
 ('lawnmowers', 4),
 ('trousers-and-shorts-for-men', 4),
 ('W0OO9438', 4),
 ('biologie-geographie', 4),
 ('decoration-exterieur', 4),
 ('qa', 4),
 ('black-rose-skin-infusion-cream-MUK200019244.html', 4),
 ('sandals-for-gir

 ('anglais-le-corps-humain-et-ses-mouvements', 2),
 ('emotional-rescue-niece-birthday-card-takes-after-auntie', 2),
 ('129685', 2),
 ('gite-de-suzette-ouessant-29g9753', 2),
 ('648924ZJP271061', 2),
 ('%E9%A6%99%E6%B3%A2', 2),
 ('beanies-for-men', 2),
 ('productpage.0923134002.html', 2),
 ('00002001853193', 2),
 ('nana-me-to-you-birthday-card', 2),
 ('139501', 2),
 ('W0OO9361', 2),
 ('mini-mallette-massage-pierres-chaudes-15211040', 2),
 ('productpage.0863598001.html', 2),
 ('sk-7928081992002828037,10', 2),
 ('32', 2),
 ('macchina-fotografica-compatta-leica-q2-nero-miglior-prezzo', 2),
 ('341312.html', 2),
 ('product.double-chain-necklace-gold.0960640001.html', 2),
 ('productpage.0897723002.html', 2),
 ('product.ribbed-cardigan-brown.0961411004.html', 2),
 ('productpage.0954232003.html', 2),
 ('la-chouette-blanche-22g170914', 2),
 ('tiger-espadrilles', 2),
 ('product.sculpt-recycled-polyamide-slip-dress-beige.0893745002.html', 2),
 ('54634', 2),
 ('gg-supreme-backpack-p-406370KLQAX9772

 ('product.jersey-twill-shirt-jacket-blue.0631787024.html', 2),
 ('cotton-boxer-3-pack-573853.html', 2),
 ('productpage.0870524021.html', 2),
 ('alisa', 2),
 ('ultra-sun-protection-spf-45-pa--anti-glycation-primer-MUK200006779.html', 2),
 ('product.long-organic-cotton-tunic-shirt-white.0967678001.html', 2),
 ('en-bh', 2),
 ('apprendre-lire-compter', 2),
 ('292695', 2),
 ('product.merino-wool-cotton-mix-a-line-dress-green.0896281002.html', 2),
 ('sides', 2),
 ('me-to-you-uncle-birthday-card', 2),
 ('138982', 2),
 ('cache-poubelles-composteurs', 2),
 ('composteur-a-acces-direct-design-209l-92531250', 2),
 ('product.wool-tabard-knitted-vest-black.0940941001.html', 2),
 ('911537', 2),
 ('bracelet-en-oeil-de-tigre-et-obsidienne-92061800', 2),
 ('la-guyonnais-22g570896', 2),
 ('tiger-stretch-canvas-espadrilles', 2),
 ('chemise-oxford-fantaisie-cintree-rayee-567779.html', 2),
 ('lifestyle-06', 2),
 ('all-day-luminous-weightless-foundation-UK200014584.html', 2),
 ('product.wool-cashmere-tailor

 ('product.abstract-art-detail-sheer-roll-neck-jumper-blue.0934565001.html', 2),
 ('476466K9GVT8856', 2),
 ('large-brush-set-3-pieces', 2),
 ('639662-1000', 2),
 ('spring-summer-2021-pre-order', 2),
 ('achat-tete-plombee-scratch-tackle-finess-nose-jig-head-par-5-193186.html',
  2),
 ('coussin-masseur-cervical-autonome-15211190', 2),
 ('iphone-xs-max-256-go-oro-sbloccato-da-tutti-gli-operatori-miglior-prezzo',
  2),
 ('175710.html', 2),
 ('the-stars-and-the-sky-anniversary-card-40-years-still-strong', 2),
 ('134145', 2),
 ('Varese*Dea*Damen*Chelsea*Boot*Rot.prod', 2),
 ('productpage.0768858001.html', 2),
 ('achat-canne-spinning-sakura-ionizer-finesse-light-game-insf-202178.html', 2),
 ('camicia-in-lino-relaxed-fit-563817.html', 2),
 ('%E3%83%88%E3%83%BC%E3%83%88%E3%83%90%E3%83%83%E3%82%B0_cod45551296jq.html',
  2),
 ('jogging-gris-chine-bande-logo-laterale-fjog21002sgryd1.html', 2),
 ('shoulder-bag_cod45551196wm.html', 2),
 ('a-quoi-ca-sert-de-porter-un-masque', 2),
 ('product.slightly-

 ('36830.html', 2),
 ('64545492TCG8563', 2),
 ('gite-de-rousseau-h85g015675', 2),
 ('rainbow-dust-leaf-green-progel-food-colouring-25g', 2),
 ('608953-1006', 2),
 ('15', 2),
 ('gift-ideas', 2),
 ('for-her', 2),
 ('W0OO9013R', 2),
 ('bearn-compact-wallet-H039790CK89', 2),
 ('la-chaumepinette-27g1255', 2),
 ('chemise-cintree-en-lin-565790.html', 2),
 ('soldes', 2),
 ('mille-et-une-nuits-de-charme', 2),
 ('en-vn', 2),
 ('full-conditioner-MUK200007187.html', 2),
 ('6466353G0011072', 2),
 ('knitcraft-pink-cosy-on-up-yarn-200g', 2),
 ('650032-1002', 2),
 ('coumessac-30g12542', 2),
 ('grands-outils-de-jardin-enfant-vilac-91190360', 2),
 ('dips', 2),
 ('2747087', 2),
 ('294805', 2),
 ('mon-tresor-8bs010a0kkf0kur', 2),
 ('product.straight-fit-t-shirt-white.0252867004.html', 2),
 ('shoes-man', 2),
 ('productpage.0916289002.html', 2),
 ('cute-elephant-anniversary-card-elliott-and-buttons', 2),
 ('138182', 2),
 ('product.straight-contrast-panel-trousers-beige.0974446002.html', 2),
 ('productpage.0

 ('reverence-aromatique-hand-wash-MUK300055877.html', 1),
 ('small-wallet_cod22009999kn.html', 1),
 ('turnschuhe-old-skool--tulipes--kenzo%2F-vans', 1),
 ('FA55SN601F87.39.39.html', 1),
 ('bain-magique-bleu-tinti.html', 1),
 ('achat-tresse-hearty-rise-valley-hunter-x8-150m-vert-204214.html', 1),
 ('outerwear_cod41975185kn.html', 1),
 ('shirts-and-blouses_cod26191867425002707.html', 1),
 ('achat-arome-fun-fishing-elite-flavour-50ml-35167.html', 1),
 ('pegase-belt-buckle-reversible-leather-strap-32mm-U_BELT_32_HOMMEpH077933CB86pH073967CAAE080',
  1),
 ('productpage.0667499003.html', 1),
 ('gerippte-damensocken-197204.html', 1),
 ('00002001875092', 1),
 ('Superfit+Rush+GoreTex+M%C3%A4dchen+Sneaker+Lila.prod', 1),
 ('kenzo-sport--little-x--polo-shirt-', 1),
 ('FA65PO0504SK.56.L.html', 1),
 ('celestial-black-diamond-eye-mask-MUK200020236.html', 1),
 ('샌들_cod11693025uh.html', 1),
 ('warm-up-fleece-H800165Ev03XS', 1),
 ('les-fougeres-22g351314', 1),
 ('tiny-monogram-card-case-in-grained-leath

 ('barley-bear-from-your-little-girl-mothers-day-card', 1),
 ('176606', 1),
 ('le-rieu-73g70107', 1),
 ('americana-largo-medio-pata-de-gallo-pinos-joya-mujer', 1),
 ('BR40315-29.html', 1),
 ('productpage.0686564027.html', 1),
 ('productpage.0906649001.html', 1),
 ('intro-to-soap-making-kit', 1),
 ('643470-1000', 1),
 ('le-puits-vaillant-02g126', 1),
 ('productpage.0579381077.html', 1),
 ('cricut-infusible-ink-black-transfer-sheets-2-pack', 1),
 ('647933-1000', 1),
 ('le-bourrut-40g10278', 1),
 ('productpage.0808168003.html', 1),
 ('productpage.0803468009.html', 1),
 ('product.ribbed-cashmere-slippers-pink.0760298007.html', 1),
 ('laquo-le-renard-de-morlange-raquo-de-alain-surget-25-juin', 1),
 ('achat-boite-a-leurres-effzett-water-proof-lure-cases-v2-183701.html', 1),
 ('bottes_cod11746644hm.html', 1),
 ('productpage.0971777001.html', 1),
 ('le-chalet-de-mon-pere-73g232552', 1),
 ('sante', 1),
 ('ma-bible-de-l-herboristerie-edition-luxe-10228690', 1),
 ('625807HVK709765', 1),
 ('produc

 ('461731', 1),
 ('thank-you-very-much-pink-card', 1),
 ('143540', 1),
 ('residence-u-quarciu-20g58804', 1),
 ('275209', 1),
 ('double-knit-full-zip-hoodie-565865.html', 1),
 ('camiseta-algodon-estampado-david-bowie-negro-htsc21044kbla55.html', 1),
 ('kaia-small-satchel-in-nubuck-and-shearling-619740BTO8W9276.html', 1),
 ('W0OJ9007', 1),
 ('p-8bt346abvlf1d3q', 1),
 ('demaquillant-pour-les-yeux-biphase-dr-hauschka.html', 1),
 ('radiance-brightening-dark-circle-eye-cream-MUK300056694.html', 1),
 ('envelope-large-bag-in-mix-matelasse-grain-de-poudre-embossed-leather-600166BOW981000.html',
  1),
 ('%E3%83%8F%E3%83%BC%E3%83%95%E3%82%A4%E3%83%B3%E3%82%B0%E3%83%AA%E3%83%83%E3%82%B7%E3%83%A5%E3%83%AA%E3%83%96-%E3%82%AB%E3%83%BC%E3%83%87%E3%82%A3%E3%82%AC%E3%83%B3-211M2Y501816.html',
  1),
 ('21a7eb40-f718-49ad-90ae-ff2283516015', 1),
 ('8576464', 1),
 ('hp-probook-430-g3-13-core-i5-23-ghz-ssd-128-gb-8gb-tastiera-spagnolo-miglior-prezzo',
  1),
 ('453344.html', 1),
 ('la-grange-14g2211', 1),
 (

 ('BR95809-02.html', 1),
 ('productpage.0901497002.html', 1),
 ('kate-medium-in-grain-de-poudre-embossed-leather-364021BOW0J9906.html', 1),
 ('productpage.0946392001.html', 1),
 ('lily-sugar-n-cream-jute-yarn-70g', 1),
 ('636317-1003', 1),
 ('62555196IWG8745', 1),
 ('vestes_cod16003788fg.html', 1),
 ('women%E2%80%99s-black-%E2%80%9Ci-love-la%E2%80%9D-linen-v-neck-t-shirt', 1),
 ('BQ10575-02.html', 1),
 ('vyhodny-sucet', 1),
 ('%E3%82%B7%E3%83%A7%E3%83%AB%E3%83%80%E3%83%BC%E3%83%90%E3%83%83%E3%82%B0_cod343549805745250.html',
  1),
 ('productpage.0682289012.html', 1),
 ('stay-all-day-16h-long-lasting-make-up-zid7500820001', 1),
 ('gabardina-529626.html', 1),
 ('monogram-chain-wallet-in-crocodile-embossed-shiny-leather-377829DND1N1000.html',
  1),
 ('productpage.0929321005.html', 1),
 ('aigue-marine-971g1483', 1),
 ('product.zip-up-puffer-vest-dark-navy.0774326004.html', 1),
 ('buy-leurre-coulant-bone-dash-90s-9cm-185118.html', 1),
 ('gite-de-letang-79g134', 1),
 ('used-ford-transit-custo

 ('productpage.0889564001.html', 1),
 ('palms-20g25311', 1),
 ('achat-flotteur-a-oeillet-rive-didier-delannoy-232-213760.html', 1),
 ('gite-aux-fleurs-87g6043', 1),
 ('guo-min-paris-paris', 1),
 ('LT3L01', 1),
 ('kauf-gummikoder-savage-gear-sandeel-lures-56921.html', 1),
 ('necklace_cod50249709et.html', 1),
 ('asus-vivobook-s410un-eb075t-14-core-i7-18-ghz-hdd-1-tb-8gb-tastiera-francese-miglior-prezzo',
  1),
 ('298031.html', 1),
 ('happy-birthday-cherry-blossom-card', 1),
 ('166986', 1),
 ('productpage.0787235002.html', 1),
 ('diabolo-card-holder-H078297CK0F', 1),
 ('moisture-defining-whip-MUK200023410.html', 1),
 ('studio-la-ville-mauny-35g111413', 1),
 ('5382-belt-buckle-reversible-leather-strap-32mm-U_BELT_32_HOMMEpH080029CY89pH073967CAAD080',
  1),
 ('productpage.0890481002.html', 1),
 ('productpage.0894668005.html', 1),
 ('productpage.0685816061.html', 1),
 ('00002001458168', 1),
 ('Nike*Court*Borough*Kinder*Sneaker*Schwarz.prod', 1),
 ('243532', 1),
 ('3-years-leather-anniversary

 ('cf4efa5b-2c21-4e06-9e20-ca7857856735', 1),
 ('117398', 1),
 ('productpage.0925813001.html', 1),
 ('productpage.0665542015.html', 1),
 ('lzg343a-gites-de-france-3-epis-4-pers-saint-georges-de-levejac-48g123431',
  1),
 ('la-cailletiere-17g16081', 1),
 ('productpage.0593434005.html', 1),
 ('productpage.0963466002.html', 1),
 ('je-sais-le-faire-11203080', 1),
 ('le-val-boury-76g26107', 1),
 ('corolla-business-promo', 1),
 ('productpage.0948357003.html', 1),
 ('black-wool-scarf', 1),
 ('p-fxs124afhpf0qa1', 1),
 ('productpage.0866585005.html', 1),
 ('skiing-greetings-card', 1),
 ('10901', 1),
 ('used-ford-grand-c-max', 1),
 ('id-1653351841', 1),
 ('dads-army-dont-panic-youre-70-personalised-card', 1),
 ('171773', 1),
 ('productpage.0918126002.html', 1),
 ('short_cod22527730565982608.html', 1),
 ('p-fb0689a7d5f0abb', 1),
 ('exfoliating-body-mitt-UK200019035.html', 1),
 ('parsley-seed-anti-oxidant-serum-MUK200006996.html', 1),
 ('gg-supreme-carry-on-p-451003K5RMN9769', 1),
 ('iconic', 1),


 ('borse-shopping_cod2204324140561790.html', 1),
 ('%E3%82%A6%E3%82%A3%E3%83%A1%E3%83%B3%E3%82%B9%E3%82%99%E3%82%A6%E3%82%A9%E3%83%83%E3%83%81-%E3%82%A6%E3%82%A3%E3%83%A1%E3%83%B3%E3%82%B9%E3%82%99',
  1),
 ('p-fow424a12if10r3', 1),
 ('productpage.0743530036.html', 1),
 ('W1OX8107A0L', 1),
 ('falda-de-gasa-estampado-floral-rojo-largo-midi-', 1),
 ('emergence-des-bourgeois-un-nouveau-paysage-social-au-moyen-age', 1),
 ('pebeo-vitrea-160-iridescent-medium-45ml', 1),
 ('506099-1000', 1),
 ('crossbody-and-belt-bags_cod45551164ti.html', 1),
 ('sandals_cod11958398gv.html', 1),
 ('RX7142%20MALE%20rb7142-tortoise', 1),
 ('8053672824193', 1),
 ('rainbow-dust-red-sugar-crystals-50g', 1),
 ('571306-1013', 1),
 ('6936904', 1),
 ('airgun-shooter-magazine-single-issue.thtml', 1),
 ('productpage.0904286001.html', 1),
 ('CustomisePoster.aspx', 1),
 ('la-grange-neuve-73g121101', 1),
 ('cheval-de-fete-scarf-90-H003748Sv01', 1),
 ('soutien-gorge-emboitant-decollete-plongeant-naturel-wish.html', 1),
 ('le

 ('blouson-en-cuir-d-agneau-femme', 1),
 ('BN48025-02.html', 1),
 ('jogginghose-aus-baumwollpique-569273.html', 1),
 ('addict-sneaker-H201108Zv01365', 1),
 ('FB55SF300FQ9.99.TU.html', 1),
 ('product.merino-crew-neck-sweater-orange.0775814007.html', 1),
 ('productpage.0733253008.html', 1),
 ('29g17140-29g17140', 1),
 ('uptown-medium-tote-in-grain-de-poudre-embossed-leather-5576531KA0J9207.html',
  1),
 ('achat-canne-gunki-dots-lure-219602.html', 1),
 ('g912005-91g912005', 1),
 ('product.sleeveless-pleated-shirt-dress-white.0899701006.html', 1),
 ('mas-de-muratel-13g614', 1),
 ('harriet', 1),
 ('baby-handprint-frame-art-set-15cm-x-10cm', 1),
 ('611794-1000', 1),
 ('productpage.0878828002.html', 1),
 ('product.organic-cotton-ruffled-denim-dress-blue.0914584001.html', 1),
 ('la-metairie-du-chateau-du-courbat-h36g002234', 1),
 ('product.leather-overshirt-black.0985665001.html', 1),
 ('seawhite-a3-portrait-cupcycling-eco-starter-sketchbook', 1),
 ('651244-1000', 1),
 ('blue-and-white-joggers

 ('dmc-blue-mouline-special-25-cotton-thread-8m-3846', 1),
 ('564378-1200', 1),
 ('achat-lunettes-polarisantes-costa-riconcito-580g-192183.html', 1),
 ('productpage.0928111001.html', 1),
 ('le-logis-daline-27g996', 1),
 ('productpage.0916866002.html', 1),
 ('eradikate-blemish-spot-treatment-MUK200023194.html', 1),
 ('product.linen-shirt-detail-playsuit-blue.0937909002.html', 1),
 ('les-agrots-71g1084', 1),
 ('grapefruit-face-cleanser-MUK200006427.html', 1),
 ('product.sleeveless-cotton-mix-top-white.0908580001.html', 1),
 ('hemline-tapestry-needles-size-24_26-6-pack', 1),
 ('567160-1000', 1),
 ('mountain-out-of-a-molehill-personalised-card', 1),
 ('172067', 1),
 ('crystal-retinal-1-stable-retinal-night-serum-MUK200028314.html', 1),
 ('snazaroo-bright-pink-face-paint-compact-18ml', 1),
 ('639332-1002', 1),
 ('fenetres-sur-le-golfe-56g24050', 1),
 ('productpage.0871690001.html', 1),
 ('doudoune-navy-bi-matiere-garnissage-recycle-garcon-', 1),
 ('XS41023-48-14A.html', 1),
 ('productpage.0

 ('tailored-fit-golf-polohemd-449740.html', 1),
 ('buy-waders-stocking-neoprene-scierra-kenai-neo-4mm-chest-foot-206132.html',
  1),
 ('shampooing-sweetie-demelant-et-hydratant-pachamamai.html', 1),
 ('product.gestuftes-hemdblusenkleid-in-maxil%C3%A4nge-steel-blue.0959071001.html',
  1),
 ('le-toulourenc-du-ventoux-26g188004', 1),
 ('productpage.0873292001.html', 1),
 ('productpage.0925175002.html', 1),
 ('mortgage-calculators', 1),
 ('stamp-duty-calculator', 1),
 ('productpage.0862103003.html', 1),
 ('trench-coat-en-coton-stretch-479597.html', 1),
 ('412667', 1),
 ('%E3%83%8D%E3%83%83%E3%82%AF%E3%83%AC%E3%82%B9', 1),
 ('%E3%83%A0%E3%83%BC%E3%83%B3%E3%83%9A%E3%83%B3%E3%83%80%E3%83%B3%E3%83%88%E3%83%8D%E3%83%83%E3%82%AF%E3%83%AC%E3%82%B9%EF%BC%88%E3%83%A1%E3%82%BF%E3%83%AB%EF%BC%89-622692Y15008030.html',
  1),
 ('h-belt-buckle-leather-strap-32mm-U_BELT_32_HOMMEpH064544CM2MpH067123CA78095',
  1),
 ('notions-de-web-et-d-interface-homme-machine', 1),
 ('h-strie-belt-buckle-reversible-leath

 ('les-roses-h67g013458', 1),
 ('tee-shirt-blanc-arty-homme', 1),
 ('MN10373-01.html', 1),
 ('productpage.0605939004.html', 1),
 ('lotus-youth-preserve-moisturiser-MUK200026338.html', 1),
 ('logan-peacoat-460109.html', 1),
 ('logo-double-knit-track-trouser-560365.html', 1),
 ('low-top-sneakers_cod1050808984459.html', 1),
 ('product.draped-boxy-shirt-dress-midnight-blue.0729228001.html', 1),
 ('baskets-a-patin-isla-cuir-refendu-561742.html', 1),
 ('lzg338a-gite-de-france-3-epis-2-pers-saint-georges-de-levejac-48g123381', 1),
 ('crayola-supertips-superwashable-felt-tips-12-pack', 1),
 ('574148-1000', 1),
 ('deep-sleep-pillow-spray-MUK200008263.html', 1),
 ('W0OO9463A', 1),
 ('productpage.0912572001.html', 1),
 ('start-62', 1),
 ('ga3191-1', 1),
 ('productpage.0959128002.html', 1),
 ('FB55MU104P60.69.41.html', 1),
 ('samsung-galaxy-a3-2015-16-go-oro-compatibile-con-tutti-gli-operatori-miglior-prezzo',
  1),
 ('2344.html', 1),
 ('bath-and-body_cod51119967tu.html', 1),
 ('achat-cuiller-ondu

 ('productpage.0814762001.html', 1),
 ('RB3016%20UNISEX%20clubmaster%20marble-wrinkled%20black', 1),
 ('8056597260275', 1),
 ('la-pierre-ecrite-04g15082', 1),
 ('p-8ag9296dmf0ggh', 1),
 ('bae-birthday-card', 1),
 ('168711', 1),
 ('cotton-mesh-quarter-zip-jumper-3616419331849.html', 1),
 ('custom-fit-gingham-shirt-555584.html', 1),
 ('shoulder-bag_cod45494370tm.html', 1),
 ('chardonniere-74g273027', 1),
 ('west-yorkshire-spinners-deep-teal-colourlab-dk-yarn-100g', 1),
 ('647241-1000', 1),
 ('gucci-flora-18k-ring-with-diamonds-p-629827J85408000', 1),
 ('dmc-gold-plastic-storage-wallet', 1),
 ('564370-1000', 1),
 ('productpage.0949800001.html', 1),
 ('free-pattern-knit-a-bunny-onesie-pattern', 1),
 ('615513-1000', 1),
 ('pantalon-slim-sullivan-velours-cotele-565796.html', 1),
 ('bague-aventurine-71148460', 1),
 ('achat-trousse-a-cuiller-suissex-ii-120108.html', 1),
 ('productpage.0868810001.html', 1),
 ('no.-3-hair-perfector-MUK300053878.html', 1),
 ('product.teddy-fleece-clutch-white.092

 ('8056597140423', 1),
 ('bouteille-isotherme-impression-bois-53154350', 1),
 ('clin-d-oeil-en-mediterranee-l-etna', 1),
 ('00002001872212', 1),
 ('Rieker*Damen*Stiefelette*Schwarz.prod', 1),
 ('product.organic-cotton-fitted-denim-shirt-dress-black.0936158002.html', 1),
 ('productpage.0908729009.html', 1),
 ('t-shirt-fy06261z2f0crh', 1),
 ('top-handle-bag_cod45494479ta.html', 1),
 ('clic-h-guepards-bracelet-H701415FOB8PM', 1),
 ('productpage.0882037001.html', 1),
 ('maille', 1),
 ('pull-en-cachemire-603087YALJ21000.html', 1),
 ('productpage.0929599002.html', 1),
 ('productpage.0806757009.html', 1),
 ('chalet-le-virolet-74g33044', 1),
 ('00002001735809', 1),
 ('Timberland+Cross+Mark+PT+Chukka+Herren+Schn%C3%BCrboot+Cognac+.prod', 1),
 ('crewneck-bodysuit-DP21LOOK17FEMME', 1),
 ('prices-and-specifications', 1),
 ('product.technical-body-bag-blue.0827514002.html', 1),
 ('flaxby-nature-s-creation-all-seasons-mattress', 1),
 ('131-00718', 1),
 ('grosse-beuteltasche-bellport-aus-leder-532309

 ('longue-surchemise-kenzo-x-kansaiyamamoto', 1),
 ('FB55CH5619KE.71.M.html', 1),
 ('gite-de-la-fontaine-68g1373', 1),
 ('h-optique-belt-buckle-reversible-leather-strap-38mm-U_BELT_38_HOMMEpH080033CP2KpH075406CABC085',
  1),
 ('17g16038-17g16038', 1),
 ('damestrui-fantasie-harig-breigaren-elektrisch-groen', 1),
 ('BR18155-54.html', 1),
 ('lzg343b-gites-de-france-3-epis-4-pers-saint-georges-de-levejac-48g123432',
  1),
 ('FA62TS7204SJ.99.L.html', 1),
 ('%EC%95%84%EC%9A%B0%ED%84%B0%EC%9B%A8%EC%96%B4_cod41975206oc.html', 1),
 ('women39s-jersey-crewneck-tee-CYO4000EUB.html', 1),
 ('en-de', 1),
 ('kaia-small-satchel-in-smooth-leather-619740BWR0W6475.html', 1),
 ('shiatsuk-lyon', 1),
 ('LRJQ01', 1),
 ('sugar-lip-treatment-advanced-therapy-MUK200025594.html', 1),
 ('birthday-cool-shoes-photo-card', 1),
 ('174642', 1),
 ('medium-adley-shoulder-bag-3616418769292.html', 1),
 ('sevres-cap-H202008NvH857', 1),
 ('oasis-sandal-H071002Zv03380', 1),
 ('buy-lure-box-meiho-vs-7055-137940.html', 1),
 ('p

 ('650798-1000', 1),
 ('qu-est-ce-qu-une-bonne-information', 1),
 ('happy-mothers-day-photo-card', 1),
 ('149018', 1),
 ('%E3%83%8F%E3%82%99%E3%82%B1%E3%82%99%E3%83%83%E3%83%88-8bs017a72vf1aqa', 1),
 ('belt-with-paved-buckle-in-smooth-leather-554776BOO0Y1000.html', 1),
 ('achat-moulinet-mer-shimano-tld-10664.html', 1),
 ('productpage.0923682001.html', 1),
 ('productpage.0948519003.html', 1),
 ('productpage.0730086001.html', 1),
 ('productpage.0903727002.html', 1),
 ('kapuzenjacke-polo-clot-565651.html', 1),
 ('silentnight-newbury-800-pocket-eco-pillowtop-mattress', 1),
 ('116-00147', 1),
 ('as-de-coeur-belt-buckle-reversible-leather-strap-13mm-U_BELT_13pH081666CDZ2pH065538CAAB070',
  1),
 ('W0OO9081', 1),
 ('womens-leather-platform-espadrille-p-646386A3N009022', 1),
 ('mini-constance-martelee-belt-buckle-reversible-leather-strap-24mm-U_BELT_24pH075395CDZ2pH052150CABV070',
  1),
 ('productpage.0928189001.html', 1),
 ('avant-garden-sweetbriar-moss-eau-de-parfum-MUK200021051.html', 1),
 (

 ('tie-7-h-maillon-tie-H006222Tv07', 1),
 ('les-combes-74g270503', 1),
 ('pop-h-15-belt-H081087CK89085', 1),
 ('49ecf7bf-76be-4787-8538-1cec196ea98e', 1),
 ('273317', 1),
 ('classic-sac-de-jour-nano-in-grain-de-poudre-embossed-leather-392035BOWEN1000.html',
  1),
 ('productpage.0867557006.html', 1),
 ('product.ribbed-wool-hybrid-scarf-black.0764983001.html', 1),
 ('domaine-de-la-croix-h36g004448', 1),
 ('pull-a-col-en%C2%A0v-superpose-motif-a-pois-561067.html', 1),
 ('hamac-simple-rayures-52151350', 1),
 ('productpage.0928290001.html', 1),
 ('productpage.0590928025.html', 1),
 ('messengers', 1),
 ('sid-messenger-bag-in-antiqued-lambskin-6099271GE0E1000.html', 1),
 ('productpage.0881400001.html', 1),
 ('medium-messenger-bag-with-double-g-p-6489331U10T1000', 1),
 ('gant-de-crepe-noir-tade.html', 1),
 ('chaleco-de-plumon-reversible-3616413368339.html', 1),
 ('lavender-blue-foam-sheet-225cm-x-30cm', 1),
 ('647978-1017', 1),
 ('striped-long-sleeve-shirt-560256.html', 1),
 ('92ad41b5-ddcf-45

 ('productpage.0188590037.html', 1),
 ('k-skate-tiger-laceless-sneakers', 1),
 ('FB55SN100F80.99.39.html', 1),
 ('cricut-chrome-foil-iron_on-12-x-24-inches', 1),
 ('645292-1007', 1),
 ('bac-sur-pieds-design-grand-modele-92531200', 1),
 ('12353042', 1),
 ('le-bastidon-cote-riviere-84g4028', 1),
 ('ultra-facial-cleanser-UK200003314.html', 1),
 ('la-maison-de-la-plage-29g22370', 1),
 ('via-52', 1),
 ('baguette-magique-tinti.html', 1),
 ('anglais-les-verbes-d-action', 1),
 ('achat-bouchons-d-oreilles-beretta-passif-off-shot-par-3-182189.html', 1),
 ('powerfull-5-liquid-lip-balm-zid9276540001', 1),
 ('sistahood-photo-birthday-card', 1),
 ('154240', 1),
 ('productpage.0866126011.html', 1),
 ('29g28890-29g28890', 1),
 ('productpage.0963381004.html', 1),
 ('lot-de-3-loupes-avec-support-92622750', 1),
 ('p-8bh383adp6f1cn7', 1),
 ('productpage.0894140006.html', 1),
 ('product.baseball-cap-yellow.0594226014.html', 1),
 ('slim-fit-oxfordhemd-mit-streifen-555288.html', 1),
 ('envelope-small-en-cuir

 ('136-01375', 1),
 ('product.cropped-knitted-jacket-teal.0950328006.html', 1),
 ('p-8br600abhjf1aqf', 1),
 ('una-familia-76g6124', 1),
 ('productpage.0693575005.html', 1),
 ('quelques-grands-recits-adaptes-en-bd-chez-delcourt', 1),
 ('productpage.0888331005.html', 1),
 ('65g123911-65g123911', 1),
 ('ja-JP', 1),
 ('low-top-sneakers_cod560971904026424.html', 1),
 ('00002001861862', 1),
 ('Beach+Mountain+Damen+Schn%C3%BCrboot+Gr%C3%BCn.prod', 1),
 ('productpage.0810172014.html', 1),
 ('productpage.0921073001.html', 1),
 ('pop-h-pendant-H147991Fv85', 1),
 ('pantalon-en-denim-confort-211M288PJ2010.html', 1),
 ('aux-deux-berriaudes-h18g009367', 1),
 ('logo-cotton-french-terry-short-548403.html', 1),
 ('interlock-henley-shirt-536235.html', 1),
 ('to-the-stars-gorgeous-boyfriend-birthday', 1),
 ('127656', 1),
 ('aeaf9a95-1e43-4f2e-85ad-231fe5a4bdc3', 1),
 ('product.wool-mix-teddy-half-zip-closure-dress-blue.0945337001.html', 1),
 ('polo-coupe-ajustee-en-coton-eponge-566300.html', 1),
 ('the-o

 ('168146', 1),
 ('achat-chaussures-de-wading-vision-nahka-michelin-203929.html', 1),
 ('productpage.0540930016.html', 1),
 ('schulterriemen-aus-satin-in-rosa', 1),
 ('p-8av181aej6f1d9s', 1),
 ('icon-ring-in-yellow-gold-p-325964J85V58062', 1),
 ('mens-jackets', 1),
 ('productpage.0905320003.html', 1),
 ('productpage.0808445001.html', 1),
 ('60th-birthday-card-lavender-milestone-birthday', 1),
 ('137144', 1),
 ('charms_cod16494023980402535.html', 1),
 ('buy-link-stonfo-soft-154015.html', 1),
 ('00002001861197', 1),
 ('Beach*Mountain*Passenger*M%C3%A4dchen*Midcut*Rosa.prod', 1),
 ('polo-in-pique-custom-slim-fit-489304.html', 1),
 ('gruener-kulturbeutel-aus-biobaumwollgaze-gabrielle-paris', 1),
 ('YR01385-99-TU.html', 1),
 ('frequent-styler-MUK200025840.html', 1),
 ('gite-communal-de-bouzais-h18g009110', 1),
 ('gucci-horsebit-1955-shoulder-bag-p-602204H58AK2599', 1),
 ('cinturones_cod8008779904926382.html', 1),
 ('productpage.0914078003.html', 1),
 ('sophia-denim-shorts-573508.html', 1),


 ('BM29315-02.html', 1),
 ('p-7u1398a8c7f1c2p', 1),
 ('productpage.0688873001.html', 1),
 ('cotton-fabric-bolt-119cm-x-2m', 1),
 ('648308-1000', 1),
 ('73g132268-73g132268', 1),
 ('gite-du-moulin-de-chapiteau-16g5040', 1),
 ('animapolis-scarf-90-H003275Sv08', 1),
 ('8cdc268f-2256-47e9-9214-a7832e3ce69f', 1),
 ('4815387', 1),
 ('happy-birthday-you-massive-bellend-card', 1),
 ('170992', 1),
 ('doraemon-x-gucci-womens-rhyton-sneaker-p-655037DRW009522', 1),
 ('455276J84000701', 1),
 ('kauf-pipette-gegen-ungeziefer-frontline-spot-on-214698.html', 1),
 ('au-nid-douillet-14g2627', 1),
 ('chatterbox-mothers-day-card-emotional-rescue', 1),
 ('138120', 1),
 ('p-fw1048afetf0qa1', 1),
 ('les-cabrettes-73g250104', 1),
 ('doze-quilted-waterproof-anti-allergy-mattress-protector', 1),
 ('733-00059', 1),
 ('product.organic-cotton-shrub-print-socks-orange.0949754002.html', 1),
 ('les-peyroliers-26g308001', 1),
 ('womens-fake', 1),
 ('not-print-slide-sandal-p-6363452GC008252', 1),
 ('29g13870-29g13870', 

 ('au-bonheur-normand-76g6085', 1),
 ('le-blavet-56g13017', 1),
 ('product.folded-cotton-a-line-dress-turquoise.0741172001.html', 1),
 ('00002001871765', 1),
 ('Bench*sneaker*femmes*bleu.prod', 1),
 ('dreaming-of-cricket-birthday-card-jolly-follies', 1),
 ('productpage.0602673029.html', 1),
 ('F965SN200F70.99.40.html', 1),
 ('p-8br771a72vf17b8', 1),
 ('les-granits-roses-h44g011983', 1),
 ('nantucket-watch-17-x-23mm-W052181WW00', 1),
 ('wilton-12_inch-disposable-decorating-bags-24-pack', 1),
 ('614849-1000', 1),
 ('interlocking-g-necklace-in-silver-p-479219J84008106', 1),
 ('361-01703-configurable', 1),
 ('productpage.0932743003.html', 1),
 ('productpage.0810169018.html', 1),
 ('productpage.0589599042.html', 1),
 ('la-terrasse-du-chene-84g4129', 1),
 ('productpage.0323155025.html', 1),
 ('452d29b4-cccd-40f9-835f-3d4105f9fd87', 1),
 ('shoulder-bag_cod45551210lr.html', 1),
 ('jardiniere-treillage-arc-chevrefeuille-91024870', 1),
 ('sac-bowling-noir-femme', 1),
 ('BM95329-02.html', 1),
 ('

 ('gutermann-cream-sew-all-thread-100m-1', 1),
 ('566304-1000', 1),
 ('productpage.0870710001.html', 1),
 ('W0OO9296HC', 1),
 ('productpage.0809752001.html', 1),
 ('100-emotions', 1),
 ('I-Page3_12', 1),
 ('productpage.0934255002.html', 1),
 ('productpage.0943938001.html', 1),
 ('cheche-noir-a-carreaux-blancs-et-rouges-i.code', 1),
 ('QR90084-02.html', 1),
 ('roman', 1),
 ('diffuseur-d-huiles-essentielles-airom-91681610', 1),
 ('product.sleeveless-fluid-wool-mix-dress-black.0937490001.html', 1),
 ('gite-le-marlet-gma186-48g151860', 1),
 ('productpage.0748355020.html', 1),
 ('mens-rhyton-sneaker-with-mouth-print-p-552089A9L009522', 1),
 ('productpage.0781820001.html', 1),
 ('men-s-black-sweatshirt', 1),
 ('MM15013-02.html', 1),
 ('tompkins-skinny-jeans-with-polo-437769.html', 1),
 ('uniball-px203-paint-permanent-marker-in-silver', 1),
 ('571159-1001', 1),
 ('chaqueta-de-sarga-de-algodon-553373.html', 1),
 ('00002001830526', 1),
 ('Varese*James*chaussure*%C3%A0*lacets*hommes*cognac.prod'

 ('%ED%8E%9C%EB%94%94-%EC%84%A0%EC%83%A4%EC%9D%B8-%EB%9D%BC%EC%A7%80-%EB%82%B4%EC%B6%94%EB%9F%B4-%EA%B0%80%EC%A3%BD-%EC%87%BC%ED%8D%BC',
  1),
 ('la-ferme-de-letang-2-22g131716', 1),
 ('gite-de-vauls-15g480', 1),
 ('chemise-col-officier-bleu-ciel-boutonnee-hccl22076ksky01.html', 1),
 ('les-sapins-01g358010', 1),
 ('otros-accesorios', 1),
 ('calcetines-largos-211MCS93581.html', 1),
 ('la-clef-des-champs-76g3061', 1),
 ('khaki-military-overshirt-hcc16144kak01.html', 1),
 ('productpage.0954625001.html', 1),
 ('productpage.0228257001.html', 1),
 ('blazer-en-laine-melangee-499404.html', 1),
 ('jupe-longue-plissee-en-voile-imprime-tropical', 1),
 ('BQ27275-31-40.html', 1),
 ('productpage.0831736002.html', 1),
 ('productpage.0871241011.html', 1),
 ('les-mas-de-galine-loranger-07g272703', 1),
 ('navy-satin-bias-binding-15mm-x-2m', 1),
 ('571864-1011', 1),
 ('hand-and-nail-cream-MUK300000003.html', 1),
 ('W0OO9364R', 1),
 ('294339', 1),
 ('jogginghose-polo-team-aus-fleece-3616419557218.html', 1

 ('productpage.0934262002.html', 1),
 ('pants-and-shorts-for-men', 1),
 ('the-north-face-logo-x-gucci-web-print-silk-shorts-p-654771ZAGUR9117', 1),
 ('hose-andela-aus-wollmischung-3616530634904.html', 1),
 ('bandeau-aus-seide-in-orange', 1),
 ('p-fxt011adnkf0eu5', 1),
 ('polo-sport-fleece-sweatshirt-567798.html', 1),
 ('RB2186%20UNISEX%20state%20street-grey', 1),
 ('8056597177689', 1),
 ('product.sterling-silver-abstract-bangle-silver.0873405001.html', 1),
 ('cotton-mesh-quarter-zip-pullover-569272.html', 1),
 ('4939309C2VT8745', 1),
 ('achat-leurre-suspending-rapala-jointed-shad-rap-7cm-14262.html', 1),
 ('productpage.0968883001.html', 1),
 ('productpage.0955462002.html', 1),
 ('productpage.0811993006.html', 1),
 ('achat-casquette-homme-savage-gear-salt-uv-cap-bleu-183019.html', 1),
 ('6772-05g6772', 1),
 ('bg', 1),
 ('baby-kit-printed-jersey-baby-kit', 1),
 ('baby-kit-buk056st8f0c11', 1),
 ('productpage.0881369001.html', 1),
 ('lip', 1),
 ('pillow-talk-lip-secrets-kit-MUK200027805.ht

In [61]:
# Same for "Confirmation"

get_most_occurrent_path_chunks("Confirmation")

Number of different path chunks: 2702
[('orderConfirmation', 2327),
 ('checkout', 2267),
 ('de_de', 717),
 ('en_gb', 451),
 ('en_us', 301),
 ('on', 287),
 ('demandware.store', 287),
 ('nl_nl', 266),
 ('MCheckout-ThankYouPage', 212),
 ('sv_se', 159),
 ('klarna', 125),
 ('pl_pl', 118),
 ('en_GB', 114),
 ('Sites-en-GB-Site', 109),
 ('confirmation', 109),
 ('thankyou.aspx', 104),
 ('en', 104),
 ('us', 99),
 ('carconfigurator', 81),
 ('workflow', 81),
 ('summary', 81),
 ('store', 73),
 ('confirmation.html', 73),
 ('book', 62),
 ('Sites-IKKS_COM-Site', 60),
 ('COSummary-Confirmation', 59),
 ('fr', 56),
 ('ru_ru', 54),
 ('de_at', 47),
 ('join', 46),
 ('complete', 46),
 ('Checkout', 41),
 ('Confirmation', 41),
 ('uk', 40),
 ('en_eur', 39),
 ('en-us', 38),
 ('Sites-en-US-Site', 34),
 ('en_US', 34),
 ('en_gbp', 33),
 ('checkout-v2', 27),
 ('finance_calculator', 27),
 ('fc_workflow', 27),
 ('quotation-completed', 27),
 ('printOrderConfirmation', 25),
 ('next', 23),
 ('Sites-fr-FR-Site', 21),
 ('f

 ('26422347175', 1),
 ('18328452643', 1),
 ('27358124724', 1),
 ('27373046814', 1),
 ('27337428004', 1),
 ('18350842963', 1),
 ('27398331954', 1),
 ('27460040244', 1),
 ('b8962a12-c1ff-4410-a6bd-9acc130dc236', 1),
 ('18337644703', 1),
 ('27432467324', 1),
 ('27368689004', 1),
 ('30703963260', 1),
 ('30733969720', 1),
 ('18340027693', 1),
 ('923612910', 1),
 ('5190757', 1),
 ('31553354926', 1),
 ('30717366700', 1),
 ('31572665226', 1),
 ('26378750205', 1),
 ('30720814480', 1),
 ('18324382283', 1),
 ('18331142113', 1),
 ('4b1d7ad9-5cb6-42a1-b333-19506ae6a7d1', 1),
 ('1277bca1-9dbd-4237-97b0-e73a8e58d6ed', 1),
 ('31569193166', 1),
 ('31548035096', 1),
 ('31568686346', 1),
 ('30708132590', 1),
 ('30751707090', 1),
 ('18314999783', 1),
 ('31580660736', 1),
 ('30743063680', 1),
 ('27448515834', 1),
 ('30743126270', 1),
 ('30759891730', 1),
 ('31599199286', 1),
 ('924017610', 1),
 ('26401878125', 1),
 ('19b8dd31-275c-4355-886e-efb6a91b0d3c', 1),
 ('924383330', 1),
 ('1e52e699-5dc2-4dbf-a96c-0

In [62]:
# Same for "Checkout"

get_most_occurrent_path_chunks("Checkout")

Number of different path chunks: 1663
[('checkout', 2658),
 ('en_gb', 1047),
 ('en', 895),
 ('book', 665),
 ('stores', 590),
 ('order', 481),
 ('store', 238),
 ('checkout.html', 238),
 ('select-a-store', 207),
 ('appointment-type', 196),
 ('single', 186),
 ('appointment-triage-questionnaire', 176),
 ('en-us', 169),
 ('card-details', 149),
 ('ordersummary', 147),
 ('on', 142),
 ('demandware.store', 142),
 ('personal-details', 139),
 ('date-and-time', 135),
 ('fr', 124),
 ('MCheckout-FormHandler', 120),
 ('recap', 87),
 ('de', 86),
 ('en_GB', 81),
 ('Sites-en-GB-Site', 80),
 ('shipping', 78),
 ('hearing', 70),
 ('de_de', 63),
 ('us', 60),
 ('Checkout', 55),
 ('uk', 55),
 ('it', 48),
 ('OnePageCheckout', 48),
 ('Confirmation', 44),
 ('earwax', 43),
 ('earwax-removal', 43),
 ('nl_nl', 40),
 ('nl', 39),
 ('HostedPaymentPage', 38),
 ('pl', 38),
 ('holidays', 36),
 ('booking', 36),
 ('fr_fr', 35),
 ('location', 31),
 ('orderConfirmation', 31),
 ('summary', 29),
 ('Pages', 27),
 ('CustomerPass

## Feature ideas and design

**Idea**: there is a notion of specificity of a word with respect to a given category

We will try to go through the dataset and find, for each atomic part (chunks) of the prefixes, to get the resulting distributions over all webpage categories. We might want to consider "relevant" chunks only, i.e. those that appear at least *threshold* times in the dataset. Let's set threshold=10

In [291]:
# Get an idea of how many meaningful path chunks there are

threshold = 10

all_path_chunks = [
        chunk
        for path, cat in zip(website_df["path"], website_df["category_name"])
        for chunk in path.split("/")
        if chunk
    ]

print(f"Total number of path chunks is {len(set(all_path_chunks))}")
filtered_path_chunks = [
    chunk
    for chunk, count in sorted(Counter(all_path_chunks).items(), key = lambda x: -x[1])
    if count >= threshold
]
print(f"Total number of relevant path chunks that occur at least 10 times is {len(filtered_path_chunks)}")

Total number of path chunks is 53017
Total number of relevant path chunks that occur at least 10 times is 1651


In [141]:
# Proba of being a search page if "search" or "recherche" is in url is p_search("search") or p_search("recherche")
# -> could create an embedding emb("search") = [p_search("search"), p_product("search") etc]

from collections import defaultdict

chunk_to_category_counts = defaultdict(lambda: defaultdict(int))

for path, category in zip(website_df["path"], website_df["category_name"]):
    for chunk in path.split("/"):
        if not chunk or category not in ALL_CATEGORIES_SET:
            continue
        chunk_to_category_counts[chunk][category] += 1

chunk_to_category_scores = defaultdict(lambda: defaultdict(float))
for chunk, category_counts in chunk_to_category_counts.items():
    chunk_counts = sum(category_counts.values())
    if chunk_counts < 10:
        continue
    for category, count in category_counts.items():
        chunk_to_category_scores[chunk][category] = count/chunk_counts

pprint(chunk_to_category_scores)

defaultdict(<function <lambda> at 0x1407c4940>,
            {'%D0%B6%D0%B5%D0%BD%D1%89%D0%B8%D0%BD%D1%8B': defaultdict(<class 'float'>,
                                                                       {'Category': 0.1,
                                                                        'Product': 0.9}),
             '%D0%BC%D1%83%D0%B6%D1%87%D0%B8%D0%BD%D1%8B': defaultdict(<class 'float'>,
                                                                       {'Category': 0.2727272727272727,
                                                                        'Product': 0.7272727272727273}),
             '%D0%BE%D0%B4%D0%B5%D0%B6%D0%B4%D0%B0': defaultdict(<class 'float'>,
                                                                 {'Category': 0.08333333333333333,
                                                                  'Product': 0.9166666666666666}),
             '%E3%82%A6%E3%82%A3%E3%83%A1%E3%83%B3%E3%82%BA': defaultdict(<class 'float'>,
                 

                                                  {'Category': 0.023809523809523808,
                                                   'Product': 0.9761904761904762}),
             'body-wash': defaultdict(<class 'float'>, {'Product': 1.0}),
             'boilers': defaultdict(<class 'float'>,
                                    {'Formations / services': 1.0}),
             'bol-accessoires-relaxation': defaultdict(<class 'float'>,
                                                       {'Product': 1.0}),
             'bolsos': defaultdict(<class 'float'>,
                                   {'Category': 0.7727272727272727,
                                    'Product': 0.22727272727272727}),
             'bonus.html': defaultdict(<class 'float'>,
                                       {'My account': 0.9375,
                                        'Other': 0.0625}),
             'book': defaultdict(<class 'float'>,
                                 {'Category': 0.12394705174488568,
     

             'deals': defaultdict(<class 'float'>,
                                  {'Cart': 0.014492753623188406,
                                   'Category': 0.6099033816425121,
                                   'Home': 0.006038647342995169,
                                   'My account': 0.006038647342995169,
                                   'Other': 0.358695652173913,
                                   'Product': 0.0036231884057971015,
                                   'Store locator': 0.0012077294685990338}),
             'debetne-karty': defaultdict(<class 'float'>,
                                          {'Category': 0.08333333333333333,
                                           'Product': 0.9166666666666666}),
             'deco-maison': defaultdict(<class 'float'>,
                                        {'Category': 0.09352517985611511,
                                         'Product': 0.9064748201438849}),
             'decor': defaultdict(<class 'float'>,
     

                                          'Category': 0.9411764705882353}),
             'f2': defaultdict(<class 'float'>,
                               {'Cart': 0.02631578947368421,
                                'Checkout': 0.10526315789473684,
                                'Other': 0.02631578947368421,
                                'Product': 0.8421052631578947}),
             'fabrics-and-fat-quarters': defaultdict(<class 'float'>,
                                                     {'Category': 1.0}),
             'face-masks': defaultdict(<class 'float'>,
                                       {'Category': 0.8333333333333334,
                                        'Formations / services': 0.08333333333333333,
                                        'Other': 0.08333333333333333}),
             'face-oils': defaultdict(<class 'float'>, {'Product': 1.0}),
             'face-suncream': defaultdict(<class 'float'>, {'Product': 1.0}),
             'facture.asp': defaultdict(<c

                                   'Product': 0.6851351351351351,
                                   'Search': 0.0891891891891892}),
             'jacken-maentel': defaultdict(<class 'float'>,
                                           {'Category': 0.8461538461538461,
                                            'Other': 0.15384615384615385}),
             'jacken-maentel.html': defaultdict(<class 'float'>,
                                                {'Category': 0.8461538461538461,
                                                 'Other': 0.15384615384615385}),
             'jacken.html': defaultdict(<class 'float'>,
                                        {'Category': 0.9,
                                         'Other': 0.1}),
             'jackets': defaultdict(<class 'float'>,
                                    {'Category': 0.23076923076923078,
                                     'Product': 0.7692307692307693}),
             'jackets-coats': defaultdict(<class 'float'>, {'Ca

             'occitanie': defaultdict(<class 'float'>, {'Product': 1.0}),
             'off-duty': defaultdict(<class 'float'>, {'Category': 1.0}),
             'offer-not-found.thtml': defaultdict(<class 'float'>,
                                                  {'Other': 1.0}),
             'offers': defaultdict(<class 'float'>,
                                   {'Category': 0.34545454545454546,
                                    'Information / legals': 0.01818181818181818,
                                    'My account': 0.45454545454545453,
                                    'Other': 0.18181818181818182}),
             'oise': defaultdict(<class 'float'>, {'Product': 1.0}),
             'on': defaultdict(<class 'float'>,
                               {'Brand image': 0.015286624203821656,
                                'Cart': 0.05477707006369427,
                                'Checkout': 0.18089171974522292,
                                'Confirmation': 0.365605095541401

             'shop-damen': defaultdict(<class 'float'>, {'Category': 1.0}),
             'shop-men': defaultdict(<class 'float'>, {'Category': 1.0}),
             'shop-women': defaultdict(<class 'float'>,
                                       {'Category': 0.9952380952380953,
                                        'Other': 0.004761904761904762}),
             'shopping-bag': defaultdict(<class 'float'>,
                                         {'Cart': 0.9771825396825397,
                                          'My account': 0.022817460317460316}),
             'shorts': defaultdict(<class 'float'>,
                                   {'Category': 0.12903225806451613,
                                    'Product': 0.8709677419354839}),
             'shorts.html': defaultdict(<class 'float'>, {'Category': 1.0}),
             'shoulder-bags': defaultdict(<class 'float'>,
                                          {'Category': 0.18032786885245902,
                                       

                                'Cart': 0.04369434665562228,
                                'Category': 0.17726237316214538,
                                'Checkout': 0.012424932698281217,
                                'Confirmation': 0.02050113895216401,
                                'Favorites / wishlist': 0.0008283288465520812,
                                'Form': 0.0006212466349140609,
                                'Help / support': 0.0010354110581901014,
                                'Home': 0.11762269621039553,
                                'Information / legals': 0.0012424932698281218,
                                'My account': 0.010975357216815076,
                                'Offers & services': 0.007869124042244772,
                                'Other': 0.010768275005177056,
                                'Press / news': 0.002277904328018223,
                                'Product': 0.47194036032304826,
                                'Search': 0.

**N.B.**<br>
Those probabilities may be biased because categories are unbalanced. Actually, if the word "car" has been seen 250 times within the "Product" category and 100 times within the "Career & applications" category, this does not mean that "car" is a bigger clue towards "Product" because the "Product" category appears more than 2.5 times more than the "Career & applications" category. Thus, for "car", we might consider 250 vs 100 or 250/prior_proba("Product") vs 100/prior_proba("Career & applications"). This motivates the creation of two modes ("unbalanced" and "balanced") later in the notebook

**Idea**: could be the same for urls

In [126]:
# Proba of belonging to the "search" category given the website/prefix is p'_search(prefix)
# -> could create a second embedding emb'("search") = [p'_search(prefix), p'_product(prefix) etc]

from collections import defaultdict

prefix_to_category_counts = defaultdict(lambda: defaultdict(int))

for prefix, category in zip(website_df["prefix"], website_df["category_name"]):
    if not prefix or category not in all_categories:
        continue
    prefix_to_category_counts[prefix][category] += 1

prefix_to_category_scores = defaultdict(lambda: defaultdict(float))
for prefix, category_counts in prefix_to_category_counts.items():
    prefix_counts = sum(category_counts.values())
    for category, count in category_counts.items():
        prefix_to_category_scores[prefix][category] = count/prefix_counts
pprint(prefix_to_category_scores)

defaultdict(<function <lambda> at 0x133bee160>,
            {'accv-www.cosstores.com': defaultdict(<class 'float'>,
                                                   {'Product': 1.0}),
             'action.which.co.uk': defaultdict(<class 'float'>,
                                               {'Formations / services': 1.0}),
             'app.contentsquare.com': defaultdict(<class 'float'>,
                                                  {'Home': 0.24652087475149106,
                                                   'Other': 0.1172962226640159,
                                                   'Product': 0.6361829025844931}),
             'archives.ikks.com': defaultdict(<class 'float'>, {'Home': 1.0}),
             'asm.hm.com': defaultdict(<class 'float'>, {'Other': 1.0}),
             'at.pandora.net': defaultdict(<class 'float'>,
                                           {'Brand image': 0.125,
                                            'Cart': 0.1875,
                     

                                             'Category': 0.2078272604588394,
                                             'Checkout': 0.0103463787674314,
                                             'Home': 0.03418803418803419,
                                             'Information / legals': 0.02114260008996851,
                                             'My account': 0.005847953216374269,
                                             'Other': 0.002249212775528565,
                                             'Product': 0.5330634278002699,
                                             'Search': 0.11291048133153396}),
             'www.pizzahut.co.uk': defaultdict(<class 'float'>,
                                               {'Cart': 0.022716170004885197,
                                                'Category': 0.35784074255007325,
                                                'Checkout': 0.1172447484123107,
                                                'Home': 0.1316560820

**Idea**: could be the same for actions too

## Train a basic model and evaluate 5-fold accuracy metrics

### Design a vectorizer

We will split the dataset into a training and a test set, and the training test will again be split between a set for probability training (i.e. a map from path chunk (resp. prefixes, actions) to category probas) and a set for model training. <br>
We will implement a multiclass Logistic Regression with category probabilities for paths/prefixes/actions as inputs and one-hot encodings for categories as outputs

**Category probabilities by url chunk (website, path, actions = "c_vars") definitions**

In [286]:
from collections import defaultdict

def get_path_chunk_to_category_scores(paths, categories, balanced=False):
    path_chunk_to_category_counts = (
        defaultdict(lambda: defaultdict(float))
        if balanced
        else defaultdict(lambda: defaultdict(int))
    )
    

    for path, category in zip(paths, categories):
        for chunk in path.split("/"):
            if not chunk or category not in ALL_CATEGORIES_SET:
                continue
            path_chunk_to_category_counts[chunk][category] += (
                1/CATEGORY_TO_PRIOR_PROBABILITY[category]
                if balanced
                else 1
            )

    path_chunk_to_category_scores = defaultdict(lambda: defaultdict(float))
    for chunk, category_counts in path_chunk_to_category_counts.items():
        chunk_counts = sum(category_counts.values())
        if chunk_counts < 10:
            continue
        for category, count in category_counts.items():
            path_chunk_to_category_scores[chunk][category] = count/chunk_counts

    return path_chunk_to_category_scores

In [273]:
def get_website_to_category_scores(prefixes, categories, balanced=False):
    prefix_to_category_counts = (
        defaultdict(lambda: defaultdict(float))
        if balanced
        else defaultdict(lambda: defaultdict(int))
    )
    

    for prefix, category in zip(prefixes, categories):
        if not prefix or category not in ALL_CATEGORIES_SET:
            continue
        prefix_to_category_counts[prefix][category] += (
            1/CATEGORY_TO_PRIOR_PROBABILITY[category]
            if balanced
            else 1
        )

    prefix_to_category_scores = defaultdict(lambda: defaultdict(float))
    for prefix, category_counts in prefix_to_category_counts.items():
        prefix_counts = sum(category_counts.values())
        if prefix_counts < 10:
            continue
        for category, count in category_counts.items():
            prefix_to_category_scores[prefix][category] = count/prefix_counts

    return prefix_to_category_scores

In [274]:
def get_action_chunk_to_category_scores(actions, categories, balanced=False):
    action_chunk_to_category_counts = (
        defaultdict(lambda: defaultdict(float))
        if balanced
        else defaultdict(lambda: defaultdict(int))
    )
    

    for action, category in zip(actions, categories):
        for chunk in action.replace("=", "&").split("&"):
            chunk = chunk.strip()
            if not chunk or category not in ALL_CATEGORIES_SET:
                continue
            action_chunk_to_category_counts[chunk][category] += (
                1/CATEGORY_TO_PRIOR_PROBABILITY[category]
                if balanced
                else 1
            )

    action_chunk_to_category_scores = defaultdict(lambda: defaultdict(float))
    for chunk, category_counts in action_chunk_to_category_counts.items():
        chunk_counts = sum(category_counts.values())
        if chunk_counts < 10:
            continue
        for category, count in category_counts.items():
            action_chunk_to_category_scores[chunk][category] = count/chunk_counts

    return action_chunk_to_category_scores

**Associated vectorizers**

In [275]:
def vectorize_path(path, path_chunk_to_category_scores):
    return np.mean(
        [
            [
                path_chunk_to_category_scores[chunk][category]
                for category in ALL_CATEGORIES
            ]
            for chunk in path.split("/")
        ],
        axis=0
    ).tolist()

In [276]:
def vectorize_website(prefix, prefix_to_category_scores):
    return [
        prefix_to_category_scores[prefix][category]
        for category in ALL_CATEGORIES
    ]

In [277]:
def vectorize_action(action, action_chunk_to_category_scores):
    return np.mean(
        [
            [
                action_chunk_to_category_scores[chunk.strip()][category]
                for category in ALL_CATEGORIES
            ]
            for chunk in action.replace("=", "&").split("&")
        ],
        axis=0
    ).tolist()

### Carry out stratified k-fold (5 folds)

In [285]:
from tqdm.notebook import tqdm

category_to_one_hot_embedding = {}
idx_to_category = {i: category for i, category in enumerate(ALL_CATEGORIES)}
for category in ALL_CATEGORIES:
    category_to_one_hot_embedding[category] = [
        int(i == len(category_to_one_hot_embedding))
        for i in range(len(ALL_CATEGORIES))
    ]

urls = []
paths = []
prefixes = []
actions = []
categories = []

print("Computing dataset arrays: urls, paths, prefixes, actions, categories")

for url, path, prefix, action, category in tqdm(
    zip(
        np.array(website_df["url"]),
        np.array(website_df["path"]),
        np.array(website_df["prefix"]),
        np.array(website_df["c_vars"]),
        np.array(website_df["category_name"])
    )
):
    if category not in ALL_CATEGORIES_SET:
        continue
    urls.append(url)
    paths.append(path)
    prefixes.append(prefix)
    actions.append(action)
    categories.append(category)

urls = np.array(urls)
paths = np.array(paths)
prefixes = np.array(prefixes)
actions = np.array(actions)
categories = np.array(categories)

Computing dataset arrays: urls, paths, prefixes, actions, categories


0it [00:00, ?it/s]

In [280]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

n_splits = 5
fold_accuracy_scores = []
fold_accuracy_scores_categorywise = []

skf = StratifiedKFold(n_splits=n_splits)
for idx, (train_index, test_index) in tqdm(enumerate(skf.split(urls, categories)), leave=False):
    print(f"Preparing training and test features/targets for batch {idx + 1}/{n_splits}...")
    X_train = []
    y_train = []
    X_test = []
    y_test = []

    print(f"Splitting our data on fold {idx + 1}/{n_splits}...")
    urls_train, urls_test = urls[train_index], urls[test_index]
    categories_train, categories_test = categories[train_index], categories[test_index]
    prefixes_train, prefixes_test = prefixes[train_index], prefixes[test_index]
    paths_train, paths_test = paths[train_index], paths[test_index]
    actions_train, actions_test = actions[train_index], paths[test_index]

    print("Splitting our data into probability and model training parts...")
    batch_skf = StratifiedKFold(n_splits=2)
    for batch_train_probas_index, batch_train_model_index in tqdm(batch_skf.split(urls_train, categories_train), leave=False):
        urls_train_probas, urls_train_model = urls_train[batch_train_probas_index], urls_train[batch_train_model_index]
        categories_train_probas, categories_train_model = categories_train[batch_train_probas_index], categories_train[batch_train_model_index]
        prefixes_train_probas, prefixes_train_model = prefixes_train[batch_train_probas_index], prefixes_train[batch_train_model_index]
        paths_train_probas, paths_train_model = paths_train[batch_train_probas_index], paths_train[batch_train_model_index]
        actions_train_probas, actions_train_model = actions_train[batch_train_probas_index], actions_train[batch_train_model_index]
        
        print("Training category probabilities on half train set (probability training part)...")
        path_chunk_to_category_scores = get_path_chunk_to_category_scores(
            paths_train_probas,
            categories_train_probas,
            balanced=False
        )
        prefix_to_category_scores = get_website_to_category_scores(
            prefixes_train_probas,
            categories_train_probas,
            balanced=False
        )
        action_chunk_to_category_scores = get_action_chunk_to_category_scores(actions_train_probas, categories_train_probas)
        print(f"Computing features and targets for our model training on fold {idx + 1}/{n_splits} on the other half of the training set...")
        for path, prefix, action, category in zip(
            paths_train_model,
            prefixes_train_model,
            actions_train_model,
            categories_train_model
        ):
            X_train.append(
                vectorize_path(
                    path,
                    path_chunk_to_category_scores
                ) + vectorize_website(
                    prefix,
                    prefix_to_category_scores
                ) + vectorize_action(
                    action,
                    action_chunk_to_category_scores
                )
            )
            y_train.append(category)
        
    X_train = np.array(X_train)
    y_train = np.array(y_train)
    print(f"Defining and training our model on fold {idx + 1}/{n_splits}...")
    model = LogisticRegression(
        penalty='l2',
        dual=False,
        tol=0.001,
        C=1.0,
        fit_intercept=True,
        intercept_scaling=1,
        class_weight="balanced", # Balance the class weights
        random_state=None,
        solver='lbfgs',
        max_iter=500,
        multi_class='auto',
        verbose=0,
        warm_start=False,
        n_jobs=None,
        l1_ratio=None
    )
    model.fit(X_train, y_train)
    
    print(f"Training category probabilities on full train set for fold {idx + 1}/{n_splits}...")
    path_chunk_to_category_scores = get_path_chunk_to_category_scores(
        paths_train,
        categories_train,
        balanced=False
    )
    prefix_to_category_scores = get_website_to_category_scores(
        prefixes_train,
        categories_train,
        balanced=False
    )
    action_chunk_to_category_scores = get_action_chunk_to_category_scores(
        actions_train,
        categories_train,
        balanced=False
    )

    print(f"Computing model predictions on test for fold {idx + 1}/{n_splits} and computing metrics...")
    for path, prefix, action, category in zip(
        paths_test,
        prefixes_test,
        actions_test,
        categories_test
    ):
        X_test.append(
            vectorize_path(
                path,
                path_chunk_to_category_scores
            ) + vectorize_website(
                prefix,
                prefix_to_category_scores
            ) + vectorize_action(
                action,
                action_chunk_to_category_scores
            )
        )
        y_test.append(category)
    X_test = np.array(X_test)
    y_test = np.array(y_test)
    y_pred = model.predict(X_test)

    print(f"Computing metrics on test for fold {idx + 1}/{n_splits}...")
    fold_accuracy_score = accuracy_score(y_pred, y_test)
    fold_accuracy_scores.append(fold_accuracy_score)
    
    true_category_to_prediction = defaultdict(list)
    category_to_accuracy_score = {}
    for pred, truth in zip(y_pred, y_test):
        true_category_to_prediction[truth].append(pred)
    for category, predictions in true_category_to_prediction.items():
        category_to_accuracy_score[category] = np.mean([int(pred == category) for pred in predictions])
    fold_accuracy_scores_categorywise.append(category_to_accuracy_score)
    print("------------------------------------------------------------------------------------------")


0it [00:00, ?it/s]

Preparing training and test features/targets for batch 1/5...
Splitting our data on fold 1/5...
Splitting our data into probability and model training parts...


0it [00:00, ?it/s]

Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 1/5 on the other half of the training set...
Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 1/5 on the other half of the training set...
Defining and training our model on fold 1/5...
Training category probabilities on full train set for fold 1/5...
Computing model predictions on test for fold 1/5 and computing metrics...
Computing metrics on test for fold 1/5...
------------------------------------------------------------------------------------------
Preparing training and test features/targets for batch 2/5...
Splitting our data on fold 2/5...
Splitting our data into probability and model training parts...


0it [00:00, ?it/s]

Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 2/5 on the other half of the training set...
Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 2/5 on the other half of the training set...
Defining and training our model on fold 2/5...
Training category probabilities on full train set for fold 2/5...
Computing model predictions on test for fold 2/5 and computing metrics...
Computing metrics on test for fold 2/5...
------------------------------------------------------------------------------------------
Preparing training and test features/targets for batch 3/5...
Splitting our data on fold 3/5...
Splitting our data into probability and model training parts...


0it [00:00, ?it/s]

Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 3/5 on the other half of the training set...
Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 3/5 on the other half of the training set...
Defining and training our model on fold 3/5...
Training category probabilities on full train set for fold 3/5...
Computing model predictions on test for fold 3/5 and computing metrics...
Computing metrics on test for fold 3/5...
------------------------------------------------------------------------------------------
Preparing training and test features/targets for batch 4/5...
Splitting our data on fold 4/5...
Splitting our data into probability and model training parts...


0it [00:00, ?it/s]

Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 4/5 on the other half of the training set...
Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 4/5 on the other half of the training set...
Defining and training our model on fold 4/5...
Training category probabilities on full train set for fold 4/5...
Computing model predictions on test for fold 4/5 and computing metrics...
Computing metrics on test for fold 4/5...
------------------------------------------------------------------------------------------
Preparing training and test features/targets for batch 5/5...
Splitting our data on fold 5/5...
Splitting our data into probability and model training parts...


0it [00:00, ?it/s]

Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 5/5 on the other half of the training set...
Training category probabilities on half train set (probability training part)...
Computing features and targets for our model training on fold 5/5 on the other half of the training set...
Defining and training our model on fold 5/5...
Training category probabilities on full train set for fold 5/5...
Computing model predictions on test for fold 5/5 and computing metrics...
Computing metrics on test for fold 5/5...
------------------------------------------------------------------------------------------


### Results for both "unbalanced" and "balanced" modes

In [281]:
# Unbalanced mode

print(f"Overall accuracy: {np.mean(fold_accuracy_scores)}")
print("Accuracy by category:")
category_to_accuracy_score = {
    category: np.mean(
        [
            category_to_accuracy_score[category]
            for category_to_accuracy_score in fold_accuracy_scores_categorywise
        ]
    )
    for category in ALL_CATEGORIES
}
pprint(category_to_accuracy_score)

Overall accuracy: 0.5164668989402215
Accuracy by category:
{'Appointments / booking': 0.45,
 'Brand image': 0.7473601718282877,
 'Careers & applications': 1.0,
 'Cart': 0.7816389698458119,
 'Category': 0.6553837342497136,
 'Checkout': 0.5120805762944707,
 'Confirmation': 0.971321501265004,
 'Favorites / wishlist': 0.8551282051282051,
 'Form': 0.6499999999999999,
 'Formations / services': 0.9704724409448818,
 'Help / support': 0.8502615694164989,
 'Home': 0.6244410190674144,
 'Information / legals': 0.5911504424778762,
 'My account': 0.6454901960784314,
 'Offers & services': 0.9195402298850575,
 'Other': 0.46674467708885914,
 'Press / news': 0.8248022598870056,
 'Product': 0.18902031135594446,
 'Search': 0.8308021448508278,
 'Store locator': 0.7675956542276807}


In [271]:
# Balanced mode

print(f"Overall accuracy: {np.mean(fold_accuracy_scores)}")
print("Accuracy by category:")
category_to_accuracy_score = {
    category: np.mean(
        [
            category_to_accuracy_score[category]
            for category_to_accuracy_score in fold_accuracy_scores_categorywise
        ]
    )
    for category in ALL_CATEGORIES
}
pprint(category_to_accuracy_score)

Overall accuracy: 0.5901507120350076
Accuracy by category:
{'Appointments / booking': 0.05,
 'Brand image': 0.7321104112849389,
 'Careers & applications': 1.0,
 'Cart': 0.5852513877804771,
 'Category': 0.4670103092783505,
 'Checkout': 0.5226667426038872,
 'Confirmation': 0.9546714793889934,
 'Favorites / wishlist': 0.8705128205128206,
 'Form': 0.5166666666666666,
 'Formations / services': 0.9527559055118109,
 'Help / support': 0.9180684104627768,
 'Home': 0.6619413768019002,
 'Information / legals': 0.7531904983698183,
 'My account': 0.5980392156862745,
 'Offers & services': 0.9933333333333334,
 'Other': 0.5893520974402181,
 'Press / news': 0.9157627118644067,
 'Product': 0.4447710592638449,
 'Search': 0.9408345987371582,
 'Store locator': 0.8554558337269722}


Results show a clear signal in our algorithm, with more than 50% accuracy over 20 categories (random model would perform around 0.05 accuracy score). Our algorithm nevertheless sometimes performs poorly on specific categories, like "Product" for the unbalanced mode or "Appointments / booking" for the balanced mode. This has to raise our attention and could be a focus for error analysis and a reason for carrying a mixture of experts on both models.
While averaging our accuracy score on all the categories (with same weight on every category) we find that both methods score an approximate 71.5% accuracy across categories.

## Ideas of improvement

This algorithm could be improved in several ways:
- Perform error analysis to improve features and focus on errors and categories with a weak accuracy score
- Clean maps from chunks to category scores: remove figures, explicit categories of words, stop words etc
- Create a system for merging similar words or words with the same grammatical root/meaning/translation Ex: search <-> searches, search <-> recherche, knitwear <-> sportswear etc. Is useful to handle new expressions in the url, namely actions. Use stemming or pre-trained embedding models like Word2Vec.
- Improve the path chunk probability computations by taking into account the prior/bias on the category
- Grid-search hyperparameters (Logistic l2 penalization, threshold on minimal occurrence for chunk relevance) etc
- Try scaling inputs before feeding to the penalized Logistic Regression
- Try other models: MLPClassifier, non-linear classifiers (Random Forest/Gradient Boosting etc) ...
- Dig in the specificity of a given url chunk for a given category -> use TF-IDF-like features. For instance, chunks like ".fr" or "us" are frequent in all the different categories, there are not discriminant. Find a way of implementing this in the vecto
- Combine our "balanced" and "unbalanced" modes to leverage the strengths of both of them. Merge them, or simply average the predict_proba of the two models
- Correct the randomness of fold splits to be able to get perfectly comparable results (the models can still remain random during training though)