## Random words from statistics of the IKEA catalogue

What we will do:

- scrape the IKEA catalogue
- do some **housekeeping** on the data
- find all the transition between pairs of characters
- create words by doing a random walk on the graph

### Context

Names of IKEA products have always been a mystery to me
and apparently to 
[Chuck Palahniuk](https://en.wikipedia.org/wiki/Chuck_Palahniuk).

Apparently they are [not random](https://qz.com/896146/how-ikea-names-its-products-the-curious-taxonomy-behind-billy-poang-malm-kallax-and-rens/):

- Bathroom articles = Names of Swedish lakes and bodies of water
Bed textiles = Flowers and plants
- Beds, wardrobes, hall furniture = Norwegian place names
- Bookcases = Professions, Scandinavian boy’s names

etc.

But we are going to compose new names by doing a random walk 
on a graph built from data culled from the IKEA catalogue.

#### Examples

BOLSSJUMNDERINHÖDRERÖR BALA BOFIGSKN BISÖ <br>
BEDNÖVIG BALFÅD BJUSORA BIGILINGGAR BORNG BIKELYUSIK

### Notes

I wrote this to try out some new stuff:

- in Python 3.6.6 
- on a Asus C301SA Chromebook 
- using 
[JupyterLab](https://blog.jupyter.org/jupyterlab-is-ready-for-users-5a6f039b8906) served from a Ubuntu Xenial in a [Crouton chroot](https://github.com/dnschneid/crouton).

In [1]:
import requests
import urllib, re

prod_pp = re.compile('<h3 class ="noBold"><span class="productTitle floatLeft">(.*?)</span>'
'.*?<span class="productDesp">(.*?)</span></h3>', re.DOTALL)

prod_cat_pp = re.compile('<a class=".*?" href="(.*?)">(.*?)</a>', re.DOTALL)

I had to dump the file and open it with vi because I was having trouble copying in the inspect window in Chrome

In [3]:
url = 'https://www.ikea.com/us/catalog/allproducts/alphabetical/'
fp = urllib.request.urlopen(url)
tt = fp.read()
with open('rr.txt','wb') as fp:
    fp.write(str(tt))

Begin by 
1. pulling the main index page 
1. extracting the urls for different category

In [62]:
url = 'https://www.ikea.com/us/catalog/allproducts/alphabetical/'

fp = urllib.request.urlopen(url)
tt = fp.read()

data = prod_cat_pp.findall(str(tt))
data2 = [ (x,y.strip('\\t\\r\\n')) for x,y in data if 'categories' in x]
catelog_urls, categories_tags = list(zip(*data2))

Scrape all the category pages and get the product names


In [73]:
prods = []
failed = []
for category_url in catelog_urls:
    url = 'https://www.ikea.com' + category_url
    print(url)
    try:
        fp = urllib.request.urlopen(url)
        tt = fp.read().decode("utf8")
        dd = list(zip(* prod_pp.findall(str(tt)) ) )
        prods.append(dd)
    except:
        failed.append(url)
        print('Failed%s '%url)
    

https://www.ikea.com/us/en/catalog/categories/departments/living_room/39130/
https://www.ikea.com/us/en/catalog/categories/departments/living_room/16239/
https://www.ikea.com/us/en/catalog/categories/departments/childrens_ikea/31772/
https://www.ikea.com/us/en/catalog/categories/departments/childrens_ikea/18690/
https://www.ikea.com/us/en/catalog/categories/departments/childrens_ikea/18716/
https://www.ikea.com/us/en/catalog/categories/departments/cooking/20636/
https://www.ikea.com/us/en/catalog/categories/departments/dining/16244/
https://www.ikea.com/us/en/catalog/categories/departments/bathroom/20519/
https://www.ikea.com/us/en/catalog/categories/departments/bathroom/39269/
https://www.ikea.com/us/en/catalog/categories/departments/bathroom/10555/
https://www.ikea.com/us/en/catalog/categories/departments/bathroom/10736/
https://www.ikea.com/us/en/catalog/categories/departments/bathroom/20490/
https://www.ikea.com/us/en/catalog/categories/departments/bathroom/20802/
https://www.ikea.

check if there were any exceptions

In [74]:
failed

[]

## Housekeeping

We should have done this in the scraping loop
but it's not too late.

Plant names are not interesting being mostly Greek so delete them.

In [166]:
' '.join(catalog['Outdoor pots & plants'][0])

"SOCKER SOCKER SOCKER SOCKER INGEFÄRA SOCKER IKEA PS 2014 BITTERGURKA SOCKER VATTENKRASSE KANELSTÅNG LANTLIV LANTLIV LANTLIV SATSUMAS SATSUMAS ASKHOLMEN SATSUMAS ASKHOLMEN SOCKER SOCKER VILDAPEL TOMAT TOMAT IKEA PS 2002 DRACAENA DRACAENA ODLA HIMALAYAMIX SUCCULENT CACTACEAE KALANCHOE PHALAENOPSIS PHALAENOPSIS HIPPEASTRUM PHALAENOPSIS BROMELIACEAE SUCCULENT SUCCULENT ASPLENIUM 'CRISPY WAVE' DRACAENA MARGINATA ALOE VERA PEPEROMIA DRACAENA MARGINATA EUPHORBIA CHAMAECYPARIS ZAMIOCULCAS FICUS LYRATA BAMBINO FICUS MICROCARPA GINSENG FICUS MICROCARPA GINSENG CRASSULA SPATHIPHYLLUM CHAMAEDOREA ELEGANS CHAMAEDOREA CATARA DRACAENA MASSANGEANA PACHIRA AQUATICA BEAUCARNEA RECURVATA RAVENEA AECHMEA YUCCA ELEPHANTIPES TROPISK SANSEVIERIA TRIFASCIATA SCHLUMBERGERA ARAUCARIA CODIAEUM PACHIRA AQUATICA ASKHOLMEN ASKHOLMEN SANSEVIERIA ANANAS CALATHEA DRACAENA"

In [167]:
catalog = dict( zip( categories_tags, prods) )
kill = [x for x in catalog.keys() if 'plant' in x]
print(kill)

for x in kill:
    del catalog[x]

In [77]:
import pickle
with open('ikea.pkl','wb') as fp:
    pickle.dump(catalog, fp)

- Clean up some more by getting rid of multiple entries
- Make a word bag of all product names

In [80]:
cat2 = { x: set(y[0]) for x,y in catalog.items() if y}

In [131]:
import itertools
# this was a test on a small subset
# dd = [ list(y) for x,y in cat2.items() if 'bed' in x.lower()]
dd = [ list(y) for x,y in cat2.items() ]
word_bag  = list(itertools.chain(*dd) )
word_bag.append(' ')        

kill any words that have numbers in them

In [101]:
pp = re.compile('\d')
words = ' '.join([ x for x in word_bag if not pp.search(x)])

## Make a list of all transitions

In [173]:
transitions = [ words[i:i+2] for i in range(0,len(words) - 1)]
transitions[:10]

['KN', 'NI', 'IS', 'SL', 'LI', 'IN', 'NG', 'GE', 'E ', ' L']

you can calculate frequencies like this but we don't need to use tem

In [93]:
tx = [x for x in transitions if x[0] == 'N']
tx.sort()
from  collections import Counter
txx = Counter(tx)

# Random walk

Make a dictionnary for the random walk

- keys are just characters
- items are lists of characters that follow a key in the word_bag 

In [134]:
keys = set([ x for x in words])
trans_dict = { x:[] for x in keys}
for x in transitions:
    trans_dict[x[0]].append(x[1])

In [159]:
import random

def mk_word(xx):
    word = ''
    while xx != ' ':
        word += xx
        xx = random.choice(trans_dict[xx])
    return word
    

In [162]:
ikea_words = [ mk_word('B') for i in range(20) ]
ikea_words = [ x for x in ikea_words if len(x) > 3]
' '.join(ikea_words)

'BOLSSJUMNDERINHÖDRERÖR BALA BOFIGSKN BISÖ BEDNÖVIG BALFÅD BJUSORA BIGILINGGAR BORNG BIKELYUSIK'

In [None]:
BORDISASTEBRARVAS