The goal of this notebook was to make embeddings exploration as interactive as possible, learning [Bokeh](https://bokeh.org/) library in the process, which I found well suited for this task - it has good documentation, examples, it is flexible enough, and it allows to easily add client-side interactivity without using running python server. While I faced a few minor bugs working with it, my overall impression that it is a very nice plotting library.

As my focus was on visualization/interactivity, I do not create embeddings here, instead I base my work on [others](https://www.kaggle.com/konradb/product-embeddings) (thanks @konradb).

### Challenges/known issues

One of challenges working on this notebook was how to include product thumbnails in it. Looks like there is no way to reference images from dataset directly, but what I found is that you can reference any file from notebook output files (/kaggle/working). While it does the job, it has some drawbacks:

- you have to create 100K thumbnails on each notebook re-run, which takes time and space. Even if 128x128 thumbnails do not take that much space, the overall number of files is quite large, which causes slowdowns.
- image urls seem to expire, so while you can open notebook and view images, you can't pregenerate thumbnails in another notebook once and then reference them in another notebook. Also if you open the notebook and keep it open for a while (a few hours) and then try to hover the plot, you might see that image links are broken - to fix this you'll need to reload the page.
- it takes some time for server to respond (up to half a minute). Not because notebook itself is large, but because there are lots of tiny files

So I would like to hear if there better ways to include thumbnails into the notebook.

Another issue is that notebook output itself is quite large (currently about 40MB) as it includes almost all of `articles.csv` + embeddings (2d). It is possible to make progressive loading, extracting this data into separate file from notebook, which will make UX better, but I see this less critical problem comparing to the the first one, when you are stuck for half a minute with blank screen. Nevertheless, after the notebook is loaded it is quite fast and interactive despite 100K points thanks to Bokeh's webgl renderer.

### TODO

- Solve loading perf issues (✅ partially)
- Add color legend (✅)
- Add other embeddings

### Updates 

*01.03.2022*<br>
Partially solved loading performance issue by extracting articles.csv and embeddings to a separate file and loading it incrementally.

*03.03.2022*<br>
Added a color legend.

In [None]:
!pip install opentsne -qq

import os
import umap
import colorcet
import itertools
import pandas as pd
import bokeh as bk
import bokeh.models as bkm
import bokeh.layouts as bkl
import bokeh.resources as bkr
import bokeh.transform as bkt

from glob import glob
from PIL import Image
from openTSNE import TSNE
from tqdm.auto import tqdm
from multiprocessing import Pool
from IPython.display import HTML
from bokeh.plotting import output_notebook, show, figure

output_notebook(resources=bkr.INLINE, hide_banner=True)

display(HTML("""
<style>
    div.output_subarea  {
        padding: 0px !important;
    }
    
    .rendered_html hr {
        width: auto !important;
    }
    
    progress {
      appearance: none;
      -webkit-appearance: none;
    }

    progress::-webkit-progress-bar {
      height: 20px;
      background-color: #eee;
      border-radius: 2px;
      box-shadow: 0 2px 5px rgba(0, 0, 0, 0.25) inset;
    }
    
    .legend {
    }
    
    .legend-item-bar {
        height: 10px;
        margin-bottom: 5px;
    }
</style>
"""))

DEV = False

# generate product image thumbnails

def resize(i):
    fname = os.path.basename(i)
    im = Image.open(i)
    im.thumbnail((128, 128))
    os.makedirs(f'thumbs', exist_ok=True)
    im.save(f'thumbs/{fname}', quality=85)

def make_thumbs():        
    images = glob('../input/h-and-m-personalized-fashion-recommendations/images/**/*.jpg')
    with Pool(4) as p:
        p.map(resize, images)
        
if not DEV:
    make_thumbs()

# load data & embeddings

articles = pd.read_csv('../input/h-and-m-personalized-fashion-recommendations/articles.csv')
emb_shop = pd.read_csv('../input/product-embeddings/prodemb_shop_20.csv')
emb_img = pd.read_csv('../input/product-embeddings/prodemb_img_128.csv')
emb_img['article_id'] = emb_img['image_id'].str[-14:-4].astype('int')
emb_img = emb_img.drop(columns='image_id').set_index('article_id')

# project embedding into 2D

articles_emb = articles
emb_names = []

emb_name = 'img_tsne_euclidean'
reducer = TSNE(verbose=DEV, n_jobs=-1, metric='euclidean')
emb2d = reducer.fit(emb_img.values)
emb2d = pd.DataFrame(emb2d.astype('float32'), columns=[f'x_{emb_name}', f'y_{emb_name}'], index=emb_img.index)
articles_emb = articles_emb.join(emb2d, on='article_id')
emb_names.append(emb_name)

if not DEV:
    emb_name = 'img_umap_euclidean'
    reducer = umap.UMAP(verbose=DEV, n_neighbors=30, metric='euclidean')
    emb2d = reducer.fit_transform(emb_img.values)
    emb2d = pd.DataFrame(emb2d.astype('float32'), columns=[f'x_{emb_name}', f'y_{emb_name}'], index=emb_img.index)
    articles_emb = articles_emb.join(emb2d, on='article_id')
    emb_names.append(emb_name)

    emb_name = 'img_tsne_cosine'
    reducer = TSNE(verbose=DEV, n_jobs=-1, metric='cosine')
    emb2d = reducer.fit(emb_img.values)
    emb2d = pd.DataFrame(emb2d.astype('float32'), columns=[f'x_{emb_name}', f'y_{emb_name}'], index=emb_img.index)
    articles_emb = articles_emb.join(emb2d, on='article_id')
    emb_names.append(emb_name)

    emb_name = 'img_umap_cosine'
    reducer = umap.UMAP(verbose=DEV, n_neighbors=30, metric='cosine')
    emb2d = reducer.fit_transform(emb_img.values)
    emb2d = pd.DataFrame(emb2d.astype('float32'), columns=[f'x_{emb_name}', f'y_{emb_name}'], index=emb_img.index)
    articles_emb = articles_emb.join(emb2d, on='article_id')
    emb_names.append(emb_name)
    
articles_emb.index.name = 'index'
articles_emb.to_csv('articles_emb.csv')

# plot

p = figure(output_backend='webgl', sizing_mode='stretch_width', title='Product Embeddings', 
           tools=['pan', 'box_zoom', 'wheel_zoom', 'reset'], active_scroll='wheel_zoom', name='figure')
color_mapper = bkt.factor_cmap('product_type_name', palette=bk.palettes.Category10_10 * 10, 
                           factors=articles['product_type_name'].value_counts().index.tolist())

if DEV:
    source = articles_emb.sample(frac=0.1)
else:
    source = articles_emb.head(0)
    
scatter = p.circle(source=source.filter(regex='x_|y_|article_id|_name|detail_desc'), 
                   x=f'x_{emb_names[0]}', y=f'y_{emb_names[0]}', size=5, alpha=0.2, color=color_mapper)

p.title.align = 'center'
p.title.text_font_size='12pt'
p.toolbar.logo = None

# configure controls

root_url = ''
hover = bkm.HoverTool(attachment='above', tooltips=f"""
    <div style="clear: both; width: 350px;">
        <img src="{root_url}thumbs/0@article_id.jpg" style="float: left; 
            padding-right: 10px; padding-bottom: 5px; width: 128px; ">
        <div>
            <strong>@prod_name</strong><br><hr style="margin: 3px 0px;">
            <strong>Product group:</strong> @product_group_name<br>
            <strong>Product type:</strong> @product_type_name<br>
            <strong>Color group:</strong> @colour_group_name<br><hr style="margin: 3px 0px;">
            <em>@detail_desc</em>
        </div>
    </div>
""")

p.add_tools(hover)
p.select(bkm.WheelZoomTool).zoom_on_axis = False

progress_bar = bkm.Div(text=f'<div style="display: flex; padding-left: 20px;"><progress class="success" value=0 max=100 style="flex-grow: 1; height: 20px; border-radius: 0px;"></progress><div style="flex-shrink: 1; padding-left: 15px;">articles 0/{len(articles_emb)}</div></div>', name='progress_bar', sizing_mode='stretch_width', style={'width': '100%'})
legend = bkm.Div(name='legend', sizing_mode='stretch_width', css_classes=['legend'], style={'width': '100%'})

select_emb = bkm.Select(title='Embedding', options=emb_names, margin=(70, 5, 5, 5))
select_emb.js_on_change('value', bkm.CustomJS(args={'p': p, 'scatter': scatter}, code="""
    scatter.glyph.x.field = `x_${this.value}`;
    scatter.glyph.y.field = `y_${this.value}`;
    scatter.data_source.change.emit();
    p.reset.emit();
"""))

select_color_dim = bkm.Select(title='Color dimension', name='color_dimension', value='product_type_name',
                              options=[c for c in articles.columns if c.endswith('name') and c != 'prod_name'])
select_color_dim.js_on_change('value', bkm.CustomJS(args={'scatter': scatter, 'color_mapper': color_mapper['transform']}, code="""
    const column = this.value;
    scatter.glyph.line_color.field = column;
    scatter.glyph.fill_color.field = column;
    color_mapper.factors = valueCounts(scatter.data_source.data[column]).map(x => x[0])
    color_mapper.change.emit();
    renderLegend();
"""))

select_cmap = bkm.Select(title='Colormap', options=['bokeh', 'category10', 'category20', 
                                                    'glasbey_bw', 'glasbey_category10', 'glasbey_dark', 'glasbey_light',
                                                    'spectral','set1', 'set2', 'set3'],
                        name='palette_name', value='category10')
select_cmap.js_on_change('value', bkm.CustomJS(args={'color_mapper': color_mapper['transform']}, code=f"""
    const column = this.value;
    color_mapper.palette = palettes[this.value];
    color_mapper.change.emit();
    renderLegend();
"""))

slider_marker_size = bkm.Slider(title='Marker size', start=1, end=10, value=5, step=1)
slider_marker_size.js_link('value', scatter.glyph, 'size')

slider_marker_alpha = bkm.Slider(title='Marker alpha', start=0, end=1, value=0.2, step=0.1)
slider_marker_alpha.js_link('value', scatter.glyph, 'fill_alpha')
slider_marker_alpha.js_link('value', scatter.glyph, 'line_alpha')

show(bkl.row(bkl.column(select_emb, select_color_dim, select_cmap, 
                        slider_marker_size, slider_marker_alpha, legend, width=200), 
             bkl.column(progress_bar, p, sizing_mode='stretch_width')))

# palettes

display(HTML(f"""
<script>
    var palettes = {{
        bokeh: {list(bk.palettes.Bokeh8) * 20}, 
        category10: {list(bk.palettes.Category10_10) * 20}, 
        category20: {list(bk.palettes.Category20_20) * 20}, 
        glasbey_bw: {colorcet.b_glasbey_bw},
        glasbey_category10: {colorcet.b_glasbey_category10},
        glasbey_dark: {colorcet.b_glasbey_bw_minc_20_maxl_70},
        glasbey_light: {colorcet.b_glasbey_bw_minc_20_minl_30},
        spectral: {list(bk.palettes.Spectral11) * 20}, 
        set1: {list(bk.palettes.Set1_9) * 20}, 
        set2: {list(bk.palettes.Set2_8) * 20}, 
        set3: {list(bk.palettes.Set3_12) * 20},
    }}
</script>
"""))

# legend

display(HTML("""
<script>
    function waitForBokeh(fn, maxAttempts) {
        function _waitForBokeh() {
            if (window.Bokeh !== undefined && Bokeh.documents.length > 0 && Bokeh.documents[Bokeh.documents.length - 1].is_idle) {
                clearInterval(timer);
                fn();
            } else {
                attempts ++;
                if (attempts > maxAttempts) {
                    clearInterval(timer);
                }
            }
        }

        let timer = setInterval(_waitForBokeh, 50);
        let attempts = 0;
    }
    
    function valueCounts(list) {
        let counts = {};
        for(let x of list) {
            if (!counts[x]) {
                counts[x] = 0;
            }
            counts[x] += 1;
        }
        counts = Object.keys(counts).map(x => [x, counts[x]]);
        counts.sort((x, y) => y[1] - x[1]);
        return counts;
    }

    function renderLegend() {
        let doc = Bokeh.documents[Bokeh.documents.length - 1];
        let figure = doc.get_model_by_name('figure');
        let dataSource = figure.renderers[0].data_source;
        let el = doc.get_model_by_name('legend');
        let paletteName = doc.get_model_by_name('palette_name').value;
        let palette = palettes[paletteName];
        let colorDim = doc.get_model_by_name('color_dimension').value;
        let counts = valueCounts(dataSource.data[colorDim]);
        let html = '<br>';
        for (let i = 0; i < counts.length; i ++) {
            html += `${counts[i][0]}: ${counts[i][1]} <div class='legend-item-bar' style='width: ${Math.max(5, counts[i][1] / counts[0][1] * 150)}px; 
                     background-color: ${palette[i]}'>&nbsp;</div>`;
        }
        el.text = html;
    }

    waitForBokeh(renderLegend, 100);

</script>
"""))

# incremental data loading (applicable only for rendered notebook)

if not DEV:
    display(HTML(f"""
    <script>
        var totalArticles = {len(articles_emb)};
        var loadedArticles = 0;
    </script>
    """))

    display(HTML("""
    <script>
        require.config({
            paths: {
                Papa: 'https://unpkg.com/papaparse@5.3.1/papaparse'
            }
        });

        require(['Papa'], function(Papa) {
            function loadData() {
                var doc = Bokeh.documents[Bokeh.documents.length - 1];
                var figure = doc.get_model_by_name('figure');
                var dataSource = figure.renderers[0].data_source;
                var progressBar = doc.get_model_by_name('progress_bar');
                var columns = Object.keys(dataSource.data);
                Papa.parse('articles_emb.csv', {
                    download: true,
                    header: true,
                    dynamicTyping: true,
                    chunkSize: 2000000,
                    chunk: function(results) {
                        var rows = results.data.filter(x => x.index !== null);
                        loadedArticles += rows.length;
                        progressBar.text = `<div style="display: flex; padding-left: 20px;"><progress class="success" value=${loadedArticles} max=${totalArticles} style="flex-grow: 1; height: 20px; border-radius: 0px;"></progress><div style="flex-shrink: 1; padding-left: 15px;">articles ${loadedArticles}/${totalArticles}</div></div>`;
                        for (var col of columns) {
                            var colValues = rows.map(r => r[col]);
                            if (col.startsWith('x_') || col.startsWith('y_')) {
                                var oldValues = dataSource.data[col];
                                var combinedValues = new Float32Array(oldValues.length + colValues.length);
                                combinedValues.set(oldValues, 0);
                                combinedValues.set(colValues, oldValues.length);
                                dataSource.data[col] = combinedValues;
                            } else {
                                dataSource.data[col].push(...colValues);
                            }
                        }
                        dataSource.change.emit();
                        renderLegend();
                    }
                });
            }
            
            waitForBokeh(loadData, 100);
        });
    </script>
    """))